How to Bulk Load Vectors Into SingleStoreDB

EH

Eric Hanson

Director of Product Management

How to Bulk Load Vectors Into SingleStoreDB

SingleStoreDB supports vector data and nearest-neighbor search, with powerful vector query processing in SQL.

Vector search is a foundational part of generative AI applications like chatbots, as well as other AI applications like semantic text search, face matching, object photo matching and many more. The basics of using SingleStoreDB for vector search are covered in our documentation.

Before you can demonstrate the power of AI with your vector data, you need to get it into your database! You can of course insert vector data with regular INSERT statements from your application. For larger data sets, you can bulk load vectors and associated data with LOAD DATA or PIPELINES. For example, suppose you have this table:

create table vectors(id int, vec blob, shard(id));

And a file /data/vector_data.txt of data to load into it, containing these lines:

1       [0.18,0.36,0.54,0.73]
2       [0.62,0.70,0.15,0.31]

Tabs are used to separate the columns. It's convenient to use tab (or some other character) other than a comma to separate the columns so the loader doesn't think the commas in the JSON array of numbers for the vector are column separators.

To do a simple, one-time load of a data file, use LOAD DATA like this:

load data infile '/data/vector_data.txt'
into table vectors
fields terminated by '\t'
(id, @vec_text)
set vec = json_array_pack(@vec_text);

Using JSON_ARRAY_PACK is important here because you need to convert the text form of the vector (a JSON array of numbers) to the internal packed binary format (an array of 32-bit floats). That's the format that the DOT_PRODUCT and EUCLIDEAN_DISTANCE functions require as input.

To validate that the vector data is there, run:

select id, json_array_unpack(vec)
from vectors;

You'll see this:

+------+---------------------------------------------------+
| id   | json_array_unpack(vec)                            |
+------+---------------------------------------------------+
|    1 | [0.180000007,0.360000014,0.540000021,0.730000019] |
|    2 | [0.620000005,0.699999988,0.150000006,0.310000002] |
+------+---------------------------------------------------+

The rounding errors compared with the input file are normal due to the inexact nature of floating-point conversions from decimal strings. If you want to load a large data set, or want continuous streaming load, you can use a PIPELINE. For example:

truncate table vectors; /* remove the data from the previous load */

create pipeline pipe as
load data fs '/data/vector_data.txt'
into table vectors
fields terminated by '\t'
(id, @vec_text)
set vec = json_array_pack(@vec_text);

start pipeline pipe;

Now, we can see the data is there:

select id, json_array_unpack(vec)
from vectors;

And finally, since we're not using it any more, stop the pipeline:

stop pipeline pipe;

If you're running on our cloud service, things work almost the same way — but you'll likely want to put your input data in a cloud object store. Our documentation for LOAD DATA and PIPELINES shows how to adjust the file name to reference a file in S3, Azure Blob Store or GCS.

For more information on why SingleStoreDB is a great choice for working with vector data, explore our documentation and our blog on the topic


Interested in running vector capabilities yourself? Get started with a free trial of SingleStoreDB today.


Share