New

Working with Vector Data

Notebook

SingleStore Notebooks

Working with Vector Data

Note

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

Using vector embeddings has become popular recently, but getting vector data into your database can leave you with a lot of questions. This notebook shows various ways to load vectors into SingleStoreDB from Python using the Python client, SQLAlchemy, pandas, and the SQL magic commaands. It covers vectors in the form of numpy arrays or Python lists of numerics.

We'll use the following function to reset the vector data table between examples.

In [1]:

def reset_table():
"""Reset the table for use in the examples below."""
with s2.connect() as conn:
with conn.cursor() as cur:
cur.execute('DROP TABLE IF EXISTS vectors;')
cur.execute(r'''
CREATE TABLE vectors (
vec_f32 BLOB
);
''')

At any time, if you want to see the actual query being sent to the database, you can set the following environment variable before making the query to the server.

In [2]:

import os
# os.environ['SINGLESTOREDB_DEBUG_QUERIES'] = '1'

Action Required

If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

Create a database for our examples.

In [3]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS vector_data;
%sql CREATE DATABASE vector_data;

Action Required

Make sure to select the vector_data database from the drop-down menu at the top of this notebook. It updates the connection_url which is used by the %%sql magic command and SQLAlchemy to make connections to the selected database.

Generate numpy arrays containing vector data

The code belowe generates 1,000 rows of 10 random 32-bit float numbers in a numpy array. This data will be used in the following examples.

In [4]:

import numpy as np

In [5]:

vec_f32 = [np.random.rand(10).astype(np.float32) for _ in range(1000)]
vec_f32[:3]

Create a Python list of float values from the numpy array

We will show how to work with both numpy arrays and Python lists in the following examples. This cell creates a list of Python lists of floats equivalent to the numpy arrays above.

In [6]:

vec_f32_list = [list([float(y) for y in x]) for x in vec_f32]
vec_f32_list[:3]

Upload and downloading data to SingleStoreDB

In the following sections, we'll describe how to use the SingleStoreDB Python client, SQLAlchemy, the %%sql magic, and pandas to upload and download vector data.

Using SingleStoreDB Python client

In [7]:

import singlestoredb as s2
conn = s2.connect()
cursor = conn.cursor()

Working with numpy arrays

The SingleStoreDB Python client supports numpy arrays natively. If a numpy array is passed as a parameter to a query, it will be converted to a byte string containing the contents of the array. The data type of the numpy array is preserved, so you need to ensure that it is the proper numpy dtype before uploading. You can change the data type of a numpy array by using the astype method.

In [8]:

reset_table()

Recall that vec_f32 contained numpy arrays of float32 values.

In [9]:

vec_f32[:3]

The executemany method will insert multiple rows of data in a single SQL query.

In [10]:

cursor.executemany('INSERT INTO vectors(vec_f32) VALUES (%s)', vec_f32)

To download the vector data from SingleStoreDB, you simple execute a SELECT statement. The data is held in blob columns, so the result will simply contain byte strings.

In [11]:

cursor.execute('SELECT vec_f32 FROM vectors LIMIT 5')

Since we want to use the data as numpy arrays, we can "reconstitute" the arrays as we read the data using the np.frombuffer function.

In [12]:

out_f32 = [np.frombuffer(x[0], dtype=np.float32) for x in cursor]
out_f32

Working with Python lists

It is also possible to upload Python lists without going through a numpy array using the struct package. In this method, we convert the floats to a byte string and pass that byte string as the parameter to the INSERT statement. The possible format codes are as follows. The little-endian indicator (<) should also be used.

  • f - float32

  • d - float64

  • b - int8

  • h - int16

  • l - int32

  • q - int64

In [13]:

reset_table()

In [14]:

import struct
# Construct the format for a vector of 10 32-bit floats, in this case it is '<10f'
fmt = '<{}f'.format(len(vec_f32_list[0]))
vec_f32_list_bytes = [struct.pack(fmt, *x) for x in vec_f32_list]
vec_f32_list_bytes[:3]

The INSERT and SELECT code is the same as for numy arrays

In [15]:

cursor.executemany('INSERT INTO vectors(vec_f32) VALUES (%s)', vec_f32_list_bytes)

In [16]:

cursor.execute('SELECT vec_f32 FROM vectors LIMIT 5')

To unpack the rows as Python lists, we use the struct package again.

In [17]:

out_f32_list = [list(struct.unpack(fmt, x[0])) for x in cursor]
out_f32_list[:3]

Using SQLAlchemy

In order to use SingleStoreDB with SQLAlchemy, you need to install the sqlalchemy-singlestoredb dialect as follows.

pip install sqlalchemy-singlestoredb

In [18]:

import sqlalchemy as sa
eng = sa.create_engine(connection_url)
conn = eng.connect()

The SQLAlchemy method works much like the SingleStoreDB method. However, SQLAlchemy (v2+) requires parameters to be in a dictionary, and the substitution syntax is of the form :var_name where 'var_name' in the key in the dictionary.

Working with numpy arrays

In [19]:

reset_table()

SQLAlchemy requires you to construct the query as a sa.text object. Parameters for inserting multple rows are in a list of dictionaries.

In [20]:

query = sa.text('INSERT INTO vectors(vec_f32) VALUES (:vec_f32)')
conn.execute(query, [dict(vec_f32=x) for x in vec_f32])

Selecting the data works much as before as well.

In [21]:

result = conn.execute(sa.text('SELECT vec_f32 FROM vectors LIMIT 5'))

We can use the np.frombuffer function again to convert the byte strings to numpy arrays.

In [22]:

out_f32 = [np.frombuffer(x[0], dtype=np.float32) for x in result]
out_f32

Working with Python lists

To upload Python lists of values, you use the struct package to construct the byte strings as described in the "Uploading Python Lists" in the previous section. The rest of the code here stays the same with the exception of replacing vec_f32 with vec_f32_list_bytes as the query parameter for the INSERT query.

In [23]:

reset_table()

Recall that we create a list of bytes (vector) objects in the previous example. This list of vectors can be passed to the INSERT as well as numpy arrays.

In [24]:

vec_f32_list_bytes[:3]

In [25]:

query = sa.text('INSERT INTO vectors(vec_f32) VALUES (:vec_f32)')
conn.execute(query, [dict(vec_f32=x) for x in vec_f32_list_bytes])

In [26]:

result = conn.execute(sa.text('SELECT vec_f32 FROM vectors LIMIT 5'))

Unpacking the Python lists works as before as well.

In [27]:

out_f32_list = [list(struct.unpack(fmt, x[0])) for x in result]
out_f32_list[:3]

Using pandas

The pandas package has utilities for working with databases. The two primary methods / functions are DataFrame.to_sql which uploads DataFrame data to a table, and pd.read_sql which downloads data from a table.

In [28]:

reset_table()

In [29]:

import pandas as pd

First, we'll create a pandas DataFrame with our numpy arrays.

Working with numpy arrays

In [30]:

df = pd.DataFrame(dict(vec_f32=pd.Series(vec_f32)))
df.head()

We can use the to_sql method of the DataFrame to upload the data. Notice that we are using the SQLAlchemy connection we created in the previous section as the con parameter.

In [31]:

df.to_sql('vectors', con=conn, if_exists='append', index=False)

To read the data, we use the read_sql function. As before, we are getting byte strings back that will need to be converted into numpy arrays.

In [32]:

out_df = pd.read_sql('vectors', con=conn)
out_df.head(3)

We apply the np.frombuffer function to each element in the vec_f32 column to reconstruct the numpy array.

In [33]:

out_df['vec_f32'] = out_df['vec_f32'].apply(lambda x: np.frombuffer(x, dtype=np.float32))

In [34]:

out_df.head(3)

Working with Python lists

Because Python lists are not typed arrays like numpy arrays, we have to convert them to bytes before uploading them.

In [35]:

reset_table()

Construct a DataFrame using Python lists as the data.

In [36]:

df = pd.DataFrame(dict(vec_f32=vec_f32_list))
df.head(3)

Note that we are using our fmt value from a previous section to convert the Python lists to bytes using struct.pack.

In [37]:

fmt

In [38]:

df['vec_f32'] = df['vec_f32'].apply(lambda x: struct.pack(fmt, *x))

In [39]:

df['vec_f32'].head(3)

Use the to_sql method to upload the DataFrame.

In [40]:

df.to_sql('vectors', con=conn, if_exists='append', index=False)

In [41]:

out_df = pd.read_sql('vectors', con=conn)
out_df.head(3)

We now have to convert the byte strings back to Python lists.

In [42]:

out_df['vec_f32'] = out_df['vec_f32'].apply(lambda x: list(struct.unpack(fmt, x)))

In [43]:

out_df.head(3)

Using the %%sql / %sql magic commands

While the SQL magic commands are convenient for invoking basic SQL commands, they aren't quite as good for complex queries that insert data. The primary issue is that you must construct the query as a string and ensure that all of your data is properly escaped. We'll demonstrate some basics here, but the methods described in the previous sections are likely to work better.

In [44]:

reset_table()

Working with numpy arrays or Python lists

The SQL magic commands do not do any of the automatic conversions of data to query parameters, so this must be done manually before creating the query. This is done the same way whether the source is numpy arrays or Python lists. In either case, you must convert the objects to byte strings as we have in the previous sections, then convert that byte string into a hex literal that can be used in the query.

In [45]:

# Convert an element of the numpy array to a hex string
vec_f32[0].tobytes().hex()

In [46]:

# Convert an element of the Python list to a hex string
struct.pack(fmt, *vec_f32_list[0]).hex()

To construct the query string for the %%sql command, we need to build the entire list of values to insert in a separate step. We'll insert the X at the beginning of the string to indicate a hex literal to SingleStoreDB. We'll also add the parentheses around the value for inserting multiple rows of data using the INSERT statement.

In [47]:

params = ["(X'{}')".format(x.tobytes().hex()) for x in vec_f32]
params[:3]

In [48]:

%%sql
INSERT INTO vectors(vec_f32) VALUES {{ ','.join(params) }}

We can now select the data.

In [49]:

%%sql out <<
SELECT * FROM vectors LIMIT 5

In [50]:

out

At this point, there is nothing we can do with SQL magic commands to convert the data back into numpy arrays or Python lists. We need to drop to Python for that.

In [51]:

out_df = pd.DataFrame(out)

In [52]:

out_df['vec_f32'] = out_df['vec_f32'].apply(lambda x: np.frombuffer(x, dtype=np.float32))

In [53]:

out_df.head(3)

Using JSON

It is also possible to use JSON to create vectors, however, this method require serializing and deserializing JSON on either end which isn't quite a efficient as the techniques above. It also requires using the JSON_ARRAY_PACK and JSON_ARRAY_UNPACK functions in your queries to go back and forth between the vector bytes and JSON. Here is an example of inserting the Python list of floats.

In [54]:

import json

In [55]:

params = ['(JSON_ARRAY_PACK("{}"))'.format(json.dumps(x)) for x in vec_f32_list]
params[:3]

In [56]:

%%sql
INSERT INTO vectors(vec_f32) VALUES {{ ','.join(params) }}

If you use the JSON_ARRAY_UNPACK function in your SELECT statement, you can download the data as JSON.

In [57]:

%%sql out <<
SELECT JSON_ARRAY_UNPACK(vec_f32) AS 'vec_f32' FROM vectors LIMIT 5

In [58]:

out = pd.DataFrame(out)
out

In [59]:

out['vec_f32'][0]

Notice that since the data type of the column in the SELECT is JSON, it automatically gets converted to a Python list in the client.

In [60]:

type(out['vec_f32'][0])

Conclusion

As you can see, there are various interfaces available for uploading and downloading vector data. Depending on which Python framework you are using and what format your data is in, you can pick and choose which methods work for your use-case.

Action Required

If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.

In [61]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS vector_data;

Details

Tags

#stater#vectodb#genai

License

This Notebook has been released under the Apache 2.0 open source license.