July 6, 2023

Fast and Easy Vector Ingestion from Python DataFrames

Arnaud Comet

Director, Product Management

Fast and Easy Vector Ingestion from Python DataFrames

DataFrames are an easy way to manipulate data with Python. As we've gotten more usage of our vector features, we've had questions about ingesting embeddings into SingleStoreDB from Pandas DataFrames directly.

To do that, you need to convert the array containing the vector embedding into a packed byte string using numpy. Using DataFrames, you get both speed and simplicity when loading your embeddings as vectors into SingleStoreDB — especially if you use the approach presented here.

Connection

If you run locally or on a different cloud service, you will need to create a connection string:

from sqlalchemy import *
database_name = 'mydatabase'
db_connection_str =
"mysql+pymysql://"+connection_user+":"+connection_password+"@"+co
nnection_host+":"+connection_port+"/"+database_name+"?ssl_cipher=
HIGH"
db_connection = create_engine(db_connection_str)

With a SingleStore notebook, you simply need to do the following and select the database you want to access:

from sqlalchemy import *
db_connection = create_engine(connection_url)

Getting the Dataset Example

Let’s get a sample dataset from the 2022 Beijing Winter Olympics that include text chunks (first column) and its embeddings (second column). Make sure you have the libraries with !pip3 install

Note: SingleStore-hosted notebooks only require you to install wget library

# Install libraries
import pandas as pd
import os
import wget
import ast
import numpy as np
import json

# Get the data
embeddings_url =
"https://cdn.openai.com/API/examples/data/winter_olympics_2022.cs
v"
embeddings_path = "winter_olympics_2022.csv"

if not os.path.exists(embeddings_path):
    wget.download(embeddings_url, embeddings_path)
    print("File downloaded successfully.")
else:
    print("File already exists in the local file system.")

# Import the data as dataframe and converts the column embedding
into an array - for cases where you have arrays, you don't need
to convert
df = pd.read_csv("winter_olympics_2022.csv",
converters=dict(embedding=lambda x: np.array(json.loads(x))))

Create a Table

This DataFrame has two columns:

text : the text from the wikipedia pages
embedding : the embedding from the text is stored in a blob column type

We will create a simple table using SQL:

CREATE TABLE IF NOT EXISTS winter_olympics_2022 (
    text TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
    embedding BLOB
);

Load Your DataFrame Into the SQL Table

Now we want to ingest the embedding into the table we previously created:

df.to_sql('winter_olympics_2022', con = db_connection, if_exists
= 'append', index = False, chunksize = 1000)

That’s it! The following is slower for ingesting embeddings ( JSON_ARRAY_PACK requires additional processing) and needs more lines of code:

def insert_records(
    df: pd.DataFrame,
    table_name: str,
    batch_size: int = 1000
):
    stmt = f"""
        INSERT INTO {table_name} (
            text,
            embedding
        )
        VALUES (
            %s,
            JSON_ARRAY_PACK(%s)
        )
    """.format(table_name=table_name)

    record_arr = df.to_records(index=True)

    for i in range(0, len(record_arr), batch_size):
        batch = record_arr[i:i+batch_size]
        values = [(row[1], str(row[2])) for row in batch]
        db_connection.execute(stmt, values)
    return

insert_records(df, "winter_olympics_2022")

Recap

Because the data was a string from a csv file, we had to convert it into an array using the json module: `converters=dict(embedding=lambda x: np.array(json.loads(x)))`. If you already had your embeddings in a form of array, you can skip that conversion code.

Learn more about all the options using to_sql through the Pandas documentation: pandas.DataFrame.to_sql — pandas 2.0.2 documentation

Get Started with SingleStoreDB

If you are looking to better understand our functionalities around search and vector, I recommend you to start with the Singlestore Helios trial and pick one of the four available notebooks around semantic search.