New

Inserting embeddings from multiple models into SingleStore Using External Functions

Notebook

SingleStore Notebooks

Inserting embeddings from multiple models into SingleStore Using External Functions

Overview

In the realm of database management and artificial intelligence, the ability to directly incorporate vector embeddings from leading providers into your database can significantly enhance your application's capabilities. This notebook delves into the practicalities of using SingleStoreDB’s external functions to dynamically fetch and update vector embeddings for textual data, focusing on models from OpenAI and Hugging Face. This approach simplifies the process of enriching your database with deep learning insights, directly within SQL queries.

SingleStoreDB facilitates the storage of vector embeddings in two primary ways: using the blob type with the JSON_ARRAY_PACK function to convert a JSON array of floating-point numbers into an encoded blob, and the vector type, introduced in version 8.5, designed specifically for efficient handling of high-dimensional data.

Blob Type and JSON_ARRAY_PACK

The blob type, combined with JSON_ARRAY_PACK, provides a method to store vector embeddings as encoded blobs. This approach allows for the flexibility of storing embeddings from multiple models in the same column by including a model_id column to identify which embedding corresponds to which model. However, it does not leverage SingleStoreDB's advanced vector operation capabilities.

Vector Type for Enhanced Efficiency

From version 8.5, SingleStoreDB supports the vector data type, offering an ordered collection of numeric values with a fixed number of dimensions. This type is optimized for various data representations, including embeddings from large language models (LLMs), making it easier to insert, load, and query vector data. The vector type requires specifying the dimension size and currently supports F32 as the element type. When choosing the vector type, it's important to note that, depending on the model of embedding, you will have to create a new column for each model since the dimensions must be specified for the vector type. This requirement can influence database schema design based on the diversity of embedding models used.

Vector Index and ANN Search

SingleStoreDB's support for vector similarity scoring and Approximate Nearest Neighbor (ANN) search enables efficient k-nearest neighbor queries, especially beneficial for large datasets and high concurrency requirements. While exact kNN search provides precise results, ANN search offers a faster, though approximate, alternative, striking a balance between accuracy and speed. This feature is ideal for applications such as semantic search of text, retrieval-augmented generation (RAG), and image matching based on vector embeddings similarity.

Key Considerations

  • Storage Flexibility: Using the blob type for embedding storage offers flexibility in handling multiple models within a single column but precludes the use of vector indices and ANN search capabilities.

  • Efficiency and Specificity: The vector type, along with vector indices, enables efficient ANN searches, necessitating separate columns for embeddings from different models due to dimension specification requirements.

Architecture diagram :

Architecture Diagram

Database Setup for Embeddings Demo

This section outlines the initial setup required for our embeddings demo, including the creation of databases and tables to store sentences and their corresponding embeddings.

Steps:

  1. Drop Existing Database (if exists): Ensures that there is no existing embeddings_demo database that might conflict with our setup.

  2. Create New Database: Initializes a fresh database named embeddings_demo for our demo.

  3. Switch to New Database: Sets the context to the newly created database for subsequent operations.

  4. Create random_sentences Table: This table is designed to store sentences along with a unique identifier (uuid) and a timestamp. The structure supports the insertion of sample sentences for which we will generate embeddings.

  5. Insert Sample Data: Populates the random_sentences table with a variety of sentences. These sentences serve as our data source for embedding generation.

  6. Create random_sentences_embeddings Table: Designed to store the sentences along with their embeddings generated by two different models - OpenAI's Ada model and the Hugging Face MiniLM L12 v2 model. The embeddings are stored in vector format with specified dimensions, utilizing SingleStore's vector data type for efficient embedding storage and operations.

  7. Create random_sentences_embeddings_2 Table: A table intended to store sentences and their embeddings in a blob format, providing an alternative structure for embedding storage.

Purpose:

  • The random_sentences table serves as the input for our embedding generation process.

  • The random_sentences_embeddings table showcases how to store embeddings in a structured vector format, allowing for direct operations on embeddings within the database. This table leverages SingleStore's vector data type, enabling ordered collections of numeric values with fixed dimensions for embeddings.

  • The duplicated creation command for random_sentences_embeddings_2 appears to be an oversight and should be considered for removal to avoid confusion.

In [1]:

%%sql
DROP DATABASE IF EXISTS embeddings_demo;
CREATE DATABASE embeddings_demo;
USE embeddings_demo;

In [2]:

%%sql
CREATE TABLE random_sentences (
uuid VARCHAR(256),
sentence VARCHAR(512),
timestamp TIMESTAMP DEFAULT NOW()
);
-- inserting sample data
INSERT INTO random_sentences VALUES (uuid(), 'Hello there, how are you today?', DEFAULT);
INSERT INTO random_sentences VALUES (uuid(), 'The quick brown fox jumped over the lazy dog', DEFAULT);
INSERT INTO random_sentences VALUES (uuid(), 'She sells seashells by the seashore', DEFAULT);
INSERT INTO random_sentences VALUES (uuid(), 'The early bird gets the worm', DEFAULT);
INSERT INTO random_sentences VALUES (uuid(), 'Fortune favors the bold', DEFAULT);
INSERT INTO random_sentences VALUES (uuid(), 'A penny saved is a penny earned', DEFAULT);
INSERT INTO random_sentences VALUES (uuid(), 'You cant teach an old dog new tricks', DEFAULT);
INSERT INTO random_sentences VALUES (uuid(), 'The grass is always greener on the other side', DEFAULT);
INSERT INTO random_sentences VALUES (uuid(), 'Birds of a feather flock together', DEFAULT);
INSERT INTO random_sentences VALUES (uuid(), 'Actions speak louder than words', DEFAULT);

In [3]:

%%sql
SELECT * FROM random_sentences;

In [4]:

%%sql
CREATE TABLE random_sentences_embeddings (
uuid VARCHAR(256),
sentence VARCHAR(512),
openai_ada002_embeddings VECTOR(1536),
hf_miniLM_L12_v2_embeddings VECTOR(384)
);

In [5]:

%%sql
CREATE TABLE random_sentences_embeddings_2 (
uuid VARCHAR(256),
sentence VARCHAR(512),
model_id VARCHAR(512),
embedding BLOB
);

Inserting Embeddings into SingleStoreDB

This segment of the notebook demonstrates the activation of external functions within SingleStoreDB and outlines the steps for creating and utilizing an external function, get_embedding, to dynamically generate text embeddings. Additionally, we detail the implementation of stored procedures designed to automate the insertion of these embeddings into specified tables.

The get_embedding External Function

We introduce an external function, get_embedding, that communicates with a remote service to obtain embeddings for given text inputs. This function signifies the bridge between SingleStoreDB and machine learning models, facilitating the direct insertion of complex embeddings into the database.

  • About External Functions: External functions in SingleStoreDB allow for operations to be executed outside the database process, supporting both scalar and table-valued returns. This feature is instrumental in integrating machine learning insights into SQL workflows, enhancing data with vector embeddings from models like OpenAI or Hugging Face. Check more about external functions here

  • Code for external function: To explore and test the demo, please navigate to the 'singlestore spaces' repository. Within the designated notebook's repository, you will discover the relevant code file external_function_api.py. This code is ready for use and has been set up to facilitate an interactive demonstration. For quicker access check appendix at the end of the notebook

In [6]:

%%sql
CREATE OR REPLACE EXTERNAL FUNCTION get_embedding(str1 TEXT, model_name TEXT) RETURNS TEXT
AS REMOTE SERVICE 'http://<your_ip_address_where_api_is_running>:5000/functions/get_embedding'
FORMAT JSON;
-- Test external function
-- SELECT get_embedding("blueberry", 'openai_embedding')AS res ;

Dynamic Data Insertion with InsertDynamicData_1

A stored procedure, InsertDynamicData_1, dynamically constructs and executes an SQL query to insert generated embeddings into the random_sentences_embeddings_2 table. This process exemplifies the seamless integration of machine learning embeddings into database records, leveraging the get_embedding external function.

In [7]:

%%sql
CREATE OR REPLACE PROCEDURE InsertDynamicData_1(source_table_id TEXT, target_table_id TEXT, source_column_id TEXT, embedding_model_id TEXT)
AS
DECLARE
sql_query TEXT;
BEGIN
sql_query = CONCAT('INSERT INTO ', target_table_id, ' (uuid, sentence, model_id, embedding) SELECT uuid, sentence, ''', embedding_model_id, ''' ,JSON_ARRAY_PACK(get_embedding(', source_column_id, ', ''', embedding_model_id, ''')) FROM ', source_table_id);
EXECUTE IMMEDIATE sql_query;
END;

In [8]:

%%sql
CALL InsertDynamicData_1('random_sentences', 'random_sentences_embeddings_2', 'sentence', 'openai_embedding');

In [9]:

%%sql
CALL InsertDynamicData_1('random_sentences', 'random_sentences_embeddings_2', 'sentence', 'hf_embedding');

In [10]:

%%sql
select COUNT(*) from random_sentences_embeddings_2;

Advanced Embedding Insertion with InsertDynamicData_2

Another stored procedure, InsertDynamicData_2, is designed to handle multiple embedding models, inserting their outputs into designated vector columns within the random_sentences_embeddings table. This procedure illustrates the flexibility and power of SingleStoreDB in accommodating complex data types like vectors, directly derived from machine learning embeddings.

In [11]:

%%sql
CREATE OR REPLACE PROCEDURE InsertDynamicData_2(source_table_id TEXT, target_table_id TEXT, source_column_id TEXT)
AS
DECLARE
sql_query TEXT;
BEGIN
sql_query = CONCAT(
'INSERT INTO ', target_table_id,
' (uuid, sentence, openai_ada002_embeddings, hf_miniLM_L12_v2_embeddings) ',
'SELECT uuid, sentence, ',
'get_embedding(', source_column_id, ', ''openai_embedding''), ',
'get_embedding(', source_column_id, ', ''hf_embedding'') ',
'FROM ', source_table_id
);
EXECUTE IMMEDIATE sql_query;
END;

In [12]:

%%sql
CALL InsertDynamicData_2('random_sentences', 'random_sentences_embeddings', 'sentence');

In [13]:

%%sql
SELECT COUNT(*) FROM random_sentences_embeddings;

Lets try to query these embeddings to get matching score using dot_product

In [14]:

%%sql
SELECT
r1.sentence AS sentence1,
r2.sentence AS sentence2,
ROUND( DOT_PRODUCT(r1.openai_ada002_embeddings, r2.openai_ada002_embeddings), 2) AS openai_score,
ROUND( DOT_PRODUCT(r1.hf_miniLM_L12_v2_embeddings, r2.hf_miniLM_L12_v2_embeddings), 2) AS hf_score
FROM
random_sentences_embeddings r1,
random_sentences_embeddings r2
ORDER BY
openai_score DESC, hf_score DESC;

Appendix

Code for external function API.

In [15]:

%%writefile external_function_api.py
import json
import time
from concurrent.futures import ThreadPoolExecutor
import openai
import torch
from flask import Flask
from flask import request
from openai import OpenAI
from transformers import AutoModel
from transformers import AutoTokenizer
# Set up OpenAI
api_key = 'add your openai key'
client = OpenAI(api_key=api_key)
# Load Hugging Face model
model_name = 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'
hf_model = AutoModel.from_pretrained(model_name)
hf_tokenizer = AutoTokenizer.from_pretrained(model_name)
# Hugging Face embedding function
def get_hf_embedding(texts):
embeddings = []
for text in texts:
inputs = hf_tokenizer(text, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
embedding = hf_model(**inputs).last_hidden_state.mean(dim=1)
norm = torch.linalg.vector_norm(embedding, ord=2, dim=1, keepdim=True)
normalized_embedding = embedding / norm
embeddings.append(normalized_embedding.squeeze().tolist())
return embeddings
# OpenAI embedding function
def get_ada_002_embedding(texts, model='text-embedding-ada-002'):
responses = openai.embeddings.create(input=texts, model=model)
return [response.embedding for response in responses.data]
def process_batch(batch, model_name):
texts = [text for text in batch if isinstance(text, str) and text.strip()]
if not texts:
return []
if model_name == 'openai_embedding':
try:
return get_ada_002_embedding(texts, 'text-embedding-ada-002')
except Exception as e:
print(f'Error in OpenAI processing: {e}')
return []
elif model_name == 'hf_embedding':
return get_hf_embedding(texts)
else:
print(f'Invalid model name: {model_name}')
return []
app = Flask(__name__)
@app.route('/functions/get_embedding', methods=['POST'])
def get_embedding():
""" incoming data is this format :
{"data":
[[<row id>, <data string >, <model_name string>],
[<row id>, <data string >, <model_name string>],
... ]}
"""
start_time = time.time()
row_ids, args, model_names = [], [], []
for row_id, data, model_name in request.json['data']:
row_ids.append(row_id)
args.append(data)
model_names.append(model_name)
batch_size = 1024
futures = []
with ThreadPoolExecutor(max_workers=len(args) // batch_size) as executor:
for i in range(0, len(args), batch_size):
batch = args[i:i + batch_size]
# Assuming all texts in the batch use the same model
model_name = model_names[i]
futures.append(executor.submit(process_batch, batch, model_name))
flat_results = [future.result() for future in futures]
time_taken = time.time() - start_time
app.logger.info(f'Time taken: {time_taken} seconds')
res = map(json.dumps, flat_results)
return dict(data=list(zip(row_ids, res)))
if __name__ == '__main__':
app.run(debug=True, host='0.0.0.0', port=5000)

Details

Tags

#vectordb#genai#openai#rag#huggingface

License

This Notebook has been released under the Apache 2.0 open source license.