Semantic Visualization and Vector Datatype
Notebook
Note
This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.
SingleStoreDB supports vector database processing, which allows you to store and search vector data. Vectors usually come from objects: text, images, video, audio, etc. In a vector space model, words with similar meanings, such as "happy" and "joyful," are represented by vectors that lie in proximity, reflecting their semantic similarity. Vector database searches find data based on its content or meaning, even without exact matches.
1. Create a workspace in your workspace group
S-00 is sufficient.
2. Create a database named db_vector
Action Required
If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
In [1]:
shared_tier_check = %sql show variables like 'is_shared_tier'if not shared_tier_check or shared_tier_check[0][1] == 'OFF':%sql DROP DATABASE IF EXISTS db_vector;%sql CREATE DATABASE db_vector;
Action Required
Make sure to select a database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
Create table words
and insert the words into the table.
In [2]:
%%sqlCREATE TABLE /* Creating table for sample data. */ words(word varchar(25));INSERT INTO words VALUES ("red"), ("potatoes"), ("soda"), ("cheese"),("water"), ("blue"), ("crispy"), ("hamburger"),("coffee"), ("green"), ("milk"), ("la croix"),("yellow"), ("chocolate"), ("french fries"),("latte"), ("cake"), ("brown"), ("cheeseburger"),("espresso"), ("cheesecake"), ("black"), ("mocha"),("fizzy"), ("carbon"), ("banana"), ("sunshine"),("orange carrot"), ("sun"), ("hay"), ("cookies"),("fish"), ('king'), ('man'), ('woman'), ('queen'),('Paris'), ('France'), ('Poland'), ('Warsaw'),('prince'), ('throne'), ('Elizabeth'), ('ruler');
In [3]:
%%sqlSHOW TABLES EXTENDED;
3. Install and import required libraries
In this section, we will set up the necessary environment by installing important libraries .
The install process may take a couple minutes.
In [4]:
!pip3 install --upgrade sentence-transformers torch tensorflow pandarallel --quiet
Import several libraries for data manipulation (e.g., Pandas, NumPy), database connectivity (SQLAlchemy, SingleStoreDB), machine learning (PyTorch, Transformers), and parallel processing (pandarallel).
In [5]:
import jsonimport ibisimport numpy as npimport pandas as pdimport sqlalchemy as saimport singlestoredb as s2import torchfrom pandarallel import pandarallelfrom transformers import AutoTokenizerfrom transformers import AutoModelfrom sqlalchemy import *db_connection = create_engine(connection_url)pandarallel.initialize(nb_workers=2, progress_bar=True)
4. Load Sentence Transformer model along with its tokenizer, making them ready for use in tasks like sentence embeddings or similarity calculations.
Load Sentence Transformers model
In [6]:
model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"model = AutoModel.from_pretrained(model_name)tokenizer = AutoTokenizer.from_pretrained(model_name)
5. Load the data into dataframe from database table
Load the data into a DataFrame
In [7]:
result = %sql select * from words;df = pd.DataFrame(result)df
6. Function to retrieve the embedding
This function, named get_embedding, takes a sentence as input and returns its embedding using a pre-trained tokenizer and model. It tokenizes the sentence and returns the resulting embedding as a NumPy array with a float32 data type.
In [8]:
import numpy as npdef get_embedding(sentence: str) -> np.ndarray[np.float32]:"""Retrieve embedding for given sentence."""inputs = tokenizer(sentence, padding=True, truncation=True, return_tensors="pt")with torch.no_grad():embedding = model(**inputs).last_hidden_state.mean(dim=1).squeeze().tolist()return np.array(embedding, dtype='<f4')
7. Apply the function get_embedding
It adds a new column 'word_embeddings' to a DataFrame df by applying a function get_embedding in parallel to the 'word' column, aiming to calculate and store word embeddings for each word in the DataFrame. The parallel_apply function leverages parallel processing capabilities for efficient computation.
In [9]:
# Apply the functiondf['word_embeddings'] = df['word'].parallel_apply(get_embedding)
Below code writes the DataFrame df to a SQL table named words_table
using SingleStoreDB (s2). It replaces the existing table if it already exists, does not include an index column in the table, and specifies the data type for the word_embeddings
column as LargeBinary.
In [10]:
# Create the table with vector embeddingsdf.to_sql('words_table',s2.create_engine().connect(),if_exists='replace',index=False,dtype=dict(word_embeddings=sa.LargeBinary),)
8. Visualizing words
Below code transforms word embeddings stored in the word_embeddings
column of DataFrame df using t-SNE (t-Distributed Stochastic Neighbor Embedding), reducing the dimensionality to 2 components. The resulting transformed data is stored in the variable vis_dims
, representing the two-dimensional visualization of the word embeddings.
In [11]:
matrix = df["word_embeddings"]matrix = matrix.tolist()matrix
In [12]:
matrix = np.asarray(matrix)from sklearn.manifold import TSNE# Create a t-SNE model and transform the datatsne = TSNE(n_components=2, perplexity=10, random_state=42, init='random', learning_rate=200)vis_dims = tsne.fit_transform(matrix)vis_dims.shape
Install the Matplotlib library using the pip package manager, allowing for the visualization of data and plots in Python.
In [13]:
!pip install matplotlib
Using Matplotlib to create a scatter plot visualizing the 2D representation of word embeddings obtained from t-SNE. The code iterates through each word in the DataFrame df, extracts its coordinates from vis_dims, and annotates the corresponding point on the scatter plot with the word label. Finally, the plot is displayed using plt.show()
.
In [14]:
import matplotlib.pyplot as pltimport matplotlibimport numpy as nplabels = []x = [x for x,y in vis_dims]y = [y for x,y in vis_dims]for word in range(len(df)):first_value = df['word'].iat[word]labels.append(first_value)print(len(df))plt.scatter(x, y,alpha=0.3)for i in range(len(x)):plt.scatter(x[i],y[i])plt.annotate(labels[i],xy=(x[i], y[i]),xytext=(5, 2),textcoords='offset points',ha='right',va='bottom',)plt.show();
You can insert words of your choice in the words table and execute all the cells above to visualize the semantic patterns.
9. Introducing Vector Datatype
We can see below word_embeddings
column is blob
datatype
In [15]:
%%sqlDESC words_table;
In [16]:
%%sqlSELECT word, word_embeddings FROM words_table LIMIT 2;
This below line of code executes a SQL query on the words_table
, selecting the word
column and the hexadecimal representation of the word_embeddings
column for the first row in the table using the limit 1
clause.
In [17]:
%%sqlSELECT word, HEX(word_embeddings) FROM words_table LIMIT 1;
Below query extracts the word
column and unpacks the JSON array stored in the word_embeddings
column for the first row in the words_table
, providing a more readable representation of the word embeddings.
In [18]:
%%sqlSELECT word, JSON_ARRAY_UNPACK(word_embeddings) FROM words_table LIMIT 1;
10. Transition from BLOB to Vector datatype
1. Add a new vector column to the right of the blob column.
2. Update the vector column with the data from the blob column.
3. Drop the blob column.
4. Rename the new vector column to the old blob column name. This will ensure any previous queries will still work, or at least require fewer changes.
In [19]:
%%sqlSELECT VECTOR_NUM_ELEMENTS(word_embeddings) FROM words_table LIMIT 1;
In [20]:
%%sqlALTER TABLE words_table ADD COLUMN emb2 vector(384) AFTER word_embeddings;UPDATE words_table SET emb2=word_embeddings;
In [21]:
%%sqlSELECT word, emb2, JSON_ARRAY_UNPACK(word_embeddings) FROM words_table LIMIT 1;
In [22]:
%%sqlALTER TABLE words_table DROP COLUMN word_embeddings;ALTER TABLE words_table CHANGE emb2 word_embeddings;
In [23]:
%%sqlDESC words_table;
11. Semantic Search of the word -sunshine using Infix Operator
Performing a semantic search for the word 'sunshine' to find contextually similar or related words and phrases based on their semantic meanings rather than exact lexical matches.
The infix operators <*>
and <->
can be used to facilitate DOT_PRODUCT and EUCLIDEAN_DISTANCE operations, respectively, providing a more concise query syntax compared to using the existing built-in functions such as DOT_PRODUCT(a, b) and EUCLIDEAN_DISTANCE(a, b).
In [24]:
%%sqlSELECT word_embeddings INTO @c from words_table WHERE word LIKE 'sunshine%';SELECT word, (@c<*>word_embeddings) AS scoreFROM words_tableORDER BY score descLIMIT 3;
Clean up
Action Required
If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.
In [25]:
shared_tier_check = %sql show variables like 'is_shared_tier'if not shared_tier_check or shared_tier_check[0][1] == 'OFF':%sql DROP DATABASE IF EXISTS db_vector;
Details
About this Template
SingleStoreDB leverages vector database processing to enable storage and search of vector data, allowing for semantic similarity-based searches to find content based on meaning rather than exact matches, such as locating related words like "happy" and "joyful."
This Notebook can be run in Shared Tier, Standard and Enterprise deployments.
Tags
License
This Notebook has been released under the Apache 2.0 open source license.