New

Semantic Visualization and Vector Datatype

Notebook

SingleStore Notebooks

Semantic Visualization and Vector Datatype

Note

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

SingleStoreDB supports vector database processing, which allows you to store and search vector data. Vectors usually come from objects: text, images, video, audio, etc. In a vector space model, words with similar meanings, such as "happy" and "joyful," are represented by vectors that lie in proximity, reflecting their semantic similarity. Vector database searches find data based on its content or meaning, even without exact matches.

1. Create a workspace in your workspace group

S-00 is sufficient.

2. Create a database named db_vector

Action Required

If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

In [1]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS db_vector;
%sql CREATE DATABASE db_vector;

Action Required

Make sure to select a database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

Create table words and insert the words into the table.

In [2]:

%%sql
CREATE TABLE words(word varchar(25));
INSERT INTO words VALUES ("red"), ("potatoes"), ("soda"), ("cheese"),
("water"), ("blue"), ("crispy"), ("hamburger"),
("coffee"), ("green"), ("milk"), ("la croix"),
("yellow"), ("chocolate"), ("french fries"),
("latte"), ("cake"), ("brown"), ("cheeseburger"),
("espresso"), ("cheesecake"), ("black"), ("mocha"),
("fizzy"), ("carbon"), ("banana"), ("sunshine"),
("orange carrot"), ("sun"), ("hay"), ("cookies"),
("fish"), ('king'), ('man'), ('woman'), ('queen'),
('Paris'), ('France'), ('Poland'), ('Warsaw'),
('prince'), ('throne'), ('Elizabeth'), ('ruler');

In [3]:

%%sql
SHOW TABLES EXTENDED;

3. Install and import required libraries

In this section, we will set up the necessary environment by installing important libraries .

The install process may take a couple minutes.

In [4]:

!pip3 install --upgrade sentence-transformers torch tensorflow pandarallel --quiet

Import several libraries for data manipulation (e.g., Pandas, NumPy), database connectivity (SQLAlchemy, SingleStoreDB), machine learning (PyTorch, Transformers), and parallel processing (pandarallel).

In [5]:

import json
import ibis
import numpy as np
import pandas as pd
import sqlalchemy as sa
import singlestoredb as s2
import torch
from pandarallel import pandarallel
from transformers import AutoTokenizer
from transformers import AutoModel
from sqlalchemy import *
db_connection = create_engine(connection_url)
pandarallel.initialize(nb_workers=2, progress_bar=True)

4. Load Sentence Transformer model along with its tokenizer, making them ready for use in tasks like sentence embeddings or similarity calculations.

Load Sentence Transformers model

In [6]:

model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

5. Load the data into dataframe from database table

Load the data into a DataFrame

In [7]:

result = %sql select * from words;
df = pd.DataFrame(result)
df

6. Function to retrieve the embedding

This function, named get_embedding, takes a sentence as input and returns its embedding using a pre-trained tokenizer and model. It tokenizes the sentence and returns the resulting embedding as a NumPy array with a float32 data type.

In [8]:

import numpy as np
def get_embedding(sentence: str) -> np.ndarray[np.float32]:
"""Retrieve embedding for given sentence."""
inputs = tokenizer(sentence, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
embedding = model(**inputs).last_hidden_state.mean(dim=1).squeeze().tolist()
return np.array(embedding, dtype='<f4')

7. Apply the function get_embedding

It adds a new column 'word_embeddings' to a DataFrame df by applying a function get_embedding in parallel to the 'word' column, aiming to calculate and store word embeddings for each word in the DataFrame. The parallel_apply function leverages parallel processing capabilities for efficient computation.

In [9]:

# Apply the function
df['word_embeddings'] = df['word'].parallel_apply(get_embedding)

Below code writes the DataFrame df to a SQL table named words_table using SingleStoreDB (s2). It replaces the existing table if it already exists, does not include an index column in the table, and specifies the data type for the word_embeddings column as LargeBinary.

In [10]:

# Create the table with vector embeddings
df.to_sql(
'words_table',
s2.create_engine().connect(),
if_exists='replace',
index=False,
dtype=dict(word_embeddings=sa.LargeBinary),
)

8. Visualizing words

Below code transforms word embeddings stored in the word_embeddings column of DataFrame df using t-SNE (t-Distributed Stochastic Neighbor Embedding), reducing the dimensionality to 2 components. The resulting transformed data is stored in the variable vis_dims, representing the two-dimensional visualization of the word embeddings.

In [11]:

matrix = df["word_embeddings"]
matrix = matrix.tolist()
matrix

In [12]:

matrix = np.asarray(matrix)
from sklearn.manifold import TSNE
# Create a t-SNE model and transform the data
tsne = TSNE(n_components=2, perplexity=10, random_state=42, init='random', learning_rate=200)
vis_dims = tsne.fit_transform(matrix)
vis_dims.shape

Install the Matplotlib library using the pip package manager, allowing for the visualization of data and plots in Python.

In [13]:

!pip install matplotlib

Using Matplotlib to create a scatter plot visualizing the 2D representation of word embeddings obtained from t-SNE. The code iterates through each word in the DataFrame df, extracts its coordinates from vis_dims, and annotates the corresponding point on the scatter plot with the word label. Finally, the plot is displayed using plt.show().

In [14]:

import matplotlib.pyplot as plt
import matplotlib
import numpy as np
labels = []
x = [x for x,y in vis_dims]
y = [y for x,y in vis_dims]
for word in range(len(df)):
first_value = df['word'].iat[word]
labels.append(first_value)
print(len(df))
plt.scatter(x, y,alpha=0.3)
for i in range(len(x)):
plt.scatter(x[i],y[i])
plt.annotate(
labels[i],
xy=(x[i], y[i]),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom',
)
plt.show();

You can insert words of your choice in the words table and execute all the cells above to visualize the semantic patterns.

9. Introducing Vector Datatype

We can see below word_embeddings column is blob datatype

In [15]:

%%sql
DESC words_table;

In [16]:

%%sql
SELECT word, word_embeddings FROM words_table LIMIT 2;

This below line of code executes a SQL query on the words_table, selecting the word column and the hexadecimal representation of the word_embeddings column for the first row in the table using the limit 1 clause.

In [17]:

%%sql
SELECT word, HEX(word_embeddings) FROM words_table LIMIT 1;

Below query extracts the word column and unpacks the JSON array stored in the word_embeddings column for the first row in the words_table, providing a more readable representation of the word embeddings.

In [18]:

%%sql
SELECT word, JSON_ARRAY_UNPACK(word_embeddings) FROM words_table LIMIT 1;

10. Transition from BLOB to Vector datatype

1. Add a new vector column to the right of the blob column.

2. Update the vector column with the data from the blob column.

3. Drop the blob column.

4. Rename the new vector column to the old blob column name. This will ensure any previous queries will still work, or at least require fewer changes.

In [19]:

%%sql
SELECT VECTOR_NUM_ELEMENTS(word_embeddings) FROM words_table LIMIT 1;

In [20]:

%%sql
ALTER TABLE words_table ADD COLUMN emb2 vector(384) AFTER word_embeddings;
UPDATE words_table SET emb2=word_embeddings;

In [21]:

%%sql
SELECT word, emb2, JSON_ARRAY_UNPACK(word_embeddings) FROM words_table LIMIT 1;

In [22]:

%%sql
ALTER TABLE words_table DROP COLUMN word_embeddings;
ALTER TABLE words_table CHANGE emb2 word_embeddings;

In [23]:

%%sql
DESC words_table;

11. Semantic Search of the word -sunshine  using Infix Operator

Performing a semantic search for the word 'sunshine' to find contextually similar or related words and phrases based on their semantic meanings rather than exact lexical matches.

The infix operators <*> and <-> can be used to facilitate DOT_PRODUCT and EUCLIDEAN_DISTANCE operations, respectively, providing a more concise query syntax compared to using the existing built-in functions such as DOT_PRODUCT(a, b) and EUCLIDEAN_DISTANCE(a, b).

In [24]:

%%sql
SELECT word_embeddings INTO @c from words_table WHERE word LIKE 'sunshine%';
SELECT word, (@c<*>word_embeddings) AS score
FROM words_table
ORDER BY score desc
LIMIT 3;

Clean up

Action Required

If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.

In [25]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS db_vector;

Details

Tags

#stater#vectordb#genai#huggingface#semanticsearch

License

This Notebook has been released under the Apache 2.0 open source license.