New

Semantic Search with Hugging Face Models and Datasets

Notebook

SingleStore Notebooks

Semantic Search with Hugging Face Models and Datasets

Note

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

In this notebook, we will demonstrate an example of conducting semantic search on SingleStoreDB with SQL! Unlike traditional keyword-based search methods, semantic search algorithms take into account the relationships between words and their meanings, enabling them to deliver more accurate and relevant results – even when search terms are vague or ambiguous.

SingleStoreDB’s built-in parallelization and Intel SIMD-based vector processing takes care of the heavy lifting involved in processing vector data. This allows your to run your ML algorithms right in your database extremely efficiently with just 1 line of SQL!

In this example, we use Hugging Face to create embeddings for our dataset and run semantic_search using dot_product vector matching function!

1. Create a workspace in your workspace group

S-00 is sufficient.

Action Required

If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

2. Create a database named semantic_search

In [1]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS semantic_search;
%sql CREATE DATABASE semantic_search;

Action Required

Make sure to select the semantic_search database from the drop-down menu at the top of this notebook. It updates the connection_url which is used by the %%sql magic command and SQLAlchemy to make connections to the selected database.

3. Install and import required libraries

We will use an embedding model on Hugging Face with Sentence Transfomers library. We will be analysing the sentiment of reviewers of selected movies. This dataset is available on Hugging Face and to use it, we will need the datasets library.

The install process may take a couple minutes.

In [2]:

!pip3 install --upgrade sentence-transformers torch tensorflow datasets --quiet
import json
import ibis
import numpy as np
import pandas as pd
import sqlalchemy as sa
import singlestoredb as s2
import torch
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import AutoModel

4. Load Sentence Transformer library and create a function called get_embedding()

To vectorize and embed the reviews that watchers of the movies left, we will be using the sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model. We will create a function called get_embedding() that will call this model and return the vectorized version of the sentence.

In [3]:

# Load Sentence Transformers model
model_name = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Add a function to compute the embedding. The result will be a numpy array of 32-bit floats.

In [4]:

def get_embedding(sentence: str) -> np.ndarray[np.float32]:
"""Retrieve embedding for given sentence."""
inputs = tokenizer(sentence, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
embedding = model(**inputs).last_hidden_state.mean(dim=1).squeeze().tolist()
return np.array(embedding, dtype='<f4')

5. Load the dataset on movie reviews from Hugging Face into a DataFrame

We will be doing some operations and only sampling 100 random reviews from the "test" dataset of imdb-movie-reviews.

In [5]:

# Load the dataset into a pandas DataFrame
dataset = load_dataset("ajaykarthick/imdb-movie-reviews")
dataframe = dataset["train"].to_pandas()
sample_size = 100 # Adjust the desired sample size
random_sample = dataframe.sample(n=sample_size)

6. Generate embeddings of the reviews left by customers and add them to your DataFrame

We want to embed the entries in the review column and add the embeddings to the data. We will do this with pandas and our get_embeddings function. Embeddings are stored as a numpy array.

In [6]:

random_sample['review_embeddings'] = random_sample['review'].apply(get_embedding)

7. Insert data into SingleStoreDB

You can seamlessly bring this data to your SingleStoreDB table directly from your from DataFrame. SingleStore ♥️ Python.

We will bring this data into a table called reviews. Notice how you don't have to write any SQL for this – we will infer the schema from your DataFrame and underneath the hood configure how to bring this DataFrame into our database. Since numpy arrays don't map directly to a database type, we give pandas a type hint to create a blob column for the review_embeddings column.

In [7]:

random_sample.to_sql('reviews',
s2.create_engine().connect(),
if_exists='replace',
index=False,
dtype=dict(review_embeddings=sa.LargeBinary))

In [8]:

# Create a database connection and display the `CREATE TABLE` statement
conn = s2.connect()
conn.show.create_table('reviews')

8. Run the semantic search algorithm with just one line of SQL

We will utilize SingleStoreDB's distributed architecture to efficiently compute the dot product of the input string (stored in searchstring) with each entry in the database and return the top 5 reviews with the highest dot product score. Each vector is normalized to length 1, hence the dot product function essentially computes the cosine similarity between two vectors – an appropriate nearness metric. SingleStoreDB makes this extremely fast because it compiles queries to machine code and runs dot_product using SIMD instructions.

In [9]:

searchstring = input('Please enter a search string:')
search_embedding = get_embedding(searchstring).tobytes().hex()
results = %sql SELECT review, DOT_PRODUCT(review_embeddings, X'{{search_embedding}}') AS Score \
FROM reviews ORDER BY Score DESC LIMIT 5;
print()
for i, res in enumerate(results):
print(f'{i + 1}: {res[0]} Score: {res[1]:.2f}\n')

9. Clean up

Action Required

If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.

In [10]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS semantic_search;

Details

Tags

#starter#vectordb#huggingface

License

This Notebook has been released under the Apache 2.0 open source license.