Hybrid Search
Notebook
Note
This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.
Source: OpenAI Cookbook
Hybrid search integrates both keyword-based search and semantic search in order to combine the strengths of both and provide users with a more comprehensive and efficient search experience. This notebook is an example on how to perform hybrid search with SingleStore's database and notebooks.
Setup
Let's first download the libraries necessary.
In [1]:
%pip install wget openai==1.3.3 --quiet
In [2]:
import jsonimport osimport pandas as pdimport wget
In [3]:
# Import the library for vectorizing the data (Up to 2 minutes)%pip install sentence-transformers --quietfrom sentence_transformers import SentenceTransformermodel = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base')
Import data from CSV file
This csv file holds the title, summary, and category of approximately 2000 news articles.
In [4]:
# download reviews csv filecvs_file_path = 'https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/AG_news_samples.csv'file_path = 'AG_news_samples.csv'if not os.path.exists(file_path):wget.download(cvs_file_path, file_path)print('File downloaded successfully.')else:print('File already exists in the local file system.')
In [5]:
df = pd.read_csv('AG_news_samples.csv')df
In [6]:
data = df.to_dict(orient='records')data[0]
Action Required
If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
Set up the database
Set up the SingleStoreDB database which will hold your data.
In [7]:
shared_tier_check = %sql show variables like 'is_shared_tier'if not shared_tier_check or shared_tier_check[0][1] == 'OFF':%sql DROP DATABASE IF EXISTS news;%sql CREATE DATABASE news;
Action Required
Make sure to select a database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.
In [8]:
%%sqlDROP TABLE IF EXISTS news_articles;CREATE TABLE IF NOT EXISTS news_articles /* Creating table for sample data. */(title TEXT,description TEXT,genre TEXT,embedding BLOB,FULLTEXT (title, description));
Get embeddings for every row based on the description column
In [9]:
# Will take around 3.5 minutes to get embeddings for all 2000 rowsdescriptions = [row['description'] for row in data]all_embeddings = model.encode(descriptions)all_embeddings.shape
Merge embedding values into data
rows.
In [10]:
for row, embedding in zip(data, all_embeddings):row['embedding'] = embedding
Here's an example of one row of the combined data.
In [11]:
data[0]
Populate the database
In [12]:
%sql TRUNCATE TABLE news_articles;import sqlalchemy as safrom singlestoredb import create_engine# Use create_table from singlestoredb since it uses the notebook connection URLconn = create_engine().connect()statement = sa.text('''INSERT INTO news.news_articles (title,description,genre,embedding)VALUES (:title,:description,:label,:embedding)''')conn.execute(statement, data)
Semantic search
Connect to OpenAI
In [13]:
import openaiEMBEDDING_MODEL = 'text-embedding-ada-002'GPT_MODEL = 'gpt-3.5-turbo'
In [14]:
import getpassopenai.api_key = getpass.getpass('OpenAI API Key: ')
Run semantic search and get scores
In [15]:
search_query = 'Articles about Aussie captures'search_embedding = model.encode(search_query)# Create the SQL statement.query_statement = sa.text('''SELECTtitle,description,genre,DOT_PRODUCT(embedding, :embedding) AS scoreFROM news.news_articlesORDER BY score DESCLIMIT 10''')# Execute the SQL statement.results = pd.DataFrame(conn.execute(query_statement, dict(embedding=search_embedding)))results
Hybrid search
This search finds the average of the score gotten from the semantic search and the score gotten from the key-word search and sorts the news articles by this combined score to perform an effective hybrid search.
In [16]:
hyb_query = 'Articles about Aussie captures'hyb_embedding = model.encode(hyb_query)# Create the SQL statement.hyb_statement = sa.text('''SELECTtitle,description,genre,DOT_PRODUCT(embedding, :embedding) AS semantic_score,MATCH(title, description) AGAINST (:query) AS keyword_score,(semantic_score + keyword_score) / 2 AS combined_scoreFROM news.news_articlesORDER BY combined_score DESCLIMIT 10''')# Execute the SQL statement.hyb_results = pd.DataFrame(conn.execute(hyb_statement, dict(embedding=hyb_embedding, query=hyb_query)))hyb_results
Clean up
Action Required
If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.
In [17]:
shared_tier_check = %sql show variables like 'is_shared_tier'if not shared_tier_check or shared_tier_check[0][1] == 'OFF':%sql DROP DATABASE IF EXISTS news;
Details
About this Template
Hybrid search combines keyword search with semantic search, aiming to provide more accurate results.
This Notebook can be run in Shared Tier, Standard and Enterprise deployments.
Tags
License
This Notebook has been released under the Apache 2.0 open source license.