New

Hybrid Search

Notebook

SingleStore Notebooks

Hybrid Search

Note

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

Source: OpenAI Cookbook

Hybrid search integrates both keyword-based search and semantic search in order to combine the strengths of both and provide users with a more comprehensive and efficient search experience. This notebook is an example on how to perform hybrid search with SingleStore's database and notebooks.

Setup

Let's first download the libraries necessary.

In [1]:

%pip install wget openai==1.3.3 --quiet

In [2]:

import json
import os
import pandas as pd
import wget

In [3]:

# Import the library for vectorizing the data (Up to 2 minutes)
!pip install sentence-transformers --quiet
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base')

Import data from CSV file

This csv file holds the title, summary, and category of approximately 2000 news articles.

In [4]:

# download reviews csv file
cvs_file_path = 'https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/AG_news_samples.csv'
file_path = 'AG_news_samples.csv'
if not os.path.exists(file_path):
wget.download(cvs_file_path, file_path)
print('File downloaded successfully.')
else:
print('File already exists in the local file system.')

In [5]:

df = pd.read_csv('AG_news_samples.csv')
df

In [6]:

data = df.to_dict(orient='records')
data[0]

Action Required

If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

Set up the database

Set up the SingleStoreDB database which will hold your data.

In [7]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS news;
%sql CREATE DATABASE news;

Action Required

Make sure to select a database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

In [8]:

%%sql
DROP TABLE IF EXISTS news_articles;
CREATE TABLE IF NOT EXISTS news_articles (
title TEXT,
description TEXT,
genre TEXT,
embedding BLOB,
FULLTEXT (title, description)
);

Get embeddings for every row based on the description column

In [9]:

# Will take around 3.5 minutes to get embeddings for all 2000 rows
descriptions = [row['description'] for row in data]
all_embeddings = model.encode(descriptions)
all_embeddings.shape

Merge embedding values into data rows.

In [10]:

for row, embedding in zip(data, all_embeddings):
row['embedding'] = embedding

Here's an example of one row of the combined data.

In [11]:

data[0]

Populate the database

In [12]:

%sql TRUNCATE TABLE news_articles;
import sqlalchemy as sa
from singlestoredb import create_engine
# Use create_table from singlestoredb since it uses the notebook connection URL
conn = create_engine().connect()
statement = sa.text('''
INSERT INTO news.news_articles (
title,
description,
genre,
embedding
)
VALUES (
:title,
:description,
:label,
:embedding
)
''')
conn.execute(statement, data)

Semantic search

Connect to OpenAI

In [13]:

import openai
EMBEDDING_MODEL = 'text-embedding-ada-002'
GPT_MODEL = 'gpt-3.5-turbo'

In [14]:

import getpass
openai.api_key = getpass.getpass('OpenAI API Key: ')

Run semantic search and get scores

In [15]:

search_query = 'Articles about Aussie captures'
search_embedding = model.encode(search_query)
# Create the SQL statement.
query_statement = sa.text('''
SELECT
title,
description,
genre,
DOT_PRODUCT(embedding, :embedding) AS score
FROM news.news_articles
ORDER BY score DESC
LIMIT 10
''')
# Execute the SQL statement.
results = pd.DataFrame(conn.execute(query_statement, dict(embedding=search_embedding)))
results

Hybrid search

This search finds the average of the score gotten from the semantic search and the score gotten from the key-word search and sorts the news articles by this combined score to perform an effective hybrid search.

In [16]:

hyb_query = 'Articles about Aussie captures'
hyb_embedding = model.encode(hyb_query)
# Create the SQL statement.
hyb_statement = sa.text('''
SELECT
title,
description,
genre,
DOT_PRODUCT(embedding, :embedding) AS semantic_score,
MATCH(title, description) AGAINST (:query) AS keyword_score,
(semantic_score + keyword_score) / 2 AS combined_score
FROM news.news_articles
ORDER BY combined_score DESC
LIMIT 10
''')
# Execute the SQL statement.
hyb_results = pd.DataFrame(conn.execute(hyb_statement, dict(embedding=hyb_embedding, query=hyb_query)))
hyb_results

Clean up

Action Required

If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.

In [17]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS news;

Details

Tags

#starter#openai#genai#vectordb

License

This Notebook has been released under the Apache 2.0 open source license.