New

Hybrid Search

Notebook


SingleStore Notebooks

Hybrid Search

Note

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

Source: OpenAI Cookbook

Hybrid search integrates both keyword-based search and semantic search in order to combine the strengths of both and provide users with a more comprehensive and efficient search experience. This notebook is an example on how to perform hybrid search with SingleStore's database and notebooks.

Setup

Let's first download the libraries necessary.

In [1]:

%pip install wget openai==1.3.3 --quiet

In [2]:

import json
import os
import pandas as pd
import wget

In [3]:

# Import the library for vectorizing the data (Up to 2 minutes)
%pip install sentence-transformers --quiet
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base')

Import data from CSV file

This csv file holds the title, summary, and category of approximately 2000 news articles.

In [4]:

# download reviews csv file
cvs_file_path = 'https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/AG_news_samples.csv'
file_path = 'AG_news_samples.csv'
if not os.path.exists(file_path):
wget.download(cvs_file_path, file_path)
print('File downloaded successfully.')
else:
print('File already exists in the local file system.')

In [5]:

df = pd.read_csv('AG_news_samples.csv')
df

In [6]:

data = df.to_dict(orient='records')
data[0]

Action Required

If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

Set up the database

Set up the SingleStoreDB database which will hold your data.

In [7]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS news;
%sql CREATE DATABASE news;

Action Required

Make sure to select a database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

In [8]:

%%sql
DROP TABLE IF EXISTS news_articles;
CREATE TABLE IF NOT EXISTS news_articles /* Creating table for sample data. */(
title TEXT,
description TEXT,
genre TEXT,
embedding BLOB,
FULLTEXT (title, description)
);

Get embeddings for every row based on the description column

In [9]:

# Will take around 3.5 minutes to get embeddings for all 2000 rows
descriptions = [row['description'] for row in data]
all_embeddings = model.encode(descriptions)
all_embeddings.shape

Merge embedding values into data rows.

In [10]:

for row, embedding in zip(data, all_embeddings):
row['embedding'] = embedding

Here's an example of one row of the combined data.

In [11]:

data[0]

Populate the database

In [12]:

%sql TRUNCATE TABLE news_articles;
import sqlalchemy as sa
from singlestoredb import create_engine
# Use create_table from singlestoredb since it uses the notebook connection URL
conn = create_engine().connect()
statement = sa.text('''
INSERT INTO news.news_articles (
title,
description,
genre,
embedding
)
VALUES (
:title,
:description,
:label,
:embedding
)
''')
conn.execute(statement, data)

Semantic search

Connect to OpenAI

In [13]:

import openai
EMBEDDING_MODEL = 'text-embedding-ada-002'
GPT_MODEL = 'gpt-3.5-turbo'

In [14]:

import getpass
openai.api_key = getpass.getpass('OpenAI API Key: ')

Run semantic search and get scores

In [15]:

search_query = 'Articles about Aussie captures'
search_embedding = model.encode(search_query)
# Create the SQL statement.
query_statement = sa.text('''
SELECT
title,
description,
genre,
DOT_PRODUCT(embedding, :embedding) AS score
FROM news.news_articles
ORDER BY score DESC
LIMIT 10
''')
# Execute the SQL statement.
results = pd.DataFrame(conn.execute(query_statement, dict(embedding=search_embedding)))
results

Hybrid search

This search finds the average of the score gotten from the semantic search and the score gotten from the key-word search and sorts the news articles by this combined score to perform an effective hybrid search.

In [16]:

hyb_query = 'Articles about Aussie captures'
hyb_embedding = model.encode(hyb_query)
# Create the SQL statement.
hyb_statement = sa.text('''
SELECT
title,
description,
genre,
DOT_PRODUCT(embedding, :embedding) AS semantic_score,
MATCH(title, description) AGAINST (:query) AS keyword_score,
(semantic_score + keyword_score) / 2 AS combined_score
FROM news.news_articles
ORDER BY combined_score DESC
LIMIT 10
''')
# Execute the SQL statement.
hyb_results = pd.DataFrame(conn.execute(hyb_statement, dict(embedding=hyb_embedding, query=hyb_query)))
hyb_results

Clean up

Action Required

If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.

In [17]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS news;

Details


About this Template

Hybrid search combines keyword search with semantic search, aiming to provide more accurate results.

Notebook Icon

This Notebook can be run in Shared Tier, Standard and Enterprise deployments.

Tags

starteropenaigenaivectordb

License

This Notebook has been released under the Apache 2.0 open source license.