New

Hybrid Search

Notebook


SingleStore Notebooks

Hybrid Search

Note

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

Source: OpenAI Cookbook

Hybrid search integrates both keyword-based search and semantic search in order to combine the strengths of both and provide users with a more comprehensive and efficient search experience. This notebook is an example on how to perform hybrid search with SingleStore's database and notebooks.

Setup

Let's first download the libraries necessary.

In [1]:

1%pip install wget openai==1.3.3 --quiet

In [2]:

1import json2import os3import pandas as pd4import wget

In [3]:

1# Import the library for vectorizing the data (Up to 2 minutes)2%pip install sentence-transformers --quiet3
4from sentence_transformers import SentenceTransformer5
6model = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base')

Import data from CSV file

This csv file holds the title, summary, and category of approximately 2000 news articles.

In [4]:

1# download reviews csv file2cvs_file_path = 'https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/AG_news_samples.csv'3file_path = 'AG_news_samples.csv'4
5if not os.path.exists(file_path):6    wget.download(cvs_file_path, file_path)7    print('File downloaded successfully.')8else:9    print('File already exists in the local file system.')

In [5]:

1df = pd.read_csv('AG_news_samples.csv')2df

In [6]:

1data = df.to_dict(orient='records')2data[0]

Action Required

If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

Set up the database

Set up the SingleStoreDB database which will hold your data.

In [7]:

1shared_tier_check = %sql show variables like 'is_shared_tier'2if not shared_tier_check or shared_tier_check[0][1] == 'OFF':3    %sql DROP DATABASE IF EXISTS news;4    %sql CREATE DATABASE news;

Action Required

Make sure to select a database from the drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

In [8]:

1%%sql2DROP TABLE IF EXISTS news_articles;3CREATE TABLE IF NOT EXISTS news_articles /* Creating table for sample data. */(4    title TEXT,5    description TEXT,6    genre TEXT,7    embedding BLOB,8    FULLTEXT (title, description)9);

Get embeddings for every row based on the description column

In [9]:

1# Will take around 3.5 minutes to get embeddings for all 2000 rows2
3descriptions = [row['description'] for row in data]4all_embeddings = model.encode(descriptions)5all_embeddings.shape

Merge embedding values into data rows.

In [10]:

1for row, embedding in zip(data, all_embeddings):2    row['embedding'] = embedding

Here's an example of one row of the combined data.

In [11]:

1data[0]

Populate the database

In [12]:

1%sql TRUNCATE TABLE news_articles;2
3import sqlalchemy as sa4from singlestoredb import create_engine5
6# Use create_table from singlestoredb since it uses the notebook connection URL7conn = create_engine().connect()8
9statement = sa.text('''10        INSERT INTO news.news_articles (11            title,12            description,13            genre,14            embedding15        )16        VALUES (17            :title,18            :description,19            :label,20            :embedding21        )22    ''')23
24conn.execute(statement, data)

Semantic search

Connect to OpenAI

In [13]:

1import openai2
3EMBEDDING_MODEL = 'text-embedding-ada-002'4GPT_MODEL = 'gpt-3.5-turbo'

In [14]:

1import getpass2
3openai.api_key = getpass.getpass('OpenAI API Key: ')

Run semantic search and get scores

In [15]:

1search_query = 'Articles about Aussie captures'2search_embedding = model.encode(search_query)3
4# Create the SQL statement.5query_statement = sa.text('''6    SELECT7        title,8        description,9        genre,10        DOT_PRODUCT(embedding, :embedding) AS score11    FROM news.news_articles12    ORDER BY score DESC13    LIMIT 1014    ''')15
16# Execute the SQL statement.17results = pd.DataFrame(conn.execute(query_statement, dict(embedding=search_embedding)))18results

Hybrid search

This search finds the average of the score gotten from the semantic search and the score gotten from the key-word search and sorts the news articles by this combined score to perform an effective hybrid search.

In [16]:

1hyb_query = 'Articles about Aussie captures'2hyb_embedding = model.encode(hyb_query)3
4# Create the SQL statement.5hyb_statement = sa.text('''6    SELECT7        title,8        description,9        genre,10        DOT_PRODUCT(embedding, :embedding) AS semantic_score,11        MATCH(title, description) AGAINST (:query) AS keyword_score,12        (semantic_score + keyword_score) / 2 AS combined_score13    FROM news.news_articles14    ORDER BY combined_score DESC15    LIMIT 1016    ''')17
18# Execute the SQL statement.19hyb_results = pd.DataFrame(conn.execute(hyb_statement, dict(embedding=hyb_embedding, query=hyb_query)))20hyb_results

Clean up

Action Required

If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.

In [17]:

1shared_tier_check = %sql show variables like 'is_shared_tier'2if not shared_tier_check or shared_tier_check[0][1] == 'OFF':3    %sql DROP DATABASE IF EXISTS news;

Details


About this Template

Hybrid search combines keyword search with semantic search, aiming to provide more accurate results.

This Notebook can be run in Shared Tier, Standard and Enterprise deployments.

Tags

starteropenaigenaivectordb

License

This Notebook has been released under the Apache 2.0 open source license.

See Notebook in action

Launch this notebook in SingleStore and start executing queries instantly.