New

Semantic Search with OpenAI QA

Notebook

SingleStore Notebooks

Semantic Search with OpenAI QA

Note

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

In this Notebook you will use a combination of Semantic Search and a Large Langauge Model (LLM) to build a basic Retrieval Augmented Generation (RAG) application. For a great introduction into what RAG is, please read A Beginner's Guide to Retrieval Augmented Generation (RAG).

Prerequisites for interacting with ChatGPT

Install OpenAI package

Let's start by installing the openai Python package.

In [1]:

!pip install openai==1.3.3 --quiet

Connect to ChatGPT and display the response

In [2]:

import openai
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo"

You will need an OpenAI API key in order to use the the openai Python library.

In [3]:

import getpass
import os
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ')
client = openai.OpenAI()

Test the connection.

In [4]:

response = client.chat.completions.create(
model=GPT_MODEL,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the gold medal for curling in Olymics 2022?"},
]
)
print(response.choices[0].message.content)

Get the data about Winter Olympics and provide the information to ChatGPT as context

1. Install and import libraries

In [5]:

!pip install tabulate tiktoken wget --quiet

In [6]:

import json
import numpy as np
import os
import pandas as pd
import wget

2. Fetch the CSV data and read it into a DataFrame

Download pre-chunked text and pre-computed embeddings. This file is ~200 MB, so may take a minute depending on your connection speed.

In [7]:

embeddings_url = "https://cdn.openai.com/API/examples/data/winter_olympics_2022.csv"
embeddings_path = "winter_olympics_2022.csv"
if not os.path.exists(embeddings_path):
wget.download(embeddings_url, embeddings_path)
print("File downloaded successfully.")
else:
print("File already exists in the local file system.")

Here we are using the converters= parameter of the pd.read_csv function to convert the JSON array in the CSV file to numpy arrays.

In [8]:

def json_to_numpy_array(x: str | None) -> np.ndarray | None:
"""Convert JSON array string into numpy array."""
return np.array(json.loads(x)) if x else None
df = pd.read_csv(embeddings_path, converters=dict(embedding=json_to_numpy_array))
df

In [9]:

df.info(show_counts=True)

3. Set up the database

Action Required

If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

Create the database.

In [10]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS winter_wikipedia;
%sql CREATE DATABASE winter_wikipedia;

Action Required

Make sure to select the winter_wikipedia database from the drop-down menu at the top of this notebook. It updates the connection_url which is used by the %%sql magic command and SQLAlchemy to make connections to the selected database.

In [11]:

%%sql
CREATE TABLE IF NOT EXISTS winter_olympics_2022 (
id INT PRIMARY KEY,
text TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
embedding BLOB
);

4. Populate the table with our DataFrame

Create a SQLAlchemy connection.

In [12]:

import singlestoredb as s2
conn = s2.create_engine().connect()

Use the to_sql method of the DataFrame to upload the data to the requested table.

In [13]:

df.to_sql('winter_olympics_2022', con=conn, index=True, index_label='id', if_exists='append', chunksize=1000)

5. Do a semantic search with the same question from above and use the response to send to OpenAI again

In [14]:

import sqlalchemy as sa
def get_embedding(text: str, model: str = 'text-embedding-ada-002') -> str:
"""Return the embeddings."""
return [x.embedding for x in client.embeddings.create(input=[text], model=model).data][0]
def strings_ranked_by_relatedness(
query: str,
df: pd.DataFrame,
table_name: str,
relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
top_n: int=100,
) -> tuple:
"""Returns a list of strings and relatednesses, sorted from most related to least."""
# Get the embedding of the query.
query_embedding_response = get_embedding(query, EMBEDDING_MODEL)
# Create the SQL statement.
stmt = sa.text(f"""
SELECT
text,
DOT_PRODUCT_F64(JSON_ARRAY_PACK_F64(:embedding), embedding) AS score
FROM {table_name}
ORDER BY score DESC
LIMIT :limit
""")
# Execute the SQL statement.
results = conn.execute(stmt, dict(embedding=json.dumps(query_embedding_response), limit=top_n))
strings = []
relatednesses = []
for row in results:
strings.append(row[0])
relatednesses.append(row[1])
# Return the results.
return strings[:top_n], relatednesses[:top_n]

In [15]:

from tabulate import tabulate
strings, relatednesses = strings_ranked_by_relatedness(
"curling gold medal",
df,
"winter_olympics_2022",
top_n=5
)
for string, relatedness in zip(strings, relatednesses):
print(f"{relatedness=:.3f}")
print(tabulate([[string]], headers=['Result'], tablefmt='fancy_grid'))
print('\n\n')

In [16]:

import tiktoken
def num_tokens(text: str, model: str=GPT_MODEL) -> int:
"""Return the number of tokens in a string."""
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
def query_message(
query: str,
df: pd.DataFrame,
model: str,
token_budget: int
) -> str:
"""Return a message for GPT, with relevant source texts pulled from SingleStoreDB."""
strings, relatednesses = strings_ranked_by_relatedness(query, df, "winter_olympics_2022")
introduction = 'Use the below articles on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."'
question = f"\n\nQuestion: {query}"
message = introduction
for string in strings:
next_article = f'\n\nWikipedia article section:\n"""\n{string}\n"""'
if (
num_tokens(message + next_article + question, model=model)
> token_budget
):
break
else:
message += next_article
return message + question
def ask(
query: str,
df: pd.DataFrame=df,
model: str=GPT_MODEL,
token_budget: int=4096 - 500,
print_message: bool=False,
) -> str:
"""Answers a query using GPT and a table of relevant texts and embeddings in SingleStoreDB."""
message = query_message(query, df, model=model, token_budget=token_budget)
if print_message:
print(message)
messages = [
{"role": "system", "content": "You answer questions about the 2022 Winter Olympics."},
{"role": "user", "content": message},
]
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=0
)
response_message = response.choices[0].message.content
return response_message

In [17]:

print(ask('Who won the gold medal for curling in Olymics 2022?'))

Clean up

Action Required

If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.

In [18]:

shared_tier_check = %sql show variables like 'is_shared_tier'
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
%sql DROP DATABASE IF EXISTS winter_wikipedia;

Details

Tags

#starter#openai#vectordb#genai

License

This Notebook has been released under the Apache 2.0 open source license.