New

Semantic Search with OpenAI QA

Notebook

SingleStore Notebooks

Semantic Search with OpenAI QA

Note

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

In this Notebook you will use a combination of Semantic Search and a Large Langauge Model (LLM) to build a basic Retrieval Augmented Generation (RAG) application. For a great introduction into what RAG is, please read A Beginner's Guide to Retrieval Augmented Generation (RAG).

Prerequisites for interacting with ChatGPT

Install OpenAI package

Let's start by installing the openai Python package.

In [1]:

1
!pip install openai==1.3.3 --quiet

Connect to ChatGPT and display the response

In [2]:

1
import openai
2

3
EMBEDDING_MODEL = "text-embedding-ada-002"
4
GPT_MODEL = "gpt-3.5-turbo"

You will need an OpenAI API key in order to use the the openai Python library.

In [3]:

1
import getpass
2
import os
3

4
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ')
5

6
client = openai.OpenAI()

Test the connection.

In [4]:

1
response = client.chat.completions.create(
2
  model=GPT_MODEL,
3
  messages=[
4
        {"role": "system", "content": "You are a helpful assistant."},
5
        {"role": "user", "content": "Who won the gold medal for curling in Olymics 2022?"},
6
    ]
7
)
8

9
print(response.choices[0].message.content)

Get the data about Winter Olympics and provide the information to ChatGPT as context

1. Install and import libraries

In [5]:

!pip install tabulate tiktoken wget --quiet

In [6]:

1
import json
2
import numpy as np
3
import os
4
import pandas as pd
5
import wget

2. Fetch the CSV data and read it into a DataFrame

Download pre-chunked text and pre-computed embeddings. This file is ~200 MB, so may take a minute depending on your connection speed.

In [7]:

1
embeddings_url = "https://cdn.openai.com/API/examples/data/winter_olympics_2022.csv"
2
embeddings_path = "winter_olympics_2022.csv"
3

4
if not os.path.exists(embeddings_path):
5
    wget.download(embeddings_url, embeddings_path)
6
    print("File downloaded successfully.")
7
else:
8
    print("File already exists in the local file system.")

Here we are using the converters= parameter of the pd.read_csv function to convert the JSON array in the CSV file to numpy arrays.

In [8]:

1
def json_to_numpy_array(x: str | None) -> np.ndarray | None:
2
    """Convert JSON array string into numpy array."""
3
    return np.array(json.loads(x)) if x else None
4

5
df = pd.read_csv(embeddings_path, converters=dict(embedding=json_to_numpy_array))
6
df

In [9]:

1
df.info(show_counts=True)

3. Set up the database

Action Required

If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

Create the database.

In [10]:

1
shared_tier_check = %sql show variables like 'is_shared_tier'
2
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
3
    %sql DROP DATABASE IF EXISTS winter_wikipedia;
4
    %sql CREATE DATABASE winter_wikipedia;

Action Required

Make sure to select the winter_wikipedia database from the drop-down menu at the top of this notebook. It updates the connection_url which is used by the %%sql magic command and SQLAlchemy to make connections to the selected database.

In [11]:

1
%%sql
2
CREATE TABLE IF NOT EXISTS winter_olympics_2022 /* Creating table for sample data. */(
3
    id INT PRIMARY KEY,
4
    text TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
5
    embedding BLOB
6
);

4. Populate the table with our DataFrame

Create a SQLAlchemy connection.

In [12]:

1
import singlestoredb as s2
2

3
conn = s2.create_engine().connect()

Use the to_sql method of the DataFrame to upload the data to the requested table.

In [13]:

1
df.to_sql('winter_olympics_2022', con=conn, index=True, index_label='id', if_exists='append', chunksize=1000)

5. Do a semantic search with the same question from above and use the response to send to OpenAI again

In [14]:

1
import sqlalchemy as sa
2

3

4
def get_embedding(text: str, model: str = 'text-embedding-ada-002') -> str:
5
    """Return the embeddings."""
6
    return [x.embedding for x in client.embeddings.create(input=[text], model=model).data][0]
7

8

9
def strings_ranked_by_relatedness(
10
    query: str,
11
    df: pd.DataFrame,
12
    table_name: str,
13
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
14
    top_n: int=100,
15
) -> tuple:
16
    """Returns a list of strings and relatednesses, sorted from most related to least."""
17

18
    # Get the embedding of the query.
19
    query_embedding_response = get_embedding(query, EMBEDDING_MODEL)
20

21
    # Create the SQL statement.
22
    stmt = sa.text(f"""
23
        SELECT
24
            text,
25
            DOT_PRODUCT_F64(JSON_ARRAY_PACK_F64(:embedding), embedding) AS score
26
        FROM {table_name}
27
        ORDER BY score DESC
28
        LIMIT :limit
29
    """)
30

31
    # Execute the SQL statement.
32
    results = conn.execute(stmt, dict(embedding=json.dumps(query_embedding_response), limit=top_n))
33

34
    strings = []
35
    relatednesses = []
36

37
    for row in results:
38
        strings.append(row[0])
39
        relatednesses.append(row[1])
40

41
    # Return the results.
42
    return strings[:top_n], relatednesses[:top_n]

In [15]:

1
from tabulate import tabulate
2

3
strings, relatednesses = strings_ranked_by_relatedness(
4
    "curling gold medal",
5
    df,
6
    "winter_olympics_2022",
7
    top_n=5
8
)
9

10
for string, relatedness in zip(strings, relatednesses):
11
    print(f"{relatedness=:.3f}")
12
    print(tabulate([[string]], headers=['Result'], tablefmt='fancy_grid'))
13
    print('\n\n')

In [16]:

1
import tiktoken
2

3

4
def num_tokens(text: str, model: str=GPT_MODEL) -> int:
5
    """Return the number of tokens in a string."""
6
    encoding = tiktoken.encoding_for_model(model)
7
    return len(encoding.encode(text))
8

9

10
def query_message(
11
    query: str,
12
    df: pd.DataFrame,
13
    model: str,
14
    token_budget: int
15
) -> str:
16
    """Return a message for GPT, with relevant source texts pulled from SingleStoreDB."""
17
    strings, relatednesses = strings_ranked_by_relatedness(query, df, "winter_olympics_2022")
18
    introduction = 'Use the below articles on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."'
19
    question = f"\n\nQuestion: {query}"
20
    message = introduction
21
    for string in strings:
22
        next_article = f'\n\nWikipedia article section:\n"""\n{string}\n"""'
23
        if (
24
            num_tokens(message + next_article + question, model=model)
25
            > token_budget
26
        ):
27
            break
28
        else:
29
            message += next_article
30
    return message + question
31

32

33
def ask(
34
    query: str,
35
    df: pd.DataFrame=df,
36
    model: str=GPT_MODEL,
37
    token_budget: int=4096 - 500,
38
    print_message: bool=False,
39
) -> str:
40
    """Answers a query using GPT and a table of relevant texts and embeddings in SingleStoreDB."""
41
    message = query_message(query, df, model=model, token_budget=token_budget)
42
    if print_message:
43
        print(message)
44
    messages = [
45
        {"role": "system", "content": "You answer questions about the 2022 Winter Olympics."},
46
        {"role": "user", "content": message},
47
    ]
48
    response = client.chat.completions.create(
49
        model=model,
50
        messages=messages,
51
        temperature=0
52
    )
53
    response_message = response.choices[0].message.content
54
    return response_message

In [17]:

1
print(ask('Who won the gold medal for curling in Olymics 2022?'))

Clean up

Action Required

If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.

In [18]:

1
shared_tier_check = %sql show variables like 'is_shared_tier'
2
if not shared_tier_check or shared_tier_check[0][1] == 'OFF':
3
    %sql DROP DATABASE IF EXISTS winter_wikipedia;

Details

About this Template

Provide context to chatGPT using data stored in SingleStoreDB.

This Notebook can be run in Shared Tier, Standard and Enterprise deployments.

License

This Notebook has been released under the Apache 2.0 open source license.

See Notebook in action

Launch this notebook in SingleStore and start executing queries instantly.

Semantic Search with OpenAI QA

Notebook

Semantic Search with OpenAI QA

Prerequisites for interacting with ChatGPT

Install OpenAI package

Connect to ChatGPT and display the response

Get the data about Winter Olympics and provide the information to ChatGPT as context

1. Install and import libraries

2. Fetch the CSV data and read it into a DataFrame

3. Set up the database

4. Populate the table with our DataFrame

5. Do a semantic search with the same question from above and use the response to send to OpenAI again

Clean up

Details

About this Template

Tags

License

See Notebook in action