New

Semantic Search with OpenAI QA

Notebook


SingleStore Notebooks

Semantic Search with OpenAI QA

Note

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

In this Notebook you will use a combination of Semantic Search and a Large Langauge Model (LLM) to build a basic Retrieval Augmented Generation (RAG) application. For a great introduction into what RAG is, please read A Beginner's Guide to Retrieval Augmented Generation (RAG).

Prerequisites for interacting with ChatGPT

Install OpenAI package

Let's start by installing the openai Python package.

In [1]:

1!pip install openai==1.3.3 --quiet

Connect to ChatGPT and display the response

In [2]:

1import openai2
3EMBEDDING_MODEL = "text-embedding-ada-002"4GPT_MODEL = "gpt-3.5-turbo"

You will need an OpenAI API key in order to use the the openai Python library.

In [3]:

1import getpass2import os3
4os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ')5
6client = openai.OpenAI()

Test the connection.

In [4]:

1response = client.chat.completions.create(2  model=GPT_MODEL,3  messages=[4        {"role": "system", "content": "You are a helpful assistant."},5        {"role": "user", "content": "Who won the gold medal for curling in Olymics 2022?"},6    ]7)8
9print(response.choices[0].message.content)

Get the data about Winter Olympics and provide the information to ChatGPT as context

1. Install and import libraries

In [5]:

1!pip install tabulate tiktoken wget --quiet

In [6]:

1import json2import numpy as np3import os4import pandas as pd5import wget

2. Fetch the CSV data and read it into a DataFrame

Download pre-chunked text and pre-computed embeddings. This file is ~200 MB, so may take a minute depending on your connection speed.

In [7]:

1embeddings_url = "https://cdn.openai.com/API/examples/data/winter_olympics_2022.csv"2embeddings_path = "winter_olympics_2022.csv"3
4if not os.path.exists(embeddings_path):5    wget.download(embeddings_url, embeddings_path)6    print("File downloaded successfully.")7else:8    print("File already exists in the local file system.")

Here we are using the converters= parameter of the pd.read_csv function to convert the JSON array in the CSV file to numpy arrays.

In [8]:

1def json_to_numpy_array(x: str | None) -> np.ndarray | None:2    """Convert JSON array string into numpy array."""3    return np.array(json.loads(x)) if x else None4
5df = pd.read_csv(embeddings_path, converters=dict(embedding=json_to_numpy_array))6df

In [9]:

1df.info(show_counts=True)

3. Set up the database

Action Required

If you have a Free Starter Workspace deployed already, select the database from drop-down menu at the top of this notebook. It updates the connection_url to connect to that database.

Create the database.

In [10]:

1shared_tier_check = %sql show variables like 'is_shared_tier'2if not shared_tier_check or shared_tier_check[0][1] == 'OFF':3    %sql DROP DATABASE IF EXISTS winter_wikipedia;4    %sql CREATE DATABASE winter_wikipedia;

Action Required

Make sure to select the winter_wikipedia database from the drop-down menu at the top of this notebook. It updates the connection_url which is used by the %%sql magic command and SQLAlchemy to make connections to the selected database.

In [11]:

1%%sql2CREATE TABLE IF NOT EXISTS winter_olympics_2022 /* Creating table for sample data. */(3    id INT PRIMARY KEY,4    text TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,5    embedding BLOB6);

4. Populate the table with our DataFrame

Create a SQLAlchemy connection.

In [12]:

1import singlestoredb as s22
3conn = s2.create_engine().connect()

Use the to_sql method of the DataFrame to upload the data to the requested table.

In [13]:

1df.to_sql('winter_olympics_2022', con=conn, index=True, index_label='id', if_exists='append', chunksize=1000)

5. Do a semantic search with the same question from above and use the response to send to OpenAI again

In [14]:

1import sqlalchemy as sa2
3
4def get_embedding(text: str, model: str = 'text-embedding-ada-002') -> str:5    """Return the embeddings."""6    return [x.embedding for x in client.embeddings.create(input=[text], model=model).data][0]7
8
9def strings_ranked_by_relatedness(10    query: str,11    df: pd.DataFrame,12    table_name: str,13    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),14    top_n: int=100,15) -> tuple:16    """Returns a list of strings and relatednesses, sorted from most related to least."""17
18    # Get the embedding of the query.19    query_embedding_response = get_embedding(query, EMBEDDING_MODEL)20
21    # Create the SQL statement.22    stmt = sa.text(f"""23        SELECT24            text,25            DOT_PRODUCT_F64(JSON_ARRAY_PACK_F64(:embedding), embedding) AS score26        FROM {table_name}27        ORDER BY score DESC28        LIMIT :limit29    """)30
31    # Execute the SQL statement.32    results = conn.execute(stmt, dict(embedding=json.dumps(query_embedding_response), limit=top_n))33
34    strings = []35    relatednesses = []36
37    for row in results:38        strings.append(row[0])39        relatednesses.append(row[1])40
41    # Return the results.42    return strings[:top_n], relatednesses[:top_n]

In [15]:

1from tabulate import tabulate2
3strings, relatednesses = strings_ranked_by_relatedness(4    "curling gold medal",5    df,6    "winter_olympics_2022",7    top_n=58)9
10for string, relatedness in zip(strings, relatednesses):11    print(f"{relatedness=:.3f}")12    print(tabulate([[string]], headers=['Result'], tablefmt='fancy_grid'))13    print('\n\n')

In [16]:

1import tiktoken2
3
4def num_tokens(text: str, model: str=GPT_MODEL) -> int:5    """Return the number of tokens in a string."""6    encoding = tiktoken.encoding_for_model(model)7    return len(encoding.encode(text))8
9
10def query_message(11    query: str,12    df: pd.DataFrame,13    model: str,14    token_budget: int15) -> str:16    """Return a message for GPT, with relevant source texts pulled from SingleStoreDB."""17    strings, relatednesses = strings_ranked_by_relatedness(query, df, "winter_olympics_2022")18    introduction = 'Use the below articles on the 2022 Winter Olympics to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."'19    question = f"\n\nQuestion: {query}"20    message = introduction21    for string in strings:22        next_article = f'\n\nWikipedia article section:\n"""\n{string}\n"""'23        if (24            num_tokens(message + next_article + question, model=model)25            > token_budget26        ):27            break28        else:29            message += next_article30    return message + question31
32
33def ask(34    query: str,35    df: pd.DataFrame=df,36    model: str=GPT_MODEL,37    token_budget: int=4096 - 500,38    print_message: bool=False,39) -> str:40    """Answers a query using GPT and a table of relevant texts and embeddings in SingleStoreDB."""41    message = query_message(query, df, model=model, token_budget=token_budget)42    if print_message:43        print(message)44    messages = [45        {"role": "system", "content": "You answer questions about the 2022 Winter Olympics."},46        {"role": "user", "content": message},47    ]48    response = client.chat.completions.create(49        model=model,50        messages=messages,51        temperature=052    )53    response_message = response.choices[0].message.content54    return response_message

In [17]:

1print(ask('Who won the gold medal for curling in Olymics 2022?'))

Clean up

Action Required

If you created a new database in your Standard or Premium Workspace, you can drop the database by running the cell below. Note: this will not drop your database for Free Starter Workspaces. To drop a Free Starter Workspace, terminate the Workspace using the UI.

In [18]:

1shared_tier_check = %sql show variables like 'is_shared_tier'2if not shared_tier_check or shared_tier_check[0][1] == 'OFF':3    %sql DROP DATABASE IF EXISTS winter_wikipedia;

Details


About this Template

Provide context to chatGPT using data stored in SingleStoreDB.

This Notebook can be run in Shared Tier, Standard and Enterprise deployments.

Tags

starteropenaivectordbgenai

License

This Notebook has been released under the Apache 2.0 open source license.

See Notebook in action

Launch this notebook in SingleStore and start executing queries instantly.