Using SingleStoreDB as a Vector Database for Q&A Chatbots

SingleStoreDB has long supported vector functions like dot_product, which make it a good fit for AI applications that require text similarity matching. An example of this type of AI application is a chatbot that answers questions from a corpus of information.

In this blog post, we’ll demonstrate how we use SingleStoreDB — along with AI models like Whisper and ChatGPT — to create a chatbot that uses the YCombinator Youtube channel to answer questions about startups, and give startup-related advice. We initially built this as a side project using another vector database, but recently converted it to SingleStoreDB.

The bot is accessible here: https://transcribe.param.codes/ask/yc-s2.

How We Built It

Step 1. Transcribing the videos

We first used OpenAI’s whisper model to transcribe all the videos on the YC YouTube channel. Instead of running the model ourselves, we used Replicate to run the model and give us the transcriptions, which are stored in a simple SQLite database.

Step 2. Creating embeddings from the transcriptions

Because models like ChatGPT have a limited context length of 4096 tokens, we cannot just give ChatGPT all the transcriptions and ask it questions based on the entire corpus. So, when we get a question, we need to find the parts of the transcriptions that are most relevant to the question and only give those to ChatGPT in the prompt. To do this, we need to first create embeddings for the text in the transcriptions.

An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness, and large distances suggest low relatedness. Here’s a nice video explaining how this works.

We used OpenAI’s API to create embeddings for our text. This is as simple as:

embedding = openai.Embedding.create(
    input=group,
    model="text-embedding-ada-002"
)

vector = embedding['data'][0]['embedding']

Note: There are a few other minor implementation details here. We need to group large parts of the text into windows that are within the size of the embedding model’s context size before we pass it to the model.

Step 3. Storing the embeddings in SingleStoreDB

Next, we need to store these embeddings somewhere. We need to be able to find distances between two vectors quickly to retrieve the most relevant text for our question answering bot.

SingleStoreDB supports this well. We can store the vectors in a blob datatype, using the dot_product method for retrieval.

Our table looks like this:

Field	Type	Nullable
vector	blob	no
text_data	text	no
src	text	no

Here is the SQL command to create this table:

CREATE TABLE embeddings (
       vector blob not null,
       text_data text not null,
       src text not null
);

Here, the text_data column stores the text corresponding to the vector and src stores the link of the YouTube video that the text is from.

Step 4. Retrieval

Now when we get a question, it’s very easy to find the text that’s most relevant using a simple SQL query. Our retrieval method looks like this:

def get_closest(conn, vector, limit=1):
    # Convert the array of floats to binary format
    binary_vector = struct.pack(f'{len(vector)}f', *vector)


    query = '''
       SELECT text_data, src, DOT_PRODUCT(vector, %s) AS distance
       FROM embeddings
       ORDER BY distance DESC
       LIMIT %s
    '''

    # Execute the SQL query
    with conn.cursor() as cursor:
        cursor.execute(query, (binary_vector, limit))
        return cursor.fetchall()

Step 5. Using ChatGPT to answer the question

Once we get the most relevant text, we can send it over to ChatGPT and ask it to answer the question based on the text. This is a simple OpenAI call:

prompt = f"""

You are a bot that answers questions about videos from YCombinator's YouTube channel.
The context
will provide text from transcriptions of the channel. You should be as helpful as
possible.

Answer the question based on the context below. Try to be detailed and explain things
thoroughly.
If the question can't be answered based on the context,
say \"I don't know\"\n\nContext: {context}\n\n---\n\nQuestion: {question}\nAnswer:
"""
response = openai.ChatCompletion.create(
    messages=[{"role": "user", "content": prompt}],
    model="gpt-3.5-turbo"
)

return response["choices"][0].message["content"], source

Here, context is the result of the get_closest method previously described. It contains the text from the transcriptions that is most relevant to the question. After ChatGPT sends us the answer, we display it to the user with the source link.

What’s Next?

SingleStoreDB is great at vector operations, and the ability to easily use SQL queries makes it very attractive for developers. However, all the steps we outlined here are possible to do with a library like llama-index, without having to implement the details. While Llama-index currently does not support SingleStoreDB, we are going to explore adding SingleStore support to it.

We think SingleStore is a great tool to add to your toolkit if you require a general-purpose database that can manage both transactional and analytical workloads with great performance. Its compatibility with MySQL also makes it easy to use, and the support for vector functionality makes building AI applications based on SingleStoreDB very easy.

Want to get started building your own chatbot? Get started with SingleStoreDB free.

Explore more vector database, AI and chatbot-related resources