New

Searching all of Wikipedia

Notebook

SingleStore Notebooks

Searching all of Wikipedia

Note

This tutorial is meant for Standard & Premium Workspaces. You can't run this with a Free Starter Workspace due to restrictions on Storage. Create a Workspace using +group in the left nav & select Standard for this notebook. Gallery notebooks tagged with "Starter" are suitable to run on a Free Starter Workspace

In this notebook, we've embarked on the task to implement semantic search and Retrieval-Augmented Generation (RAG) across Wikipedia's extensive database, using SingleStore's indexed ANN search capabilities.

We have focused on a subset of Wikipedia, scraping 1,800 video game articles to obtain real OpenAI embeddings for about 40,000 paragraphs, supplemented by 10 million mock vectors for a scaled-up simulation. We then stored these vectors in a SingleStore database, applying different indexing options to enhance search efficiency. We show remarkable improvements in query response times, dropping to sub 100 milliseconds with indexing.

We also integrated this system with a RAG-based chat, where you can ask and retrieve contextual answers based on the video game information in the database. Additionally, I've shared the technical details and resources, including our Python code and data links, in this GitHub repository and an AWS S3 bucket (S3 URI: s3://wikipedia-video-game-data/video-game-embeddings(1).csv).

Part 0: Creating your workspace and database

For the purposes of this demo, we are only generating 10 million vectors to search through. By our estimations, Wikipedia contains around 160 million paragraphs as a whole. A quick heuristic for your workspace sizing:

160 million vectors can be handled by an S-32 Workspace
20 million vectors can be handled by an S-4 Workspace

You can now extrapolate the workspace size you require from the number of vectors you want to generate!

In [1]:

1
%%sql
2
drop database if exists video_game_wikipedia;
3
create database video_game_wikipedia;
4
use video_game_wikipedia;
5

6
-- create our table (note the usage of the vector data type)
7
create table vecs(
8
    id bigint(20),
9
    url text default null,
10
    paragraph text default null,
11
    v vector(1536) not null,
12
    shard key(id),
13
    key(id) using hash,
14
    fulltext (paragraph)
15
);

Part 1: Generating the Mock Vectors

A vecs table is created with fields for ID, URL, paragraph text, and the embedding vector.

Then, we define several functions:

randbetween: generates random floats in a specified range
gen_vector: creates a vector of given length filled with random values
normalize: adjusts a vector to unit length
norm1536: normalizes a vector of dimension 1536
nrandv1536: generates a normalized, random vector of dimension 1536

Finally, we populate vecs with 10,000,000 rows of mock vectors.

In [2]:

1
%%sql
2
-- generates random floats in a specified range
3
create or replace function randbetween(a float, b float) returns float
4
as
5
begin
6
  return (rand()*(b - a) + a);
7
end ;

In [3]:

1
%%sql
2
-- set the sql_mode so that the following function returns the expected result
3
set sql_mode = pipes_as_concat;
4

5
-- creates a vector of given length filled with random values
6
create or replace function gen_vector(length int) returns text as
7
declare s text = "[";
8
begin
9
  if length < 2 then
10
    raise user_exception("length too short: " || length);
11
  end if;
12

13
  for i in 1..length-1 loop
14
    s = s || randbetween(-1,1) || "," ;
15
  end loop;
16
  s = s || randbetween(-1,1) || "]";
17
  return s;
18
end;

In [4]:

1
%%sql
2
-- adjusts a vector to unit length
3
create or replace function normalize(v blob) returns blob as
4
declare
5
  squares blob = vector_mul(v,v);
6
  length float = sqrt(vector_elements_sum(squares));
7
begin
8
  return scalar_vector_mul(1/length, v);
9
end;

In [5]:

1
%%sql
2
-- normalizes a vector of dimension 1536
3
create or replace function norm1536(v vector(1536)) returns vector(1536) as
4
begin
5
  return normalize(v) :> vector(1536);
6
end;

In [6]:

1
%%sql
2
-- generates a normalized, random vector of dimension 1536
3
create or replace function nrandv1536() returns vector(1536) as
4
begin
5
  return norm1536(gen_vector(1536));
6
end;

In [7]:

1
%%sql
2
-- procedure to populate `vecs` with `num_rows` vectors
3
create or replace procedure insert_vectors(num_rows bigint) as
4
declare c int;
5
begin
6
  select count(*) into c from vecs;
7
  loop
8
    insert into vecs (id, v)
9
    select id + (select max(id) from vecs), nrandv1536()
10
    from vecs
11
    where id <= 128 * 1024; /* chunk size 128K so we can see progress */
12
    select count(*) into c from vecs;
13
    if c >= num_rows then
14
      exit;
15
    end if;
16
  end loop;
17
end;

In [8]:

1
%%sql
2
-- run the procedure to populate `vecs` with 10,000,000 vectors
3
-- this will take around 20 min
4
insert into vecs (id, v) values (1, nrandv1536());
5
call insert_vectors(10000000);

As a quick aside, if you want to generate the full 160 million vectors, you simply have to change the num_rows to 160,000,000: call insert_vectors(160000000);

Part 2: Getting the Wikipedia video game data

We will use a SingleStore pipeline named wiki_pipeline to import data from an S3 bucket into vecs. The pipeline is configured to load data from a CSV file located at s3://wikipedia-video-game-data/video-game-embeddings(1).csv. Since the S3 bucket is open, the credentials section is left empty.

In [9]:

%%sql

-- since the bucket is open, you can leave the credentials clause as it is

create or replace pipeline `wiki_pipeline` as

load data S3 's3://wikipedia-video-game-data/video-game-embeddings(1).csv'

config '{"region":"us-west-1"}'

credentials '{"aws_access_key_id": "",

            "aws_secret_access_key": ""}'

skip duplicate key errors

into table `vecs`

format csv

fields terminated by ','

enclosed by '"'

lines terminated by '\r\n';

In [10]:

%%sql

-- start the pipeline!

start pipeline `wiki_pipeline`;

In [11]:

%%sql

-- monitor the pipeline!

select DATABASE_NAME, PIPELINE_NAME, BATCH_ID, BATCH_STATE, START_TIME, ROWS_STREAMED, ROWS_PER_SEC

from information_schema.PIPELINES_BATCHES_SUMMARY

order by BATCH_ID;

Part 3: Building the vector indexes

Now, we have all the data in our table vecs. Let's go ahead and build our vector index. SingleStore gives us many options for our index with many tunable parameters. We will stick with the IVF indexes with default parameters.

In [12]:

1
%%sql
2
alter table vecs add vector index auto (v) INDEX_OPTIONS '{"index_type":"AUTO"}';

In [13]:

1
%%sql
2
alter table vecs add vector index ivf_flat (v) INDEX_OPTIONS '{"index_type":"IVF_FLAT"}';

In [14]:

1
%%sql
2
alter table vecs add vector index ivf_pq (v) INDEX_OPTIONS '{"index_type":"IVF_PQ"}';

Part 4: Testing our indexes

Now that we have indexed our 10M vector dataset, let us now run some queries to test index performance!

We have chosen a test vector whose paragraph is about Nintendo's Rad Racer video game. We will compare the performance of an exact K-nearest neighbor search to the searches with our ANN indexes.

As we will see, we get an order of magnitude improvement when using an index in comparison to the exact KNN search!

In [15]:

1
%%sql
2
set @qv = (select v from vecs where id = 1125899906845489);
3

4
-- NO INDEX: exact kNN search
5
select paragraph, v <*> @qv as sim
6
from vecs
7
order by sim use index () desc
8
limit 5;

In [16]:

1
%%sql
2
set @qv = (select v from vecs where id = 1125899906845489);
3

4
-- AUTO index
5
select paragraph, v <*> @qv as sim
6
from vecs
7
order by sim use index (auto) desc
8
limit 5;

In [17]:

1
%%sql
2
set @qv = (select v from vecs where id = 1125899906845489);
3

4
-- IVF_FLAT
5
select paragraph, v <*> @qv as sim
6
from vecs
7
order by sim use index (ivf_flat) desc
8
limit 5;

In [18]:

1
%%sql
2
set @qv = (select v from vecs where id = 1125899906845489);
3

4
-- IVF_PQ
5
select paragraph, v <*> @qv as sim
6
from vecs
7
order by sim use index (ivf_pq) desc
8
limit 5;

Part 5: Hybrid Search in SingleStore

Let us now see how we can implement a "hybrid search" in SingleStore! This is going to be a query that combines two powerful tools: a fulltext search and a semantic search!

In [19]:

1
%%sql
2

3
-- set the query vector
4
set @v_mario = (select v
5
                from vecs where url = "https://en.wikipedia.org/wiki/Super_Mario_Kart"
6
                order by id
7
                limit 1);
8

9
-- building the hybrid search
10
with fts as(
11
    select id, paragraph, match (paragraph) against ('Mario Kart') as score
12
    from vecs
13
    where match (paragraph) against ('Mario Kart')
14
    order by score desc
15
    limit 200
16
),
17
vs as (
18
    select id, paragraph, v <*> @v_mario as score
19
    from vecs
20
    order by score use index (auto) desc
21
    limit 200
22
)
23
select vs.id,
24
    vs.paragraph,
25
    .3 * ifnull(fts.score, 0) + .7 * vs.score as hybrid_score,
26
    vs.score as vec_score,
27
    ifnull(fts.score, 0) as ft_score
28
from fts full outer join vs
29
    on fts.id = vs.id
30
order by hybrid_score desc
31
limit 5;

Part 6: Chatting with the Video Game data!

search_wiki_page Function:
- Conducts semantic search in a database.
- Uses embeddings from get_embedding for query input.
- Finds top 'k' paragraphs in vecs table, ranked by similarity to query.
- Measures and prints search execution time.
ask_wiki_page Function:
- Utilizes results from search_wiki_page for chatbot input.
- Generates a query for an OpenAI GPT model-based chatbot.

In [20]:

!pip3 install openai --quiet

In [21]:

1
import sqlalchemy as sa
2
from openai import OpenAI
3
import getpass
4
import os
5
import time
6
import json

In [22]:

1
# OpenAI connection
2
OPENAI_API_KEY = os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ')
3
client = OpenAI()
4
EMBEDDING_MODEL = 'text-embedding-ada-002'
5
GPT_MODEL = 'gpt-3.5-turbo'
6

7
# SingleStore connection
8
engine = sa.create_engine(connection_url)
9
connection = engine.connect()

In [23]:

1
def get_embedding(text, model=EMBEDDING_MODEL):
2
    '''Generates the OpenAI embedding from an input `text`.'''
3
    if isinstance(text, str):
4
        response = client.embeddings.create(input=[text], model=model)
5
        return json.dumps(response.data[0].embedding)

In [24]:

1
def search_wiki_page(query, limit=5):
2
    '''Returns a df of the top k matches to the query ordered by similarity.'''
3
    query_embedding_vec = get_embedding(query)
4
    statement = sa.text(
5
        f'''select paragraph, v <*> :query_embedding :> vector(1536) AS similarity
6
        from vecs
7
        order by similarity use index (auto) desc
8
        limit :limit;'''
9
    )
10
    print("Searching for matches...")
11
    start_time = time.time()
12
    results = connection.execute(statement, {"query_embedding": query_embedding_vec, "limit": limit})
13
    end_time = time.time()
14
    execution_time = end_time - start_time
15
    print(f"Search complete in {execution_time} seconds.")
16
    results_as_dict = results.fetchall()
17
    return results_as_dict

In [25]:

1
def ask_wiki_page(query, limit=5, temp=0.0):
2
    '''Uses RAG to answer a question from the wiki page'''
3
    results = search_wiki_page(query, limit)
4
    print("Asking Chatbot...")
5
    prompt = f'''Excerpt from the conversation history:
6
        {results}
7
        Question: {query}
8

9
        Based on the conversation history, try to provide the most accurate answer to the question.
10
        Consider the details mentioned in the conversation history to formulate a response that is as
11
        helpful and precise as possible. please provide links to WIKIPEDIA ARTICLES TO LOOK AT FOR MORE INFORMATION.
12

13
        Most importantly, IF THE INFORMATION IS NOT PRESENT IN THE CONVERSATION HISTORY, DO NOT MAKE UP AN ANSWER.'''
14
    response = client.chat.completions.create(
15
        model=GPT_MODEL,
16
        messages=[
17
            {"role": "system", "content": "You are a helpful assistant who is answering questions about an article."},
18
            {"role": "user", "content": prompt}
19
        ],
20
        temperature=temp
21
    )
22
    response_message = response.choices[0].message.content
23
    return response_message

In [26]:

1
query = input('Ask me a question about video games!')
2
ask_wiki_page(query)

Details

About this Template

Simulate searching through 10 million paragraphs in Wikipedia for information about video games, and generate answers using RAG!

This Notebook can be run in Standard and Enterprise deployments.

License

This Notebook has been released under the Apache 2.0 open source license.

See Notebook in action

Launch this notebook in SingleStore and start executing queries instantly.

Searching all of Wikipedia

Notebook

Searching all of Wikipedia

Part 0: Creating your workspace and database

Part 1: Generating the Mock Vectors

Part 2: Getting the Wikipedia video game data

Part 3: Building the vector indexes

Part 4: Testing our indexes

Part 5: Hybrid Search in SingleStore

Part 6: Chatting with the Video Game data!

Details

About this Template

Tags

License

See Notebook in action