New

Hybrid Full-text and Vector Search

Notebook

SingleStore Notebooks

Hybrid Full-text and Vector Search

Note

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

What's in this notebook:

  1. Create and use a database.

  2. Create a table and load data.

  3. Create a full-text and a vector index.

  4. Similarity search.

  5. Hybrid search.

  6. Clean up.

Questions?

Reach out to us through our forum.

1. Create and use a database.

To use this notebook, you need to have an active workspace and have selected a database to use. Please select a database using the dropdown above.

2. Create a table and load data.

This example uses a dataset of Wikipedia articles about video games. The dataset contains approximately 41,000 vectors based on 1,800 articles from Wikipedia. The data set is available under the Creative Commons Attribution-ShareAlike License 4.0. Refer to Hybrid Search and Re-ranking for more details on this example and information about hybrid search over vectors.

Create a table to hold the video games data using the SQL below. This table stores the text of the paragraphs and stores the vectors created for those paragraphs using the Vector Type.

In [1]:

%%sql
CREATE TABLE video_games(
id BIGINT(20),
url TEXT DEFAULT NULL,
paragraph TEXT DEFAULT NULL,
v VECTOR(1536) NOT NULL,
SHARD KEY(id), KEY(id) USING HASH
);

Create and run the following pipeline using the CREATE PIPELINE command to load data into the video_games table. The CREATE PIPELINE command may take around 30 seconds to run.

In [2]:

%%sql
-- since the bucket is open, you can leave the credentials clause as-is
CREATE OR REPLACE PIPELINE wiki_pipeline AS
load data S3
's3://wikipedia-video-game-data/video-game-embeddings(1).csv'
config '{"region":"us-west-1"}'
credentials '{"aws_access_key_id": "", "aws_secret_access_key": ""}'
SKIP DUPLICATE KEY ERRORS
INTO TABLE video_games
FORMAT csv
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n';
START PIPELINE wiki_pipeline FOREGROUND;

Verify the data was loaded using the query below.

Wait for the pipeline to finish before running the COUNT query.

In [3]:

%%sql
SELECT COUNT(*)
FROM video_games;

There should be 40,027 rows in the video_games table when the PIPELINE is finished.

3. Create a full-text and a vector index.

Use the following SQL to create full-text and vector indexes on the video_games table. Indexes can improve query performance on large vector data sets. Refer to Vector Indexing for more information on vector indexes and CREATE TABLE</code) for more information on full-text indexes.

In [4]:

%%sql
ALTER TABLE video_games ADD FULLTEXT ft_para(paragraph);
ALTER TABLE video_games ADD VECTOR INDEX ivf_v(v)
INDEX_OPTIONS '{"index_type":"IVF_FLAT"}';

Optionally optimize the table for best performance.

Wait for the ALTER TABLE commands to finish before running the OPTIMIZE command.

In [5]:

%%sql
OPTIMIZE TABLE video_games FULL;

4. Similarity search.

Similarity search finds a set of vectors that are most similar to a query vector. This example finds vectors representing paragraphs that are similar to a vector about the Mario Kart Game. The vector for the first paragraph about Mario Kart as our query vector. This is a good semantic query vector for Mario Kart.

To find the most similar vectors in a query vector, use an ORDER BY… LIMIT… query. The ORDER BY command will arrange the vectors by their similarity score produced by a vector similarity function, with the closest matches at the top.

The SQL below finds three paragraphs that are the most similar to the first paragraph about Mario Kart, a semantic similarity search for information about Mario Kart.

In [6]:

%%sql
SET @v_mario_kart = (SELECT v FROM video_games
WHERE URL = "https://en.wikipedia.org/wiki/Super_Mario_Kart"
ORDER BY id LIMIT 1);
SELECT id, paragraph, v <*> @v_mario_kart AS SCORE
FROM video_games
ORDER BY score DESC
LIMIT 3;

5. Hybrid search.

Hybrid Search combines multiple search methods in one query and blends full-text search (which finds keyword matches) and vector search (which finds semantic matches) allowing search results to be (re-)ranked by a score that combines full-text and vector rankings.

In [7]:

%%sql
SET @v_mario_kart = (SELECT v FROM video_games
WHERE URL = "https://en.wikipedia.org/wiki/Super_Mario_Kart"
ORDER BY id LIMIT 1);
WITH fts AS (
SELECT id, paragraph,
MATCH(paragraph) AGAINST("mario kart") AS SCORE
FROM video_games
WHERE MATCH(paragraph) AGAINST("mario kart")
ORDER BY SCORE desc
LIMIT 200
),
vs AS (
SELECT id, paragraph, v <*> @v_mario_kart AS SCORE
FROM video_games
ORDER BY score DESC
LIMIT 200
)
SELECT vs.id, SUBSTRING(vs.paragraph,0,25),
FORMAT(IFNULL(fts.score, 0) * .3
+ IFNULL(vs.score, 0) * .7, 4) AS score,
FORMAT(fts.score, 4) AS fts_s,
FORMAT(vs.score, 4) AS vs_s
FROM fts FULL OUTER JOIN vs ON fts.id = vs.id
ORDER BY score DESC
LIMIT 5;

6. Clean up.

The command below will drop the table created as part of this notebook. Dropping this table will allow you to rerun the notebook from the beginning.

In [8]:

%%sql
DROP PIPELINE wiki_pipeline;
DROP TABLE video_games;

Details

Tags

#starter#openai#genai#vectordb

License

This Notebook has been released under the Apache 2.0 open source license.