Hybrid Full-text and Vector Search
Notebook
Note
This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.
What's in this notebook:
Create and use a database.
Create a table and load data.
Create a full-text and a vector index.
Similarity search.
Hybrid search.
Clean up.
Questions?
Reach out to us through our forum.
1. Create and use a database.
To use this notebook, you need to have an active workspace and have selected a database to use. Please select a database using the dropdown above.
2. Create a table and load data.
This example uses a dataset of Wikipedia articles about video games. The dataset contains approximately 41,000 vectors based on 1,800 articles from Wikipedia. The data set is available under the Creative Commons Attribution-ShareAlike License 4.0. Refer to Hybrid Search and Re-ranking for more details on this example and information about hybrid search over vectors.
Create a table to hold the video games data using the SQL below. This table stores the text of the paragraphs and stores the vectors created for those paragraphs using the Vector Type.
In [1]:
%%sqlCREATE TABLE video_games(id BIGINT(20),url TEXT DEFAULT NULL,paragraph TEXT DEFAULT NULL,v VECTOR(1536) NOT NULL,SHARD KEY(id), KEY(id) USING HASH);
Create and run the following pipeline using the CREATE PIPELINE
command to load data into the video_games
table. The CREATE PIPELINE
command may take around 30 seconds to run.
In [2]:
%%sql-- since the bucket is open, you can leave the credentials clause as-isCREATE OR REPLACE PIPELINE wiki_pipeline ASload data S3's3://wikipedia-video-game-data/video-game-embeddings(1).csv'config '{"region":"us-west-1"}'credentials '{"aws_access_key_id": "", "aws_secret_access_key": ""}'SKIP DUPLICATE KEY ERRORSINTO TABLE video_gamesFORMAT csvFIELDS TERMINATED BY ','ENCLOSED BY '"'LINES TERMINATED BY '\r\n';START PIPELINE wiki_pipeline FOREGROUND;
Verify the data was loaded using the query below.
Wait for the pipeline to finish before running the COUNT
query.
In [3]:
%%sqlSELECT COUNT(*)FROM video_games;
There should be 40,027 rows in the video_games
table when the PIPELINE
is finished.
3. Create a full-text and a vector index.
Use the following SQL to create full-text and vector indexes on the video_games
table. Indexes can improve query performance on large vector data sets. Refer to Vector Indexing for more information on vector indexes and CREATE TABLE</code) for more information on full-text indexes.
In [4]:
%%sqlALTER TABLE video_games ADD FULLTEXT ft_para(paragraph);ALTER TABLE video_games ADD VECTOR INDEX ivf_v(v)INDEX_OPTIONS '{"index_type":"IVF_FLAT"}';
Optionally optimize the table for best performance.
Wait for the ALTER TABLE
commands to finish before running the OPTIMIZE
command.
In [5]:
%%sqlOPTIMIZE TABLE video_games FULL;
4. Similarity search.
Similarity search finds a set of vectors that are most similar to a query vector. This example finds vectors representing paragraphs that are similar to a vector about the Mario Kart Game. The vector for the first paragraph about Mario Kart as our query vector. This is a good semantic query vector for Mario Kart.
To find the most similar vectors in a query vector, use an ORDER BY… LIMIT…
query. The ORDER BY
command will arrange the vectors by their similarity score produced by a vector similarity function, with the closest matches at the top.
The SQL below finds three paragraphs that are the most similar to the first paragraph about Mario Kart, a semantic similarity search for information about Mario Kart.
In [6]:
%%sqlSET @v_mario_kart = (SELECT v FROM video_gamesWHERE URL = "https://en.wikipedia.org/wiki/Super_Mario_Kart"ORDER BY id LIMIT 1);SELECT id, paragraph, v <*> @v_mario_kart AS SCOREFROM video_gamesORDER BY score DESCLIMIT 3;
5. Hybrid search.
Hybrid Search combines multiple search methods in one query and blends full-text search (which finds keyword matches) and vector search (which finds semantic matches) allowing search results to be (re-)ranked by a score that combines full-text and vector rankings.
In [7]:
%%sqlSET @v_mario_kart = (SELECT v FROM video_gamesWHERE URL = "https://en.wikipedia.org/wiki/Super_Mario_Kart"ORDER BY id LIMIT 1);WITH fts AS (SELECT id, paragraph,MATCH(paragraph) AGAINST("mario kart") AS SCOREFROM video_gamesWHERE MATCH(paragraph) AGAINST("mario kart")ORDER BY SCORE descLIMIT 200),vs AS (SELECT id, paragraph, v <*> @v_mario_kart AS SCOREFROM video_gamesORDER BY score DESCLIMIT 200)SELECT vs.id, SUBSTRING(vs.paragraph,0,25),FORMAT(IFNULL(fts.score, 0) * .3+ IFNULL(vs.score, 0) * .7, 4) AS score,FORMAT(fts.score, 4) AS fts_s,FORMAT(vs.score, 4) AS vs_sFROM fts FULL OUTER JOIN vs ON fts.id = vs.idORDER BY score DESCLIMIT 5;
6. Clean up.
The command below will drop the table created as part of this notebook. Dropping this table will allow you to rerun the notebook from the beginning.
In [8]:
%%sqlDROP PIPELINE wiki_pipeline;DROP TABLE video_games;
Details
About this Template
Example of similarity search over vector data and a hybrid search that combines full-text search with an indexed vector search.
This Notebook can be run in Shared Tier, Standard and Enterprise deployments.
Tags
License
This Notebook has been released under the Apache 2.0 open source license.