New

Hybrid Full-text and Vector Search

Notebook

SingleStore Notebooks

Hybrid Full-text and Vector Search

Note

This notebook can be run on a Free Starter Workspace. To create a Free Starter Workspace navigate to Start using the left nav. You can also use your existing Standard or Premium workspace with this Notebook.

What's in this notebook:

Create and use a database.
Create a table and load data.
Create a full-text and a vector index.
Similarity search.
Hybrid search.
Clean up.

Questions?

Reach out to us through our forum.

1. Create and use a database.

To use this notebook, you need to have an active workspace and have selected a database to use. Please select a database using the dropdown above.

2. Create a table and load data.

This example uses a dataset of Wikipedia articles about video games. The dataset contains approximately 41,000 vectors based on 1,800 articles from Wikipedia. The data set is available under the Creative Commons Attribution-ShareAlike License 4.0. Refer to Hybrid Search and Re-ranking for more details on this example and information about hybrid search over vectors.

Create a table to hold the video games data using the SQL below. This table stores the text of the paragraphs and stores the vectors created for those paragraphs using the Vector Type.

In [1]:

1%%sql2CREATE TABLE video_games(3id BIGINT(20),4url TEXT DEFAULT NULL,5paragraph TEXT DEFAULT NULL,6v VECTOR(1536) NOT NULL,7SHARD KEY(id), KEY(id) USING HASH8);

Create and run the following pipeline using the CREATE PIPELINE command to load data into the video_games table. The CREATE PIPELINE command may take around 30 seconds to run.

In [2]:

1%%sql2-- since the bucket is open, you can leave the credentials clause as-is3CREATE OR REPLACE PIPELINE wiki_pipeline AS4load data S35's3://wikipedia-video-game-data/video-game-embeddings(1).csv'6config '{"region":"us-west-1"}'7credentials '{"aws_access_key_id": "", "aws_secret_access_key": ""}'8SKIP DUPLICATE KEY ERRORS9INTO TABLE video_games10FORMAT csv11FIELDS TERMINATED BY ','12ENCLOSED BY '"'13LINES TERMINATED BY '\r\n';14
15START PIPELINE wiki_pipeline FOREGROUND;

Verify the data was loaded using the query below.

Wait for the pipeline to finish before running the COUNT query.

In [3]:

1%%sql2SELECT COUNT(*)3FROM video_games;

There should be 40,027 rows in the video_games table when the PIPELINE is finished.

3. Create a full-text and a vector index.

Use the following SQL to create full-text and vector indexes on the video_games table. Indexes can improve query performance on large vector data sets. Refer to Vector Indexing for more information on vector indexes and CREATE TABLE</code) for more information on full-text indexes.

In [4]:

1%%sql2ALTER TABLE video_games ADD FULLTEXT ft_para(paragraph);3
4ALTER TABLE video_games ADD VECTOR INDEX ivf_v(v)5   INDEX_OPTIONS '{"index_type":"IVF_FLAT"}';

Optionally optimize the table for best performance.

Wait for the ALTER TABLE commands to finish before running the OPTIMIZE command.

In [5]:

1%%sql2OPTIMIZE TABLE video_games FULL;

4. Similarity search.

Similarity search finds a set of vectors that are most similar to a query vector. This example finds vectors representing paragraphs that are similar to a vector about the Mario Kart Game. The vector for the first paragraph about Mario Kart as our query vector. This is a good semantic query vector for Mario Kart.

To find the most similar vectors in a query vector, use an ORDER BY… LIMIT… query. The ORDER BY command will arrange the vectors by their similarity score produced by a vector similarity function, with the closest matches at the top.

The SQL below finds three paragraphs that are the most similar to the first paragraph about Mario Kart, a semantic similarity search for information about Mario Kart.

In [6]:

1%%sql2SET @v_mario_kart = (SELECT v FROM video_games3         WHERE URL = "https://en.wikipedia.org/wiki/Super_Mario_Kart"4         ORDER BY id LIMIT 1);5
6SELECT id, paragraph, v <*> @v_mario_kart AS SCORE7FROM video_games8ORDER BY score DESC9LIMIT 3;

5. Hybrid search.

Hybrid Search combines multiple search methods in one query and blends full-text search (which finds keyword matches) and vector search (which finds semantic matches) allowing search results to be (re-)ranked by a score that combines full-text and vector rankings.

In [7]:

1%%sql2SET @v_mario_kart = (SELECT v FROM video_games3         WHERE URL = "https://en.wikipedia.org/wiki/Super_Mario_Kart"4         ORDER BY id LIMIT 1);5
6WITH fts AS (7 SELECT id, paragraph,8   MATCH(paragraph) AGAINST("mario kart") AS SCORE9 FROM video_games10 WHERE MATCH(paragraph) AGAINST("mario kart")11 ORDER BY SCORE desc12 LIMIT 20013),14vs AS (15 SELECT id, paragraph, v <*> @v_mario_kart AS SCORE16 FROM video_games17 ORDER BY score DESC18 LIMIT 20019)20SELECT vs.id, SUBSTRING(vs.paragraph,0,25),21     FORMAT(IFNULL(fts.score, 0) * .322            + IFNULL(vs.score, 0) * .7, 4) AS score,23     FORMAT(fts.score, 4) AS fts_s,24     FORMAT(vs.score, 4) AS vs_s25FROM fts FULL OUTER JOIN vs ON fts.id = vs.id26ORDER BY score DESC27LIMIT 5;

6. Clean up.

The command below will drop the table created as part of this notebook. Dropping this table will allow you to rerun the notebook from the beginning.

In [8]:

1%%sql2DROP PIPELINE wiki_pipeline;3
4DROP TABLE video_games;

Details

About this Template

Example of similarity search over vector data and a hybrid search that combines full-text search with an indexed vector search.

This Notebook can be run in Shared Tier, Standard and Enterprise deployments.

License

This Notebook has been released under the Apache 2.0 open source license.

See Notebook in action

Launch this notebook in SingleStore and start executing queries instantly.

Hybrid Full-text and Vector Search

Notebook

Hybrid Full-text and Vector Search

What's in this notebook:

Questions?

1. Create and use a database.

2. Create a table and load data.

3. Create a full-text and a vector index.

4. Similarity search.

5. Hybrid search.

6. Clean up.

Details

About this Template

Tags

License

See Notebook in action