Dot product optimisation for similarity search

majed.bader · April 27, 2023, 1:06pm

Hey

I have a large table of vectors, stored as JSON in column-store. I need to perform a dot product between a single vector, and a subset of over 10k other vectors from this table (all vectors are normalised and have dimension 384). We are calculating the dot product on this large set of vectors to find the most similar one.

Currently, the conversion of the JSON arrays of the vectors to binary with JSON_ARRAY_PACK() is efficient, but using DOT_PRODUCT() takes around 0.5 to 1 second per call.
Is there a more efficient way to perform this operation? Do you have any pointers on how to index a set of vectors? Do you have any search optimisation such as indexing of vectors implemented or other suggestions on finding similar vectors?

Thanks!

hanson · April 28, 2023, 10:58pm

You should store the vectors in a blob-type column as packed binary numbers, not the json string format. Use JSON_ARRAY_PACK before you insert the vector, so the vector is stored in binary. Then DOT_PRODUCT will be much faster. E.g.

create table t(id int, v blob);
insert t values (1, json_array_pack('[0.5, 0.2, 0, 0.3]'));
insert t values (2, json_array_pack('[0, 0.25, 0.4, 0.3]'));

set @qv = json_array_pack('[0,1,0,0]');
select id, dot_product(v, @qv) as sim_score from t;

+------+---------------------+
| id   | sim_score           |
+------+---------------------+
|    1 | 0.20000000298023224 |
|    2 |                0.25 |
+------+---------------------+

For only 10,000 vectors, you don’t need any indexing. It should be very fast.

desislava.vasileva · May 3, 2023, 7:09am

Hi @hanson !

Thank you for the help!

Testing your suggestion on the storing format, we get about 50% reduction in query time, which is great!

In absolute values using the json format we calculate the dot_product for one vectors against a set of 10,000 vectors in 379ms. If we store the vectors in longblog as json_array_pack(vector) the time is reduced to 189ms.

Does this align with your performance expectations or do you think there is any other optimisation we could apply to reduce the time further?

Thank you!

hanson · May 3, 2023, 6:17pm

189ms still sounds high, but it depends on the query and hardware. Make sure you have a good shard key, to evenly spread the rows across partitions. You want all your cores to be involved in the parallel scan.

What is your full query syntax? How many rows does your query return?

desislava.vasileva · May 4, 2023, 7:42am

Hi, @hanson!

Thanks for the further hint! We will look into optimising the shard key.

The test query for which the performance is 189ms is below, and for the test we are returning all results:

select v from vectors limit 1 INTO @qv1;
select id, dot_product(v, @qv1) as sim_score from vectors;

In the actual full query we are returning only the top 10. So when we adjust the test query to only return 10 as below, the time it takes is reduced to 60ms.

select id, dot_product(v, @qv1) as sim_score 
from vectors
order by sim_score desc
limit 10;

hanson · May 5, 2023, 6:25pm

Yeah, that makes sense. Returning a lot of rows takes time.

One trick to make that query faster is to pre-filter by a threshold, so not as many rows flow into the LIMIT (top-sort) operator at the top of the query plan. E.g.

select id, dot_product(v, @qv1) as sim_score 
from vectors
where sim_score > 0.1
order by sim_score desc
limit 10;

You may have to play with the pre-filter level. I’ve used 0.1 before with success.