The Power of SQL for Vector Database Operations, Part 1

Here’s how to use one SQL query to get the top K vector matches in each category.

At SingleStore, our position on vector database processing is that you should have all the benefits of a modern, full-featured DBMS available to you when working with vectors^[Han23]. This includes full SQL support.

We're working with a prospective user of SingleStoreDB for vector database operations related to an application with a semantic search component. He asked how we could do the following easily: find the top K items in each category. This is something he didn't have an easy time with in Milvus, a specialty vector database.

With Milvus, he was finding the categories in one query, looping through them and finding the top K elements for one category at a time with a separate query. This is not easily parallelizable, and requires more work from the application side than many would prefer.

Here's how you can do this in a single SQL query in SingleStoreDB:

/* Make some items in multiple categories, with associated
   vector embeddings. */

create table items(id int, category varchar(50), vector blob);
insert into items values
  (1, "food", json_array_pack('[0,0,0,1]')),
  (2, "food", json_array_pack('[0,0.5,0.3,0.05]')),
  (3, "food", json_array_pack('[0,0.5,0.2,0]')),
  (4, "facilities", json_array_pack('[0,0,1,0]')),
  (5, "facilities", json_array_pack('[0,0.6,0.1,0.05]')),
  (6, "facilities", json_array_pack('[0,0.4,0.3,0]'));

-- query vector
set @qv = json_array_pack('[0,0.4,0.3,0]');

-- get top 2 in each category using ranking
with scored as(
  select id, category, dot_product(vector, @qv) as score
  from items
),
ranked as (
select
  row_number() over(partition by category order by score desc)
    as rank, *
  from scored
)
select *
from ranked
where rank &lt;= 2
order by category, rank;

These are the results:

+------+------+------------+---------------------+
| rank | id   | category   | score               |
+------+------+------------+---------------------+
|    1 |    4 | facilities | 0.30000001192092896 |
|    2 |    5 | facilities | 0.27000001072883606 |
|    1 |    2 | food       |  0.2900000214576721 |
|    2 |    3 | food       | 0.25999999046325684 |
+------+------+------------+---------------------+

It’s important to note we're just focusing on ease of expression here, not performance. For this particular application, the scope is usually a few million vectors at most — so a full-scan, exact-nearest-neighbor approach is plenty fast.

When you choose a tool for vector processing for nearest-neighbor search applications like semantic search, chatbots, other LLM applications, face matching, object matching and more, we think it's a good idea to consider the power of the query language — and having full SQL available just makes things easier.