The database market is seeing a proliferation of specialty vector databases.
People who buy these products and plumb them into their data architectures may find initial excitement with what they can do with them to query for vector similarity. But eventually, they will regret bringing yet another component into their application environment.
Vectors and vector search are a data type and query processing approach, not a foundation for a new way of processing data. Using a specialty vector database (SVDB) will lead to the usual problems we see (and solve) again and again with our customers who use multiple specialty systems: redundant data, excessive data movement, lack of agreement on data values among distributed components, extra labor expense for specialized skills, extra licensing costs, limited query language power, programmability and extensibility, limited tool integration, and poor data integrity and availability compared with a true DBMS.
Instead of using a SVDB, we believe that application developers using vector similarity search will be better served by building their applications on a general, modern data platform that meets all their database requirements, not just one. SingleStoreDB is such a platform.
SingleStoreDB
SingleStoreDB is a high-performance, scalable, modern SQL DBMS and cloud service that supports multiple data models including structured data, semi-structured data based on JSON, time-series, full text, spatial, key-value and vector data. Our vector database subsystem, first made available in 2017 and subsequently enhanced, allows extremely fast nearest-neighbor search to find objects that are semantically similar, easily using SQL. Moreover, so-called "metadata filtering" (which is billed as a virtue by SVDB providers) is available in SingleStoreDB in far more powerful and general form than they provide — simply by using SQL filters, joins and all other SQL capabilities.
The beauty of SingleStoreDB for vector database management is that it excels at vector-based operations and it is truly a modern database management system. It has all the benefits one expects from a DBMS including ANSI SQL, ACID transactions, high availability, disaster recovery, point-in-time recovery, programmability, extensibility and more. Plus, it is fast and scalable, supporting both high-performance transaction processing and analytics in one distributed system.
SingleStoreDB Support for Vectors
SingleStoreDB supports vectors and vector similarity search using dot_product (for cosine similarity) and euclidean_distance functions. These functions are used by our customers for applications including face recognition, visual product photo1 search and text-based semantic search [Aur23]. With the explosion of generative AI technology, these capabilities form a firm foundation for text-based AI chatbots.
The SingleStore vector database engine implements vector similarity matching extremely efficiently using Intel SIMD instructions.
Getting Vectors Into SingleStoreDB
To insert vectors into SingleStoreDB for use with the DOT_PRODUCT() function, you'll need to create a table with BLOB-typed columns to store the vectors. You can use the JSON_ARRAY_PACK() function to easily insert properly formatted vectors into the table. Here's a step-by-step example:
Create a table with two BLOB-typed columns:
CREATE TABLE dp_t (a BLOB, b BLOB);
Insert your vectors into the table using the JSON_ARRAY_PACK() function:
INSERT INTO dp_t VALUES (JSON_ARRAY_PACK('[0.7, 0.2, 1.7]'),
JSON_ARRAY_PACK('[1.0, 0.5, 2.0]'));
At this point, the table dp_t contains two binary vectors in columns a and b. You can use the JSON_ARRAY_UNPACK() function to return the table elements in JSON format for viewing:
SELECT JSON_ARRAY_UNPACK(dp_t.a), JSON_ARRAY_UNPACK(dp_t.b) FROM dp_t;
Now, you can use the DOT_PRODUCT() function to compute the dot product of the vectors in the table:
SELECT DOT_PRODUCT(dp_t.a, dp_t.b) FROM dp_t;
Note that you can specify the datatype of the vector elements by adding a suffix to the function (e.g., _I8, _I16, _I32, _I64, _F32, or _F64). The default vector element type is F32.
You can also insert vectors in binary format directly from your application, or using UNHEX to convert hex strings to binary.
High-dimensional vectors derived from deep neural networks are available for many use cases. For example, for semantic text search and chatbots you can use OpenAI's embeddings API.
Nearest-Neighbor Search in SQL
Nearest-neighbor search can be done in SQL in SingleStoreDB with an ORDER BY/LIMIT query that uses vector similarity functions to get a nearness metric to order by. For example, suppose we have this table of filenames of face images with their associated vector (in this case obtained using the Facenet system):
create table people (filename varchar(255), vector blob);
To get the top five closest matches to a query image, we can do the following:
/* get the query vector (first photo of Emma Thompson) */
select vector
into @v
from people
where filename = "Emma_Thompson/Emma_Thompson_0001.jpg";
/* get the 5 nearest neighbors based on cosine similarity */
select filename, dot_product(vector, @v) as score
from people
order by score desc
limit 5;
Results:
+--------------------------------------------------------------------------+--------------------+
| filename | score |
+--------------------------------------------------------------------------+--------------------+
| Emma_Thompson/Emma_Thompson_0001.jpg | 0.9999999403953552 |
| Emma_Thompson/Emma_Thompson_0002.jpg | 0.6130160093307495 |
| Maria_Soledad_Alvear_Valenzuela/Maria_Soledad_Alvear_Valenzuela_0003.jpg | 0.6115712523460388 |
| Emma_Thompson/Emma_Thompson_0003.jpg | 0.583335280418396 |
| Benazir_Bhutto/Benazir_Bhutto_0005.jpg | 0.5819507837295532 |
+--------------------------------------------------------------------------+--------------------+
Here are the query photo and the nearest match:
In a test described in our recent blog, a query similar to the second query took only 0.005 seconds on a 16 vcpu machine to process 16M vectors. This shows that exact nearest neighbor search on large data sets with interactive response time is achievable today with SingleStoreDB.
Hybrid Nearest-Neighbor/Metadata Vector Search in SQL
Hybrid search based on vector nearness and descriptive properties is easy in SingleStoreDB, because all the query capabilities of SQL are available. For example, suppose you have a table:
create table comments(id int, comment text, vector blob,
category varchar(64));
Suppose these are some available categories: "enthusiastic agreement", "agreement", "neutral" and "disagreement". To find the top 100 matches to a query vector @v (only considering categories that are about positive "agreement") you can write this SQL query:
select id, comment, category, dot_product(@v, vector) as score
from comments
where category in ("agreement", "enthusiastic agreement")
order by score desc
limit 100;
More sophisticated metadata filtering examples are possible using joins, subqueries and more.
Vector Joins
Since SingleStoreDB supports joins, you can do set-based nearest-neighbor search. For example, you can create a table:
create table query_text_blocks(id int, block text, vector blob);
This might contain, say, 10 text blocks of interest, and you want to retrieve the top 50 matches for any of these in a single query. There's no need to write ten separate queries, one for each block. You can use a join (in this case a cross join) instead, e.g.:
select c.id, q.id, dot_product(c.vector, q.vector) as score
from comments c, query_text_blocks q
order by score desc
limit 50;
Order of Operations for K-Nearest-Neighbor Search
SingleStoreDB can do fast K-Nearest-Neighbor search with ORDER BY/LIMIT K queries using dot_product and euclidean_distance metrics, combined with arbitrary SQL for metadata filtering. The SingleStoreDB query optimizer will determine an efficient plan for doing this.
For example, metadata filters can be done using indexes (such as hash indexes and skip lists), and more expensive vector comparison functions can be applied only to the remaining rows. Also, with SingleStoreDB columnstore tables, fast vectorized execution (not to be confused with vector similarity search), segment elimination and operations on encoded data enable efficient metadata filters without any indexes.
SingleStoreDB also will reorder filters at runtime to put the most selective and least-expensive filters first. For example, for this filter:
t.msg LIKE "%Tampa%" and dot_product(t.vector, @query_vector) > 0.1
If the LIKE "%Tampa%" filter is highly selective (meaning few rows qualify), it will do that first and not do the dot_product at all. However, if almost all msg values contain "Tampa" and the dot_product(...) > 0.1 filter is highly selective, it will reorder the filters to do that one first.
Reranking results with DOT_PRODUCT and MATCH
To rerank semantic search results in SingleStoreDB while having some additional context and tagging, users can leverage a combination of vector and full-text search features. Here's an example to help explain the process:
CREATE TABLE articles(
id INT UNSIGNED,
title VARCHAR(200),
body TEXT,
vector blob,
tags VARCHAR(200),
SORT KEY (id),
FULLTEXT(title, body, tags)
);
Now, let's say the user wants to filter and reorder the results based on a search query and additional context. This can be done by using MATCH and DOT_PRODUCT functions.
Here's an example query that looks for articles that match a certain term and then reranks them based on their vector similarity to a constant vector:
WITH search_results AS (
SELECT *, MATCH(title, body, tags) AGAINST ('search_terms') as
match_score
FROM articles
WHERE MATCH(title, body, tags) AGAINST ('search_terms')
),
reranked_results AS (
SELECT *, DOT_PRODUCT(vector, @vector_to_compare") AS similarity_score
FROM search_results
)
SELECT *
FROM reranked_results
ORDER BY similarity_score DESC;
This query first retrieves the search results that match the given term, and then it calculates the similarity score using the DOT_PRODUCT function with the constant vector. Finally, the query orders the results based on the similarity score.
This example demonstrates the use of SingleStoreDB features like MATCH for full-text search and DOT_PRODUCT for vector similarity calculation.
The Story of a Chatbot Startup: Getting Accurate Results in a Question-Answering Generative AI with GPT
This story is inspired by real events.
Once upon a time, there was a startup building finely tuned bots to aid developers with highly technical content. It required a system that could perform the following tasks:
- Rapidly process and convert semi-structured data into vectors
- Employ similarity matching to find locally indexed documents that match user inquiries
- Enhance the matching results with additional context and re-sort them
- Transmit the context to GPT-4, receive the generated response and present it to the user
The basic operational flows for this application are listed in the following diagram. However, the key factor that distinguishes this system is not merely the similarity matching capability, but also the ability to enrich the matching results with supplementary information — ultimately re-sorting the outcomes and obtaining the most accurate answer from GPT.
Challenges with a Specialty Vector Database
Initially, the startup was using an SVDB but soon realized it had its limitations. The SVDB could only provide similarity results for a specific text or question and had a very small amount of tags that each embeddings can store, whereas the startup's approach required iterating at scale and re-ranking frequently. For instance, being able to rank based on a user's specific context (like asking a question about a particular version of software) was a crucial feature for providing personalized support to developers.
As their data architecture became more complex, they had to supplement the SVDB with an ElasticSearch database. User feedback and events were stored in PostgreSQL, and fed into ElasticSearch to refine the ranking. Essentially, the SVDB became an (expensive) feature of a database.
The issue with relying solely on the SVDBs vector similarity capability was that it could only provide a record ID and a score, which was insufficient for delivering accurate results. The records had to be matched with Elasticsearch to provide a more precise version with context, which could then be sent to GPT.
Solution and Benefits with SingleStoreDB
To deliver more accurate results at a lower cost per question answered, the startup required a streamlined architecture that supported semantic search and matching with reranking and refinement, which necessitated analytics.
SingleStoreDB offered the optimal solution, as it provided superior performance for processing and analyzing semi-structured data such as JSON. SingleStoreDB can also index text, store and match vectors, and re-rank and refine matching results based on additional context.
A Thought Experiment on Product Roadmaps
We're not disclosing any specific product roadmap items here. But let's think about the roadmaps for SingleStoreDB vs. the SVDBs who address vector search application needs today. With respect to vector database applications, what do you think SVDB customers are asking for? We believe they ask for things around two broad areas:
Query processing, programmability and ingest:
- 10x+ speedup for bulk load of vectors
- Real-time ingest
- Fast export
- High-level-language extensibility to be executed as stored procedures or user-defined functions (UDF)
- SQL or the equivalent (rich query language, rich update DML)
- Query optimization
- High-performance select/filter/join/aggregation
- Separation of storage and compute
- Elastic scaling
Data integrity, availability and manageability:
- ACID transactions
- High availability
- Disaster recovery
- Point-in-time recovery
- A cloud service on all three major cloud providers
We could go on, but you get the point.
What do you think SingleStore's customers are asking for with respect to vector databases? One might posit that there are only a few items:
- Built-in support for approximate nearest-neighbor (ANN) search using an indexed approach
- Deeper partner integrations with popular LLM tools
- Typed vectors
- Syntactic sugar for handling vectors with less keystrokes
Who do you think will finish first?
By the way, SingleStoreDB can already do exact nearest-neighbor search incredibly fast via efficient, indexed metadata filtering, distributed parallel scans and SIMD. You can also do ANN search in a way that does not require searching all vectors — with a little extra work — by creating clusters, and only examining vectors in clusters nearby a query vector. Also, most partner integrations can be easily built by the customer on top of SingleStoreDB because they are client application-side integrations that use partner services, and then just interact with the DBMS via SQL.
What can SingleStoreDB do to enable your vector database applications? Try it for free in the cloud or self-hosted today, and find out.
References
[SDB23] SingleStoreDB Vector Functions
[Aur23] P. Aurora, E. Hanson, and C. Malassa, AI-Powered Semantic Search in SingleStoreDB, SingleStore Blog, April, 2023.
[Han23] E. Hanson and S. Gupta, Image Matching in SQL with SingleStoreDB, January, 2023.
[SDB22] Demo: Recognizing Faces Using a SQL Database, YouTube video, December, 2022.