Vector Search with Matryoshka Embeddings

Vector Search with Matryoshka Embeddings

Introduction

Matryoshka Representation Learning (MRL) is named after those famous nesting dolls 🪆. Crack one open and there's a smaller one inside, and another inside that, and another. The inner-most doll is a miniature version of the outer-most doll.

MRL embeddings work the same way. A 1024-dimension MRL vector contains a 128-dimension vector, which contains a 64-dimension vector. Each vector is itself a complete embedding that captures much of the meaning of the original vector. 

By leveraging this beautiful mathematical property, we get dramatically smaller indexes, faster searches, and recall that stays surprisingly close to the original. Combined with SingleStore's F16 vector support, MRL lets you shrink vector index memory and build better AI applications without paying for it in accuracy.

Our tests show that building indexes on 128-dimension vector prefixes of a 1024-dimension vector and using two-stage search with reranking reduces memory by up to 87% and increases throughput up to 6.6x, while still preserving recall.

 

Index Configuration

Memory Reduction

Throughput (QPS) Improvement

Recall@5

128-dim, IVF_FLAT

87% less

6.6x faster

96.0% (-1.8%)

128-dim, HNSW_FLAT

82% less

1.2x faster

94.2% (-1.2%)

Table 1: Memory, Queries per Second (QPS), and Recall of MRL Sub-Vector Indexing versus Full-Vector Indexing (1024-dim).

 

Maintaining fast, accurate search for large vector data sets is challenging. Vector indexes are necessary to avoid exhaustive, expensive k-nearest neighbors (kNN) search. High-dimension vector indexes with uncompressed vectors consume large amounts of memory, while indexes with compressed vectors, often using product quantization (PQ), reduce memory usage but lead to a drop in accuracy.

MRL provides an alternative that preserves high-accuracy search while using significantly less memory. MRL embeddings store different levels of detail inside a single embedding with most of the important information stored in the first few dimensions. 

When you need reduced memory without significant loss in recall or an IVF_PQFS index is not providing sufficient recall, consider using a MRL embedding, an IVF_FLAT or HNSW_FLAT index, and a two-stage vector search.

What is Matryoshka Representation Learning?

Matryoshka Representation Learning (MRL) is a training technique for embedding models. Traditional embedding models produce fixed-length vectors where every dimension matters. Truncating a standard 1024-dimension embedding to 128 dimensions discards critical information and degrades search quality.

MRL uses a different approach. An MRL-trained model produces a full-size vector, and it is trained to produce accurate representations at smaller sizes as well. The result is a single embedding that works at multiple scales. The full 1024 dimensions capture the finest detail, but the first 128 or 64 dimensions alone can capture meaningful semantics. 

As a result, vector indexes built on the first N dimensions of an MRL embedding produce relatively accurate results. In contrast, traditional embedding models require indexing the full vector, and truncating leads to loss of information.

MRL embeddings enable a two-stage search strategy. 

  1. Search a sub-vector index, built on the first N dimensions, to quickly retrieve a set of candidates. 
  2. Re-rank those candidates using the full-dimension vectors for precision to obtain the final result set. 

The sub-vector index is smaller and faster to search, while re-ranking preserves the accuracy of the full embedding.

MRL Compared to PQ

MRL and PQ are both methods for getting smaller vector indexes while limiting the recall loss. The following table compares MRL and PQ on several aspects.

Note: SingleStore does not recommend using MRL with PQ indexes, as the combination leads to reduced recall. Indexing with non-PQ indexes on prefixes of an MRL embedding can reduce index memory usage on its own, eliminating the need for the compression obtained with PQ.

 

Aspect

MRL

PQ

Availability in SingleStore

MRL can be implemented on SingleStore, using IVF_FLAT or HNSW_FLAT indexes, with a two-stage search.

SingleStore provides IVF_PQFS indexes out of the box. 

Method of Data Reduction 

Builds index on the first N dimensions of an MRL embedding.

Compresses vectors and builds an index on the compressed vectors. 

Data Reduction

Reduces data size based on the number of dimensions used. 

 

128 dimensions of a 1024-dim vector: 8x compression.

 

64 dimensions of a 1024-dim vector: 16x compression.

Provides 8x-32x compression depending on the configuration.

Recall

Recall is expected to degrade less with MRL than with PQ. MRL delivers higher recall at comparable memory savings.

PQ compresses vectors, which results in some recall degradation.

Models

The following popular models have adopted MRL:

OpenAI text-embedding-3-small/large 

nomic-embed-text-v1.5 

Jina Embeddings v3

mrl-resnet50 

Can theoretically use any model. 

 

SingleStore does not recommend using MRL embeddings with PQ.

 

When to Use Which Index

IVF_PQFS: IVF_PQFS indexes are built into SingleStore. Use IVF_PQFS if memory usage is a key consideration and some recall loss is acceptable. Use IVF_PQFS if your embeddings do not come from an MRL model and memory usage is a key consideration. If IVF_PQFS is not providing sufficient recall, consider tuning the IVF_PQFS parameters or using MRL.

MRL: Use MRL embeddings and a two-stage search if memory usage is a consideration and recall is of high importance. In addition, if the embedding model supports MRL, use the two-stage search with an HNSW_FLAT or IVF_FLAT index built on a reduced number of dimensions. 

HNSW_FLAT: HNSW_FLAT is built into SingleStore. Use HNSW_FLAT if recall is of utmost importance and memory usage is less of a consideration. (HNSW_FLAT does not compress the vectors so the index takes as much memory as the vectors themselves.)

Performance Experiments

We tested the performance of MRL embeddings with the two-stage sub-vector search strategy on a 10-million row dataset. 

Our experiments examine how indexing only the first 64 and 128 dimensions of an MRL embedding and re-ranking with the full 1024-dimension vector affects memory consumption, throughput, and recall using FLAT indexes. 

MRL uses a two-stage search which first searches a sub-vector index to retrieve a set of candidates, then re-ranks those candidates using the full-dimension vectors.

The two-stage search can be accomplished with a straightforward SQL query.

In the following query:

  1. The inner query searches the vector index (ivfflat_64) to retrieve 100 candidates from vector_tbl. 
  2. The outer query re-ranks those candidates using the dot product function (@query_vec <*> vec) and outputs the top 5.
1SELECT id FROM 2( SELECT id, vec 3  FROM vector_tbl 4  ORDER BY @query_vec_64 <*> vec_64 USE INDEX(ivfflat_64) DESC 5  LIMIT 100 ) AS candidates_list 6ORDER BY @query_vec <*> vec DESC 7LIMIT 5;8

In our experiments, we tested 100 and 1000 candidates for the first stage of the search. We selected 100 candidates (20x the number of final results) as the starting point. A value of 1,000 candidates (10x the starting point) was used to test how recall and throughput behaved with a higher number of candidates.

Common practices recommend starting with 10-20x the number of final results and increasing the number of candidates to find a balance between recall and performance. Increasing the number of candidates increases recall, but decreases performance. In general, smaller sub-vectors require more candidates. Smaller sub-vectors capture coarse semantics, larger sub-vectors capture finer details. A 64-dim sub-vector will provide less precise first-stage matches than a 128-dim sub-vector. To compensate, retrieve more candidates with smaller sub-vectors and fewer candidates with larger sub-vectors

Test Setup 

Two types of indexes were tested: IVF_FLAT and HNSW_FLAT. The default index building parameters were used; the metric search parameters were set as follows:

HNSW_FLAT: {metric_type: DOT_PRODUCT, ef: 120}

IVF_FLAT: {metric_type: DOT_PRODUCT, nprobe: 20}

The max_vector_index_cache_memory_percent was set at 99%.   

 

The following table shows the test setup. Candidate List is the number of results retrieved in the first stage of the two-stage search.

 

Parameter 

Value 

Dataset

https://huggingface.co/datasets/rasdani/cohere-wikipedia-2023-11-en 

(10M of 41.5M rows)

Embedding Model

sentence-transformers/static-retrieval-mrl-en-v1 · Hugging Face 

Base Dimensions

1024

MRL Sub-dimensions Used

64 and 128 dimensions

Candidate List

100 and 1000

 

Methodology 

For each configuration, the two-stage search retrieves a set of candidates (100 or 1000) using the sub-vector index, then re-ranks them against the full 1024-dimension vectors. 

  • Accuracy is measured as Recall@5 against kNN search.
  • Throughput is measured in queries per second (QPS) across 1,000 test queries on four concurrent threads. 
  • The DOT_PRODUCT metric was used for all similarity calculations.

The test database contained a table with 1024-dimension vectors. A random sample of 1,000 vectors was taken from that table to serve as test queries. Each of the 1,000 queries was run against the full 1024-dimension vectors using exact kNN. The top 5 results for each query were saved as the ground truth for recall calculation. 

The two-stage search was set up by creating sub-vectors (128- and 64-dimensions), normalizing these sub-vectors, and then creating an index for the sub-vectors.

Full methodology details, including table schemas, SQL examples, and index configurations, are provided in the Appendix.          

Results

Memory Reduction

The MRL sub-vector indexes reduce memory use by 82%-93% versus uncompressed indexes.

Using 64 dimensions, IVF_FLAT memory use drops from 38.0 GB to 2.52 GB, a 93% reduction, and HNSW_FLAT drops from 41.5 GB to 4.88 GB, an 88% reduction. 

Using 128 dimensions, IVF_FLAT reduces memory 87% (38.0 GB to 5.0 GB), and HNSW_FLAT reduces memory 82% (41.5 GB to 7.33 GB).

Memory consumption increases with sub-vector dimensionality. 

 

 

QPS and Recall@5

The table shows the throughput (QPS) and accuracy (Recall@5) for:

  • ANN search on FLAT indexes on 1024-dimension vectors.
  • MRL using 64-dimension sub-vectors and 100 and 1000 candidates.
  • MRL using 128-dimension sub-vectors and 100 and 1000 candidates.

The columns "vs ANN QPS" and "vs ANN Recall" compare the MRL approach against the 1024-dimension ANN FLAT index (baseline) of the same index type.

 

 

Index Type

Candidates

QPS

Recall@5

QPS vs Baseline

Recall vs Baseline

1024-Dim ANN Search

1024

IVF_FLAT

-

11.3

97.80%

-

-

1024

HNSW_FLAT

-

119.6

95.40%

-

-

64-Dim MRL Sub-vector (Two-Stage Search)

64

IVF_FLAT

100

114.9

87.80%

10.2x faster

-10%

64

IVF_FLAT

1000

28.6

95.00%

2.5x faster

-2.8%

64

HNSW_FLAT

100

153.2

87.60%

1.3x faster

-7.8%

64

HNSW_FLAT

1000

28.7

95.00%

4.2x slower

-0.4%

128-Dim MRL Sub-vector (Two-Stage Search)

128

IVF_FLAT

100

74.9

96.00%

6.6x faster

-1.8%

128

IVF_FLAT

1000

26.9

97.70%

2.4x faster

-0.1%

128

HNSW_FLAT

100

142.8

94.20%

1.2x faster

-1.2%

128

HNSW_FLAT

1000

27

95.80%

4.4x slower

0.4%

 

64-Dimension Sub-Vector 

The 64-dimension configuration delivers the highest raw speed improvement. With IVF_FLAT and 100 candidates, QPS increases from 11.3 to 114.9 (10.2×). HNSW_FLAT reaches 153.2 QPS, which is 1.3× faster than its 1024-dimension result of 119.6.

However, this comes with a drop in accuracy. Recall@5 falls to 87.8% (IVF_FLAT) and 87.6% (HNSW_FLAT) with 100 candidates, representing a decrease of 10% and 7.8%, respectively.

Increasing the candidate count to 1,000 improves recall to 95.0% for both index types. However, the speed advantage is reduced: IVF_FLAT becomes 2.5× faster than the 1024-dimension baseline, while HNSW_FLAT becomes 4.2× slower than the 1024-dimension baseline of 119.6.

128-Dimension Sub-Vector 

The 128-dimension configuration provides the best balance between speed, accuracy, and memory usage.

With IVF_FLAT and 100 candidates, QPS reaches 74.9 (6.6× faster than the 1024-dimension baseline), while Recall@5 is 96.0%, only 1.8% lower than the 1024-dimension result (97.8%).

Increasing the candidate count to 1,000 raises IVF_FLAT recall to 97.7%, within 0.1% of the 1024-dimension baseline, while still maintaining 2.4× higher throughput.

With HNSW_FLAT and 100 candidates, QPS reaches 142.8 (1.2× faster than the 1024-dimension result), with Recall@5 at 94.2%, only 1.2% lower than the 1024-dimension baseline (95.4%). Increasing to 1,000 candidates improves recall to 95.8% (0.4% higher than the 1024-dimension result), but QPS drops to 27, making it 4.4× slower. As with the 64-dim case, this happens because the re-ranking cost at 1,000 candidates outweighs the speed gained from the smaller sub-vector index.

Results Summary

The MRL methodology reduces memory 82%-93%. The different options provide different tradeoffs between memory use, throughput, and recall. 

  • Sub-vector dimension: Higher dimensions improve recall but reduce speed gains and increase index memory.

  • Candidate count: More candidates improve recall because more true positives survive Stage 1. But more candidates also means more re-ranking work in Stage 2, which reduces QPS.

    • The results confirm that smaller sub-vectors require more candidates for comparable recall. 

    • At 64 dimensions with IVF_FLAT indexes, increasing candidates from 100 to 1,000 raised Recall@5 from 87.8% to 95.0%, while at 128 dimensions, 100 achieved 96.0% Recall@5.

Conclusion

MRL embeddings paired with two-stage sub-vector search offer a practical way to reduce vector index memory use on SingleStore while maintaining accuracy and often improving throughput.

The 128-dimension configuration with the IVF_FLAT index delivers the best balance: it achieves 6.6x higher throughput and 87% lower memory usage while maintaining 96% Recall@5 which is within 2% of the full-vector baseline.

MRL and SingleStore's F16 vectors are complementary techniques that can be combined to further decrease memory usage and improve performance. MRL dramatically reduces the index size by allowing indexes to be built on fewer dimensions. F16 vectors cut vector storage requirements in half. Together, they reduce index memory and storage costs, while preserving recall and often improving performance.

 

Save time, save money, and improve your vector search with MRL and F16 vectors on SingleStore.       

Appendix: Methodology 

This section provides a description of the benchmark methodology.

System Information

Python was used to run the benchmark.

 

Component

Specification

OS

Oracle Linux 8.10

Aggregator

1 node: 4 vCPU, 8 GB RAM, 128 GB Disk

Leaves

2 nodes: 4 vCPU, 30 GB RAM, 150 GB Disk

DB Partitions

2

Python Version

3.11.11

SingleStore Version

9.0.18

 

Stage 1 (Candidate Retrieval): An ANN search is performed on the sub-vector index to retrieve a set of candidates.

Stage 2 (Precision Re-ranking): The full 1024-dimension vectors of the candidate set are compared using exact kNN, and the final top-K results are returned.

Evaluation Metrics

  • Distance Metric: DOT_PRODUCT was used for all similarity calculations.

  • Recall@5 is the primary accuracy metric. It measures the overlap between the top-5 results from the two-stage search and the top-5 results from the baseline full-vector kNN search. 

Recall@5 = (number of matching results) / 5 × 100%.

  • QPS (Queries Per Second): 1,000 randomly sampled queries were run sequentially on 4 concurrent threads, and the total time was used to calculate throughput. 

QPS = 1000 / total elapsed time (seconds).

Table Schema

The table used:

1CREATE TABLE vector_tbl (2	id BIGINT NOT NULL,3	texts LONGTEXT,4	metadata JSON,5	vec VECTOR(1024),6	SORT KEY (id), SHARD KEY(id)7);8

Query Generation

1,000 vectors were randomly sampled from the dataset to serve as test queries:

1SELECT vec FROM vector_tbl ORDER BY RAND() LIMIT 1000;2

Ground Truth (kNN) 

Each of the 1,000 queries was run against the full 1024-dimension vectors using exact kNN. The top-5 results for each query were saved as the ground truth for recall calculation:

1SELECT id FROM vector_tbl2ORDER BY vec <*> @query_vec USE INDEX() DESC3LIMIT 5;

Full 1024-Dimension Vector Index Creation

ANN indexes were created on the full 1024-dimension vector column, then optimized to build cross-partition index segments:

1ALTER TABLE vector_tbl ADD VECTOR INDEX ivfflat(vec)2  INDEX_OPTIONS '{"index_type":"IVF_FLAT",3  "metric_type": "DOT_PRODUCT", "nprobe":20}';4OPTIMIZE TABLE vector_tbl VECTOR_INDEX ivfflat;

Two-Stage Search Preparation

To enable the two-stage search, the full 1024-dimension embeddings are transformed into smaller sub-vectors that can be indexed separately. The process involves three steps.

Step 1: Create Sub-vectors

A new column is added to store the truncated sub-vector. The vector_subvector function extracts the first N dimensions from the full embedding:

1ALTER TABLE vector_tbl ADD vec_64 VECTOR(64);2UPDATE vector_tbl SET vec_64 = vector_subvector(vec :>blob, 0, 64):>vector(64);3
Step 2: Normalize Sub-vectors

After truncation, sub-vectors were re-normalized.

1DELIMITER //2CREATE OR REPLACE FUNCTION normalize(v VECTOR(64))3  RETURNS VECTOR(64) AS4DECLARE5  squares VECTOR(64) = vector_mul(v,v);6  length FLOAT = sqrt(vector_elements_sum(squares));7BEGIN8  RETURN scalar_vector_mul(1/length, v);9END //10DELIMITER ;11UPDATE vector_tbl SET vec_64 = normalize(vec_64);
Step 3: Create Index for Sub-vectors

An ANN index was created on the sub-vector column, using the same index type and parameters as the full-vector index:

1ALTER TABLE vector_tbl ADD VECTOR INDEX ivfflat_64(vec_64)2  INDEX_OPTIONS '{"index_type":"IVF_FLAT",3  "metric_type": "DOT_PRODUCT", "nprobe":20}';4OPTIMIZE TABLE vector_tbl VECTOR_INDEX ivfflat_64;5
Recall Calculation Example

 

  • Full 1024-dimension kNN. 

1SELECT id FROM vector_tbl ORDER BY vec <*> @query_vec USE INDEX() DESC LIMIT 5; 2

This query returns IDs: 9999, 9991, 10031, 10008, 302874.

  • Two-stage search (64-dim ANN with 100 candidates, re-ranked with full 1024-dimension) 

1SELECT id FROM 2( SELECT id, vec FROM vector_tbl 3ORDER BY @query_vec_64 <*> vec_644 USE INDEX(ivfflat_64) DESC LIMIT 100 ) AS candidates_list 5ORDER BY @query_vec <*> vec DESC LIMIT 5;6

This query returns: 9999, 9991, 10031, 302874, 680405.

Four of the five results match, so Recall@5 = (4 / 5) × 100 = 80%.

QPS Calculation Example

If 1,000 queries completed in 13.4 seconds, then QPS = 1000 / 13.4 = 74.9 queries per second.

Original from Michael @ Agile - Here for reference:


Share