Image Matching in SQL With SingleStoreDB

Vector functions in SingleStoreDB make it possible to solve AI problems, including face matching, product photo matching, object recognition, text similarity matching and sentiment analysis.

In this article, we’ll demonstrate how we use the dot_product function (for cosine similarity) to find a matching image of a celebrity from among 16 million records in just 5 milliseconds! And it's easy – SingleStoreDB does the heavy lifting of parallelization and SIMD-based vector processing for you so you can worry about your application, not your data infrastructure.

Other vector functions supported in SingleStoreDB include euclidean distance calculation, transforming JSON arrays to binary vectors, vector math and vector manipulation.

Want to see our YouTube video on this topic instead of reading about it? Check it out here.

Interest in vector similarity search has increased a lot over time as the AI community has made tremendous strides in representing images, words, documents and much more, with vectors derived from deep neural networks. Witness the increasing interest in the term "cosine similarity" indicated by Google Trends over the last ten years, as shown in Figure 1.

Figure 1. Search frequency trend for the term "cosine similarity" from 2013-2023

vector-matching-in-sql-with-single-store-dbVector Matching in SQL with SingleStoreDB

In SingleStoreDB, you can match images in SQL with vector similarity search. SingleStoreDB is a real-time, distributed SQL database that's great for unified transactions and analytics. It runs in the cloud, or self-hosted. Our vector similarity matching is incredibly fast because it's based on Intel SIMD instructions, plus it benefits from our native parallel execution architecture.

Why use a database to do vector similarity testing, rather than doing it in your application? So you can worry about your application logic and let someone else worry about how to run query logic in a fast, parallel way. 

SingleStoreDB applies filters and joins before vector matching operations if it makes sense. And, you can express the filters in SQL. That's not easy for an application developer to do themselves.

example-of-face-matching-with-sqlExample of Face Matching with SQL

We loaded 16,784,377 rows into this table:

create table people(
  id bigint not null primary key,
  filename varchar(255),
  vector blob
);

Each row represents one image of a celebrity, and contains a unique ID number, the file name where the image is stored and a 128-element floating point vector representing the meaning of the face. This vector was obtained using facenet, a pre-trained neural network for creating vector embeddings from a face image.

Don't worry, you don't need to understand the AI to use this kind of approach – you just need to use somebody else's pre-trained neural network, or any tool that can provide you summary vectors for an object.

Now, we query this table using:

select vector
into @v
from people
where filename = "Emma_Thompson/Emma_Thompson_0001.jpg";

select filename, dot_product(vector, @v) as score
from people where score > 0.1
order by score desc
limit 5;

The first query gets a query vector @v for the image Emma_Thompson_0001.jpg. The second query finds the top five closest matches:

Emma_Thompson_0001.jpg is a perfect match for itself, so the score is close to 1. But interestingly, the next closest match is Emma_Thompson_0002.jpg. Here are the query image and closest match:

Moreover, the search speed we obtained was truly incredible. The 2nd query took only 0.005 seconds on a 16 vcpu machine. And it processed all 16M vectors. This is a rate of over 3.3 billion vector matches per second. That's the beauty of our use of parallelism and Intel SIMD instructions to do the dot_product.

how-to-test-out-face-matching-for-yourself-in-single-store-dbHow to Test Out Face Matching For Yourself in SingleStoreDB

This demo is intriguing. But how do you get vectors to represent objects and store them in a SingleStoreDB database, so you can do something similar in a real application? The vectors need to come from application code, and must be inserted into a blob field of a table so you can use them with our dot_product() and euclidean_distance() functions.

We used facenet to get the images used in the database and queries we described. In your application environment you may be able to use facenet, or other vector-generation technologies from open source, third-party companies, or your own organization's AI team. One example of many is the article by Filip Saina on how to do this for a large collection of images using AWS tools.

Once you have a vector, you can insert it into SingleStoreDB — either directly as binary data (as packed 32-bit floating point when working with dot_product(), euclidean_distance()) or using our json_array_pack() function:

set @x = json_array_pack('[0,0,1,0]');
select length(@x); 
  ⇒ 16
select json_array_unpack(@x);
  ⇒ [0,0,1,0]

Keep reading to dive into the details of how to get vectors for images using facenet, and load or insert them into SingleStoreDB.

environment-setupEnvironment Setup

First, create a machine with Ubuntu 22.04 installed. You can also use a virtual machine running this Ubuntu version, using Docker.

On the Ubuntu machine, follow these steps to install the required software:

Install Python 3.9

sudo apt update
sudo apt install software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.9
python3.9 --version

Install pip for Python 3.9

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
sudo apt install python3.9-distutils
python3.9 get-pip.py

Install Tensorflow

python3.9 -m pip install tensorflow

Install Scipy

python3.9 -m pip install scipy

Install Facenet

python3.9 -m pip install facenet

Install imageio

python3.9 -m pip install imageio

Update facenet.py file

vi /home/ubuntu/.local/lib/python3.9/site-packages/facenet/src/facenet.py

Replace misc.imread with imageio.imread

Add the following lines to the file in the import section (at the beginning of script):

import tensorflow.compat.v1 as tf

tf.disable_v2_behavior()

import imageio

Download the contents of this publicly-accessible Google Drive folder to:

/home/ubuntu/Face_matching/

Run the following command to generate vectors of the dataset:

python3.9 
/home/ubuntu/Face_matching/facenet/util/new_faces_to_vectors.py 
--inpath /home/ubuntu/Face_matching/images/lfw/ --outpath 
/home/ubuntu/Face_matching/images/lfw-out/lfw-out.log --mdlpath
/home/ubuntu/Face_matching/models/20170512-110547

A JSON file will be generated in outpath:

/home/ubuntu/Face_matching/images/lfw-out/lfw-out.log

You can convert this JSON file data into txt file format using the following commands. These commands contain the input file paths in the source code:

python3.9 /home/ubuntu/Face_matching/facenet/util/json_parse.py
python3.9 /home/ubuntu/Face_matching/facenet/util/merge.py

celebrity_data.txt would be generated in the home directory which contains all data  with : as field separator.

Create a table in a SingleStoreDB database to load image file names and vectors:

create table celebrity_face_vector(name text, vector text);

Load the data into the SingleStoreDB table:

LOAD DATA LOCAL INFILE "path to celebrity_data.txt" INTO TABLE
celebrity_face_vector  COLUMNS TERMINATED BY ':';

Run the following script to get a similarity score, and verify you have meaningful vectors in the file:

select vector
into @json_text_v
from celebrity_face_vector
where name = "Emma_Thompson/Emma_Thompson_0001.jpg";


select name,
dot_product(JSON_ARRAY_PACK(@json_text_v), JSON_ARRAY_PACK(celebrity_face_vector.vector)) AS similarity
from celebrity_face_vector
order by similarity desc
limit 5

To get high-performance matching, you'll need to convert the JSON string-based format of the vectors into a binary format, and store the binary format in a table. This is the approach used in the examples with the "people" table in the first part of this blog. You can use JSON_ARRAY_PACK to convert text (JSON array) vectors to binary, then insert the result in a table:

create table celebrity_face_vector2(name text, vector blob); 

insert into celebrity_face_vector2
select name, json_array_pack(vector)
from celebrity_face_vector;

You can modify the input data presented to the code in this section to get vectors for face images of your own. To explore further, see the Facenet code repo from Sandberg.

a-shortcut-using-pre-made-vectorsA Shortcut Using Pre-made Vectors

If you'd like to take a shortcut and query data for the celebrity face image files using pre-created vectors — so you don't have to download the necessary software and build them using the procedure described in the preceding section — download this file: celebrity_data.sql. It contains premade vectors for the celebrity photos, in convenient INSERT statements.

Then, at the singlestore command line, create this table:

create database facematch;
use facematch;
create table people (filename varchar(255), vector blob, shard(filename));

Now, source the file to create the rows. There are around 7,161 rows in the file.

source celebrity_data.sql;

Now, query the table to find close matches to a query image:

/* Find the 5 closest picture matches to the second picture of 
   Emma Thompson. */

set @v = (select vector 
      from people 
      where filename = "Emma_Thompson/Emma_Thompson_0002.jpg");

select filename, dot_product(vector, @v) as score
from people
order by score desc
limit 5;

Share