Exploring the World of Vector Databases

KR

Kovid Rathee

Data and Infrastructure Engineer

Exploring the World of Vector Databases

In this article, you'll learn about vector databases and how they work. You'll explore their various use cases, challenges to using them and solutions to some of those challenges.

Table of contents

exploring-the-world-of-vector-databasesExploring the world of vector databases

Traditional relational and NoSQL databases were developed around transactions and flexible schemas. Most can natively represent data in either a tabular or a key-value form, and some have been made to handle analytical workloads. This works well for database lookups and text-search queries. However, if you're searching for context and meaning, you need a different way of representing data.

Extracting meaning from natural language requires a different approach than traditional relational databases. It requires a different method of storing data as well as of interacting with and processing it.

Enter vector databases.

In this article, you'll learn about vector databases and how they work. You'll explore their various use cases, challenges to using them and solutions to some of those challenges.

Let's dive right in!

understanding-vector-databasesUnderstanding vector databases

Similarly to how a relational database stores data in rows and columns and a document database stores data in JSON documents, a vector database stores arrays of numeric values called vectors. These vectors, also called embeddings, store numbers representing words or paragraphs.

Embeddings can be represented in a high-dimension space to enable similarity and nearest-neighbor search patterns, which we'll explore in more detail later. First, let's look at how vector databases compare to relational databases in representing data.

Difference in data representation in relational databases and vector databases

The core ability to easily represent multidimensional vector space distinguishes a vector database from a relational database. Theoretically, you can implement a data model in your traditional relational database to mimic a vector database. However, it would be computationally very costly because the architecture of relational databases is not meant for fuzzy searches — i.e., searches that don't work on deterministic, rule-based queries but on finding entities with similar meanings.

Similarity search works on calculating the distance between vectors in the vector space. When words, phrases and sentences are represented as vectors in the vector space, their similarity is calculated by figuring out whether they are near neighbors using algorithms like K-Nearest-Neighbor (KNN), Approximate Nearest Neighbor (ANN) and Scalable compressed approximate Nearest Neighbor (ScaNN).

Once you have a system to run queries in the vector space with the preceding algorithms, the gate to fuzzy and semantic searching opens.

use-cases-for-vector-databasesUse cases for vector databases

Semantic search is the most fundamental and widely exploited use case for vector databases. It also lays the foundation for other use cases, such as natural language processing (NLP), computer vision and more.

recommendation-systemsRecommendation systems

Recommendation systems can be found everywhere these days: eCommerce stores, over-the-top content platforms, music apps, meditation apps and more. These recommendation systems require more than just keywords and tags to be accurate.

Recommending items like films, TV series, products, games, courses and videos is subjective, meaning that you need to cater to a particular person or group. The likelihood of correctly predicting what someone might like depends on the different types of data the recommendation system has about them.

This is where the ability of vector databases to store and retrieve high-dimensional data comes in. Algorithms like ANN come in handy to go through the vector space and come up with recommendations.

image-and-video-recognitionImage and video recognition

Like a fuzzy search on text, you can do a fuzzy search on images and video since they can also be represented as a vector.

Deep learning methodologies such as convolutional neural networks (CNN), recurrent neural network (RNN) and Bidirectional Encoder Representations from Transformers (BERT) can be used to generate vector embeddings from images.

Vectorizer libraries like img2vec and data2vec help convert images and videos into embeddings that can be stored in the vector space in the same way as word embeddings derived from text data sources. You can group images into different categories, compress images and more using these libraries.

You can also use many other x2vec libraries to vectorize different data types.

anomaly-detectionAnomaly detection

Detecting anomalies is a direct application of vector databases that works because certain keywords and spatial clusters will be associated with a subject. Values outside these spatial clusters might be considered anomalous.

If you perform anomaly detection in relational or time series data, you might calculate a moving average over multiple dimensions, say, network traffic, geolocation and so forth. Any value outside a predetermined threshold might be considered anomalous.

Vector databases perform the same function by enforcing a high-dimensional threshold. Any value outside that high-dimensional threshold will be regarded as anomalous.

Anomaly detection can be used for fraud detection, security breach detection, medical disorder detection and more.

bioinformaticsBioinformatics

Analyzing biological data has always been challenging because of its high dimensionality. On top of that, many types of biological data are also time series, adding an additional layer of complexity as high-dimensional data changes rapidly over time.

Like regular images, functional imaging such as MRI or CT scans can also use domain-specific vectorization libraries to get more accurate embeddings. Having this data in a vector space can help detect biological anomalies, leading to early detection of diseases and better treatment.

natural-language-processingNatural language processing

One of vector databases' most important use cases is generating, processing and understanding natural language.

Consider a simple use case, such as looking up your company's intranet documentation. In most current implementations, it works by using tags, categories, descriptions, SQL-style query filters or full-text search powered by Elasticsearch-like inverted indexes. None of these methods understand the search for its context. Instead, they perform the search based on rules.

By allowing you to perform semantic searches, vector databases open many other doors to natural language processing use cases like text summarization, translation and creative writing. It also forms the basis for conversational chatbots.

speciality-vector-databasesSpeciality vector databases

The increased popularity of these use cases that call for vector databases has given rise to specialty vector databases (SVDBs) that are purpose-built for storing vectors and processing vector data primarily.

The jury is still out on whether an SVDB is better than a general-purpose multi-model database. If you're considering an SVDB, here are some challenges and potential solutions to keep in mind.

challenges-of-svd-bsChallenges of SVDBs

The main challenges of using SVDBs relate to costs, efforts and skills, with security also coming into play.

The critical feature of vector databases — high-dimensional data — soon becomes a problem for businesses. As the data's dimensionality increases, the vector space volume increases, which, in some cases, can make the vector space very sparse. Sparse embeddings are good for keyword-based searches, inverted indexes and other rule-based approaches to data — but not for semantic searches.

One of the direct implications of sparseness is that as the dimensionality of data increases and the space it occupies increases, the computing power needed to process the data — and the cost thereof — also increases exponentially.

In addition to these direct costs, SVDBs can also add indirect costs as they require extra effort and new skill sets on your team.

The specialized nature of SVDBs makes it challenging to get them to interact with other types of databases and applications that don't support vector databases. This means you need to convert non-vector data to vector embeddings, requiring more effort and costs.

Using an SVDB means you'll need to broaden your engineering stack to include specialty testing, profiling and monitoring tools for vector data. Most major data quality tools like Great Expectations, Deequ and dbt are built on the foundation of databases, data warehouses, data lakes and data lakehouses, so they don't have a way to run tests on embeddings.

Traditional database-type indexing also doesn't work with vector databases. You need to use specialized indexing algorithms such as Facebook AI Similarity Search (FAISS), Locality-Sensitive Hashing (LSH) and Hierarchical Navigable Small Worlds (HNSW).

And lastly, SVDBs do not yet offer the privacy and security of traditional databases because they are still in their nascent stage. This is an important consideration since many of the primary use cases for vector databases deal with highly sensitive personally identifiable information and personal health information data.

solutions-and-workarounds-for-the-challenges-of-svd-bsSolutions and workarounds for the challenges of SVDBs

Vector data is here to stay, so discarding specialty vector databases as niche isn't an option.

However, in most cases, it's still better to use a general-purpose multi-model database that supports relational, time series graph and vector data as your vector database. This means your team doesn't have to learn new skills, and you're saving your business additional licensing costs. Most importantly, it reduces the need for out-of-network data movement to significantly reduce network costs and data security risks for your business.

If you do choose to use specialty vector databases, ensure that the database is built on open standards and integrates with various tools to extract the most value from your investment. Also be sure to apply the necessary data masking, de-identification and anonymization processes so you don't end up with a security and privacy nightmare.

To save money on storage, network and computing, make sure that you choose the optimal file format for storing vector embeddings. There are many wrappers and vector-specific implementations of these file formats, such as Magnitude.

conclusionConclusion

With the rise of generative AI and natural language processing, applications that require what vector databases have to offer —  a native vector space representation of your data and the ability to query it in ever more powerful ways with semantic search — will only increase.

It's best to get ahead of the curve now and find ways to solve challenges of adopting vector databases such as increased costs, more effort and the need for more diverse skills.

At the moment, the best solution is not to adopt a specialized vector database but to opt for a general-purpose multi-model database that supports relational, time series, graph and vector data. If you do opt for a speciality vector database, follow the advice in this article to ensure you get the most value from the solution and ensure security and privacy of your data.


Share