Full-Text Search vs. Semantic Search

As a query engine engineer at SingleStore, I was fortunate enough to have both developed SingleStore’s full-text search engine, and initiated the vector/semantic search effort. I wanted to share my insights, and fact-check some widely held beliefs about semantic and full-text search.

First, Some Basics…

In full-text search, the emphasis is primarily on keyword matching. It retrieves documents or web pages that contain the exact keywords specified in the query by leveraging full text search capabilities. This is an approach used heavily by web search engines. On the other hand, semantic search aims to understand the users’ intent behind queries by analyzing its semantics, context and relationships between words or concepts. Full-text search typically treats each keyword independently, without considering the relationships between them or the overall context.

Semantic search, however, takes into account the context, synonyms, related terms and overall meaning of the query to retrieve more relevant results. It focuses on understanding the intent and meaning behind the query, rather than relying solely on specific keywords.

Let's dig a bit deeper into these two concepts:

What is Full-Text Search?

Full-text search refers to a search technology that enables users to search for specific words or phrases within a document, database, or website. Unlike traditional search algorithms that may only index certain parts of a document, full-text search indexes every word, allowing for fast and accurate search results. This comprehensive indexing makes it possible to quickly locate relevant information, even in large datasets.

Full-text search is widely used across various industries, including eCommerce, content management, and information retrieval. For instance, in eCommerce, it helps users find products by searching for specific keywords in product descriptions. In content management systems, it allows users to search through vast amounts of text to find relevant documents or articles. By providing precise and efficient search capabilities, full-text search enhances the user experience and ensures that users can find the information they need quickly and easily.

Advantages and Limitations of Full-Text Search

Full-text search offers several advantages over traditional search algorithms:

Greater Precision in Search Results: By indexing every word in a document, full-text search can provide highly precise search results, ensuring that users find exactly what they are looking for.
Improved Search Efficiency: Full-text search systems are designed to quickly scan through large volumes of text, making it possible to retrieve relevant information in a matter of seconds.
Enhanced User Experience: The ability to find relevant information quickly and easily enhances the overall user experience, making full-text search a valuable tool in various applications.

However, full-text search also has some limitations:

Storage Requirements: Full-text indexing requires significant storage space to maintain the index, especially for large datasets.
Potential for False Positives: The reliance on keyword matching can sometimes lead to irrelevant results, as documents containing the search terms may not always be contextually relevant.
Complex Algorithms Needed: To improve search results and handle large datasets efficiently, full-text search systems often require complex algorithms and techniques, which can be challenging to implement and maintain.

What is Semantic Search?

Semantic search is a type of search technology that goes beyond traditional full-text search by understanding the meaning and context of a user's search queries. It leverages natural language processing (NLP) and knowledge graphs to analyze the relationships between words, phrases, and entities, providing more accurate and relevant search results. Unlike full-text search, which focuses on keyword matching, semantic search aims to comprehend the user’s intent and the overall context of the query.

This advanced approach is particularly useful for complex search queries, such as natural language queries, where users might ask questions in a conversational manner. By interpreting the semantics of the query, semantic search can deliver more relevant and personalized results, significantly improving the user experience. For example, a semantic search engine can understand that “best Italian restaurant near me” should return results for highly-rated Italian restaurants in the user’s vicinity, even if the exact keywords are not present in the indexed documents.

So, based on these details, there's a lot of nuances and assumptions that need to be dug into. Let's take a quick fact check around some of the most popular myths and "facts" within this domain.

Fact Check: ChatGPT-Style Chatbots with Domain-Specific Knowledge Require Semantic Search

Wrong. This indeed is orthogonal to full-text versus semantic search. In fact, we did this with our very own chatbot, SQrL, which utilizes both semantic and full-text search. The core of this chatbot is very simple — for each user question, it searches a large number of documents, finds relevant information and feeds the documents to OpenAI’s Chat Completions API.

In other words, it doesn’t matter too much how you search the data. You can use semantic or full-text search, or even run some regular filters to get those documents. You can use semantic or full-text search, or even run some regular filters to get those documents. Full-text search queries allow users to define specific parameters and discover relevant content, even if they are not aware of precise terms.

Here’s the flow for semantic search:

question — model —> embedding — vector search —> relevant docs

And here’s what it looks like without semantic search:

question — tokenizer or model —> extracted terms/ property filters — full-text search —> relevant docs

Both work, and I am not convinced one is universally better than the other.

Fact Check: Semantic Search Is Better Than Full-Text Search

Totally wrong. Semantic search is not the silver bullet — even more, you can’t treat your embedding model as the silver bullet.

Depending on the use case, full-text search can be simple, reliable and efficient. A good example is for log analysis. People usually look for exact keywords, patterns or even a specific object like a database name. Full-text search guarantees to find all the entries that match the search criteria, and works much better than semantic search. A log will look like:

38:46.030  TEST: Thread 115107 (ntid 693666, conn id -1):

ResolveTwoPCTransactionsForDb: db_0: Encountered 0 prepared attached

Fact Check: Semantic Search Is Easier to Use Than Full-Text Search

Yes and no.

It’s very easy to overlook the difficulty of a less familiarized component. For semantic search, (and from a database perspective) people usually ignore the machine learning part of the work. Believe it or not, a MLE’s job is not easier than a DBA’s.

You need the right segmentation of rightly normalized data — running it on the right model with right training — to get reasonable semantic search results.

Fact Check: I Am Using OpenAI for Building My Chatbot, so I Should Also Use the OpenAI Embedding

No. You don’t need to use the same system for embedding and text completion. You could argue that OpenAI is so good that its embedding API is a very good no-brainer. OpenAI is a good starting point, especially for your personal project. However for an enterprise use case, there are a few things worth double checking:

The OpenAI API can be a lot more expensive than hosting your own model
It might not be a good idea to depend on a versioned external service. This is particularly important for semantic search, since the same model needs to be used for indexing and querying.
The OpenAI embedding might not be the best or most suitable for your data. In fact there is research around other embedding models, like

In short, enterprise use cases require machine learning expertise to evaluate, pick and train the best model (and that might be OpenAI, or not).

A Head-to-Head Comparison:

Dataset and config

Here, we’re going to use the famous Yelp dataset. More specifically, we are focusing on the reviews table with ~7 million rows, and working with a very small cluster: 2vcpu and 16GB of memory.

Easy search workloads that do not require super-complex joins and aggregation query shapes scale almost perfectly when running in a distributed system like SingleStoreDB. So for a larger dataset with lower latency requirements, we can easily scale the cluster to achieve target performance.

To avoid introducing additional variables for this particular test, let’s work with the simplest vector search: exact KNN (K-nearest neighbor) by simply storing the vector execution optimized format (similar to Faiss Flat index). If an ANN (approximate nearest neighbor) search algorithm is introduced, it usually implies larger index sizes, longer indexing time (in terms of hours) and faster search performance (with slightly worse search accuracy). If you are interested in comparing ANN search with different index implementations, stay tuned for our future blog post.

Index:

	Full-text	Self-host embedding	OpenAI
Embedding time	N/A	4 hours*	20 minutes**
Cost	N/A	~$2	~$300
Index time	~2 minutes
Size	2.6GB	21GB for encodings	81GB for encodings

*Using SingleStore Notebooks or Google collab, with GPU

**Based on standard user's TPM limits, assuming running with batching and parallel

Query:

It’s really hard to evaluate and compare the query experience of semantic search. I am not an MLE and don’t have the bandwidth to label the dataset — but let’s try to define a smart quantitative approach. In addition to the reviews themselves, the dataset also has other properties. So we can form a query based on other properties of the business, and use semantic search to query them.

More precisely, this is our practice (which I believe is a good way to test your semantic search model): Create semantic search, and use properties to validate the results.

Examples:

Semantic:

“Affordable Italian pizza with good review”

SQL validation:

Select name, categories like '%pizza%', categories like '%italian%',

business.stars, attributes::$RestaurantsPriceRange2

from business join (select * from reviews order by

dot_product(embedding, target_embedding) desc limit 10) r

where business.business_id = r.business_id

Full-text query for getting a similar result:

+pizza +italian good best great -expensive -bad cheap affordable

(incorrectly returns a review emphasizing 'price is high')

Semantic:

‘find a restaurant with average review, not good, not bad, but average”

Select name, business.stars, r.stars, text

from business join (select * from reviews order by

dot_product(embedding, target_embedding) desc limit 10) r

where business.business_id = r.business_id

Full-text query for getting a similar result:

+restaurant mediocre average common -best -great - worst -terrible

(not working very well)

Results

I tried five vector-similarity-based queries of two different types. I used the OpenAI and miniLM models to get vectors for the text. MiniLM is available from Hugging Face.

For the first type of queries (informational/objective search) both models perform pretty well, slightly better than the corresponding full-text query I can come up with. On average for the second type of queries (subjective search), the OpenAI embedding is noticeably better. And both semantic search models work much better than full-text search.

Figure 1: Recall evaluation for objective and subjective queries using vector-based search. Each pair of bars is for an individual query; there are five total queries of each type. The Y axis shows how many of the returned results actually match the search question.

More interesting findings during our test

Most general models, like OpenAI and sentence transform, will work best when you separate informational and objective searches from emotional and subjective searches.

For example:

Full-text match (‘“escape room”’) + semantic search (“scary, good for adult”)

works way better than semantic search: (‘find me a scary, good for adult escape room’).

Tentative conclusion

When building an application with ‘search’ requirements (if the dataset is large enough) always try full-text search first. It’s much cheaper and super reliable.

If regular full-text search is not enough —you need fuzzy search a lot; sometimes you don’t know how to order the outputs; you need to combine complex full-text query clauses and still can't find a desirable result — it’s time to try semantic search.

For keyword search-like semantic search or searching for docs related to a certain topic, most models will work fine. Pick a simple and cheap one (or anything you are familiar with), embed your dataset and see if it works. Here’s an example search query:

Search for all docs related to ‘SingleStore vector search’

But if the goal is to really extract the ‘semantic”, search by the meaning/attitude of docs, it becomes very important to use a good well-tuned model, and don’t expect it will just work out of the box. Either start with a high quality paid service, or look into training or fine tuning a model yourself. Here’s an example search query:

Search for all doc that supports ‘SingleStore vector search is better than existing vector databases’

How does full-text search work with search queries?

You probably have seen the glossary of terms for a book before.

It lists all the keywords, and includes a list of chapters/pages where you can easily look them up.

We do something similar in databases. Here’s how:

Tokenize the documents
Build a global dictionary with potential speedup structures (set or hashtable)
Build an inverted index, with one posting list per dictionary term

How does ANN semantic search work?

You probably have played with puzzles or Legos — but have you ever tried to reassemble a Lego or puzzle set a second time, when pieces were not pre-packaged? You would first divide all the puzzle/Lego pieces based on their color, and when trying to find a piece, only search in the group with similar colors.

When doing this in databases we:

Vectorize the documents
Divide all the vectors into groups
Build a global Index of centroids with potential speedup structures (HNSW index)
Build an inverted index, with one posting list per centroids.

Natural Language Processing (NLP) and Knowledge Graphs

Before we depart, we also need to talk about another common subject within this domain: NLP and knowledge graphs. Natural language processing (NLP) and knowledge graphs are two key technologies used in semantic search to improve the accuracy and relevance of search results. NLP analyzes the meaning and context of search queries, allowing the search engine to understand the user’s intent. Knowledge graphs represent the relationships between words, phrases, and entities, providing a structured way to interpret and connect different pieces of information.

By combining NLP and knowledge graphs, semantic search can deliver more accurate and relevant results, even for complex search queries. For example, a search engine using NLP can understand that a query like “best places to visit in Paris” is looking for tourist attractions, while a knowledge graph can provide detailed information about each attraction, such as its location, opening hours, and historical significance.

In addition to improving search results, NLP and knowledge graphs enhance the user experience by providing more personalized and relevant results. For instance, a search engine can use NLP to understand the user’s search intent and deliver results tailored to their specific needs. Similarly, knowledge graphs can offer more detailed and accurate information about specific entities, such as people, places, and organizations.

Overall, full-text search and semantic search are powerful technologies that help users find relevant information quickly and easily. By understanding the advantages and limitations of full-text search and the role of NLP and knowledge graphs in semantic search, developers and practitioners can build more effective and user-friendly search systems.

The Grand Unified Theory of Database Indexes

Potential global index and inverted Index is indeed how most full-text/ vector search indexes are built. It’s also how SingleStore’s patented columnstore secondary hash index is implemented. You can check out our SIGMOD paper on the topic here.

When implemented in the same system with similar underlying structures, you can see optimization. And we can easily reorder the execution, doing index intersection and more…

So, would it be possible to combine all of them and get the best from both worlds?

No one has done it yet — at least for any mature distributed database systems. But, never say never 😃.

Want to start testing these searches for yourself? Try SingleStoreDB free today.

You can also check out these additional resources for full-text and semantic search, and vector databases from SingleStore:

Full-Text Search vs. Semantic Search

First, Some Basics…

What is Full-Text Search?

Advantages and Limitations of Full-Text Search

What is Semantic Search?

Fact Check: ChatGPT-Style Chatbots with Domain-Specific Knowledge Require Semantic Search

Fact Check: Semantic Search Is Better Than Full-Text Search

Fact Check: Semantic Search Is Easier to Use Than Full-Text Search

Fact Check: I Am Using OpenAI for Building My Chatbot, so I Should Also Use the OpenAI Embedding

A Head-to-Head Comparison:

Dataset and config

Results

How does full-text search work with search queries?

How does ANN semantic search work?

Natural Language Processing (NLP) and Knowledge Graphs

The Grand Unified Theory of Database Indexes

On this page

Start building with SingleStore

Explore more resources

Full-Text Search vs. Semantic Search

first-some-basicsFirst, Some Basics…

what-is-full-text-searchWhat is Full-Text Search?

advantages-and-limitations-of-full-text-searchAdvantages and Limitations of Full-Text Search

what-is-semantic-searchWhat is Semantic Search?

fact-check-chat-gpt-style-chatbots-with-domain-specific-knowledge-require-semantic-searchFact Check: ChatGPT-Style Chatbots with Domain-Specific Knowledge Require Semantic Search

fact-check-semantic-search-is-better-than-full-text-searchFact Check: Semantic Search Is Better Than Full-Text Search

fact-check-semantic-search-is-easier-to-use-than-full-text-searchFact Check: Semantic Search Is Easier to Use Than Full-Text Search

fact-check-i-am-using-open-ai-for-building-my-chatbot-so-i-should-also-use-the-open-ai-embeddingFact Check: I Am Using OpenAI for Building My Chatbot, so I Should Also Use the OpenAI Embedding

a-head-to-head-comparisonA Head-to-Head Comparison:

dataset-and-configDataset and config

resultsResults

how-does-full-text-search-work-with-search-queriesHow does full-text search work with search queries?

how-does-ann-semantic-search-workHow does ANN semantic search work?

natural-language-processing-nlp-and-knowledge-graphsNatural Language Processing (NLP) and Knowledge Graphs

the-grand-unified-theory-of-database-indexesThe Grand Unified Theory of Database Indexes

On this page

related-readingRelated reading

Start building with SingleStore

Explore more resources

First, Some Basics…

What is Full-Text Search?

Advantages and Limitations of Full-Text Search

What is Semantic Search?

Fact Check: ChatGPT-Style Chatbots with Domain-Specific Knowledge Require Semantic Search

Fact Check: Semantic Search Is Better Than Full-Text Search

Fact Check: Semantic Search Is Easier to Use Than Full-Text Search

Fact Check: I Am Using OpenAI for Building My Chatbot, so I Should Also Use the OpenAI Embedding

A Head-to-Head Comparison:

Dataset and config

Results

How does full-text search work with search queries?

How does ANN semantic search work?

Natural Language Processing (NLP) and Knowledge Graphs

The Grand Unified Theory of Database Indexes

Related reading