Full-Text Search vs. Semantic Search: The Good, Bad and Ugly

ZS

Zhou Sun

Director of Engineering, Query Execution

Full-Text Search vs. Semantic Search: The Good, Bad and Ugly

As a query engine engineer at SingleStore, I was fortunate enough to have both developed SingleStore’s full-text search engine, and initiated the vector/semantic search effort. I wanted to share my insights, and fact-check some widely held beliefs about semantic and full-text search.

first-some-basicsFirst, Some Basics…

In full-text search, the emphasis is primarily on keyword matching. It retrieves documents or web pages that contain the exact keywords specified in the query. On the other hand, semantic search aims to understand the users’ intent behind queries by analyzing its semantics, context and relationships between words or concepts. Full-text search typically treats each keyword independently, without considering the relationships between them or the overall context.

Semantic search, however, takes into account the context, synonyms, related terms and overall meaning of the query to retrieve more relevant results. It focuses on understanding the intent and meaning behind the query, rather than relying solely on specific keywords.

Wrong. This indeed is orthogonal to full-text versus semantic search. In fact, we did this with our very own chatbot, SQrL, which utilizes both semantic and full-text search. The core of this chatbot is very simple — for each user question, it searches a large number of documents, finds relevant information and feeds the documents to OpenAI’s Chat Completions API. 

In other words, it doesn’t matter too much how you search the data. You can use semantic or full-text search, or even run some regular filters to get those documents. Here’s the flow for semantic search:

question — model —> embedding — vector search —> relevant docs

And here’s what it looks like without semantic search:

question — tokenizer or model —> extracted terms/ property filters — full-text search —> relevant docs

Both work, and I am not convinced one is universally better than the other.

Totally wrong. Semantic search is not the silver bullet — even more, you can’t treat your embedding model as the silver bullet.

Depending on the use case, full-text search can be simple, reliable and efficient. A good example is for log analysis. People usually look for exact keywords, patterns or even a specific object like a database name. Full-text search guarantees to find all the entries that match the search criteria, and works much better than semantic search. A log will look like:

“38:46.030   TEST: Thread 115107 (ntid 693666, conn id -1):
ResolveTwoPCTransactionsForDb: db_0: Encountered 0 prepared attached”

Yes and no.

It’s very easy to overlook the difficulty of a less familiarized component. For semantic search, (and from a database perspective) people usually ignore the machine learning part of the work. Believe it or not, a MLE’s job is not easier than a DBA’s.

You need the right segmentation of rightly normalized data — running it on the right model with right training — to get reasonable semantic search results.

fact-check-i-am-using-open-ai-for-building-my-chatbot-so-i-should-also-use-the-open-ai-embeddingFact Check: I Am Using OpenAI for Building My Chatbot, so I Should Also Use the OpenAI Embedding

No. You don’t need to use the same system for embedding and text completion. You could argue that OpenAI is so good that its embedding API is a very good no-brainer. OpenAI is a good starting point, especially for your personal project. However for an enterprise use case, there are a few things worth double checking:

  1. The OpenAI API can be a lot more expensive than hosting your own model
  2. It might not be a good idea to depend on a versioned external service. This is particularly important for semantic search, since the same model needs to be used for indexing and querying.
  3. The OpenAI embedding might not be the best or most suitable for your data. In fact there is research around other embedding models, like Hugging Face. OpenAI’s ada embedding is pretty good in general, but I wouldn't consider it state-of-the-art.

In short, enterprise use cases require machine learning expertise to evaluate, pick and train the best model (and that might be OpenAI, or not).

a-head-to-head-comparisonA Head-to-Head Comparison:

dataset-and-configDataset and config

Here, we’re going to use the famous Yelp dataset. More specifically, we are focusing on the reviews table with ~7 million rows, and working with a very small cluster: 2vcpu and 16GB of memory.

Easy search workloads that do not require super-complex joins and aggregation query shapes scale almost perfectly when running in a distributed system like SingleStoreDB. So for a larger dataset with lower latency requirements, we can easily scale the cluster to achieve target performance.

To avoid introducing additional variables for this particular test, let’s work with the simplest vector search: exact KNN (K-nearest neighbor) by simply storing the vector execution optimized format (similar to Faiss Flat index). If an ANN (approximate nearest neighbor) search algorithm is introduced, it usually implies larger index sizes, longer indexing time (in terms of hours) and faster search performance (with slightly worse search accuracy). If you are interested in comparing ANN search with different index implementations, stay tuned for our future blog post.

Index:

Full-textSelf-host embeddingOpenAI
Embedding timeN/A4 hours*20 minutes**
CostN/A~$2~$300
Index time~2 minutes
Size2.6GB21GB for encodings81GB for encodings

*Using SingleStore Notebooks or Google collab, with GPU

**Based on standard user's TPM limits, assuming running with batching and parallel

Query:

It’s really hard to evaluate and compare the query experience of semantic search. I am not an MLE and  don’t have the bandwidth to label the dataset — but let’s try to define a smart quantitative approach. In addition to  the reviews themselves, the dataset also has other properties. So we can form a query based on other properties of the business, and use semantic search to query them.

More precisely, this is our practice (which I believe is a good way to test your semantic search model): Create semantic search, and use properties to validate the results.

Examples:

Semantic:

“Affordable Italian pizza with good review”

SQL validation:

Select name, categories like '%pizza%', categories like '%italian%',
business.stars, attributes::$RestaurantsPriceRange2
from business join (select * from reviews order by
dot_product(embedding, target_embedding) desc limit 10) r
where business.business_id = r.business_id

Full-text query for getting a similar result:

+pizza +italian good best great -expensive -bad cheap affordable

(incorrectly returns a review emphasizing 'price is high')

Semantic:

‘find a restaurant with average review, not good, not bad, but average”

Select name, business.stars, r.stars, text
from business join (select * from reviews order by
dot_product(embedding, target_embedding) desc limit 10) r
where business.business_id = r.business_id

Full-text query for getting a similar result:

+restaurant mediocre average common -best -great - worst -terrible

(not working very well)

resultsResults

I tried five vector-similarity-based queries of two different types. I used the OpenAI and miniLM models to get vectors for the text. MiniLM is available from Hugging Face.

For the first type of queries (informational/objective search) both models perform pretty well, slightly better than the corresponding full-text query I can come up with. On average for the second type of queries (subjective search), the OpenAI embedding is noticeably better. And both semantic search models work much better than full-text search.

Figure 1: Recall evaluation for objective and subjective queries using vector-based search. Each pair of bars is for an individual query; there are five total queries of each type. The Y axis shows how many of the returned results actually match the search question. 

More interesting findings during our test

Most general models, like OpenAI and sentence transform, will work the best when you separate informational and objective search from emotional and subjective searches.

For example:

Full-text match (‘“escape room”’) + semantic search (“scary, good for adult”)

works way better than semantic search: (‘find me a scary, good for adult escape room’).

Tentative conclusion

When building an application with ‘search’ requirements (if the dataset is large enough) always try full-text search first. It’s much cheaper and super reliable.

If regular full-text search is not enough —you need fuzzy search a lot; sometimes you don’t know how to order the outputs; you need to combine complex full-text query clauses and still can't find a desirable result — it’s time to try semantic search.

For keyword search-like semantic search or searching for docs related to a certain topic, most models will work fine. Pick a simple and cheap one (or anything you are familiar with), embed your dataset and see if it works. Here’s an example search query:

  • Search for all docs related to ‘SingleStore vector search’

But if the goal is to really extract the ‘semantic”, search by the meaning/attitude of docs, it becomes very important to use a good well-tuned model, and don’t expect it will just work out of the box. Either start with a high quality paid service, or look into training or fine tuning a model yourself. Here’s an example search query:

  •  Search for all doc that supports ‘SingleStore vector search is better than existing vector databases’

under-the-hoodUnder the Hood

how-does-full-text-search-workHow does full-text search work?

You probably have seen the glossary of terms for a book before.

It lists all the keywords, and includes a list of chapters/pages where you can easily look them up.

We do something similar in databases. Here’s how:

  • Tokenize the documents
  • Build a global dictionary with potential speedup structures (set or hashtable)
  • Build an inverted index, with one posting list per dictionary term

how-does-ann-semantic-search-workHow does ANN semantic search work?

You probably have played with puzzles or Legos — but have you ever tried to reassemble a Lego or puzzle set a second time, when pieces were not pre-packaged? You would first divide all the puzzle/Lego pieces based on their color, and when trying to find a piece, only search in the group with similar colors.

When doing this in databases we:

  • Vectorize the documents
  • Divide all the vectors into groups
  • Build a global Index of centroids with potential speedup structures (HNSW index)
  • Build an inverted index, with one posting list per centroids.

the-grand-unified-theory-of-database-indexesThe Grand Unified Theory of Database Indexes

Potential global index and inverted Index is indeed how most full-text/ vector search indexes are built. It’s also how SingleStore’s patented columnstore secondary hash index is implemented. You can check out our SIGMOD paper on the topic here.

When implemented in the same system with similar underlying structures, you can see  optimization. And we can easily reorder the execution, doing index intersection and more…

So, would it be possible to combine all of them and get the best from both worlds? 

No one has done it yet — at least for any mature distributed database systems. But, never say never 😃.

Want to start testing these searches for yourself? Try SingleStoreDB free today.

You can also check out these additional resources for full-text and semantic search, and vector databases from SingleStore:


Share