Data + AI in 2024: Three Problems for Now and for the Future

Although they are closely related, the data and AI communities have historically operated in parallel, each advancing in its own right — but often separately.

As we continue into 2024, this trend has dramatically changed. We're witnessing a convergence that is not only reshaping existing frameworks, but also forging new pathways for technological innovation.

After working in data for almost 10 years — and learning in AI for some time — I want to share my thoughts, mainly around three big problems in data and AI.

Disclaimer: I speak mostly for myself in this blog post. Some places might be biased toward SingleStore, but in general I want to discuss the bigger picture.

A mostly solved problem: RAG

To me, RAG is kind of special in the sense that it proves databases relevant in the AI world.

Simple RAG will be part of the LLM solution

Some fun facts:

Wikipedia is around 20GB
All public codebases on Github, in total, have around ~200GB of text

There is just not enough 'valuable' textual data: it costs nothing to store them, in memory or even in GPU. So hosting the data — either as raw data or in vector form — would be a much easier problem than correctly extracting info and embedding the data. Who does the best job extracting and embedding? Model providers.

OpenAI Assistants API already provides a built-in solution for up to 10,000 files, but I don't see any real limitations there.

Specialized vector databases will be irrelevant

Vector search isn’t difficult enough — every database vendor is releasing their own. SingleStore proves it’s possible for general purpose databases to get similar ANN search performance as state-of-the-art, specialized vector databases. Check out how SingleStore brings high performance to vector search, and stay tuned for our research paper.

It also leads to the dilemma of vector search: If the workload really needs the highest possible vector search performance, people should not use a database. Rather, they should use the most suitable vector search library with correct hardware. Otherwise, the vector search performance doesn't really matter too much, since in general it is already very, very fast.

In the near term, people will still pick solutions based on ANN benchmarks. But as we understand the use cases better, I would predict vector databases will go either way, — vector or database —meaning they would either go back to index libraries (or try to become proper databases), but no longer be vector databases.

Complex RAG will stay for a while

Will all the RAG startups die? I don't think so.

In general, RAG startups should not be viewed as AI companies — they should all be viewed as data companies, since the differentiator for them is what can be retrieved. And there is a lot you can do with data:

The majority of high value data is not in textural format
In the near future, calling LLMs and creating new vector embeddings will remain slow and expensive
For most RAG apps the amount of data output is extremely low, so any improvement in retrieval quality can be the differentiator

This requires RAG startups to innovate in retrieval; most RAG will be built with a complex data stack, and might eventually become agentic apps.

A half-solved problem: Data landscape

In this picture, the data land is a divided nation — one side a harmonious federation, bound by a common thread, and the other a fragmented confederation, riddled with discord and strife. Think Roman empire on one hand, and Gauls on the other.

Enterprise-scale lakehouse

We’re seeing a trend of enterprises adapting solutions around iceberg/ delta lake, where data is shared between various analytics and AI products. Look at the hot vendors in this space: Snowflake x Iceberg, DataBricks Lakehouse and Microsoft OneLake.

In my opinion, lakehouse architectures, or modern data lakes, largely solve the offline data storage problem for machine learning training and data warehousing. In this architecture, data is easily transferable, copyable and shareable.

Many ways to get interactive speed

On the operational side, each web-app, agent and dashboard is likely powered by a database — and this is what most AI agents will interact with.

But the problem is there are too many solutions:

App database. Postgres, Cockroach, MongoDB®, SingleStore 🙂…
Accelerated analytics. ClickHouse, DuckDB, Dremio, Rockset, SingleStore 😀…
Retrieval systems. Elastic, Vector databases, SingleStore 😂…

Efforts will be made to unify the speedy layer

Key players have already started:

MongoDB is marketed as transactions, search and AI
Rockset is marketed as analytics, search and multi-model
Redis is marketed as cache, vector and database

And as one of the engineers who started the SingleStore project, I still believe in the name.

Here are some criteria of this unified speedy layer:

Foundations:

It will be a database and ideally a system of record, so it can power applications
It will be a cache and accelerator, so it can support analytics or retrieval

Key properties:

It will be multi-model, so JSON, vector, full-text, graph and traditional relational data all exist here
It will support multi-tenancy
- LLM is a multi-tenant system, so the infra is shared
- AI enables personalization, so there will be more per-tenant data
It will be connected to the stack
- It needs to be seamlessly connected to the previous lakehouse layer
- It needs to be more seamlessly connected to LLM, and I will talk about that in the next section

A mostly unsolved problem: AI's interaction with data

When talking with some LLM researchers, I learned an interesting perspective: Traditional NLP was built with the goal of learning grammar, semantics and rules of languages. But LLMs like transformer models were built upon a more general goal: to predict the next token.

Looking at successful LLMs like GPT, internally it most likely learned the semantics, grammar and rules. In other words, it learned what is needed to form dialogue from nowhere, so in some sense we can call it general intelligence. With this assumption, LLMs are more than language models — the texture input and token generation output might actually be a bottleneck, it needs better approaches to interact with the rest of the world.

I am skeptical that we will trust AI to interact with the physical world in the near future.I, for one, would not entrust AI with control over my home's power systems. Therefore, it might be easier for AI technologies to first gain expanded access to data, both for reading and writing.

Current text_to_sql will die

The goal for the Transformer model is to predict the next token. And compared to everything else, SQL is among the hardest to write one word by one word, or even one line by one line.

Although at a glance SQL seems very close to a classic NLP task, it is not. It might be harder than general code generation since it is truly one-shot (you need one single query, two-pages long). So I would say SQLgeneration won't be easier than video generation, and so far no one has put enough resources into this area. As a result, text_to_sql will remain a 'shiny toy' until either LLMs improve to another level, or someone spends a magnitude more resources in the development.

AI-powered analytics

This is my biggest disappointment in the current wave so far.We know that AI can do things human can never do — for example, read one million tokens in a second or try a million SQL queries overnight. But we are only trying to use AI to replace junior DBAs on the tasks humans handle perfectly fine (and it is still not going very well).

I want to see some fundamentally different ways to analyze data that was not possible before:

Read a million-row table, row by row, to give me some insights
Read all documents or tables in one system, to extract and decide how it should be stored

Agentic app

Andrew Ng recently gave a talk about Agents. You may think agentic apps are just another kind of AI application, but after some research and talking to the AI community, I am more or less convinced it's more fundamental.

There are two ways to picture an agentic app:

It is trying to build an AI workflow to solve a given problem
It is trying to solve the problem of problem-solving

To some extent, we can view the second as building some kind of AGI! So, the prediction is that more and more resources will be put into agentic apps, and these will be the kind of applications that land in the enterprise world.

Recall some key properties of agentic apps from Andrew’s talk:

Reflection. The LLM examines its own work to come up with ways to improve it
Tool use. The LLM is given tools like web search, code execution or any other function to help it gather information, take action or process data.
Planning. The LLM comes up with and executes a multi-step plan to achieve a goal (for example, writing an outline for an essay, doing online research, writing a draft and so on)
Multi-agent collaboration. More than one AI agent work together, splitting up tasks and discussing and debating ideas to come up with better solutions than a single agent would.

Agentic apps bring some unique and interesting data requirements, including databases as tools, databases as memory of one agent and databases as a share drive of agents.

Conclusion

In 2024, the distinctions between database and AI companies are increasingly blurred. AI companies are recognizing the need to enhance their data handling capabilities, just as database companies are integrating more AI functions. Looking ahead, the most successful enterprises will likely be those that effectively merge the strengths of both databases and AI.

As someone deeply involved in the database sector, I am both excited and apprehensive about what the future holds, when suddenly we have 1 trillion 'people' that are capable of interacting with data.

Data + AI in 2024: Three Problems for Now and for the Future

A mostly solved problem: RAG

Simple RAG will be part of the LLM solution