A Beginner's Guide to Retrieval Augmented Generation (RAG)


Pavan Belagatti

Developer Evangelist

New to the world of Retrieval Augmented Generation (RAG)? We've got you covered with this in-depth guide on what it is, advantages and real-time use cases.

A Beginner's Guide to Retrieval Augmented Generation (RAG)

a-beginners-guide-to-retrieval-augmented-generation-ragA Beginner's Guide to Retrieval Augmented Generation (RAG)

Large language models (LLMs) are becoming the backbone of most of the organizations these days as the whole world is making the transition towards AI. While LLMs are all good and trending for all the positive reasons, they also pose some disadvantages if not used properly. Yes, LLMs can sometimes produce the responses that aren’t expected, they can be fake, made up information or even biased. Now, this can happen for various reasons. We call this process of generating misinformation by LLMs as hallucination. There are some notable approaches to mitigate the LLM hallucinations such as fine-tuning, prompt engineering, retrieval augmented generation (RAG) etc. Retrieval augmented generation (RAG) has been the most talked about approach in mitigating the hallucinations faced by large language models. Today we will show you how the RAG approach works.

what-is-retrieval-augmented-generation-ragWhat is Retrieval Augmented Generation (RAG)?

Large Language Models (LLMs) sometimes produce hallucinated answers and one of the techniques to mitigate these hallucinations is by RAG. For an user query, RAG tends to retrieve the information from the provided source/information/data that is stored in a vector database. A vector database is the one that is a specialized database other than the traditional databases where vector data is stored. Vector data is in the form of embeddings that captures the context and meaning of the objects.

For example, think of a scenario where you would like to get custom responses from your AI application. First, the organization’s documents are converted into embeddings through an embedding model and stored in a vector database. When a query is sent to the AI application, it gets converted into a vector query embedding and goes through the vector database to find the most similar object by vector similarity search. This way, your LLM-powered application doesn’t hallucinate since you have already instructed it to provide custom responses and is fed with the custom data. 

One simple use case would be the customer support application, where the custom data is fed to the application stored in a vector database and when a user query comes in, it generates the most appropriate response related to your products or services and not some generic answer. This way, RAG is revolutionizing many other fields in the world.

The RAG pipeline basically involves three critical components: Retrieval component, Augmentation component, Generation component.

  • Retrieval: This component helps you fetch the relevant information from the external knowledge base like a vector database for any given user query. This component is very crucial as this is the first step in curating the meaningful and contextually correct responses.

  • Augmentation: This part involves enhancing and adding more relevant context to the retrieved response for the user query.

  • Generation: Finally, a final output is presented to the user with the help of a large language model (LLM). The LLM uses its own knowledge and the provided context and comes up with an apt response to the user’s query.

These three components are the basis of a RAG pipeline to help users to get the contextually-rich and accurate responses they are looking for. That is the reason why RAG is so special when it comes to building chatbots, question-answering systems, etc.

The key advantage of RAG is that it allows the model to pull in real-time information from external sources, making it more dynamic and adaptable to new information. It's particularly useful for tasks where the model needs to reference specific details that might not be present in its pre-trained knowledge, like fact-checking or answering questions about recent events.

advantages-of-retrieval-augmented-generationAdvantages of Retrieval Augmented Generation

There are some incredible advantages of RAG. Let me share some notable ones:

  • Scalability. RAG approach helps you with scale models by simply updating or adding external/custom data to your the external database (vector database).
  • Memory efficiency. Traditional models like GPT have limits when it comes to pulling fresh and updated information and fails to be memory efficient. RAG leverages external databases like a vector database — allowing it to pull in fresh, updated or detailed information when needed with speed.
  • Flexibility. By updating or expanding the external knowledge source, you can adapt RAG to build any AI applications with flexibility.

retrieval-augmented-generation-rag-applicationsRetrieval Augmented Generation (RAG) applications

RAG can be extremely useful in scenarios where detailed, context-aware answers are required, including:

  • Question answering systems. Providing detailed and contextually correct answers to user queries by pulling from extensive knowledge bases.
  • Content creation. Assisting writers/authors or creators by providing relevant and up to date information or facts to enrich their content creation process.
  • Research assistance. Instead of searching through so many relevant documents and websites on the internet, RAG helps researchers quickly access pertinent data or studies related to their query.

rag-real-time-use-case-exampleRAG real-time use case example

RAG has a range of potential applications, and one real-life use case is in the domain of chat applications. RAG enhances chatbot capabilities by integrating real-time data. Consider a sports league chatbot. Traditional LLMs can answer historical questions but struggle with recent events, like last night's game details.

RAG allows the chatbot to access up-to-date databases, news feeds and player bios. This means users receive timely, accurate responses about recent games or player injuries. For instance, Cohere's chatbot provides real-time details about Canary Islands vacation rentals — from beach accessibility to nearby volleyball courts. Essentially, RAG bridges the gap between static LLM knowledge and dynamic, current information.

rag-using-lang-chainRAG using LangChain

LangChain revolutionizes RAG by streamlining the interface between vast data repositories and Large Language Models (LLMs). By fragmenting immense data into digestible vectors, LangChain optimizes rapid retrieval. When users input prompts, LangChain swiftly queries its vector store, pinpointing relevant data.

This focused data is then channeled to LLMs which craft precise, context-rich responses. This synergy between LangChain's efficient data management and LLM's generation capabilities ensures users receive accurate, data-backed responses. As an open-source platform, LangChain's approach to RAG heralds a new era in AI-driven, context-aware content generation and retrieval.

Take a look at SingleStore’s integration with LangChain.

The diagram represents the RAG process for AI applications. The flow starts with end users posing a query or "ask" as represented by step 1. This inquiry is directed to a gen AI app, which then proceeds to search and retrieve relevant information from a company data repository, as denoted by step 2. Once the data is fetched, it is used as a prompt to instruct the LLMs in step 3.

The LLMs then generate an appropriate response based on the prompt and the initial query, synthesizing the retrieved data to provide a coherent and informed answer back to the end user. This RAG process combines the capabilities of information retrieval with the advanced generative capabilities of language models to offer detailed, contextually accurate answers.

fine-tuning-vs-retrieval-augmented-generationFine tuning vs. Retrieval Augmented Generation

Fine-tuning refers to the process of adapting a pre-existing, broadly trained model to a specific task or domain. Initially, an LLM is trained on a vast corpus of data to understand language structures, patterns and nuances, a phase often referred to as "pre-training."

Once this generalized understanding is established, the model can be further refined or fine-tuned on a smaller, specialized dataset that is tailored for a particular application, medical text generation, legal document analysis or customer support responses. This fine-tuning step enables the model to leverage its broad knowledge from pre-training while specializing in the nuances and specifics of the target domain, ensuring better performance on the desired task.

RAG and fine-tuning are both techniques to adapt pre-trained language models to specific tasks or domains. Here's a comparison of the two:

Retrieval Augmented Generation (RAG)Fine-tuning
DefinitionCombines large-scale knowledge retrieval with sequence generation. Retrieves relevant documents and generates an answer using them. Refines a pre-trained model on specific tasks using a smaller dataset. Adjusts the weight of the model to specialize it for a particular task.
Advantages1.Can leverage vast external knowledgeCan achieve strong performance on specific tasks
Efficient, especially when limited data is available for the task
ChallengesRequires an efficient retrieval mechanism
Potential to retrieve irrelevant documents
Computationally more intensive
Risk of overfitting if not enough data is available, or if training is too aggressive
Knowledge is limited up to the last training cut-off
Use casesOpen-domain Question Answering (QA)
Dynamic responses in chatbots
Situations where new data emerges frequently
Task-specific applications like sentiment analysis
Niche domains with unique datasets
ExamplesOpenAI's RAG model for QAFine-tuning GPT models for specific domains or tasks


Integrating SingleStore with the RAG model in an AI application can be a powerful combination. SingleStoreDB is a distributed, relational database that excels in high-performance, real-time analytics. By integrating SingleStoreDB, you can ensure that the RAG model has fast and efficient access to vast amounts of data, which can be crucial for real-time response generation.

By integrating SingleStoreDB with the RAG model, you can harness the power of real-time analytics and fast data retrieval, ensuring that your chat application provides timely and relevant responses to user queries.

For more in-depth understanding, follow SingleStore CMO Madhukar Kumar’s talk 'Building a Generative AI App on Private Enterprise Data With Retrieval Augmented Generation (RAG)' .

Also, check out this article by Ronny Hoesada, DevRel Engineer at Unstrucutured.io (a SingleStore partner) on Building a Q+A Retrieval Augmented Generation (RAG) System with Slack Data Using Unstructured and SingleStoreDB.

RAG tutorial

Let’s build a simple AI application that fetches contextually relevant information from our own data for any given user query.

Sign up for SingleStore to use it as your AI database. Once you sign up, you need to create a workspace — which is easy and free.

Once you create your workspace, create a database with any name you choose.

As you can see from the preceding screenshot, create the database from the ‘Create Database’ tab on the right side. Now, go to ‘Develop’ to use our Notebooks feature (similar to Jupyter Notebooks).

Create a new Notebook, and name it whatever you’d like.

Before doing anything, select your workspace and database from the dropdown on Notebooks.

Now, start adding the following code snippets into your Notebook you just created.

Install the required libraries

!pip install openai numpy pandas singlestoredb langchain==0.1.8
langchain-community==0.0.21 langchain-core==0.1.25

Vector embeddings example

def word_to_vector(word):
# Define some basic rules for our vector components
vector = [0] * 5 # Initialize a vector of 5 dimensions
# Rule 1: Length of the word (normalized to a max of 10 characters
for simplicity)
vector[0] = len(word) / 10
# Rule 2: Number of vowels in the word (normalized to the length
of the word)
vowels = 'aeiou'
vector[1] = sum(1 for char in word if char in vowels) / len(word)
# Rule 3: Whether the word starts with a vowel (1) or not (0)
vector[2] = 1 if word[0] in vowels else 0
# Rule 4: Whether the word ends with a vowel (1) or not (0)
vector[3] = 1 if word[-1] in vowels else 0
# Rule 5: Percentage of consonants in the word
vector[4] = sum(1 for char in word if char not in vowels and
char.isalpha()) / len(word)
return vector
# Example usage
word = "example"
vector = word_to_vector(word)
print(f"Word: {word}\nVector: {vector}")

Vector similarity example

import numpy as np
def cosine_similarity(vector_a, vector_b):
# Calculate the dot product of vectors
dot_product = np.dot(vector_a, vector_b)
# Calculate the norm (magnitude) of each vector
# Calculate the norm (magnitude) of each vector
norm_a = np.linalg.norm(vector_a)
norm_b = np.linalg.norm(vector_b)
# Calculate cosine similarity
similarity = dot_product / (norm_a * norm_b)
return similarity
# Example usage
word1 = "example"
word2 = "sample"
vector1 = word_to_vector(word1)
vector2 = word_to_vector(word2)
# Calculate and print cosine similarity
similarity_score = cosine_similarity(vector1, vector2)
print(f"Cosine similarity between '{word1}' and '{word2}':

Embedding models

from openai import OpenAI
client = OpenAI(api_key=OPENAI_KEY)
def openAIEmbeddings(input):
response = client.embeddings.create(
return response.data[0].embedding
print(openAIEmbeddings("Golden Retreiver"))

creating-a-vector-database-with-single-storeCreating a vector database with SingleStore

We will be using the LangChain framework, with SingleStore as the vector database to store our embeddings and a public .txt file link that is about the Sherlock Holmes stories.

Add OpenAI API key as an environment variable.

import os
os.environ['OPENAI_API_KEY'] = ‘mention your openai api key’

Then add the following code.

import openai
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores.singlestoredb import
import os
import pandas as pd
import requests
# URL of the public .txt file you want to use
file_url = "https://sherlock-holm.es/stories/plain-text/stud.txt"
# Send a GET request to the file URL
response = requests.get(file_url)
# Proceed if the file was successfully downloaded
if response.status_code == 200:
file_content = response.text
# Save the content to a file
file_path = 'downloaded_example.txt'
with open(file_path, 'w', encoding='utf-8') as f:
# Now, you can proceed with your original code using
# Load and process documents
loader = TextLoader(file_path) # Use the downloaded document
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=2000,
docs = text_splitter.split_documents(documents)
# Generate embeddings and create a document search database
# Replace with your OpenAI API key
embeddings = OpenAIEmbeddings(api_key=OPENAI_KEY)
# Create Vector Database
vector_database = SingleStoreDB.from_documents(docs, embeddings,
table_name="scarlet") # Replace "your_table_name" with your actual
table name
query = "which university did he study?"
docs = vector_database.similarity_search(query)
print("Failed to download the file. Please check the URL and try

Once you’ve run the code, you will see a tab to enter the query/question you would like to ask related to Sherlock Holmes.

We retrieved the relevant information from the provided data, using it to guide the response generation process. By converting our file into embeddings and storing them in SingleStore database, we created a retrievable corpus of information — ensuring the responses are not only relevant, but also rich in content derived from the provided dataset.


Retrieval Augmented Generation represents a significant leap in the evolution of language models. By combining the power of retrieval mechanisms with sequence-to-sequence generation, RAG models can provide richer, more detailed and contextually relevant outputs. As the field advances, we can expect to see even more sophisticated integrations of these components, paving the way for AI models that are not just knowledgeable, but also resourceful.