How Retrieval and Generation Work Hand in Hand in RAG

In the first article of our series, we saw the impressive performance of RAG systems built on LLM models. While demonstrating this impressive performance, we briefly touched upon the Retriever and Generator tools, but didn't delve into much technical detail. Now, in the second article of our series, we'll take a closer look at the technical operation. Let's begin by examining the architecture of RAG systems in a way that's both technical and accessible.

The Core Idea Behind RAG

At first glance, we see two main components that perform all the work together: the Retriver and the Generator. These two components work very well in synchronization and play a team game. They're no different than two athletes in a relay race. The Retriever, like the first athlete to start the race, brings the flag and hands it to the second athlete, while the Generator, like the second athlete, delivers the flag to the target. To better understand the technical structures, let's examine these two components separately.

Retriever: The Gateway to Knowledge

We could say that the Retriever is actually the first step of the system. If we recall the RAG systems that form in our minds, the model first goes outside its existing memory to search for information. The tool that makes this search possible is the first relay athlete, the Retriever component. The Retriever's most fundamental task begins with the question, "Where can I find it?" Its primary goal is to find the answer to this question. For better understanding, let's give an example. When a user asks your system, "How long do lions live on average?", the Retriever is the first to step in and search for the answer in designated databases, form sites, or any other research environment. It asks itself that magic question and takes care of its work. Now that we've understood the general process, let's move on to the technical process step by step.

Step 1: Indexing

When a retriever starts its journey, it first prepares the documents and resources it will access. To transform these documents, which are relevant to the question, into a format that the LLM model can understand, it divides them into groups of 100-300 words, called "chunks." Like other software technologies, LLM models operate based on mathematical operations. For large language models to process data, they need to be able to make sense of it. Therefore, each chunk is converted into vectors that can be interpreted numerically. This process is called "embedding." Completing these steps constitutes the first step of a retriever, and tools such as BERT, Sentence Transformers, or the OpenAI Embeddings API are generally used for practical implementation. Below, we provide an example code snippet to illustrate how these processes are put into practice.

from openai import OpenAI
 
#Initialize OpenAI client
client = OpenAI(api_key="YOUR_API_KEY")
 
#Example document
document = "RAG (Retrieval-Augmented Generation) helps LLMs access external knowledge. First, the text is split into chunks and converted into embeddings."
 
#Simple chunking function
def chunk_text(text, chunk_size=20):
    words = text.split()
    for i in range(0, len(words), chunk_size):
        yield " ".join(words[i:i + chunk_size])
 
#Generate embeddings for each chunk
embeddings = []
for chunk in chunk_text(document):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=chunk
    )
    embeddings.append(response.data[0].embedding)
 
print(f"{len(embeddings)} embeddings generated.")

Step 2: Vectorizing the Question

In the first step, the specified chunks were vectorized. The same step must be performed this time for the user question. Matching must be done between the chunks and the user question to ensure a good answer. The user question, like other raw documents, is written in natural language and must be converted into a vector space with an embedding model in order to be able to answer the question.

Similar work was done in the first two steps we've described so far. These steps are called "Dense Retrieval." Vectorization processes vectorize the meanings of words, not the words themselves. In other words, instead of a model that produces wrong answers by matching words that come to everyone's mind, the aim is to create a model that works accurately and effectively by matching the meanings between words.

#User question
user_question = "What is RAG and how does it work?"
 
#Convert the user question to an embedding
question_embedding_response = client.embeddings.create(
    model="text-embedding-3-small",
    input=user_question
)
question_embedding = question_embedding_response.data[0].embedding
 
print("User question embedding size:", len(question_embedding))

Step 3: Similarity Search

So far, we've been performing segmentation and then vectorization. We have the document fragments and the user question, which have been pre-vectorized in their embedded form. After this, the task is delegated to vector databases. Vector databases such as FAISS, ScaNN, Annoy, Weaviate, and Qdrant find the vectorized document fragments closest to the user question. This process is known as ANN (approximate nearest neighbor) search. According to the ANN model, statistical closeness is measured quickly, starting from the nearest neighbors. Because ANN is both a fast and efficient model, it is also used in RAG systems.

import faiss
import numpy as np
 
#Convert embeddings list into a NumPy array (float32 is required by FAISS)
embedding_matrix = np.array(embeddings).astype("float32")
 
#Dimension of the embeddings
dimension = embedding_matrix.shape[1]
 
#Build FAISS index (using L2 distance or inner product for similarity)
index = faiss.IndexFlatIP(dimension)  # Inner product (cosine similarity after normalization)
 
#Normalize embeddings for cosine similarity
faiss.normalize_L2(embedding_matrix)
 
#Add document embeddings to the index
index.add(embedding_matrix)
 
print(f"Number of vectors in the index: {index.ntotal}")
 
#Normalize the question embedding as well
question_vector = np.array([question_embedding]).astype("float32")
faiss.normalize_L2(question_vector)
 
#Perform ANN search (k = number of nearest neighbors to retrieve)
k = 3
distances, indices = index.search(question_vector, k)
 
print("Most similar chunk indices:", indices[0])
print("Similarity scores:", distances[0])

Step 4: Top-k Selection

All necessary steps for the Retriever to make a selection have been completed. Now it's time to make a selection. The "k" number of documents most relevant to the question is determined. Several factors are taken into account when determining the "k" parameter. These factors include the context window of the LLM model used, cost, latency, the difficulty and simplicity of the problem, and the quality and diversity of the documents. Choosing a random “k” value without considering these will make the system inefficient, slow, or a model that consumes too much energy. After considering all these factors, the "k" value is selected, and the selected documents are presented to the LLM model. The large-scale language model reads all the submitted document fragments and generates the most logical and effective answer to the question using natural language.

These were the technical steps of Retriever's work. To perform these steps, we need a high-quality Retriever. There are several factors that determine the quality of the Retriever. Let's briefly touch upon these situations in the table below.

Factor	Effect
Embedding Model Quality	The better it represents meaning, the more accurate the matching will be
Data Cleaning and Chunking	Nonsensical or overly long chunks can mislead the retriever
Top-k Value	More documents = more comprehensive results but also higher information load
Database Structure	Strong infrastructures like FAISS accelerate queries

Retriever has a golden rule: "Quality and accurate data yields a reliable answer." The data presented to the Retriever directly impacts the answer. Incorrect or irrelevant data leads to wasted time and effort right from the start. Furthermore, a poorly configured Retriever will return incorrect information. Therefore, the configuration of the Retriver is as important as the data for a correctly working model.

Generator: The Path from Information to Answer

The second of the two fundamental components of RAG systems is the "Generator." The retriever only gets the job done to a certain point. The Retriever performs tasks behind the iceberg that we don't see. The Generator, on the other hand, produces the natural language responses we see. Its mission is to make information speak. In other words, it transforms the retrieved pieces of information into a coherent response. For these operations, models with seq2seq architecture are generally used. These models are designed to transform input into output. Examples of these models include the GPT series, T5 (Text-to-Text Transfer Transformer), and BART (Bidirectional and Auto-Regressive Transformers). Let's examine these operations performed by the generator in order of technicality.

Step 1: Prepare the Context

The most relevant text documents received from the Retriever are selected. These are called "context." The Generator, in turn, creates a new class from these pieces of information. This class is called "input." The most relevant "k" documents provided by the retriever are combined with the user's question to create a sequence of inputs. This sequence is called the "prompt," a concept we hear a lot about these days, and for which even specialized engineering has emerged.

Input = [Question] + [Document₁] + [Document₂] + ... + [Documentₖ]

This generated “input” is processed using a model consisting of an encoder-decoder architecture like the T5. Let's examine this situation through an example and code.

Input Sequence: [Question] + [Document₁] + [Document₂]

Question: “What are the advantages of RAG systems?”
Document 1: "RAG provides up-to-date information thanks to its ability to retrieve real-time external data sources."
Document 2: "Compared to traditional models, RAG can generate more transparent and auditable answers."

Creating a "prompt" from user questions and documents is exemplified with the code block below. The application of the theoretical example given above into practice is explained.

#Example user question
user_question = "What are the advantages of RAG?"
 
#Example retrieved documents (from similarity search)
retrieved_docs = [
    "RAG provides up-to-date information thanks to its ability to retrieve real-time external data sources.",
    "Compared to traditional models, RAG can generate more transparent and auditable answers."
]
 
#Build the input prompt
prompt = f"Question: {user_question}\nContext:\n"
for i, doc in enumerate(retrieved_docs, start=1):
    prompt += f"- Document {i}: {doc}\n"
 
print("Final Prompt:\n")
print(prompt)

The output of the above code is given below.

Question: What are the advantages of RAG? 
Context:
Document 1: RAG provides up-to-date information thanks to its ability to retrieve real-time external data sources.
Document 2: Compared to traditional models, RAG can generate more transparent and auditable answers.

As seen in the example and code sections, the input sequence, custom-named "prompt," consists entirely of the user's question and the documents returned by the retriever. This completes the generator's first step.

Step 2: Encoding

“Transfromer” tools, which include Encoder-Decoder architecture, enable RAG systems to interpret user questions and documents and produce answers in natural language. Initially, the generated input reaches the encoder and is encoded and converted into latent representations. This process is called "encoding." The context and question are encoded together. The main goal of this process is to create an answer to the question based on the retrieved context. This restricts the model to using only its parameter-specific information; instead, documents from the Retriever are added as context.

from transformers import T5Tokenizer

#Load tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-small")
 
#Encode the prompt (Encoding step)
inputs = tokenizer.encode(
    prompt,
    return_tensors="pt",     # PyTorch tensor
    max_length=512,          # maximum sequence length
    truncation=True          # truncate if longer than max_length
)
 
print("Encoded input IDs:\n", inputs)
print("Shape:", inputs.shape)

As seen in the code example, using the T5 model, the prompt text is converted into tokenIDs to be given to the "decoder" tool.

Step 3: Decoding

The encoder generated token IDs from the "prompt" text. The decoder uses these token IDs to begin generating tokens sequentially. The purpose of token generation is to transform the prompt into a form that the model can understand. Then, at each step, it begins generating a natural language response based on the tokens and context vectors it generates, predicting the next word. We can formulate this generation as follows:

In the formula above, the letter "y" represents the generated tokens, the letters "d" represent the "k" selected documents, and the letter "q" represents the user's question. When the theoretical information and the formula are considered together, the structure of the "decoder" is better understood. Examining the formula reveals the step-by-step generation of the answer based on the question and documents. Upon completion of the "decoding" phase, the Generator prepares the user's expected answer. Thus, the operation of the system resulting from the Retriever + Generator combination is complete.

#Decoder: Generate output tokens step by step
output_ids = model.generate(
    inputs,                 # a variable from encoder
    max_length=50,          # max generation length 
    num_beams=4,            # better results with beam search 
    early_stopping=True
)
 
#Translate token IDs back to natural language
decoded_output = tokenizer.decode(output_ids[0], skip_special_tokens=True)
 
print("Generated Output IDs:\n", output_ids)
print("Decoded Output:\n", decoded_output)

In the code example we gave for the Decoder, we can clearly understand how the encryption operations performed in the encoder process are followed by the analysis and the creation of the answer.

Maintaining Reliability: Preventing Hallucinations

Hallucination is the model's ineffectiveness by generating fabricated responses. The model's ability to select from among sources and generate responses that are not affected by the parameters prevents the hallucination effect. A natural language response is generated, taking into account the "prompt" created in this way. Thus, the most important feature of RAG systems, the philosophy of "generation from external information," is demonstrated. This process is called "factual grounding."

In conclusion, we emphasized that RAG systems consist of two fundamental components, Retrieval and Generator, and that these two components operate sequentially. We observed that Retrieval accesses data through indexing, vectorizes the user question, performs similarity searches, and selects the "k" most compatible documents, respectively. As a result of these processes, the user question and related documents are provided to the Generator in the appropriate form. The Generator tool handles the context preparation and creation of a prompt, encoding the context and question, and generating a natural language response through decoding. The effect of hallucination on RAG systems is also briefly summarized. Thus, with this article, we have added the article explaining the technical details of RAG systems to our series. In the final and third articles of the series, we will examine the training stages in RAG systems in detail. I recommend that readers who want to put this technique into practice follow my GitHub account.

See you in the last article of the series:

Metin YURDUSEVEN.