Advanced RAG: Optimizing Retrieval with Additional Context & MetaData using LlamaIndex🦙

Using Open Source LLM Zephyr-7b-alpha and Instruct Embeddings hkunlp/instructor-large

Akash Mathur
12 min readDec 23, 2023

Welcome to the Advanced RAG 📚Learning Series!

Dive deeper into the fascinating world of Retrieval-Augmented Generation with this comprehensive series of articles. This series delves into cutting-edge techniques and strategies to elevate your understanding and mastery of RAG applications. Explore the following articles to enhance your skills and Stay tuned 🔔 for more articles in this series as we continue to delve deeper into the world of Advanced RAG and unlock its boundless potential.

Don’t miss out on any discoveries! Bookmark🏷️ this article and check back often for the latest installments in this exciting learning series.

Topics covered so far:

  1. Optimizing Retrieval with Additional Context & MetaData using LlamaIndex (you are here!)
  2. Enhancing Retrieval Efficiency through Rerankers using LlamaIndex
  3. Query Augmentation for Next-Level Search using LlamaIndex
  4. Smart Tracking and Debugging of Document Changes using LlamaIndex
An image by the author

RAG (Retrieval Augmented Generation) is a technique for augmenting LLM knowledge with additional, often private or real-time, data.

LLMs are trained on enormous bodies of data but they aren’t trained on your data. RAG solves this problem by adding your data to the data LLMs already have access to.

If you want to build AI applications that can reason about private data or data introduced after a model’s cutoff date, you need to augment the knowledge of the model with the specific information it needs. The process of bringing the appropriate information and inserting it into the model prompt is known as Retrieval Augmented Generation (RAG).

In RAG, your data is loaded and prepared for queries or “indexed”. User queries act on the index, which filters your data down to the most relevant context. This context and your query then go to the LLM along with a prompt, and the LLM responds.

Stages within RAG

Stages within RAG

There are five key stages within RAG, which in turn will be a part of any larger application you build. These are:

  1. Loading: this refers to getting your data from where it lives — whether it’s text files, PDFs, another website, a database, or an API — into your pipeline. LlamaHub provides hundreds of connectors to choose from.

2. Spliting: If we were to use these large sections, then we’d be inserting a lot of noisy/unwanted context and because all LLMs have a maximum context length, we wouldn’t be able to fit too much other relevant context. So instead, we’re going to split the text within each section into smaller chunks. Intuitively, smaller chunks will encapsulate single/few concepts and will be less noisy compared to larger chunks.

Text splitters break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won’t in a model’s finite context window. Empirically models struggle to find the relevant context in very long prompts.

3. Indexing: this means creating a data structure that allows for querying the data. For LLMs, this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.

4. Storing: once your data is indexed you will almost always want to store your index, as well as other metadata, to avoid having to re-index it.

5. Querying: for any given indexing strategy there are many ways you can utilize LLMs and LlamaIndex data structures to query, including sub-queries, multi-step queries, and hybrid strategies.

6. Evaluation: A critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures of how accurate, faithful, and fast your responses to queries are.

Advanced RAG

Parent-Child Chunks Retrieval

Now we’ll dive into the overview of ONE of the advanced RAG techniques related to Context Enhancement i.e. Parent-Child Chunks Retrieval.

The concept here is to retrieve smaller chunks for better search quality, but add up surrounding context for LLM to reason upon.

Let’s understand with an example-

Imagine you have a really big book, and you want to find specific chapters or paragraphs quickly. This Retriever works like a smart librarian who carefully breaks down this book into smaller, understandable chunks and remembers where they came from. When you ask for one of those smaller pieces, it not only gives you that bit but also knows exactly which big section or even the whole book it belongs to. It’s like having a shortcut to the exact spot in the big book, saving time and effort while making sure you understand everything in context.

It fetches smaller chunks during retrieval first, then references the parent IDs, and returns the bigger chunks.

Documents are split into a hierarchy of chunks and then the smallest leaf chunks are sent to the index. At the retrieval time, we retrieve k leaf chunks, and if n chunks refer to the same parent chunk, we replace them with this parent chunk and send it to LLM for answer generation.

In this blog post, we will dive into the implementations of this method in LlamaIndex. It fetches smaller chunks during retrieval first, then if more than n chunks in top k retrieved chunks are linked to the same parent node (larger chunk), we replace the context fed to the LLM by this parent node — works like auto merging a few retrieved chunks into a larger parent chunk, hence the method name. Just to note — the search is performed just within the child nodes index.

Let’s deeper dive with the help of code.

Implementation (Open Source LLM and Embedding)

When you first perform retrieval, you may want to retrieve the reference as opposed to the raw text. You can have multiple references point to the same node.

In this blog, we will explore some different usages of node references:

  • Chunk references: Different chunk sizes refer to a bigger chunk
  • Metadata references: Summaries + Generated Questions referring to a bigger chunk

We will use Open Source LLM zephyr-7b-alpha and embedding hkunlp/instructor-large

This notebook shows how you can use recursive retrieval to traverse node relationships and fetch nodes based on “references”.

Let’s implement it step-by-step:

A. Load Data — Before your chosen LLM can act on your data you need to load it. The way LlamaIndex does this is via data connectors, also called Reader. Data connectors ingest data from different data sources and format the data into Document objects. A Document is a collection of data (currently text, and in the future, images, and audio) and metadata about that data.

PDFReader = download_loader("PDFReader")
loader = PDFReader()
docs = loader.load_data(file=Path("./data/QLoRa.pdf"))

# combine all the text
doc_text = "\n\n".join([d.get_content() for d in docs])
documents = [Document(text=doc_text)]

B. Chunking- We will useSentenceSplitterthat split the text while respecting the boundaries of sentences. This function is used to break down large bodies of text into smaller sections, ensuring that sentences aren't split in the middle and making it easier to process or analyze text data in manageable portions. We will create an initial set of nodes (chunk size 1024).

node_parser = SentenceSplitter(chunk_size=1024)

base_nodes = node_parser.get_nodes_from_documents(documents)
# set node ids to be a constant
for idx, node in enumerate(base_nodes):
node.id_ = f"node-{idx}"

# print all the node ids corrosponding to all the chunks
for node in base_nodes:

C. Open Source LLM and Embedding — We will use Open Source LLM zephyr-7b-alpha and will quantify it for memory and computation. This should run on a T4 GPU in the free tier on Colab.

In this example, we will use hkunlp/instructor-large. This is an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.) by simply providing the task instruction, without any finetuning. Instructor👨‍ ranks at #14 on the MTEB leaderboard!

Now, we will be setting up the ServiceContextobject, and will be using it to construct an index and query. The input documents will be broken into nodes, and the embedding model will generate an embedding for each node. Then, at query time, the embedding model will be used again to embed the query text.

from google.colab import userdata

# huggingface api token for downloading zephyr-7b
hf_token = userdata.get('hf_token')

quantization_config = BitsAndBytesConfig(

def messages_to_prompt(messages):
prompt = ""
for message in messages:
if message.role == 'system':
prompt += f"<|system|>\n{message.content}</s>\n"
elif message.role == 'user':
prompt += f"<|user|>\n{message.content}</s>\n"
elif message.role == 'assistant':
prompt += f"<|assistant|>\n{message.content}</s>\n"

# ensure we start with a system prompt, insert blank if needed
if not prompt.startswith("<|system|>\n"):
prompt = "<|system|>\n</s>\n" + prompt

# add final assistant prompt
prompt = prompt + "<|assistant|>\n"

return prompt

llm = HuggingFaceLLM(
model_kwargs={"quantization_config": quantization_config},
# tokenizer_kwargs={},
generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},

# Embedding
embed_model = HuggingFaceInstructEmbeddings(
model_name="hkunlp/instructor-large", model_kwargs={"device": DEVICE}

# set your ServiceContext for all the next steps
service_context = ServiceContext.from_defaults(
llm=llm, embed_model=embed_model

1. Baseline Retriever

Let’s define a baseline retriever that simply fetches the top-k raw text nodes by embedding similarity. But, we have to index our data first:

a) Indexing — With our data loaded, we now have a list of Document objects (or a list of Nodes). It’s time to build an Index over these objects so you can start querying them.

In LlamaIndex terms, an Index is a data structure composed of Document objects, designed to enable querying by an LLM.

# index
base_index = VectorStoreIndex(base_nodes, service_context=service_context)

b) Base Retriever- It simply fetches the top-k raw text nodes by embedding similarity.

# find top 2 nodes
base_retriever = base_index.as_retriever(similarity_top_k=2)

retrievals = base_retriever.retrieve(
"Can you tell me about the Paged Optimizers?"

for n in retrievals:
display_source_node(n, source_length=1500)

See the output node-id and similarity score.

c) Querying — Now we’ve loaded our data, and built an index, we’re ready to get to the most significant part of an LLM application: querying!

The most important thing to know about querying is that it is just a prompt to an LLM: so it can be a question and get an answer, or a request for summarization, or a much more complex instruction.

The basis of all querying is the QueryEngine. The simplest way to get a QueryEngine is to get your index to create one for you, like this:

# create a query engine
query_engine_base = RetrieverQueryEngine.from_args(
base_retriever, service_context=service_context

# query
response = query_engine_base.query(
"Can you tell me about the Paged Optimizers?"

2. Chunk References: Smaller Child Chunks Referring to Bigger Parent Chunk

Before, we used big pieces of text, each about 1024 characters long, for finding and putting together information. Now, we’ll try something different. Instead of using those big chunks, we’ll break them down into smaller pieces, like making smaller puzzles from a big one.

Here’s how we’ll do it:

Imagine each big piece of text is like a big jigsaw puzzle piece. We’ll split that big piece into smaller ones:

  • From each big piece, we’ll create 4 slightly bigger pieces, each around 256 characters long.
  • Next, we’ll have 2 even bigger pieces, each about 512 characters long.

And of course, we’ll still keep the original big piece of 1024 characters. So, instead of just having one big piece, we’ll have lots of smaller ones. This helps us find specific things more easily, and when we need to see the bigger picture, we can always refer back to the original big piece.

sub_chunk_sizes = [256, 512]
sub_node_parsers = [SentenceSplitter(chunk_size=c) for c in sub_chunk_sizes]

all_nodes = []
for base_node in base_nodes:
for n in sub_node_parsers:
sub_nodes = n.get_nodes_from_documents([base_node])
sub_inodes = [
IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes

# also add original node to node
original_node = IndexNode.from_text_node(base_node, base_node.node_id)

all_nodes_dict = {n.node_id: n for n in all_nodes}

Now, let's take a look at all the ALL nodes dictionary containing parent and child nodes.

ALL nodes dictionary

Let’s check one child node and it’s a reference to its parent node. See, this index node (child) is referenced to node-0 (parent).

Similarly, you will see that these many smaller chunks (IndexNode) are associated with each of the original text chunks(TextNode) for example node-0. All of the smaller chunks reference to the large chunk in the metadata with index_id pointing to the index ID of the larger chunk.

a) Indexing (from these smaller chunks)

vector_index_chunk = VectorStoreIndex(
all_nodes, service_context=service_context

vector_retriever_chunk = vector_index_chunk.as_retriever(similarity_top_k=2)

When we perform retrieval, we want to retrieve the reference as opposed to the raw text. You can have multiple references point to the same

b) Recursive Retriever — When we perform retrieval, we want to retrieve the reference as opposed to the raw text. You can have multiple references point to the same node.

# Recursive Retriever
retriever_chunk = RecursiveRetriever(
retriever_dict={"vector": vector_retriever_chunk},

# query
nodes = retriever_chunk.retrieve(
"Can you tell me about the Paged Optimizers?"

for node in nodes:
display_source_node(node, source_length=2000)

c) Querying — Let’s start querying!

# Retriever Query Engine
query_engine_chunk = RetrieverQueryEngine.from_args(
retriever_chunk, service_context=service_context

# query
response = query_engine_chunk.query(
"Can you tell me about the Paged Optimizers?"

Metadata References: Summaries and Generated Questions referring to a bigger chunk

Now, we will add some additional context that references the source node.

This additional context includes summaries as well as generated questions. Due to the limited compute, I am only extracting questions, but you can uncomment the summarizer to extract summaries.

During query time, we retrieve smaller chunks, but we follow references to bigger chunks. This allows us to have more context for synthesis.

import nest_asyncio

extractors = [
# SummaryExtractor(summaries=["self"], llm=llm, show_progress=True),
QuestionsAnsweredExtractor(questions=1, llm=llm, show_progress=True),

# run metadata extractor across base nodes, get back dictionaries
metadata_dicts = []
for extractor in extractors:

Let’s add additional metadata to our existing nodes:

# all nodes consists of source nodes, along with metadata
import copy

all_nodes = copy.deepcopy(base_nodes)
for idx, d in enumerate(metadata_dicts):

# QnA node
inode_q = IndexNode(

# Summary node
# inode_s = IndexNode(
# text=d["section_summary"], index_id=base_nodes[idx].node_id)

all_nodes.extend([inode_q]) #, inode_s

# all_nodes_dict
all_nodes_dict = {n.node_id: n for n in all_nodes}

a) Indexing (with MetaData)

vector_index_metadata = VectorStoreIndex(all_nodes, service_context=service_context)
vector_retriever_metadata = vector_index_metadata.as_retriever(similarity_top_k=2)

b) Recursive Retriever (with MetaData)

retriever_metadata = RecursiveRetriever(
retriever_dict={"vector": vector_retriever_metadata},

nodes = retriever_metadata.retrieve(
"Can you tell me about the Paged Optimizers?"

for node in nodes:
display_source_node(node, source_length=2000)

c) Querying (with MetaData)

query_engine_metadata = RetrieverQueryEngine.from_args(
retriever_metadata, service_context=service_context
response = query_engine_metadata.query(
"Can you tell me about the Paged Optimizers?"


Once you have data loaded and indexed, you will probably want to store it to avoid the time and cost of re-indexing it. By default, your indexed data is stored only in memory.

We will use Chroma to store the embeddings from a VectorStorendex, you need to:

  • initialize the Chroma client
  • create a Collection to store your data in Chroma
  • assign Chroma as the vector_store in a StorageContext
  • initialize your VectorStoreIndex using that StorageContext

Here’s what that looks like:

# initialize client, setting path to save data
db = chromadb.PersistentClient(path="./chroma_db")

# create collection
chroma_collection = db.get_or_create_collection("QLoRa_knowledge_database")

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# create your index
vector_index_metadata_db = VectorStoreIndex(

Compare the Outputs

Finally, we can run evaluations on each of the retrievers (either chunk or metadata) to measure the hit rate and MRR on our specific use case and select the optimal retriever.

Refer to the notebook code on Github:

To refer to other advanced RAG methods, refer to this repo:


Thank you for reading this article, I hope it added some pieces to your knowledge stack! Before you go, if you enjoyed reading this article:

👉 Be sure to clap and follow me, and let me know if any feedback.

👉I built versatile Generative AI applications using the Large Language Model (LLM), covered advanced RAG concepts, and serverless AWS architectures for Big Data processing. You’re welcome to take a look at the repo and star⭐it.

👉Follow me: LinkedIn | GitHub | Medium | Portfolio