Advanced RAG: Query Augmentation for Next-Level Search using LlamaIndex🦙

Using Open Source LLM Zephyr-7b-alpha and BGE Embeddings bge-large-en-v1.5

Akash Mathur
13 min readJan 18, 2024

Welcome to the Advanced RAG 📚Learning Series!

Dive deeper into the fascinating world of Retrieval-Augmented Generation with this comprehensive series of articles. This series delves into cutting-edge techniques and strategies to elevate your understanding and mastery of RAG applications. Explore the following articles to enhance your skills and Stay tuned 🔔 for more articles in this series as we continue to delve deeper into the world of Advanced RAG and unlock its boundless potential.

Don’t miss out on any discoveries! Bookmark🏷️ this article and check back often for the latest installments in this exciting learning series.

Topics covered so far:

  1. Optimizing Retrieval with Additional Context & MetaData using LlamaIndex
  2. Enhancing Retrieval Efficiency through Rerankers using LlamaIndex
  3. Query Augmentation for Next-Level Search using LlamaIndex (you are here!)
  4. Smart Tracking and Debugging of Document Changes using LlamaIndex

In the realm of information retrieval, Retriever-Augmented Generation (RAG) models have marked a paradigm shift, empowering large language models (LLMs) to generate contextually rich and accurate responses. However, the path to unlocking RAG’s full potential often lies beyond the limits of its default query-retrieval-generation framework.

This article delves into the transformative power of advanced query transformation techniques designed to bridge the gap between initial user prompts and the most relevant information within vast databases.

The Challenge of Retrieval Misalignment

At the heart of query transformations lies a fundamental challenge: user-generated prompts often lack the precise language or structure that aligns seamlessly with the wording of relevant documents. This misalignment can hinder retrieval efforts, leading to suboptimal responses from even the most sophisticated LLMs. Query transformations address this challenge by strategically modifying queries before the retrieval stage, enhancing their relevance, and guiding the LLM towards better information extraction.

Problem with Zero-Shot Challenges

Recent research has shed light on the benefits of breaking down complex queries into smaller, more manageable steps, a technique particularly effective for queries that require knowledge augmentation. However, fully zero-shot dense retrieval systems, where relevance labels are absent, continue to pose significant challenges. Advanced query transformations emerge as a promising approach to address these challenges, offering innovative strategies to navigate this challenge.

The idea behind query transformations is that the retriever may not consider a user’s initial prompt to retrieve semantically similar documents. However, it will modify the query to increase its relevance to our sources before retrieving and feeding them to the language model.

There are many techniques for enhancing RAG, creating the additional challenge of knowing when to apply each. In this article, we will analyze 5 powerful query transformation techniques and will see how they can help to bridge the retrieval gap and perform next-level search.

  1. Hypothetical Document Embeddings (HyDE)
  2. Sub-Question Query Engine
  3. Router Query Engine
  4. Single-Step Query Decomposition
  5. Multi-Step Query Decomposition

Knowledge and Action, Hand in Hand

This journey won’t end with theoretical insights alone. Alongside each technique, you’ll find references to a dedicated GitHub repository, providing the code samples and implementation details.

Let’s dive into it.

Open Source LLM and Embedding

LLM: Throughout this exploration, we’ll harness the power of Zephyr-7b-alpha, a state-of-the-art open-source LLM renowned for its remarkable capabilities in understanding and generating text.

Embeddings: We will use BGE embeddings (bge-large-en-v1.5), a general-purpose embedding model to enable effective semantic search and knowledge extraction.

This embedding model ranks 5th in the MTEB Embedding Benchmark. Also, check out their repo

Let’s jump into the code.

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)

import json
import torch
from pathlib import Path
import pandas as pd
pd.set_option("display.max_colwidth", -1)

from copy import deepcopy

# transformers
from transformers import BitsAndBytesConfig

# llama_index
from llama_index.prompts import PromptTemplate
from llama_index.llms import HuggingFaceLLM
from llama_index import download_loader, Document, VectorStoreIndex, ServiceContext
from llama_index.node_parser import SentenceSplitter
from langchain.embeddings import HuggingFaceEmbeddings

from llama_index.indices.query.query_transform import HyDEQueryTransform
from llama_index.query_engine.transform_query_engine import TransformQueryEngine

from IPython.display import Markdown, display
from llama_index.response.notebook_utils import display_source_node

from llama_index.query_engine import RetrieverQueryEngine
from IPython.display import Markdown, display, HTML
from llama_index.retrievers import VectorIndexRetriever

from sentence_transformers import SentenceTransformer

DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
#Load Data

PDFReader = download_loader("PDFReader")
loader = PDFReader()
docs = loader.load_data(file=Path("QLoRa.pdf"))

# create chunks
node_parser = SentenceSplitter(chunk_size=256)
nodes = node_parser.get_nodes_from_documents(docs)
#Load Open Source LLM (zephyr-7b-alpha)

from google.colab import userdata

# huggingface api token
hf_token = userdata.get('hf_token')

quantization_config = BitsAndBytesConfig(

def messages_to_prompt(messages):
prompt = ""
for message in messages:
if message.role == 'system':
prompt += f"<|system|>\n{message.content}\n"
elif message.role == 'user':
prompt += f"<|user|>\n{message.content}\n"
elif message.role == 'assistant':
prompt += f"<|assistant|>\n{message.content}\n"

# ensure we start with a system prompt, insert blank if needed
if not prompt.startswith("<|system|>\n"):
prompt = "<|system|>\n\n" + prompt

# add final assistant prompt
prompt = prompt + "<|assistant|>\n"

return prompt

llm = HuggingFaceLLM(
model_kwargs={"quantization_config": quantization_config},
# tokenizer_kwargs={},
generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95, "do_sample":True},
#Load Open Embedding (bge-large-en-v1.5)

embed_model = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")

Before applying any query transformation, configure the Index and Retriever

# ServiceContext
service_context = ServiceContext.from_defaults(llm=llm,

# index
vector_index = VectorStoreIndex(
nodes, service_context=service_context

Let’s discuss and apply the above-mentioned query transformations one by one.

1. Hypothetical Document Embeddings (HyDE)

HyDE — Under the hood

HyDE (Hypothetical Document Embeddings) is a novel approach to dense retrieval that involves two distinct phases:

1. Generating a Hypothetical Answer

  • Instead of directly searching for relevant documents based on the raw query, HyDE first constructs a hypothetical document that might answer the query.
  • This is achieved by utilizing an instruction-following language model, tasked with generating a likely response to the query.
  • While this hypothetical document may not be factually accurate in every detail, it serves as a valuable example of what a relevant document could look like, capturing the essence of relevance.

2. Encoding and Retrieval

  • The hypothetical document is then processed by an unsupervised contrastive encoder, which distills its key features into a compact embedding vector.
  • Importantly, the encoder’s dense bottleneck acts as a lossy compressor, filtering out irrelevant details.
  • This embedding vector is then compared against a database of corpus embeddings, representing actual documents.
  • The retrieval process leverages document-document similarity encoded within the inner product during contrastive training, enabling the identification of documents that closely align with the hypothetical answers.
  • The most similar real documents are retrieved and presented as potential responses to the query, enhancing retrieval accuracy.

Let’s jump into the code.

First, we query without transformation. Then, the same query string is used for embedding lookup and also summarization.

query_str = "Describe the trade-offs between using BFloat16 as the computation data type and other possible choices. When would you choose one over the other?"

query_engine = vector_index.as_query_engine()
response = query_engine.query(query_str)


Response without HyDe

Let’s apply HyDe transformation and see the results

hyde = HyDEQueryTransform(include_original=True, llm=llm)
hyde_query_engine = TransformQueryEngine(query_engine, hyde)
response = hyde_query_engine.query(query_str)


Response with HyDe

Let’s look at the hypothetical document. We use `HyDEQueryTransform` to generate a hypothetical document and use it for embedding lookup.

query_bundle = hyde(query_str)
hyde_doc = query_bundle.embedding_strs[0]

Conclusion — You can see HyDE improves output quality significantly, by hallucinating accurately, thus improving the embedding quality and final output.

Link to the GitHub code repository

2. Sub-Question Query Engine

Understanding Traditional Query Engines

Normal query engines are designed to locate relevant information within vast datasets. They act as intermediaries between users’ questions and stored data. When a user poses a query, the engine carefully analyzes it, pinpoints relevant data, and presents a comprehensive response.

Limitations of Traditional Query Engines

While traditional query engines excel at straightforward questions, they often face challenges when confronted with multi-faceted questions spanning multiple documents.

Simply merging documents and extracting top k elements frequently fails to capture the nuances required for truly informative responses.

Enter Sub-Question Query Engines

Decomposition Strategy: To address this complexity, Sub-Question Query Engines adopt a divide-and-conquer approach. They elegantly decompose complex queries into a series of sub-questions, each targeting specific aspects of the original inquiry.

The implementation involves defining a Sub-Question Query Engine for each data source. Instead of treating all documents equally, the engine strategically addresses sub-questions specific to each data source. To generate the final response, a top-level Sub-Question Query Engine is then employed to synthesize the results from individual sub-questions.

Given the initial complex question, we use LLM to generate sub-questions and execute sub-questions on selected data sources. It gathers all sub-responses and then synthesizes the final response.

Let’s jump into the code.

from import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine
from llama_index.callbacks import CallbackManager, LlamaDebugHandler

import nest_asyncio

# Using the LlamaDebugHandler to print the trace of the sub questions
# captured by the SUB_QUESTION callback event type
llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug])

# ServiceContext
service_context = ServiceContext.from_defaults(llm=llm,

# vector query engine
vector_query_engine = VectorStoreIndex.from_documents(
docs, service_context=service_context, use_async=True

Construct sub-question query engine and run some queries!

# setup base query engine as tool
query_engine_tools = [
description="Efficient Finetuning of Quantized LLMs",

query_engine = SubQuestionQueryEngine.from_defaults(

response = query_engine.query("Describe the trade-offs between using BFloat16 as the computation data type and other possible choices. When would you choose one over the other?")

Below are the Sub-question generated:


Sub-Question Query Engine Response

Link to the GitHub code repository

3. Router Query Engine

Now, we will define a router query engine that selects one out of several candidate query engines to execute a query.

Router Query Engine

A Router Query Engine serves as a powerful decision-making module that plays a crucial role in selecting the most appropriate choices based on user queries and metadata-defined options. These routers are versatile modules that can operate independently as “selector modules” or can be utilized as query engines or retrievers on top of other query engines or retrievers.

Routers excel in various use cases, including selecting the appropriate data source from a diverse range of options and deciding whether to perform summarization or semantic search based on the user query. They can also handle more complex tasks like trying out multiple choices simultaneously and combining the results using multi-routing capabilities.

We also define a “selector”. Users can easily employ routers as query engines or retrievers, with the router taking on the responsibility of selecting query engines or retrievers to route user queries effectively.

There are several selectors available, each with some distinct attributes.

  1. The LLM selectors use the LLM to output a JSON that is parsed and the corresponding indexes are queried.
  2. The Pydantic selectors (currently only supported by gpt-4 and gpt-3.5 (the default)) use the OpenAI Function Call API to produce pydantic selection objects, rather than parsing raw JSON.
  3. For each type of selector, there is also the option to select one index to route to, or multiple.
  4. Then, define the RouterQueryEngine with a desired selector module. Here, we use the LLMSingleSelector, which uses LLM to choose an underlying query engine to route the query to.

Let’s jump into the code.

We will define a custom router query engine that selects one out of several candidate query engines to execute a query.

from llama_index import VectorStoreIndex, SummaryIndex, SimpleKeywordTableIndex

service_context = ServiceContext.from_defaults(llm=llm,

### Define all the different indexes over same data ###

# vector index
vector_index = VectorStoreIndex(
nodes, service_context=service_context

# summary index
summary_index = SummaryIndex(
nodes, service_context=service_context

# keyword index
keyword_index = SimpleKeywordTableIndex(nodes, service_context=service_context)

Next, we define Query Engines for each Index. We then wrap these with QueryEngineTool.

summary_query_engine = summary_index.as_query_engine(

vector_query_engine = vector_index.as_query_engine(service_context=service_context)

keyword_query_engine = keyword_index.as_query_engine(service_context=service_context)
from import QueryEngineTool

summary_tool = QueryEngineTool.from_defaults(
"Useful for summarization questions related to Efficient Finetuning QLORA reserach paper"

vector_tool = QueryEngineTool.from_defaults(
"Useful for retrieving specific context from QLORA reserach paper related to Efficient Finetuning "

keyword_tool = QueryEngineTool.from_defaults(
"Useful for retrieving specific context from QLORA reserach paper related to Efficient Finetuning "
"using entities mentioned in query"

Then, we will use LLM selectorswhich can use OpenAI or any other LLM to parse generated JSON under the hood to select a sub-index for routing.


# LLMSingleSelector

from llama_index.query_engine.router_query_engine import RouterQueryEngine
from llama_index.selectors.llm_selectors import LLMSingleSelector, LLMMultiSelector

router_query_engine = RouterQueryEngine(

response = router_query_engine.query("What is Double Quantization?")


Router Query Engine Response


If we want to route our query to multiple indexes, we can use a multi selector. The multi selector sends to query to multiple sub-indexes, and then aggregates all responses using a summary index to form a complete answer.

router_query_engine  = RouterQueryEngine(


Link to the GitHub code repository

4. Single-Step Query Decomposition

Recent studies have demonstrated that LLMs tend to perform better when they break down complex questions into smaller, more manageable steps. In cases where a query is complex, different parts of the knowledge base may be relevant to answer distinct “subqueries” within the overall question. The single-step query transformation acknowledges this and aims to address each subquery independently.

The single-step query decomposition feature is designed to transform a complicated question into a simpler one, specifically tailored to extract relevant information from the data collection. By breaking down the original question into smaller, more focused subqueries, the model can provide sub-answers that collectively contribute to addressing the complexity of the original question.

Image from LlamaIndex documentation

5. Multi-Step Query Decomposition

The multi-step query transformation represents an innovative approach known as the self-ask method. This method is rooted in the concept of a language model asking and answering follow-up questions to itself before providing an answer to the original query. The goal is to empower the model to seamlessly combine the information it has learned independently.

The model connects scattered facts, synthesizes insights, and uncovers relationships that might have remained obscured in a single-step approach.

Hence, Multi-Step Query Transformations overcome a common limitation of LLMs: the difficulty in combining separate facts to draw new conclusions. By iteratively exploring knowledge, the model uncovers connections that might otherwise remain hidden.

Image from LlamaIndex documentation

Let’s jump into the code.

from llama_index.indices.query.query_transform.base import StepDecomposeQueryTransform
from llama_index.query_engine.multistep_query_engine import MultiStepQueryEngine

# set Logging to DEBUG for more detailed outputs
from llama_index.query_engine.multistep_query_engine import (

step_decompose_transform = StepDecomposeQueryTransform(llm=llm, verbose=True)
query_engine = vector_index.as_query_engine(service_context=service_context)

query_engine = MultiStepQueryEngine(

Note: While running MultiStepQueryEngine, I was getting the ValueError — Could not load OpenAI model.

It looks like MultiStepQueryEngine supports only OpenAI GPT-4 and GPT-3.5 model as of now. I will keep on looking into this space and will update the code accordingly. I will also keep it as an open question for everyone and please share your response in the comment section.

When one is Right for me?

Both Sub-Question Query Engine and Single/Multi-step query decomposition in RAG tackle complex queries, but they approach the problem from different angles:

Sub-Question Query Engine

It focuses on the divide-and-conquer approach. It decomposes a complex query into a series of smaller, focused sub-questions. Each sub-question is sent to a dedicated Sub-Question Query Engine that retrieves relevant information from its specific data source.

Hence, it ensures each sub-question gets the appropriate data source, leading to more precise results. It provides comprehensive answers by aggregating insights from various sub-questions to provide a holistic response.

Single/Multi-step query decomposition

It focuses on the sequential refinement of the query. It breaks down the complex query into intermediate steps, progressively enriching the search with the retrieved information. Each step searches for relevant documents based on the current query state, updating the query with extracted knowledge.

Hence, it avoids redundant retrieval by refining the query with each step.


This exploration of advanced query augmentation is not an end, but an exciting beginning. While the approaches we covered in this article offer transformative powers, the boundaries of information retrieval stretch ever further.

Future research may delve into hybrid approaches, combining these techniques for even greater synergy. We may witness the rise of personalized transformations, adapting to individual user needs and preferences. Ultimately, the journey towards bridging the retrieval gap is a continuous one, fuelled by innovation and driven by the desire to connect users with the most accurate and insightful information, wherever it may reside.

Refer to the complete code on Github:

To refer to other advanced RAG methods, refer to this repo:

Thank you for reading this article, I hope it added some pieces to your knowledge stack! Before you go, if you enjoyed reading this article:

👉 Be sure to clap and follow me, and let me know if any feedback.

👉I built versatile Generative AI applications using the Large Language Model (LLM), covered advanced RAG concepts, and serverless AWS architectures for Big Data processing. You’re welcome to take a look at the repo and star⭐it.

👉Follow me: LinkedIn | GitHub | Medium | Portfolio