Comprehensive Guide to Customize your Llama2 ChatBot using LlamaIndex and Streamlit

Akash Mathur
12 min readNov 28, 2023
An image by Author

Problem Statement

In today’s information-driven landscape, extracting valuable insights locked within numerous data sources (APIs, PDFs, documents, CSV, SQL, etc.) presents a formidable challenge. The wealth of knowledge residing in these documents often remains untapped due to the sheer volume and the time-consuming nature of manual extraction.

Solution

LlamaIndex is a simple, flexible data framework for connecting
custom data sources to large language models. It provides the key tools to augment your LLM applications with data

LLMs come pre-trained on huge amounts of publicly available data like Wikipedia, mailing lists, textbooks, source code and more. However, they are not trained on your data, which may be private or specific to the problem you’re trying to solve. It’s behind APIs, in SQL databases, or trapped in PDFs and slide decks.

LlamaIndex solves this problem by connecting to these data sources and adding your data to the data LLMs already have. This is often called Retrieval-Augmented Generation (RAG). RAG enables you to use LLMs to query your data, transform it, and generate new insights. You can ask questions about your data, create chatbots, build semi-autonomous agents, and more.

Use Case

In this blog, we will see the use case of Q&A on the dataset (multiple PDFs) and Chatbot which can handle multiple back-and-forth queries and answers, getting clarification or answering follow-up questions. In the end, we will deploy our solution to Streamlit.

LlamaIndex gives you the tools to build knowledge-augmented chatbots. Throughout this blog, you’ll notice sections are arranged in the order you’ll perform these steps while building your app -

  1. Models
  2. Loading
  3. Indexing
  4. Storing
  5. Querying

Let’s deep dive into each section through our use case:

Models

LLMs

One of the first steps when building an LLM-based application is which LLM to use. LLMs are a core component of LlamaIndex. They can be used as standalone modules or plugged into other core LlamaIndex modules (indices, retrievers, query engines).

Let’s look at how LLMs are used at multiple different stages of your pipeline:

  1. During Indexing, you may use an LLM to determine the relevance of data (whether to index it at all) or you may use an LLM to summarize the raw data and index the summaries instead.

2. During Querying LLMs can be used in two ways:

  • During Retrieval (fetching data from your index) LLMs can be given an array of options (such as multiple different indices) and make decisions about where best to find the information you’re looking for. An agentic LLM can also use tools at this stage to query different data sources.
  • During Response Synthesis (turning the retrieved data into an answer) an LLM can combine answers to multiple sub-queries into a single coherent answer, or it can transform data, such as from unstructured text to JSON or another programmatic output format.

Usually, you will instantiate an LLM and pass it to a ServiceContext, which you then pass to other stages of the pipeline.

Quick Note on Tokenization

By default, LlamaIndex uses a global tokenizer for all token counting. This defaults to cl100k from tiktoken, which is the tokenizer to match the default LLM gpt-3.5-turbo.

If you change the LLM, you may need to update this tokenizer to ensure accurate token counts, chunking, and prompting.

In this example, we will use Open Source meta-llama/Llama-2–7b-chat-hf as our LLM and will quantify it for memory and computation. This should run on a T4 GPU in the free tier on Colab.

import torch
from transformers import BitsAndBytesConfig
from llama_index.prompts import PromptTemplate
from llama_index.llms import HuggingFaceLLM

# huggingface api token for downloading llama2
hf_token = "your-hugging-face-token"

quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)

llm = HuggingFaceLLM(
model_name="meta-llama/Llama-2-7b-chat-hf",
tokenizer_name="meta-llama/Llama-2-7b-chat-hf",
query_wrapper_prompt=PromptTemplate("<s> [INST] {query_str} [/INST] "),
context_window=3900,
model_kwargs={"token": hf_token, "quantization_config": quantization_config},
tokenizer_kwargs={"token": hf_token},
device_map="auto",
)

Embeddings

Embeddings are used in LlamaIndex to represent your documents using a sophisticated numerical representation. Embedding models take text as input and return a long list of numbers used to capture the semantics of the text. These embedding models have been trained to represent text this way and help enable many applications, including search.

When calculating the similarity between embeddings, there are many methods to use (dot product, cosine similarity, etc.). By default, LlamaIndex uses cosine similarity when comparing embeddings.

There are many embedding models to pick from. By default, LlamaIndex uses text-embedding-ada-002 from OpenAI.

In this example, we will use HuggingFaceInstructEmbeddings. This is an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.) by simply providing the task instruction, without any finetuning. Instructor👨‍ ranks at #14 on the MTEB leaderboard!

We will useSentenceSplitterthat split the text while respecting the boundaries of sentences. This function is used to break down large bodies of text into smaller sections, ensuring that sentences aren't split in the middle and making it easier to process or analyze text data in manageable portions.

The most common usage for an embedding model will be setting it in the ServiceContextobject, and then using it to construct an index and query. The input documents will be broken into nodes, and the embedding model will generate an embedding for each node. Then, at query time, the embedding model will be used again to embed the query text.

embed_model = HuggingFaceInstructEmbeddings(
model_name="hkunlp/instructor-large", model_kwargs={"device": DEVICE}
)

text_splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=20) # using the default chunk_# values as they work just fine

service_context = ServiceContext.from_defaults(llm=llm,
embed_model=embed_model,
text_splitter=text_splitter
)

Loading

Before your chosen LLM can act on your data you need to load it. The way LlamaIndex does this is via data connectors, also called Reader. Data connectors ingest data from different data sources and format the data into Document objects. A Document is a collection of data (currently text, and in the future, images, and audio) and metadata about that data.

Loading using SimpleDirectoryReader

The easiest reader to use is our SimpleDirectoryReader, which creates documents out of every file in a given directory. It is built into LlamaIndex and can read a variety of formats including Markdown, PDFs, Word documents, images, etc.

from llama_index import SimpleDirectoryReader

# Load data
documents = SimpleDirectoryReader('./sample_data/pdfs').load_data()
len(documents)

Indexing

With your data loaded, you now have a list of Document objects (or a list of Nodes). It’s time to build an Index over these objects so you can start querying them.

In LlamaIndex terms, an Index is a data structure composed of Document objects, designed to enable querying by an LLM.

Parsing Documents into Nodes

Under the hood, indexers split your Document into Node objects, which are similar to Documents (they contain text and metadata) but have a relationship to their parent Document.

Node corresponds to a chunk of text from a Document. LlamaIndex takes in Document objects and internally parses/chunks them into Node objects

The way in which your text is split up can have a large effect on the performance of your application in terms of accuracy and relevance of results returned. The defaults work well for simple text documents, so depending on what your data looks like you will sometimes want to modify the default ways in which your documents are split up.

Remember, a ServiceContext is a simple bundle of configuration data passed to many parts of LlamaIndex.

Indexes have a .from_documents() method which accepts an array of Document objects and will correctly parse and chunk them up.

vector_index = VectorStoreIndex.from_documents(documents, service_context=service_context)

When you want to search your embeddings, your query is itself turned into a vector embedding, and then a mathematical operation is carried out VectorStorendexto rank all the embeddings by how semantically similar they are to your query.

Reference: Colab Notebook

Storing

Once you have data loaded and indexed, you will probably want to store it to avoid the time and cost of re-indexing it. By default, your indexed data is stored only in memory.

Persisting to disk

The simplest way to store your indexed data is to use the built-in .persist() method of every Index, which writes all the data to disk at the location specified. This works for any type of index.

Under the hood, LlamaIndex also supports swappable storage components that allows you to customize:

  • Document stores: where ingested documents (i.e., Node objects) are stored,
  • Index stores: where index metadata are stored,
  • Vector stores: where embedding vectors are stored.
  • Graph stores: where knowledge graphs are stored (i.e. for KnowledgeGraphIndex).

Using Vector Stores

A VectorStorendex is by far the most frequent type of Index you’ll encounter. The VectorStorendex takes your Documents and splits them up into Nodes. It then creates vector embeddings of the text of every node, ready to be queried by an LLM. In this example, we’ll be using Chroma, an open-source vector store.

# check if storage already exists
if not os.path.exists("./storage"):
# load the documents and create the index
documents = SimpleDirectoryReader("data").load_data()
vector_index = VectorStoreIndex.from_documents(documents)
# store it for later
vector_index.storage_context.persist()
else:
# load the existing index
storage_context = StorageContext.from_defaults(persist_dir="./storage")
vector_index = load_index_from_storage(storage_context)

Using Chroma

To use Chroma to store the embeddings from a VectorStorendex, you need to:

  • initialize the Chroma client
  • create a Collection to store your data in Chroma
  • assign Chroma as the vector_store in a StorageContext
  • initialize your VectorStoreIndex using that StorageContext

Here’s what that looks like:

import chromadb
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext

# # To make ephemeral client that is a short lasting client or an in-memory client
# db = chromadb.EphemeralClient()

# initialize the Persistent client, setting path to save data
db = chromadb.PersistentClient(path="./chroma_db")

# create collection
chroma_collection = db.get_or_create_collection("bank_earnings_database")

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# create your index
vector_index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
service_context=service_context
)

Reference: Colab Notebook

Querying

Querying

Query Engine

Now you’ve loaded your data, built an index, and stored that index for later, you’re ready to get to the most significant part of an LLM application: querying!

The most important thing to know about querying is that it is just a prompt to an LLM: so it can be a question and get an answer, or a request for summarization, or a much more complex instruction.

The basis of all querying is the QueryEngine. The simplest way to get a QueryEngine is to get your index to create one for you, like this:

# create a query engine and query
query_engine = vector_index.as_query_engine()
response = query_engine.query("Enter your query")
print(response)

Stages of querying

However, there is more to querying than initially meets the eye. Querying consists of three distinct stages:

  • Retrieval is when you find and return the most relevant documents for your query from your Index. The most common type of retrieval is top_k semantic retrieval, but there are many other retrieval strategies.
  • Postprocessing is when the Nodes retrieved are optionally reranked, transformed, or filtered, for instance by requiring that they have specific metadata such as keywords attached.
  • Response synthesis is when your query, your most relevant data and your prompt are combined and sent to your LLM to return a response.

Node Postprocessors

It also supports advanced Node filtering and augmentation that can further improve the relevancy of the retrieved Node objects. This can help reduce the time/number of LLM calls/costs or improve response quality. For example:

  • KeywordNodePostprocessor: filters nodes by required_keywords and exclude_keywords.
  • SimilarityPostprocessor: filters nodes by setting a threshold on the similarity score (thus only supported by embedding-based retrievers)

The full list of node postprocessors is documented in the Node Postprocessor Reference.

Response Synthesis

It synthesizes a response given the retrieved Node. You can see how to specify different response modes.

# configure retriever
retriever = VectorIndexRetriever(
index=vector_index,
similarity_top_k=6,
)

# configure node postprocessors
s_processor = SimilarityPostprocessor(similarity_cutoff=0.79)
k_processor = KeywordNodePostprocessor(
exclude_keywords=["environmental"]
)

# configure response synthesizer
response_synthesizer = get_response_synthesizer(service_context=service_context)

query_engine = RetrieverQueryEngine(
retriever=retriever,
node_postprocessors=[s_processor, k_processor],
response_synthesizer=response_synthesizer
)

Before we start querying, let’s understand quickly the Response Modes Llamaindex supports:

# compact
query_engine = vector_index.as_query_engine(response_mode="compact")
response = query_engine.query("What is the Q2FY24 Net revenue for HDFC Bank?")
pprint_response(response, show_source=True)

# refine
query_engine = vector_index.as_query_engine(response_mode="refine")
response = query_engine.query("What is the Q2FY24 Net revenue for HDFC Bank?")
pprint_response(response, show_source=True)

# tree_summarize
query_engine = vector_index.as_query_engine(response_mode="tree_summarize")
response = query_engine.query("What is the Q2FY24 Net revenue for HDFC Bank?")
pprint_response(response, show_source=True)

# no_text
"""
Only runs the retriever to fetch the nodes that would have been sent
to the LLM, without actually sending them.
"""
query_engine = vector_index.as_query_engine(response_mode="no_text")
print(response.source_nodes)

Chat Engine

So far you have only been working with the query engine. Llama index also supports a chat engine. So how do they differ?

A chat engine is a high-level interface for having a conversation with your data. A chat is a multiple back and forth instead of a single question and answer with your data.

By keeping track of the conversation history, it can answer questions with the past context in mind.

Chat engine has several chat modes. We will use the “condense question” mode.

The chat history is analyzed and the user message is rewritten to be a query for the index. The response is returned after reading the response from the query engine. So let’s use the condensed question to see how it will remember the history of the chat.

chat_engine = vector_index.as_chat_engine(chat_mode="condense_question")
response = chat_engine.chat("Can you provide important highlights of ICICI Bank's earnings report?")
print(response)

chat_engine.reset()

Reference: Colab Notebook

Putting It All Together Into a Streamlit App‍🚀

You’ve loaded your data, indexed it, stored your index, and queried your index. Now we will build and deploy a custom web app using a simple interface using Streamlit.

Streamlit UI

Refer to the streamlit code on Github:

🌟Bonus! 🌟

As you have come this far, let me share a couple of things:

  1. If you want to use LangChain to build a similar Chatbot. Refer to this repo:

2. If you want to use LangChain to chat with CSV, Refer to this repo:

3. Let me also show how we can use LLMs to automate metadata extraction to improve our retrieval process!

Automated Metadata Extraction for Better Retrieval

Automated Metadata Extraction in LlamaIndex refers to the process of automatically identifying and extracting key information from documents to create descriptive metadata. This metadata serves as tags or labels that help categorize and organize documents, enabling more efficient indexing and retrieval.

We can use LLMs to automate metadata extraction with Metadata Extractor modules-

  1. SummaryExtractor — automatically extracts a summary over a set of Nodes
  2. QuestionsAnsweredExtractor — extracts a set of questions that each Node can answer
  3. TitleExtractor — extracts a title over the context of each Node
  4. EntityExtractor — extracts entities (i.e. names of places, people, things) mentioned in the content of each Node

Then you can chain the Metadata Extractors with the node parser. Refer to the notebook below:

Congratulations on building your first LLM-powered Chatbot! 🎉👏🎊

If you enjoyed reading this article comment “Hell Yes!” in the comment section and let me know if any feedback.

You’re welcome to take a look at the repo and star⭐it.

Feel free to follow me on Medium, and GitHub, or say Hi on LinkedIn. I am excited to discuss across AI, ML, NLP, and MLOps areas!

--

--