Elevating Mistral-7B’s Performance through QLoRA

Akash Mathur
13 min readDec 15, 2023

In recent advancements within the realm of Natural Language Processing (NLP), fine-tuning pre-trained language models has emerged as a pivotal technique for tailoring models to specific tasks and domains. Leveraging the power of large-scale language models such as the GPT (Generative Pre-trained Transformer) series, fine-tuning facilitates the adaptation of these models to particular contexts, yielding superior performance in various language understanding and generation tasks.

🌟A Quick Note on Fine-Tuning LLMs🌟

Training LLMs is computationally intensive. Full fine-tuning requires memory not just to store the model, but various other parameters that are required during the training process.

Even if your computer can hold the model weights, which are now on the order of hundreds of gigabytes for the largest models, you must also be able to allocate memory for optimizer states, gradients, forward activations, and temporary memory throughout the training process.

These additional components can be many times larger than the model and can quickly become too large to handle on consumer hardware.

In contrast to full fine-tuning where every model weight is updated during supervised learning, Parameter-Efficient Fine-Tuning (PEFT) methods only update a small subset of parameters. Some techniques freeze most of the model weights and focus on fine-tuning a subset of existing model parameters, for example, particular layers or components.

Other techniques don’t touch the original model weights at all, and instead add a small number of new parameters or layers and fine-tune only the new components. With PEFT, most if not all of the LLM weights are kept frozen.

As a result, the number of trained parameters is much smaller than the number of parameters in the original LLM. In some cases, just 15–20% of the original LLM weights. This makes the memory requirements for training much more manageable. In fact, PEFT can often be performed on a single GPU. And because the original LLM is only slightly modified or left unchanged, PEFT is less prone to the catastrophic forgetting problems of full fine-tuning

Full fine-tuning results in a new version of the model for every task you train on. Each of these is the same size as the original model, so it can create an expensive storage problem if you’re fine-tuning for multiple tasks. With PEFT, you train only a small number of weights, which results in a much smaller footprint overall, as small as megabytes depending on the task. The new parameters are combined with the original LLM weights for inference.

The PEFT weights are trained for each task and can be easily swapped out for inference, allowing efficient adaptation of the original model to multiple tasks. There are several methods you can use for parameter efficient fine-tuning, each with trade-offs on parameter efficiency, memory efficiency, training speed, model quality, and inference costs.

We will discuss one of the reparameterization methods which reduces the number of parameters to train by creating new low-rank transformations of the original network weights. A commonly used technique of this type is LoRA (Low-rank Adaptation), which we’ll explore in this article.

🌟(Q)Low-Rank Adaptation (LoRA)🌟

LoRA is a parameter-efficient fine-tuning technique that falls into the re-parameterization category.

LoRA freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.

In transformers, there are two kinds of neural networks; self-attention and feedforward networks. The weights of these networks are learned during pre-training. After the embedding vectors are created, they’re fed into the self-attention layers where a series of weights are applied to calculate the attention scores. During full fine-tuning, every parameter in these layers is updated.

Basic NN

A low-rank approximation of a matrix aims to approximate the original matrix as closely as possible, but with a lower rank. The rank of a matrix is a value that gives you an idea of the matrix’s complexity; a lower-rank matrix reduces computational complexity, and thus increases efficiency of matrix multiplications. Low-rank decomposition refers to the process of effectively approximating a matrix A by deriving low-rank approximations of A. Singular Value Decomposition (SVD) is a common method for low-rank decomposition.

More details for the math people—

For example, suppose we have an LLM with 7B parameters represented in a weight matrix W. (In reality, the model parameters are, of course, distributed across different matrices in many layers, but for simplicity, we refer to a single weight matrix here). During backpropagation, we learn a ΔW matrix, which contains information on how much we want to update the original weights to minimize the loss function during training.

The weight update is then as follows:

Wupdated = W + ΔW

If the weight matrix W contains 7B parameters, then the weight update matrix ΔW also contains 7B parameters, and computing the matrix ΔW can be very compute and memory intensive.

The LoRA method proposed by Hu et al. replaces to decompose the weight changes, ΔW, into a lower-rank representation. To be precise, it does not require to explicitly compute ΔW. Instead, LoRA learns the decomposed representation of ΔW directly during training which is where the savings are coming from, as shown in the figure below.

The rank of a matrix is the linear space spanned by its rows or columns; to be precise, it is the maximal number of linearly independent columns (or rows) in a matrix. Low-rank decomposition approximates a given matrix A with lower-rank matrices by a product of two matrices U and V, which can then be further decomposed into matrices of lower dimensions.

With LoRA

As illustrated above, the decomposition of ΔW means that we represent the large matrix ΔW with two smaller LoRA matrices, A and B. If A has the same number of rows as ΔW and B has the same number of columns as ΔW, we can write the decomposition as ΔW = AB. (AB is the matrix multiplication result between matrices A and B.)

How much memory does this save? It depends on the rank r, which is a hyperparameter. For example, if ΔW has 10,000 rows and 20,000 columns, it stores 200,000,000 parameters. If we choose A and B with r=8, then A has 10,000 rows and 8 columns, and B has 8 rows and 20,000 columns, that’s 10,000×8 + 8×20,000 = 240,000 parameters, which is about 830× less than 200,000,000.

Of course, A and B can’t capture all the information that ΔW could capture, but this is by design. When using LoRA, we hypothesize that the model requires W to be a large matrix with full rank to capture all the knowledge in the pretraining dataset. However, when we finetune an LLM, we don’t need to update all the weights and capture the core information for the adaptation in a smaller number of weights than ΔW would; hence, we have the low-rank updates via AB.

QLoRA

In QLoRA, the original model’s parameters are first quantized to lower-bit values based on a user-defined quantization configuration. This makes the model more compact. Subsequently, LoRA is applied to the model’s layers to further optimize for the specific task. This combination in QLoRA allows for fine-tuning on significantly less computational power, which essentially democratizes the ability to fine-tune models.

Because this model has the same number of parameters as the original, there is little to no impact on inference latency. Researchers have found that applying LoRA to just the self-attention layers of the model is often enough to fine-tune for a task and achieve performance gains. However, in principle, you can also use LoRA on other components like the feed-forward layers. However since most of the parameters of LLMs are in the attention layers, you get the biggest savings in trainable parameters by applying LoRA to these weights matrices.

🌟Let’s Start Execution!🌟

In this notebook and tutorial, we will fine-tune the Mistral 7B model — which outperforms Llama 2 13B on all tested benchmarks.

In this notebook, we will load the large model in 4bit using bitsandbytes and use LoRA to train using the PEFT library from Hugging Face 🤗.

If you get an error like this: OutOfMemoryError: CUDA out of memory, tweak your parameters to make the model less computationally intensive.

1. Model

Let’s now load Mistral — mistralai/Mistral-7B-v0.1 — using 4-bit quantization and set up the tokenizer. Add padding on the left as it makes training use less memory.

For model_max_length, it's helpful to get a distribution of your data lengths.

model_id = "mistralai/Mistral-7B-v0.1"

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")

# Set up the tokenizer. Add padding on the left as it makes training use less memory.
# https://ai.stackexchange.com/questions/41485/while-fine-tuning-a-decoder-only-llm-like-llama-on-chat-dataset-what-kind-of-pa

tokenizer = AutoTokenizer.from_pretrained(
model_id,
model_max_length=512,
padding_side="left",
add_eos_token=True)

tokenizer.pad_token = tokenizer.eos_token

That’s what the model looks like, notice the self-attention layers which we will utilize during LoRA-

self_attn layers before LoRA

You can check the percentage of trainable model parameters using below:

def print_number_of_trainable_model_parameters(model):
trainable_model_params = 0
all_model_params = 0
for _, param in model.named_parameters():
all_model_params += param.numel()
if param.requires_grad:
trainable_model_params += param.numel()
return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(model))

2. Dataset

We will leverage the Hugging Face datasets library to preprocess financial data for training a language model. The load_dataset function fetches the financial dataset, and a custom function generate_prompt is defined to create prompt texts based on task instructions and corresponding responses in the dataset. These prompts are then added as a new column called "prompt" to the dataset.

The dataset is shuffled using a specified seed to randomize the order of examples. Subsequently, the tokenizer function processes the "prompt" column's text data in batches, tokenizing it for model training.

Lastly, the dataset is split into training and testing subsets, with 90% of the data allocated for training and 10% for testing, facilitating supervised learning on the language model. The resulting training and testing datasets (train_data and test_data) are prepared for subsequent model training and evaluation, respectively.

from datasets import load_dataset

data = load_dataset("gbharti/finance-alpaca", split='train')

# Define a function to generate a prompt text based on a data point
def generate_prompt(data_point):
"""
Generate input text based on a prompt, task instruction, (context info.), and answer

:param data_point: dict: Data point
:return: dict: tokenized prompt
"""
# Create a text with just instruction and response
text = 'Below is an instruction that describes a task. Write a response that ' \
'appropriately completes the request.\n\n'
text += f'### Instruction:\n{data_point["instruction"]}\n\n'
text += f'### Response:\n{data_point["output"]}'
return text

# Add the "prompt" column in the dataset by applying the generate_prompt function to each data point
text_column = [generate_prompt(data_point) for data_point in data]
data = data.add_column("prompt", text_column)

# Shuffle the dataset with a specified seed
data = data.shuffle(seed=1234)

# Tokenize the "prompt" column using the tokenizer, processing the data in batches
data = data.map(lambda samples: tokenizer(samples["prompt"]), batched=True)

# Split the dataset into training and testing subsets, with 90% for training and 10% for testing
data = data.train_test_split(test_size=0.1)
train_data = data["train"]
test_data = data["test"]

3. Setup the PEFT/LoRA model for Fine-Tuning

Now we need to set up the LoRA model for fine-tuning with a new layer/parameter adapter. Using PEFT/LoRA, we will be freezing the underlying LLM and only training the adapter. Have a look at the LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

As we saw earlier the model layers, as we will apply QLoRA to all the self_attnlayers of the model. Those layers are q_proj, k_proj, v_proj, o_proj.

Now, we define the LoRA config.

  • r is the rank of the low-rank matrix used in the adapters, which thus controls the number of parameters trained. A higher rank will allow for more expressivity, but there is a compute tradeoff.
  • alpha is the scaling factor for the learned weights. The weight matrix is scaled by alpha/r, and thus a higher value for alpha assigns more

We will use r=8 and lora_alpha=16 so that we have more emphasis on the new fine-tuned data while also reducing computational complexity.

lora_target_modules = [
"q_proj",
"up_proj",
"o_proj",
"k_proj",
# "down_proj",
# "gate_proj",
# "v_proj",
]

lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=lora_target_modules,
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)

# Add LoRA adapter layers/parameters to the original LLM to be trained.
model = get_peft_model(model, lora_config)

print(print_number_of_trainable_model_parameters(model))

Now, look at the Mistral’s layers after LoRA

self_attn layers after LoRA

4. Start Training

Define training arguments and create a `Trainer` instance. A note on training:

You can set the max_steps to be high initially, and examine at what step your model’s performance starts to degrade. There is where you’ll find a sweet spot for how many steps to perform. For example, say you start with 500 steps, and find that at around 100 steps the model starts overfitting — the validation loss goes up (bad) while the training loss goes down significantly, meaning the model is learning the training set really well, but is unable to generalize to new datapoints. Therefore, 100 steps would be your sweet spot, so you would use the checkpoint-100 model repo in your output dir as your final model.

OUTPUT_DIR = "output"

peft_training_args = transformers.TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=2e-4, # Want a small lr for finetuning
fp16=True,
save_total_limit=3,
logging_steps=1,
output_dir=OUTPUT_DIR,
max_steps=5,
optim="paged_adamw_8bit",
lr_scheduler_type="cosine",
warmup_ratio=0.05
)

trainer = transformers.Trainer(
model=model,
train_dataset=train_data,
eval_dataset=test_data,
args=peft_training_args,
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
trainer.train()

Save model

After your model is finished training, you can save your model to a directory and push it to HF Space.

model.save_pretrained("output_dir")

from huggingface_hub import notebook_login
notebook_login()

model.push_to_hub("akash2212/mistral_7b_finetuned", use_auth_token=True)
tokenizer.push_to_hub("akash2212/mistral_7b_finetuned", use_auth_token=True)

5. Inference Time!

Lets load the specified pre-trained weights and configurations for inference tasks.

PEFT_MODEL = "akash2212/mistral_7b_finetuned"

config = PeftConfig.from_pretrained(PEFT_MODEL)
model = AutoModelForCausalLM.from_pretrained(
config.base_model_name_or_path,
return_dict=True,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token

# Load the Lora model
model = PeftModel.from_pretrained(model, PEFT_MODEL)

The below function generates a response or completion for a given query by preparing a prompt, tokenizing it, feeding it into the model, and returning the generated completion as text.

def get_completion(query: str, model, tokenizer) -> str:

prompt_template = """
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Question:
{query}

### Answer:
"""
prompt = prompt_template.format(query=query)

encodeds = tokenizer(prompt, return_tensors="pt", add_special_tokens=True)

model_inputs = encodeds.to(DEVICE)


generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
return (decoded[0])


result = get_completion(query="Pay off car loan entirely or leave $1 until the end of the loan period?", model=model, tokenizer=tokenizer)
print(result)

🌟Additional Notes🌟

  1. Balancing LoRA Hyperparameters: R and Alpha

As the original LoRA paper outlines, LoRA introduces an additional scaling coefficient for applying the LoRA weights to the pretrained weights during the forward pass. The scaling involves the rank parameter r, which we discussed earlier, as well as another hyperparameter α (alpha) that is applied as follows:

scaling = alpha / r
weight += (lora_B @ lora_A) * scaling

As we can see in the code formula above, the larger the influence of the LoRA weights.

In this experiment, I used r=8 and alpha=16, which resulted in a 2-fold scaling. Choosing alpha as two times r is a common rule of thumb when using LoRA for LLMs, but I was curious if this still holds for larger r values. In other words, “alpha = 2×rank” really seems to be a sweet spot.

Choosing alpha as two times as large as r may often result in the best outcomes, but it may also not hurt to experiment with different ratios.

2. Enable LoRA for More Layers

The tables above showed experiments where LoRA was only enabled for select weight matrices, i.e., the Key and Value weight matrices in each transformer layer. In addition, we can also enable LoRA for the Query weight matrices, the other linear layers between the multihead attention blocks, and the linear output layer.

If we enable LoRA for all these additional layers, we increase the number of trainable parameters which comes with a larger memory requirement but can increase the modeling performance noticeably. It might be worthwhile exploring the other combinations in future experiments.

To conclude, here’s a simplified explanation of how LoRA works

  1. Initial Model: Start with a large pre-trained model (e.g., Llama 2, Mistral, etc.).
  2. Low-Rank Matrix: Introduce low-rank approximations for the matrices that will be used to adapt the model for the specific task at hand. Low-rank matrices (adapters) are typically formed for all linear layers of the model, but this can vary based on the model architecture and the task.
  3. Transform Layers: Instead of directly modifying the original weights of the model, LoRA applies a transformation using the low-rank matrices to the outputs of affected layers.
  4. Fine-tuning: During the fine-tuning process, only the parameters in the low-rank matrices are updated. The rest of the model’s parameters are kept fixed. Again, updating only low-rank matrices allows for fine-tuning on smaller, cheaper GPUs.
  5. Prediction: For making predictions, the adapted layers are used in conjunction with the original pre-trained model. The low-rank adapted layers act as a kind of “add-on” to the existing architecture, adjusting its behavior for the specific task.

🌟Some benefits of LoRA🌟

  1. Efficiency: Because it only updates a small subset of parameters, fine-tuning is faster and requires less computational power. According to the LoRA paper, it can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times.
  2. Specialization: The low-rank adaptation allows the model to specialize in a particular task without a complete overhaul of the original weights.
  3. Preservation: By keeping the bulk of the model fixed, LoRA helps preserve the generalization capabilities of the original pre-trained model while still enabling task-specific adaptations.

Refer to the notebook code on Github:

Thank you for reading this article, I hope it added some pieces to your knowledge stack! Before you go, if you enjoyed reading this article:

  • Be sure to clap and follow me, and let me know if any feedback.
  • I built versatile applications using the Large Language Model (LLM) and serverless AWS architectures for Big Data processing. You’re welcome to take a look at the repo and star⭐it.
  • Follow me: LinkedIn | GitHub | Medium

--

--