LoRA and QLoRA — Efficient Fine-tuning

Fine-tuning large language models has long been the domain of those with massive computational resources. However, LoRA and QLoRA bring revolution - they enable efficient adaptation of even the most modern models with minimal hardware and memory requirements.

What is LoRA and Why We Need It¶

Low-Rank Adaptation (LoRA) represents a revolutionary approach to fine-tuning large language models. While traditional fine-tuning requires updating all model parameters, LoRA modifies only a small subset using low-rank matrices. This means dramatic reduction in memory requirements and computational complexity.

The principle lies in decomposing weight changes into two smaller matrices A and B, where the original matrix W is updated according to the formula: W’ = W + BA. The rank of these matrices is typically 1-64, which is orders of magnitude smaller than the original dimensions.

Practical Advantages of LoRA¶

Memory efficiency: Trains only 0.1-1% of parameters
Faster training: Fewer parameters = faster convergence
Modularity: Adapters can be easily swapped for different tasks
Preservation of original model: Base model remains unchanged

Implementing LoRA with PEFT¶

The Hugging Face PEFT (Parameter-Efficient Fine-Tuning) library provides simple LoRA implementation. Here’s a complete example of fine-tuning a model for text classification:

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# LoRA and QLoRA — Efficient Fine-tuning
model_name = "microsoft/DialoGPT-medium"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=2,
    torch_dtype=torch.float16
)

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=8,  # rank
    lora_alpha=32,  # scaling parameter
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"]  # which layers to adapt
)

# Apply LoRA to model
peft_model = get_peft_model(model, lora_config)

# Display number of trainable parameters
peft_model.print_trainable_parameters()
# Output: trainable params: 294,912 || all params: 117,504,512 || trainable%: 0.25

Key LoRA Parameters¶

Rank (r): Determines the dimension of adaptation matrices. Lower values (4-16) are more memory efficient, higher (64-128) provide greater expressivity. Alpha: Scaling factor, typically 2-4× larger than rank. Target modules: Specifies which layers will be adapted - usually attention projections.

QLoRA: Quantization + LoRA¶

QLoRA (Quantized LoRA) combines LoRA with 4-bit quantization, enabling fine-tuning of even 65B parameter models on regular GPUs. It uses NormalFloat4 (NF4) quantization optimized for normally distributed neural network weights.

from transformers import BitsAndBytesConfig
from peft import LoraConfig, prepare_model_for_kbit_training

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # nested quantization
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)

# QLoRA configuration
qlora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, qlora_config)

QLoRA Advantages¶

QLoRA achieves 75% memory savings compared to standard fine-tuning while maintaining 99% performance. Llama-2 7B can be fine-tuned on 12GB GPU instead of the original 28GB.

Training and Optimization¶

For efficient training with LoRA/QLoRA, we recommend specific optimizer and scheduler settings:

from transformers import Trainer, TrainingArguments
from torch.optim import AdamW

# Optimized training arguments for LoRA
training_args = TrainingArguments(
    output_dir="./lora-model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    warmup_steps=100,
    max_steps=1000,
    learning_rate=2e-4,  # higher LR for LoRA
    fp16=True,  # or bf16 for QLoRA
    logging_steps=50,
    save_strategy="steps",
    save_steps=500,
    remove_unused_columns=False,
    gradient_checkpointing=True,  # memory saving
)

# Create trainer
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)

# Start training
trainer.train()

# Save only LoRA adapter (few MB)
peft_model.save_pretrained("./my-lora-adapter")

Loading and Inference¶

The advantage of LoRA is the ability to quickly switch between different adapters for different tasks:

from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-medium")

# Load LoRA adapter
lora_model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")

# Inference
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
with torch.no_grad():
    outputs = lora_model.generate(**inputs, max_length=50)

# Switch to another adapter
lora_model.load_adapter("./another-task-adapter", adapter_name="task2")
lora_model.set_adapter("task2")

Merging Adapter with Base Model¶

For production deployment, you can merge the LoRA adapter directly into the original model:

# Merge and save complete model
merged_model = lora_model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")

Best Practices and Tips¶

Rank selection: Start with r=8-16 for most tasks. Use r=32-64 for complex adaptations. Target modules: For transformer models, target attention projections (q_proj, v_proj). For specific tasks, experiment with feed-forward layers.

Learning rate: LoRA typically requires higher learning rate (1e-4 to 2e-4) than full fine-tuning. Batch size: Smaller batch sizes often work better due to LoRA’s regularization effect.

Summary¶

LoRA and QLoRA represent a game-changer for fine-tuning large models. LoRA enables efficient adaptation with minimal resource requirements, while QLoRA extends these possibilities to the largest models through quantization. These techniques democratize access to advanced fine-tuning and open new possibilities for AI model customization even for smaller teams with limited resources.

loraqlorapeft

CORE SYSTEMS team

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

All articles