Pretrained LLMs have general knowledge, but to make them experts on your specific data or task (e.g., classifying your company's support tickets), you need to adapt them. Let's compare the traditional method with modern, more efficient approaches.
1. Full Fine-Tuning
The Idea: Take a general-purpose pretrained model (like BERT or GPT) and continue the training process on your own smaller, labeled dataset. This updates all of the model's original weights to specialize it for your task.
The Process:
- Load a pretrained model.
- Add a new "head" layer on top that is specific to your task (e.g., a classification layer).
- Train the entire model on your dataset with a low learning rate. The model's weights, already good at understanding language, are slightly adjusted to master your specific problem.
Pros:
- High Performance: Often achieves the best possible accuracy for the task.
Cons:
- Expensive: Requires significant computational resources (powerful GPUs) and time.
- Catastrophic Forgetting: The model might "forget" some of its general knowledge.
- Storage Inefficient: For each new task, you must save a complete, multi-gigabyte copy of the entire fine-tuned model.
2. Parameter-Efficient Fine-Tuning (PEFT)
The drawbacks of full fine-tuning have led to the rise of PEFT methods. The core idea is to freeze the vast majority of the pretrained LLM's weights and only train a very small number of new parameters.
Popular PEFT Techniques:
- Prompt Tuning: This method doesn't touch the model at all. Instead, it learns a small "soft prompt" vector that is added to the input embeddings. It's like learning the perfect instruction to give the frozen model to get it to perform your task.
- LoRA (Low-Rank Adaptation): This is one of the most popular and effective PEFT methods. Instead of updating the huge weight matrices of the Transformer, LoRA learns two small, "low-rank" matrices that represent the changes to the original weights. During inference, these small matrices are combined with the original frozen weights. You only need to train and store these tiny adapter matrices, which can be just a few megabytes in size.
Pros of PEFT (like LoRA):
- Computationally Cheap: Drastically reduces the memory and compute required for training.
- Storage Efficient: You only store a tiny adapter file (megabytes) for each task, not the full model (gigabytes). One base model can be used for many tasks just by swapping adapters.
- Portable and Modular: Easy to share, deploy, and manage different task-specific adapters.
- Avoids Catastrophic Forgetting: The original model's knowledge is preserved.
Practical Example with Hugging Face peft
Let's see how easy it is to set up LoRA using the Hugging Face peft library.
Python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType
# 1. Load the base pretrained model
model_name = "bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# 2. Define the LoRA configuration
# We are telling PEFT to apply LoRA to the query and value matrices
# in the self-attention layers of the BERT model.
lora_config = LoraConfig(
task_type=TaskType.SEQ_CLS, # This is a sequence classification task
r=16, # The rank of the update matrices
lora_alpha=32, # LoRA scaling factor
target_modules=["query", "value"],
lora_dropout=0.1,
)
# 3. Wrap the base model with the PEFT model
peft_model = get_peft_model(model, lora_config)
# Print the number of trainable parameters
peft_model.print_trainable_parameters()
# Output will be something like:
# trainable params: 669,890 || all params: 109,482,242 || trainable%: 0.6118
# Now, you can train `peft_model` just like you would a normal Hugging Face model.
# Only the 0.6% of LoRA parameters will be updated, not the 99.4% of frozen base model params.