Four Methods for Fine-Tuning Large Models

Overview

Fine-tuning adjusts the parameters of a large language model (LLM) to adapt it to a specific task by training on a task-related dataset. The required amount of fine-tuning depends on task complexity and dataset size. In deep learning, fine-tuning is a common technique to improve the performance of pretrained models. Besides ChatGPT, many pretrained models can be fine-tuned.

What is PEFT

PEFT (Parameter-Efficient Fine-Tuning) is an open-source tool from Hugging Face for parameter-efficient fine-tuning of large models. It integrates four fine-tuning methods that can achieve results close to full-parameter fine-tuning while updating only a small number of parameters, enabling fine-tuning when GPU resources are limited.

Fine-Tuning Approaches

Fine-tuning can be categorized as full fine-tuning or repurposing:

Full Fine-tuning: Update all parameters of the pretrained model. All layers and weights are optimized to fit the target task. This approach suits tasks that differ substantially from the pretraining objectives or require high adaptability. Full fine-tuning typically demands significant compute and time but can yield superior performance.
Repurposing (Partial Fine-tuning): Update only the top layers or a few layers while keeping lower-layer parameters fixed. This preserves the pretrained model's general knowledge and adapts higher-level representations to the target task. Repurposing is useful when the target task is similar to the pretraining domain or when the task dataset is small. It requires less compute and time than full fine-tuning but may underperform in some cases.

Common fine-tuning strategies include:

Fine-tune all layers: involve every layer of the pretrained model in training.
Fine-tune top layers: update only the upper layers of the model.
Freeze bottom layers: keep lower layers fixed and fine-tune only the top layers.
Layer-wise fine-tuning: progressively unfreeze and fine-tune layers from bottom to top.
Transfer learning: transfer pretrained knowledge to a new task, typically by fine-tuning top layers or freezing bottom layers.

Classic Fine-Tuning

Classic fine-tuning continues training a pretrained model with a small amount of task-specific data. The pretrained weights are updated to better fit the task. The amount of fine-tuning required depends on the similarity between the pretraining corpus and the task-specific data: higher similarity often requires less fine-tuning, while lower similarity requires more.

Prompt Tuning (P-tuning)

Prompt Tuning was introduced in the 2021 paper "The Power of Scale for Parameter-Efficient Prompt Tuning." Prompt tuning is a simple parameter-efficient approach: keep the model feed-forward parameters fixed and update only some embedding parameters to achieve low-cost fine-tuning of large models.

Classic prompt tuning does not update any core model parameters. Instead, it focuses on crafting input prompts or templates that guide the pretrained model to produce desired outputs. Typical implementations use a prompt encoder, such as BiLSTM+MLP, to encode pseudo-prompts (discrete tokens) and concatenate them with input embeddings. LSTM-based reparameterization was used to accelerate training, and a few natural language anchor tokens (anchors, e.g., "Britain") were introduced to further improve performance. The encoder part producing the prompt embeddings is then optimized.

P-tuning v1 has two notable limitations: it does not generalize well across tasks and it is sensitive to model scale. For some complex natural language understanding tasks, performance is poor unless the pretrained model is sufficiently large. The original paper reported:

Prompt length effect: At sufficient model scale, a prompt length of 1 can be reasonable; length 20 achieves strong performance.
Prompt initialization: Random uniform initialization performs worse than other methods, but differences diminish at large model scales.
Pretraining strategy: Language model adaptation helps, but at large model scales differences are reduced.
Number of fine-tuning steps: Smaller models benefit from more steps; at very large scales, zero-shot performance can already be strong, and at around 10B parameters results match full-parameter fine-tuning.

Code example:

from peft import PromptTuningConfig, get_peft_model peft_config = PromptTuningConfig(task_type="SEQ_CLS", num_virtual_tokens=10) model = AutoModelForCausalLM.from_pretrained(model_name_or_path, return_dict=True) model = get_peft_model(model, peft_config)

Prefix Tuning

Prefix Tuning was proposed in the 2021 paper "Prefix-Tuning: Optimizing Continuous Prompts for Generation." Instead of updating all model parameters, this method constructs a sequence of task-specific virtual tokens as a prefix before the input tokens and trains only the prefix parameters while keeping the rest of the Transformer fixed.

Compared with fine-tuning, prefix tuning optimizes a small learnable continuous task-specific vector (the prefix) instead of the entire model. Prompt design is similar but manual prompts are explicit and not trainable, while prefixes are implicit and learnable. Searching for optimal discrete prompts is difficult, and discrete prompts may be suboptimal for continuous neural representations.

Code example:

peft_config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=20) model = AutoModelForCausalLM.from_pretrained(model_name_or_path, return_dict=True) model = get_peft_model(model, peft_config)

P-tuning v2

P-tuning v2 builds on P-tuning and prefix-tuning by introducing deep prompt encoding and multi-task learning. Experiments show that by tuning only about 0.1% of parameters, models from 330M to 10B parameters can match the performance of full fine-tuning across various tasks.

The goal of P-tuning v2 is to make prompt tuning comparable to fine-tuning across model scales and tasks. Earlier prompt tuning methods were limited by model scale and task type: they worked well only when the pretrained model was sufficiently large, and they performed poorly on sequence tagging tasks. P-tuning v2 addresses these limitations by inserting continuous prompts at each layer and applying a learned prompt at the sequence front. Key improvements include:

Removal of the reparameterization trick used to accelerate training.
Use of multi-task learning: pretrain prompts on multi-task datasets for better initialization before adapting to downstream tasks.
Avoidance of verbalizers for vocabulary mapping; instead, use classification heads similar to BERT by applying a randomly initialized classification head at the first token, improving compatibility with sequence labeling tasks.

Key design factors in P-tuning v2:

Reparameterization: MLPs used in Prefix Tuning and P-tuning to construct trainable embeddings can produce inconsistent results across tasks and datasets in natural language understanding.
Prompt length: Optimal prompt length varies by task; simple classification may work best with length=20, while more complex tasks require longer prompts.
Multi-task learning: Optional but useful for better initialization and improved performance.
Classification head: Using an LM head to predict tokens is central to some prompt tuning approaches, but it is unnecessary in full data settings and incompatible with sequence tagging. P-tuning v2 uses a classification head at the first token like BERT.

Code example:

peft_config = PrefixTuningConfig(task_type="SEQ_CLS", num_virtual_tokens=20) model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, return_dict=True) model = get_peft_model(model, peft_config)

AdaLoRA

Different weight parameters in pretrained language models contribute unequally to downstream tasks. AdaLoRA allocates parameter budget more intelligently by dynamically adjusting the singular values of low-rank increments obtained via singular value decomposition. This approach updates only the parameter increments that most affect model performance, improving both performance and parameter efficiency during fine-tuning.

Code example:

peft_config = AdaLoraConfig(peft_type="ADALORA", task_type="SEQ_2_SEQ_LM", r=8, lora_alpha=32, target_modules=["q", "v"], lora_dropout=0.01) model = AutoModelForCausalLM.from_pretrained(model_name_or_path, return_dict=True) model = get_peft_model(model, peft_config)