Fine-Tuning Language Models

A conceptual guide to adapting LMs to downstream use cases

Language Models
Fine-Tuning
Author

Andreas Stöffelbauer

Published

July 26, 2025

🚧 This article is only a work in progress, with the sections on fine-tuning still entirely in draft mode and left out so far.

1 Introduction

Adapting language models to a real-world use cases is an crucial step for anyone trying to build something valuable with them. While larger models sometimes work well almost out-of-the-box, you’re unlikely to hit the performance ceiling without being more thoughtful than that, and there are plenty of cases where using or fine-tuning a smaller model is actually the better option (eg at scale).

In this conceptual guide, I’ll describe various ways to adapt and fine-tune language models for downstream use cases. While fine-tuning is the ultimate theme here, there is a whole range of ways to customize and mold a model’s behavior. In fact, fine-tuning itself is not just one method, but many. And while the internet is already abound with advice and tutorials, none comprehensively describe, organize, and link all these ideas in a way that holds me back me from writing this tutorial.

In this initial version, it is not so much a hands-on tutorial on how to adapt or fine-tune models (even though I’d like it to eventually be). But for now, it’s an attempt at organizing (my own) knowledge and ideas around what you can do and why. All of this is very much an active, fast-changing area of research (much of it empirical rather than theoretic), with lots of room for improvement and new ideas. Nevertheless, the topics I discuss here are fundamental ideas that have already stood the test of time and will remain useful at least for the foreseeable future.

When you read this, I generally assume that you have a good prior knowledge of language models. However, it will be mostly just about building up a broad understanding of the options available to you, so don’t let that stop you from reading it.

2 Preliminaries

Some general info before we’ll get to the main parts.

2.1 Learning Concepts

The first thing I want to do is define and differentiate various types of learning, which should help get a common understanding of the mechanisms behind how language models “learn”.

Concept Definition
Supervised learning Supervised learning means learning from labels, i.e., learning a function \(y=f(x)\). For example, fine-tuning a language model on a classification task using a set of labeled examples would be supervised learning.
Unsupervised learning Unsupervised learning means learning without access to labels. It often involves finding patterns in data, like clustering. Techniques like nearest neighbors, dimensionality reduction, and embedding models also fall under this category.
Self-supervised learning Deriving labels from the data itself, such as predicting the next token in a sequence, is self-supervised learning. Other examples include diffusion models, masked language modeling (e.g., BERT), and auto-encoders.
Transfer learning Using a model trained on one task (or data domain) A to solve a different task B. It can be more effective than learning task B directly when A has abundant data but B does not. The goal is for learned representations to generalize to downstream tasks.
Multi-task learning Training a single model on multiple tasks encourages (or even forces) the learning of shared representations. This has a strong regularization effect, which helps explain why large language models, which are inherently trained on diverse tasks, perform so well.
Meta learning Also known as “learning to learn,” meta learning teaches a model to learn from entire datasets rather than single data points. It’s conceptually related to in-context learning.
In-context learning Language models can learn from examples provided in the input context, without updating their weights. This powerful mechanism is not yet fully understood.
Contrastive learning Aims to embed inputs such that similar inputs are close together and dissimilar ones are far apart in representation space, according to some distance metric such as cosine similarity. This is especially effective for training embedding models.
Domain adaption Adapting a model trained in one domain to perform well in another. Continued pre-training on new domain data (e.g., medical texts) is one common method.
Continual learning Unlike static learning, continual learning allows models to update knowledge over time, akin to human learning. This remains a challenging and largely unsolved area.
Curriculum learning Learning simpler concepts before harder ones, similar to human education. While current models don’t fully follow this approach, some aspects (e.g., higher-quality data later in training) resemble it.
Model distillation Training a smaller model (student) using outputs from a larger model (teacher). The teacher provides richer signals than traditional next-token prediction.
Reinforcement learning Learning through interaction with an environment via rewards. Actions are selected based on a policy. In language models, tokens act as actions.
On-policy vs off-policy learning On-policy learning uses data from the model’s own behavior. Off-policy uses external data.

2.2 Model Architectures

In this section, I’ll briefly review the most common model architectures used today. This is important insofar as different architectures are (pre)-trained in different ways, and thus need to be adapted and/or fine-tuned in different ways.

The two dominating model architectures in the current era of modern NLP are decoder and encoder models. Both are more modern versions of the transformer architecture originally introduced in 2017 by the famous paper “Attention is all you need.”

2.2.1 Decoder Models

Most of the attention today is on decoder models. These are characterized by their causal attention mechanism that allows them to predict one token ahead at a time. This is also why they are called generative or auto-regressive language models. As it turns out, this paradigm not only works incredibly well, but it’s also very desirable as it makes things incredibly flexible, since almost any tasks can just be framed as a next token prediction task. This has given rise to an entirely new paradigm in machine learning. Most of what I’ll cover in this article will primarily and implicitly apply to decoder language models, except when noted otherwise.

2.2.2 Encoder Models

Encoder models, unlike decoder models, have bi-directional attention. Many encoder-based language models are trained using masked language modelling, where a random subset of tokens are masked and predicted by the model (see below), but other pre-text tasks exist too. Usually, these models are not as useful out-of-the box, but they can be fine-tuned to perform many different tasks. Popular use cases are classification, or any tasks where you need one prediction per token (eg named entity recognition).

<|CLS|> Yesterday I <|M|> to the <|M|> <|EOS|>

Encoders are also a common choice for (text) embedding models (no need for causal attention there either), which are often trained using contrastive learning. A popular multi-model embedding model example is CLIP, which is trained to align images with their respective captions.

More recently, even for tasks like classification or embedding models, which were previously encoders’ territory, decoder model are increasingly often used, for example by adapting an already pre-trained decoder language model. Most of the research today goes into decoder-like models since the convenience of using one architecture to solve all problems just seems quite alluring (very much to the detriment of idea diversity).

2.2.3 Other Architectures

Just for the sake of completeness, transformers in both encoder/decoder form are not the only architectures in use today. In particular, more efficient models that can be scaled even further are needed since transformer models suffer from quadratic complexity in their attention mechanism.

For example, state-space models (popularized by a series of models named “Mamba”) also exist and have shown some promise. Liner attention mechanism have also been proposed. However, these models remain niche and have so far not been used in frontier models (to the extent we can even say that about closed source models of course).

2.3 Pre & Post-Training of LLMs

I’ll assume you know how today’s language models are trained, so this will be brief. However, I will also try to connect some of the concepts from earlier, which will be a good segway into inference-time and fine-tuning techniques.

Modern LLMs are trained in (three) stages. The first stage is massive pre-training using next token prediction. This is self-supervised learning, essentially a classification task over tokens (often between 30k-250k in practice) using cross-entropy loss. Pre-training language models shouldn’t just bee seen as next-token prediction, since that isn’t the full story. The main advantage is representation learning, ie to have the model learn rich, useful, and general representations which it can use to solve many different tasks, but this could also be achieved otherwise (e.g. via masked language modelling or even diffusion).

Following that is post-training. Traditionally this first involved supervised instruction tuning (to teach a model to respond rather than be a pure text completer) and reinforcement learning (to align the model with human preferences and bring about reasoning abilities), though the boundaries between all three stages are increasingly blurring.

Both pre-training and in particular instruction tuning is essentially massive multitask learning. Turns out there are also lots of key elements of meta learning happening, ie the model learns to learn from what is provided in its context. This is the underlying force that gives rise to emergent behaviors such as few-shot learning and in-context learning (and I would argue chain-of-thought and reasoning too). Training phases often also incorporate elements of curriculum learning - for example you would typically see higher quality data later, or long context data.

During the RL phase, both on-policy and off-policy methods may be applied, with a general advantage of on-policy methods. Once a large and strong foundation model has been trained, a common practice is to distill its capabilities into smaller language models.

Increasingly, the lines between these three steps is blurring or even vanishing. Today, RL has been successfully applied on top of base models (i.e. pre-trained only models). And remember that even pre-training and SFT can be viewed from the lens of reinforcement learning via “behavior cloning”.

3 Inference-Time Techniques

As you know, language models are quite steerable, and there are many techniques that can help improve performance even without updating their weights. If you look more closely, as we’ll do here, you’ll notice that this isn’t just “prompting”, but there are many interesting and effective techniques.

3.1 Zero-Shot Prompting

Language models can perform many tasks in a zero-shot manner, given just instructions. This is actually already an example of in-context learning - ie the LM learns the task from your instructions alone.

While this seems trivial at first, the way I think about prompting is through activating different sub-networks of the language model. If you change the instructions (i.e. the prompt), even just a little bit, the language model will use different neurons in its network and the results will likely change (sometimes drastically).

Asking a language model to “think step by step” can also help boost performance, especially on tasks that require system 2 style thinking (ie deliberate problem solving), like math or composite problems. This and countless other prompt engineering techniques have been proposed over time (some more relevant and useful than others). However, this is not meant to be a prompt engineering guide, so just know that prompting is one of the many ways to adapt a language model.

To get some ideas of what makes a good prompt, see for example…

3.2 Reasoning

What started the discovery of chain-of-though prompting has led to the development of explicit “reasoning” models that perform thinking natively, ie even without you asking for it (or some models like Qwen3 even allow you to enable/disable the thinking mode, a feature I really like). Typically, the model response looks something like this:

<think>...</think>
<answer>...</answer>

This is sometimes also called scaling “inference-time compute”, ie spending more time (or rather tokens) on thinking, and it has been shown to improve performance massively in many cases, especially hit@1 rates (it has also shown to decrease performance on some tasks due to “overthinking”). While it’s related to chain-of-thought, it goes further than that and is often characterized by certain cognitive behaviors such as reflection, self-verification and backtracking.

Reasoning models today are trained primarily using reinforcement learning from verifiable rewards (RLVR). While to some degree, reasoning can also be (and is) trained into language models using SFT (eg during instruction fine-tuning), the cognitive behaviors just mentioned appear to emerge naturally in RLVR. It’s not fully understood yet why and how exactly they emerge, but it’s an exciting and useful development.

See more on that in the RLVR fine-tuning section later.

3.3 Few-Shot Learning

Few-shot learning is another well known and proven technique that is extremely useful in practice. This is “in-context learning” as it’s most commonly understood.

In fact, research has shown that something akin to gradient descent occurs within the network when doing few-shot learning (and thereby is related to meta learning). However, others are more skeptical and argue that there is no real learning happening at all, and that it’s just task retrieval (meaning it merely helps the model retrieve the right pattern to apply).

Either way, the performance gains are real. In fact, people have shown that these gains continue even into the hundreds and thousands of in-context examples (“many-shot learning” if you will). However, at this point, context length and its associated cost are likely to become a problem. In that case, fine-tuning would probably be a good idea.

Code
# Example of few-shot learning
---

# General instructions
{instructions}

# Examples
Input: {input1}
Label: {label1}

Input: {input2}
Label: {label2}

# Target question

Input: {input3}
Label: <model generated answer>

3.4 Self-Consistency Sampling

There are also various sampling-related techniques that can be very useful in practice. For instance, with self-consistency sampling, you sample N responses and take a majority vote. You probably want to use a somewhat higher temperature so you don’t get the same answer multiple times (other sampling techniques like beam search may help too).

Code
# Sample 1
<think> ... </think>
<answer> B </answer>

# Sample 2
<think> ... </think>
<answer> A </answer>

# Sample 3
<think> ... </think>
<answer> B </answer>

# Majority Voting
Selected answer: B

Some cool research from Google showed that simply scaling parallel sampling and parallel verification can match or even exceed the performance of reasoning models. An important part of explanation for this is the generation-verification gap, which claims that verifying a correct answer is generally easier than generating one. So by generating many answers, there’ll be one correct one, and that one will likely be easy to verify too.

If you don’t have answers you can simply verify and compare (as in multiple choice or math answers), you can also evaluate/rank the responses using a LLM-as-a-judge approach (either in parallel or all at once, as below).

Code
Given the following 3 generated answers to a question, select the best answer you think is most likely to be correct.

Q: {question}

A1: {answer1}
A2: {answer2}
A3: {answer3}

---

# Model response
<think> ... </think>
<answer> A2 </answer>

3.5 Retrieval Augmentation

Language models don’t know everything, and in fact tend to make stuff up as when they don’t (reliably saying “idk” is still an open research question I guess). Instead, if often helps to provide the LM with relevant context to ground the model’s response.

There are many ways to do RAG, including query expansion and query augmented generation (QuAG). Some may refer to this as “agentic search”, but the main idea is to have the LM write it’s own search query.

To be done…

3.6 Tool Use

Equipping language models with the ability to use tools or call functions has been one of the most influential developments and is perhaps the main reasons language models are becoming actually useful in the real world.

Today’s frontier models can even combine tool use and reasoning. This makes sense insofar as it allows the model to select tools thoughtfully, and to reflect on its outputs. Modern workflows can require tens or hundreds of tool calls within a single turn.

To be done…

3.7 Structured Generation

When unconstrained, LMs can generate any sequence of tokens. However, in practice, you often want the response to look exactly as you say. The most popular example is JSON format, but you can constrain the model’s outputs to strictly follow any structure, whether that’s XML or a regex pattern.

To be done…

3.8 Automatic Prompt Optimization

To be done…

4 Fine-Tuning Techniques

At last, welcome to the core part of this article. You want to consider adapting the weights of a language model in the following cases

  • Performance: you may not reach your desired performance using only inference-time strategies
  • Reliability: you may just get unreliable results either due to the stochastic nature of LLMs, or even at zero temperature
  • Behavior: you may want to change the behavior of your model
  • Knowledge: this is a open problem, and fine-tuning may or may not be sufficient
  • Latency & Throughput: often a LLM is just not economical to use at scale, an latency may be too low for many real world use cases, so a smaller model can be fine-tuned to match or exceed the performance of larger models on specific tasks
  • Cost: you may have very detailed instructions (for example with few-shot examples), which costs more

The idea here is essentially to apply transfer learning. Language models are pre-trained extensively and have a lot of world knowledge, general language understanding capabilities, and problem solving skills.

4.1 Supervised Fine-Tuning (SFT)

To be done…

4.2 Reinforcement Learning from Human Feedback (RLHF)

To be done…

4.3 Reinforcement Learning from Verifiable Rewards (RLVR)

To be done…

4.4 Supervised Fine-Tuning (SFT)

To be done…

4.5 Model Distillation

To be done…

4.6 Contrastive Learning

To be done…

5 Conclusion

To be done…

6 Appendix

To be done…

6.1 Parameter-Efficient Fine-Tuning (PEFT)

To be done…

6.2 Sampling

To be done…

6.3 Encoder Model Fine-Tuning

To be done…