1 Introduction

Adapting language models to a real-world use cases is an crucial step for anyone trying to build something valuable with them. While larger models sometimes work well (almost) out-of-the-box, you’re unlikely to hit the performance ceiling without being more thoughtful than that, and there are plenty of cases where using or fine-tuning a smaller model is actually the better idea (eg at scale).

In this conceptual guide, I’ll describe various ways to adapt and fine-tune language models for downstream use cases. While fine-tuning is the ultimate theme here, there is a whole range of ways to customize and steer a model’s behavior. In fact, fine-tuning itself is not just one method but many. And while the internet is already abound with advice and tutorials, none comprehensively describe, organize, and link all these ideas in a way that holds me back me from writing this tutorial.

In this initial version, it is not so much a hands-on tutorial on how to adapt or fine-tune models (even though I’d like it to eventually be). But for now, it’s an attempt at organizing (my own) knowledge and ideas around what you can do and why. All of this is very much an active, fast-changing area of research (much of it empirical rather than theoretic), with lots of room for improvement and new ideas. Nevertheless, the topics I discuss here are general ideas that have already stood the test of (some) time and will remain useful at least for the foreseeable future.

When you read this, I generally assume that you have a good prior knowledge of language models. However, it will be mostly just about building up a broad understanding of all the options available to you, so don’t let that stop you from reading it.

2 Preliminaries

Some general info before we’ll get to the main parts.

2.1 Learning Concepts

The first thing I want to do is define and differentiate various types of learning, which should help get a common understanding of the mechanisms behind how language models “learn”.

Concept	Definition
Supervised Learning	Supervised learning means learning from labels, ie learning a function \(y=f(x)\) that predicts \(y\) from \(x\). If you, for example, fine-tune a language model on a classification task using a set of labelled examples \((x, y)\), this would be supervised learning.
Unsupervised Learning	Unsupervised learning is learning without labels. Often that means finding general patterns in data, such as clusters. Also, nearest neighbors, dimensionality reduction, and embedding models may all fall under this umbrella too.
Self-supervised Learning	Sometimes, it’s possible to derive labels from the training data itself for free. For example, the next token in a sequence of words can be a target. Diffusion models are also an example of self-supervised learning (where de-noising itself is the objective), and so is masked language modelling (commonly used to pre-train encoder models such as BERT), as well as auto-encoders.
Transfer Learning	Using one model that was pre-trained on task (or data) A to solve task B is the essence of transfer learning. This can be more effective than training a model from scratch on B, especially if there is much data for A but not for B. The purpose of LM pre-training for example is that the learned representations are generalizable to many downstream tasks.
Multi-task Learning	Similarly, training a single model on multiple tasks can be an advantage since it allows (or forces) the model to learn representations that are shared across tasks. Curiously, multi-task learning has a large regularization effect, which is arguably a key reason why language models, which are inherently trained on countless of tasks, actually work.
Meta Learning	Meta learning, also referred to as “learning to learn”, is a machine learning paradigm that teaches a model to learn from full datasets rather than individual data points. Some see pre-training of language models as am implicit meta learning task, as the model has to learn how to learn from its context. As such, meta learning has also a strong link to in-context learning of language models.
In-context Learning (ICL)	Language models have been shown to follow instructions and learn from examples provided in their context, without requiring weight updates. This mechanism is still not fully understood yet, of course, incredibly useful.
Contrastive learning	The objective of contrastive learning is to encode inputs into a representation space in which related inputs are pulled closer together, according to a distance metric, while unrelated items are pushed apart. It is an especially useful technique for training embedding models. As such it has found applications in other areas too, such as recommendation systems.
Domain Adaption	Adapting a model pre-trained on data from domain A to perform well in domain B is another common problem. Foundation language models, for example, learn from a vast range of domains, and so they may not be experts in all (say for example medicine or law), which is especially true of smaller models. Domain adaption can be achieved, for example, through continued pre-training on data from domain B.
Continual Learning	Humans don’t learn once and then never again. Instead, we learn (and unlearn!) continuously, which is currently out of reach for language models. This is also referred to open-endedness and is an important but still unsolved area of AI research.
Curriculum learning	Likewise, as humans we learn the basics before we learn more advanced things. Current language models don’t learn like that, but rather all at once. That being said, there are still certain elements of curriculum learning to be found, such as using higher quality data towards the end of pre-training.
Model distillation	Model distillation is a learning paradigm in which a (typically) small model (the student) is trained on the outputs of a large model (the teacher). This can be more effective than training the small model from scratch, especially if you need the large model anyway (as is often the case with language models). The main reason is that the learning signal of the large model is extremely high. While self-supervised learning on next-tokens provides only one token at a time, the prediction of the teacher LM provides a full distribution over tokens - a much richer learning signal.
Reinforcement learning	Unlike supervised learning, which uses labels as targets, reinforcement learning means interacting with an environment and learning from the rewards it provides. RL agents select actions based on a policy, which is a probability distribution over action. In the context of language models, typically tokens are considered the actions.
On-policy vs off-policy Learning	If a model learns from data generated by its own policy (the behavior policy), learning is considered to be on-policy. If data comes, for example, from a different model, it is considered off-policy. Learning on-policy has been shown to be very important and powerful, but it is also more complex.

2.2 Model Architectures

In this section, I’ll briefly review the most common model architectures used today. This is important insofar as different architectures are (pre)-trained in different ways, and thus need to be adapted and/or fine-tuned in different ways.

The two dominating model architectures in the current era of modern NLP are decoder and encoder models. Both are more modern versions of the transformer architecture originally introduced in 2017 by the famous paper “Attention is all you need.”

2.2.1 Decoder Models

Most of the attention today is on decoder models. These are characterized by their causal attention mechanism that allows them to predict one token ahead at a time. This is also why they are called generative or auto-regressive language models. As it turns out, this paradigm not only works incredibly well, but it’s also very desirable as it makes things incredibly flexible, since almost any tasks can just be framed as a next token prediction task. This has given rise to an entirely new paradigm in machine learning. Most of what I’ll cover in this article will primarily and implicitly apply to decoder language models, except when noted otherwise.

2.2.2 Encoder Models

Encoder models, unlike decoder models, have bi-directional attention. Many encoder-based language models are trained using masked language modelling, where a random subset of tokens are masked and predicted by the model (see below), but other pre-text tasks exist too. Usually, these models are not as useful out-of-the box, but they can be fine-tuned to perform many different tasks. Popular use cases are classification, or any tasks where you need one prediction per token (eg named entity recognition).

<|CLS|> Yesterday I <|M|> to the <|M|> <|EOS|>

Encoders are also a common choice for (text) embedding models (no need for causal attention there either), which are often trained using contrastive learning. A popular multi-model embedding model example is CLIP, which is trained to align images with their respective captions.

More recently, even for tasks like classification or embedding models, which were previously encoders’ territory, decoder model are increasingly often used, for example by adapting an already pre-trained decoder language model. Most of the research today goes into decoder-like models since the convenience of using one architecture to solve all problems just seems quite alluring (very much to the detriment of idea diversity).

2.2.3 Other Architectures

Just for the sake of completeness, transformers in both encoder/decoder form are not the only architectures in use today. In particular, more efficient models that can be scaled even further are needed since transformer models suffer from quadratic complexity in their attention mechanism.

For example, state-space models (popularized by a series of models named “Mamba”) also exist and have shown some promise. Liner attention mechanism have also been proposed. However, these models remain niche and have so far not been used in frontier models (to the extent we can even say that about closed source models of course).

2.3 Pre & Post-Training of LLMs

I’ll assume you know how today’s language models are trained, so this will be brief. However, I will also try to connect some of the concepts from earlier, which will be a good segway into inference-time and fine-tuning techniques.

Modern LLMs are trained in (three) stages. The first stage is massive pre-training using next token prediction. This is self-supervised learning, essentially a classification task over tokens (often between 30k-250k in practice) using cross-entropy loss. Pre-training language models shouldn’t just bee seen as next-token prediction, since that isn’t the full story. The main advantage is representation learning, ie to have the model learn rich, useful, and general representations which it can use to solve many different tasks, but this could also be achieved otherwise (e.g. via masked language modelling or even diffusion).

Following that is post-training. Traditionally this first involved supervised instruction tuning (to teach a model to respond rather than be a pure text completer) and reinforcement learning (to align the model with human preferences and bring about reasoning abilities), though the boundaries between all three stages are increasingly blurring.

Both pre-training and in particular instruction tuning is essentially massive multitask learning. Turns out there are also lots of key elements of meta learning happening, ie the model learns to learn from what is provided in its context. This is the underlying force that gives rise to emergent behaviors such as few-shot learning and in-context learning (and I would argue chain-of-thought and reasoning too). Training phases often also incorporate elements of curriculum learning - for example you would typically see higher quality data later, or long context data.

During the RL phase, both on-policy and off-policy methods may be applied, with a general advantage of on-policy methods. Once a large and strong foundation model has been trained, a common practice is to distill its capabilities into smaller language models.

Increasingly, the lines between these three steps is blurring or even vanishing. Today, RL has been successfully applied on top of base models (i.e. pre-trained only models). And remember that even pre-training and SFT can be viewed from the lens of reinforcement learning via “behavior cloning”.

3 Inference-Time Techniques

As you know, language models are quite steerable, and there are many techniques that can help improve performance even without updating their weights. If you look more closely, as we’ll do here, you’ll notice that this isn’t just “prompting”, but there are many interesting and effective techniques.

3.1 Zero-Shot Prompting

Language models can perform many tasks in a zero-shot manner, given just instructions. This is actually already an example of in-context learning - ie the LM learns the task from your instructions alone.

While this seems trivial at first, the way I think about prompting is through activating different sub-networks of the language model. If you change the instructions (i.e. the prompt), even just a little bit, the language model will use different neurons in its network and the results will likely change (sometimes drastically).

Asking a language model to “think step by step” can also help boost performance, especially on tasks that require system 2 style thinking (ie deliberate problem solving), like math or composite problems. This and countless other prompt engineering techniques have been proposed over time (some more relevant and useful than others). However, this is not meant to be a prompt engineering guide, so just know that prompting is one of the many ways to adapt a language model.

To get some ideas of what makes a good prompt, see for example…

GPT-4.1 Prompting Guide (OpenAI):
https://cookbook.openai.com/examples/gpt4-1_prompting_guide
Anthropic Prompt Engineering Guide:
https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview
Prompt Engineering Guide: https://www.promptingguide.ai/

3.2 Reasoning

What started the discovery of chain-of-though prompting has led to the development of explicit “reasoning” models that perform thinking natively, ie even without you asking for it (or some models like Qwen3 even allow you to enable/disable the thinking mode, a feature I really like). Typically, the model response looks something like this:

<think>...</think>
<answer>...</answer>

This is sometimes also called scaling “inference-time compute”, ie spending more time (or rather tokens) on thinking, and it has been shown to improve performance massively in many cases, especially hit@1 rates (it has also shown to decrease performance on some tasks due to “overthinking”). While it’s related to chain-of-thought, it goes further than that and is often characterized by certain cognitive behaviors such as reflection, self-verification and backtracking.

Reasoning models today are trained primarily using reinforcement learning from verifiable rewards (RLVR). While to some degree, reasoning can also be (and is) trained into language models using SFT (eg during instruction fine-tuning), the cognitive behaviors just mentioned appear to emerge naturally in RLVR. It’s not fully understood yet why and how exactly they emerge, but it’s an exciting and useful development.

See more on that in the RLVR fine-tuning section later.

3.3 Few-Shot Learning

Few-shot learning is another well known and proven technique that is extremely useful in practice. This is “in-context learning” as it’s most commonly understood.

In fact, research has shown that something akin to gradient descent occurs within the network when doing few-shot learning (and thereby is related to meta learning). However, others are more skeptical and argue that there is no real learning happening at all, and that it’s just task retrieval (meaning it merely helps the model retrieve the right pattern to apply).

Either way, the performance gains are real. In fact, people have shown that these gains continue even into the hundreds and thousands of in-context examples (“many-shot learning” if you will). However, at this point, context length and its associated cost are likely to become a problem. In that case, fine-tuning would probably be a good idea.

Code

# Example of few-shot learning
---

# General instructions
{instructions}

# Examples
Input: {input1}
Label: {label1}

Input: {input2}
Label: {label2}

# Target question

Input: {input3}
Label: <model generated answer>

3.4 Self-Consistency Sampling

There are also various sampling-related techniques that can be very useful in practice. For instance, with self-consistency sampling, you sample N responses and take a majority vote. You probably want to use a somewhat higher temperature so you don’t get the same answer multiple times (other sampling techniques like beam search may help too).

Code

# Sample 1
<think> ... </think>
<answer> B </answer>

# Sample 2
<think> ... </think>
<answer> A </answer>

# Sample 3
<think> ... </think>
<answer> B </answer>

# Majority Voting
Selected answer: B

Some cool research from Google showed that simply scaling parallel sampling and parallel verification can match or even exceed the performance of reasoning models. An important part of explanation for this is the generation-verification gap, which claims that verifying a correct answer is generally easier than generating one. So by generating many answers, there’ll be one correct one, and that one will likely be easy to verify too.

If you don’t have answers you can simply verify and compare (as in multiple choice or math answers), you can also evaluate/rank the responses using a LLM-as-a-judge approach (either in parallel or all at once, as below).

Code

Given the following 3 generated answers to a question, select the best answer you think is most likely to be correct.

Q: {question}

A1: {answer1}
A2: {answer2}
A3: {answer3}

---

# Model response
<think> ... </think>
<answer> A2 </answer>

3.5 Retrieval Augmentation

Language models don’t know everything, and in fact tend to make stuff up as when they don’t (reliably saying “idk” is still an open research question I guess). Instead, if often helps to provide the LM with relevant context to ground the model’s response.

There are many ways to do RAG, including query expansion and query augmented generation (QuAG). Some may refer to this as “agentic search”, but the main idea is to have the LM write it’s own search query.

3.6 Tool Use

Equipping language models with the ability to use tools or call functions has been one of the most influential developments and is perhaps the main reasons language models are becoming actually useful in the real world.

Today’s frontier models can even combine tool use and reasoning. This makes sense insofar as it allows the model to select tools thoughtfully, and to reflect on its outputs. Modern workflows can require tens or hundreds of tool calls within a single turn.

3.7 Structured Generation

When unconstrained, LMs can generate any sequence of tokens. However, in practice, you often want the response to look exactly as you say. The most popular example is JSON format, but you can constrain the model’s outputs to strictly follow any structure, whether that’s XML or a regex pattern.

3.8 Automatic Prompt Optimization

4 Fine-Tuning Techniques

At last, welcome to the core part of this article. You want to consider adapting the weights of a language model in the following cases

Performance: you may not reach your desired performance using only inference-time strategies
Reliability: you may just get unreliable results either due to the stochastic nature of LLMs, or even at zero temperature
Behavior: you may want to change the behavior of your model
Knowledge: this is a open problem, and fine-tuning may or may not be sufficient
Latency & Throughput: often a LLM is just not economical to use at scale, an latency may be too low for many real world use cases, so a smaller model can be fine-tuned to match or exceed the performance of larger models on specific tasks
Cost: you may have very detailed instructions (for example with few-shot examples), which costs more

The idea here is essentially to apply transfer learning. Language models are pre-trained extensively and have a lot of world knowledge, general language understanding capabilities, and problem solving skills.

4.1 Supervised Fine-Tuning (SFT)

4.2 Reinforcement Learning from Human Feedback (RLHF)

4.3 Reinforcement Learning from Verifiable Rewards (RLVR)

4.4 Supervised Fine-Tuning (SFT)

4.5 Model Distillation

4.6 Contrastive Learning

5 Conclusion

6 Appendix

6.1 Parameter-Efficient Fine-Tuning (PEFT)

6.2 Sampling

6.3 Encoder Model Fine-Tuning