22 Jul 2025

Fine-tuning vs RAG vs prompting: a practical decision guide

Every few months, a business asks us to fine-tune an AI model for their use case. Sometimes that's exactly the right call. More often, the problem they're trying to solve is better addressed by RAG or better prompting — ...

Team presenting fine-tuned AI model results in a modern office boardroom

Every few months, a business asks us to fine-tune an AI model for their use case. Sometimes that's exactly the right call. More often, the problem they're trying to solve is better addressed by a different approach — RAG, or simply better prompt engineering — that's faster to deploy and cheaper to maintain.

The confusion is understandable. "Fine-tuning" has become a catch-all term for "making AI work better for our situation." In practice it refers to a specific technical intervention with specific trade-offs, appropriate for a specific subset of problems. Understanding when fine-tuning is the right tool — versus retrieval-augmented generation, versus prompt engineering, versus just using a more capable base model — is the decision that determines whether your AI investment delivers.

What each approach actually does

These three interventions operate at different levels of the stack, which is why they suit different problems.

Prompt engineering changes the framing — the instructions, examples, and context you give the model in each request. The model itself doesn't change. If the task is something the model can already do reasonably well but isn't doing exactly right, prompt engineering is almost always the right first move. It costs nothing to iterate, takes hours rather than weeks, and can produce substantial improvements across a wide range of tasks: document classification, email drafting, structured data extraction, content summarisation. The ceiling is the model's existing capability — if the base model fundamentally can't do what you need, no prompt will change that.

Retrieval-Augmented Generation (RAG) changes the context — instead of relying on the model's training data, it retrieves relevant documents from a knowledge base and includes them in the prompt. The model's weights don't change; what changes is what it can see when answering. RAG is the right architecture when the limiting factor is access to information: internal documentation, recent data, customer records, product specifications, proprietary research. It handles knowledge freshness naturally — update the knowledge base, and the model automatically works from the updated information without any retraining. The limitation is retrieval quality: the model can only work with what the retrieval layer surfaces.

Fine-tuning changes the weights — the model is retrained on examples from your domain, which modifies how it processes inputs at a fundamental level. This is appropriate when the limiting factor isn't knowledge (RAG handles that) or framing (prompting handles that), but behaviour: the model needs to learn a pattern or style that can't be adequately expressed in a prompt or retrieved from a document.

When fine-tuning actually makes sense

Fine-tuning earns its overhead when one of these conditions holds.

Consistent tone or style at scale. If you're generating thousands of customer communications, legal clause summaries, or product descriptions and every one needs to match a specific voice, prompt-based style guidance works but adds tokens to every request. Fine-tuning the style in once is more efficient at scale and produces more consistent results across long generation runs.

High-frequency classification or extraction that prompt engineering can't nail. For narrow, high-frequency classification tasks — routing support tickets, detecting intent in short messages, extracting specific field types from semi-structured text — a fine-tuned smaller model often outperforms a prompted larger model at a fraction of the inference cost. The model has learned the classification boundary from examples rather than inferring it from instructions each time.

Edge or on-premises deployment where size and latency are constraints. Fine-tuning a small model (3B–7B parameters) for a specific task can produce something that runs on local hardware without internet access — relevant for manufacturing, healthcare, and legal environments where data confidentiality or latency requirements rule out cloud API calls. We covered the infrastructure for this in our post on small language models.

Teaching the model a genuinely new pattern. If your task type doesn't appear in the base model's training data — a proprietary document format, a domain-specific reasoning chain, a classification scheme that maps to no public taxonomy — fine-tuning is the way to instil that pattern. Prompting can't teach the model something it has no prior exposure to.

When fine-tuning is the wrong choice

The most common and expensive fine-tuning mistakes all follow the same logic: choosing fine-tuning because the technology sounds sophisticated, rather than because it addresses the actual bottleneck.

Fine-tuning when you need knowledge. A model fine-tuned on your product documentation doesn't "know" your products — it's learned stylistic patterns from examples. When the documentation changes, the fine-tuned model gives confidently wrong answers based on outdated training data. RAG with a maintained knowledge base handles this correctly and updates automatically. Fine-tuning is a snapshot; RAG is a live connection.

Fine-tuning to fix bad prompts. If a carefully engineered prompt can solve the problem, fine-tuning adds weeks of work and ongoing maintenance overhead without a qualitative step up. The test: invest serious time in prompt engineering first. Fine-tuning is for the gap between "good prompt" and "still not right."

Fine-tuning without enough data. Supervised fine-tuning requires a minimum of several hundred high-quality instruction-response pairs to produce reliable improvements; ideally several thousand. If you're working from a small, hastily assembled dataset, the fine-tuned model's quality will be erratic — sometimes better, sometimes worse, in ways that are hard to predict or debug. The data curation problem is almost always harder than the training problem.

Fine-tuning when you need explainability. A fine-tuned model's behaviour is harder to audit than a prompted model's. With prompting, you can trace output back to specific instructions or retrieved context. With a fine-tuned model, behaviour is encoded in the weights in ways that aren't directly inspectable. For regulated contexts where you need to document why the model produced a specific output, RAG with cited sources or prompted chains-of-thought are more tractable architectures.

The GDPR and AI Act dimension

Fine-tuning has a compliance surface that prompting and RAG don't — a distinction that gets too little attention in capability discussions.

When you fine-tune on company data, you're using that data as training material. Under GDPR, if the fine-tuning dataset contains personal data — customer interactions, employee records, any information relating to identifiable individuals — that processing requires a lawful basis under Article 6. Using customer support logs or HR data as training examples without a legal basis assessment, purpose limitation analysis, and data minimisation review is a compliance risk easy to overlook when the team is focused on benchmarks.

The EU AI Act adds another layer for fine-tuned models used in high-risk contexts. Annex III categories include employment decisions, access to essential services, and administration of justice. If a fine-tuned model feeds into any of these processes, the Act's requirements for risk management systems, technical documentation, logging, and human oversight apply to the modified model — not just to the commercial base model you started from. Modifying weights for a regulated use case triggers compliance obligations that apply from August 2026.

For most European businesses, the practical path is fine-tuning on synthetic data or carefully anonymised examples. The model learns the patterns it needs; the compliance surface stays manageable. This requires upfront investment in generating or anonymising training data, but it's significantly less expensive than retroactively addressing a GDPR gap after deployment.

What fine-tuning actually costs

Data preparation dominates the timeline. Creating a high-quality instruction-response dataset for your specific task — writing representative examples, reviewing quality, handling edge cases, balancing class distributions — typically takes more elapsed time than the training run itself. For a meaningful dataset of 1,000–5,000 examples, expect weeks of careful curation, not days.

Training costs scale with model size. Fine-tuning a 7B parameter model on a few thousand examples runs in hours on a single modern GPU. Fine-tuning a 70B+ model requires meaningfully more compute. Cloud fine-tuning services (OpenAI's fine-tuning API, AWS Bedrock, Google Vertex AI) abstract the infrastructure complexity, but the per-token costs add up at scale — and the resulting model remains hosted externally, which may conflict with data residency requirements.

Maintenance is ongoing. A fine-tuned model requires re-evaluation whenever the base model is updated, whenever task requirements shift, and whenever the distribution of real-world inputs drifts from the training distribution. Prompt-based approaches inherit base model improvements automatically; fine-tuned models don't. Every base model update is a potential retraining decision.

Evaluation is harder to instrument. Testing a prompt change takes minutes — run it on a sample set, inspect outputs. Evaluating a fine-tuning change requires maintaining a held-out test set and running full training-evaluation pipelines. This infrastructure has to be built and maintained as a first-class engineering concern.

A practical decision sequence

Before committing to fine-tuning, work through this in order:

1. Can a well-crafted prompt get you to acceptable quality? Invest real time — well-designed system prompt, few-shot examples, chain-of-thought instructions. If yes, ship the prompted solution and revisit fine-tuning when you have scale evidence it's needed.

2. Is the limiting factor knowledge rather than behaviour? If the model gets things wrong because it doesn't have access to the right information, build a RAG pipeline with a maintained knowledge base first.

3. Do you have enough high-quality training examples? 500 is a floor; 2,000–5,000 is realistic for reliable results. If the answer is no, invest in data curation first — fine-tuning on thin data produces unreliable results.

4. Does your training data involve personal data or regulated use cases? Resolve the GDPR and AI Act compliance architecture before training. Synthetic or anonymised data is usually the more sustainable path.

5. Are there latency, cost-at-scale, or on-premises requirements that justify the overhead? If yes, fine-tuning is a legitimate architectural decision — size it to the task, not to the most capable model available.

The organisations that get the most value from fine-tuning are the ones who've already shipped a prompted baseline, measured precisely where it falls short, and have both the data and the operational discipline to manage a fine-tuned model as a maintained artefact rather than a one-time build.

Evaluating whether fine-tuning is the right approach for your use case?

We help European businesses map their AI use cases to the right technical approach — whether that's prompt engineering, RAG, fine-tuning, or a combination — and build the proof of concept that validates the choice before committing to the full architecture. If you're working through this decision, a scoping conversation is the right starting point.

Let's talk about your AI architecture →
Lino Moretto
Lino Moretto
RAAS Impact

Drawing from over 20 years of expertise as Fractional innovation Manager, I love bridging diverse knowledge areas while fostering seamless collaboration among internal departments, external agencies, and providers. My approach is characterized by a collaborative and engaging management style, strong negotiation skills, and a clear vision to preemptively address operational risks.

No guesswork.
No slide decks.
Just impact.

Ready to move from AI hype to a working system? In a free 30-minute call we'll identify your highest-impact use case and tell you exactly what it takes to get there.

No upfront cost · Italy · Malta · Europe · English & Italian