How to build a domain-specific LLM without training from scratch
A domain-specific LLM almost never means a model trained from a blank slate. It means adapting a strong base model to your field, and the real skill is picking the cheapest method that clears your accuracy bar. This guide walks the options ladder from prompting and RAG up to from-scratch training, the build path for the realistic case, what to budget, and how to evaluate before you commit.

The short version
- A domain-specific LLM is usually an adapted base model rather than one trained from scratch. The decision that matters is which rung of the options ladder clears your accuracy bar at the lowest cost.
- Match the method to the problem. Inject facts with RAG. Shape behavior with fine-tuning. Ovadia et al. (EMNLP 2024) found RAG consistently outperformed unsupervised fine-tuning for injecting new factual knowledge.
- Fine-tuning is now cheap. QLoRA (Dettmers et al., 2023) fine-tuned a 65B-parameter model on a single 48GB GPU while preserving full 16-bit task performance, and AWS suggests starting with as few as 50 to 100 high-quality examples.
- Training from scratch is the wrong default. BloombergGPT used a 700B+ token corpus and roughly 1.3 million GPU-hours on 512 A100s. The gap between that and one GPU is the whole argument for climbing the ladder instead of jumping to the top.
- Do not skip evaluation. Build a held-out, in-domain test set and benchmark the tuned model against both the base model and the RAG baseline before declaring success.
The options ladder for a domain-specific LLM
A domain-specific LLM is a language model whose behavior or knowledge has been specialized for a narrow field, such as medicine, law, finance, or your own product, so it uses the right vocabulary and is more reliable on in-domain tasks than a general model. In practice it almost always means adapting an existing strong base model instead of training one from scratch. The four ways to do that form a ladder from cheapest and fastest to most expensive and most controlled: prompting plus RAG, fine-tuning or LoRA, continued pretraining, then training from scratch.
The honest thesis of this page is that most teams should climb the ladder one rung at a time and resist jumping to the top. Start at the bottom rung; move up only when evaluation proves the cheaper rung is insufficient. The vast majority of domain-specific LLM projects are solved at the first two rungs, and from-scratch training is almost never the right call for an individual company. The table below is the spine of the rest of this guide, and the chart that follows plots the same four rungs against cost and control.
| Rung | What it is | When it is justified | Cost and effort |
|---|---|---|---|
| 1. Prompting + RAG | Keep the base model frozen; inject domain context through system prompts, few-shot examples, and retrieval from your own documents at inference time | The gap is knowledge: facts, policies, and docs that change often, where you need source citations and have little training data | Lowest. Days to weeks, no GPUs to train |
| 2. Fine-tune / LoRA (PEFT) | Update the model, or small adapter weights, on curated input-to-output examples to teach behavior, format, tone, or a narrow skill | The model knows the domain but will not behave the way you need, or latency, privacy, and offline use matter | Moderate. Adapters trainable on one GPU |
| 3. Continued pretraining | A second pretraining phase on a large corpus of raw in-domain text, then optionally fine-tune | The domain language itself is far from the base model distribution, with rare jargon or specialized notation, and you hold a large unlabeled corpus | High. Multi-GPU, large corpus |
| 4. Train from scratch | Build a foundation model from a fresh initialization | Almost never for one company. Only when no suitable base exists and you have web-scale data, a large ML org, and seven-figure compute | Highest. Millions of dollars, months |
The decision rule to carry through the whole build is short. Inject facts with RAG. Shape behavior with fine-tuning. Adapt the language itself with continued pretraining. Train from scratch only if you are at the scale of a Bloomberg and no base model fits, and even Bloomberg started from a known architecture rather than a blank slate. When the goal is a tuned model rather than a guide, that scoping is the work our custom LLM development team does first.
Knowledge versus behavior, the axis that decides the rung
The single most useful question is whether your gap is knowledge or behavior. If the model needs facts it does not have, reach for RAG. If the model has the knowledge but will not behave the way you need, a consistent format, a house style, a classification, or a tool-call shape, reach for fine-tuning. Research backs this split directly: for injecting new factual knowledge, retrieval beats unsupervised fine-tuning.
Ovadia et al., in Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs (EMNLP 2024), found that RAG consistently outperformed unsupervised fine-tuning for both knowledge seen during training and entirely new knowledge, and that models struggle to learn new factual information through unsupervised fine-tuning.1 A separate 2024 study, FineTuneBench, reinforced the point: commercial fine-tuning APIs were limited at infusing genuinely new knowledge.2
The practical takeaway is that fine-tuning is for behavior and RAG is for facts, and the strongest production architecture is frequently a hybrid: fine-tune the model for behavior and format, then attach RAG for the facts that change. We unpack that trade in depth in our companion guide on fine-tuning versus RAG, and the broader picture of tool-using systems built on these models lives in the AI agents guide.
The build path for the realistic case
For the rung most teams land on, fine-tuning a strong base model, the build path has five stages: curate the data, choose the method, train, evaluate, and deploy. Data curation is the make-or-break step, and parameter-efficient methods like LoRA make the training stage far cheaper than most teams expect.
Work through it in order.
- Data curation. For behavior fine-tuning you need input-to-output pairs that look exactly like production traffic, and quality matters far more than volume. AWS guidance for fine-tuning Anthropic's Claude 3 Haiku in Amazon Bedrock recommends starting with a small but high-quality dataset, calling 50 to 100 rows a reasonable start, and stresses that a clean, high-quality dataset is of paramount importance.3 The concrete steps are to dedupe, decontaminate by removing anything that overlaps your eval set, normalize the format, balance classes, hold out a test split before training, and have domain experts review samples.
- Choose the method. Default to parameter-efficient fine-tuning. HuggingFace's PEFT library fine-tunes only a small number of extra parameters, which it states decreases computational and storage costs while yielding performance comparable to a fully fine-tuned model, making it possible to train and store large models on consumer hardware.4 LoRA (Hu et al., 2021) injects trainable low-rank matrices while freezing the original weights, and the paper reports up to 10,000 times fewer trainable parameters and roughly 3 times less GPU memory versus full fine-tuning on the models it tested.5 QLoRA (Dettmers et al., 2023) backpropagates through a frozen 4-bit model into LoRA adapters, enough to fine-tune a 65B-parameter model on a single 48GB GPU while preserving full 16-bit task performance.6 Full fine-tuning is worth the cost only when you have very large high-quality data and adapters demonstrably underperform; a 2024 study even argues LoRA and full fine-tuning can differ structurally, so verify on your own eval instead of assuming equivalence.7
- Train. Pick a base model whose license permits commercial use and your deployment mode, then configure hyperparameters. AWS calls out that careful adjustment of the learning-rate multiplier and batch size plays a crucial role.3 Watch for overfitting on small datasets by monitoring validation loss and using early stopping.
- Evaluate. Compare the tuned model against the base model and against the RAG baseline on a held-out, in-domain eval set before declaring success. Section five covers how.
- Deploy. Choose between a managed inference service that hosts the adapter weights and a self-hosted GPU serving stack for privacy, latency, or cost at scale. LoRA adapters are small and can be swapped on top of a shared base, so many domain variants are cheap to host. Add guardrails, monitoring, and a feedback loop that collects new training data from production.
The data and compute to actually budget
The gap between fine-tuning and from-scratch training is enormous, and it is the entire argument for climbing the ladder. Fine-tuning with LoRA or QLoRA can run on a single GPU with hundreds of high-quality examples. Training from scratch needs hundreds of GPUs and billions of tokens. BloombergGPT, a real domain model for finance, used a corpus of more than 700 billion tokens and roughly 1.3 million GPU-hours on 512 A100s, taking about 53 days.
Set expectations against the numbers below. The first two rows are within reach of a small team; the bottom row is a different category of project entirely.
| Approach | Data needed | Compute (directional) |
|---|---|---|
| Prompting + RAG | Your existing documents and knowledge base; no labeled training data | Inference only; embed and index your corpus |
| LoRA / QLoRA fine-tune | Hundreds to low thousands of high-quality pairs (AWS suggests starting near 50 to 100, then scaling as eval demands) | Often a single GPU; QLoRA fine-tuned a 65B model on one 48GB GPU |
| Continued pretraining | A large raw in-domain corpus, typically billions of tokens, varying widely by domain | Multi-GPU cluster, days |
| Train from scratch | Web-scale corpus. BloombergGPT used 700B+ tokens, 363B of them finance-specific | BloombergGPT: roughly 1.3M GPU-hours on 512 A100s, about 53 days |
Costs follow the same shape. RAG carries the lowest build cost but recurring inference and retrieval infrastructure. LoRA and QLoRA add a modest one-time training cost, feasible on a single GPU, and the small adapters are cheap to serve. Continued pretraining is a large one-time compute spend justified only by a genuine vocabulary or distribution gap. From-scratch training sits in the multi-million-dollar range by industry estimates, which is why it is the wrong default for nearly every company. Steering a client to the cheapest rung that meets the bar is the core of how we approach custom LLM development.
How to evaluate a domain-specific LLM
General benchmarks like MMLU measure broad knowledge, not your domain, so they cannot tell you whether your model is ready. You need a held-out, in-domain eval set, supplemented where they exist by public domain benchmarks, and you should always compare the tuned model against the base model and the RAG baseline on the same set.
Real, citable domain benchmarks exist for the regulated fields where domain models are most common. Medical work has MedQA, MedMCQA, and MultiMedQA. Legal reasoning has LegalBench. Finance has FinBen and FinEval.8 Use them for orientation, but treat your own held-out set as the deciding test, because public benchmarks rarely match your exact task distribution.
What to measure: task accuracy on your eval split, format and schema validity, faithfulness for RAG (does the answer trace to a retrieved source), latency, and cost per request. The discipline that separates a model that ships from one that quietly hallucinates is treating evaluation as a gate, not a formality, and measuring the cheaper rung as the baseline every time you consider climbing higher.
Domain-specific LLM questions
What is a domain-specific LLM?
Do I need to train an LLM from scratch for my industry?
What is the difference between fine-tuning and RAG for a domain LLM?
How much data do I need to fine-tune a domain-specific LLM?
What do LoRA and PEFT do, and why do they matter?
How long does it take to build a domain-specific LLM?
Sources
- Ovadia et al., Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs, EMNLP 2024.
- FineTuneBench: How well do commercial fine-tuning APIs infuse knowledge into LLMs? (2024).
- AWS, Best practices for fine-tuning Anthropic’s Claude 3 Haiku on Amazon Bedrock (2024).
- HuggingFace, PEFT (Parameter-Efficient Fine-Tuning) documentation (accessed 2026).
- Hu et al., LoRA: Low-Rank Adaptation of Large Language Models (2021).
- Dettmers et al., QLoRA: Efficient Finetuning of Quantized LLMs (2023).
- LoRA vs Full Fine-tuning: An Illusion of Equivalence (2024).
- Wu et al., BloombergGPT: A Large Language Model for Finance (2023). Domain benchmarks: MedQA, MedMCQA, MultiMedQA (medical), LegalBench (legal), FinBen, FinEval (finance).
Building AI
AI Copilots for SaaS: Build vs Buy Guide
AI copilot vs AI agent for SaaS: a copilot assists, an agent acts. How an in-app copilot works, the RAG and multi-tenant...
Read guide →
Building AI
How to Add AI to Your SaaS Product: A Production-First Playbook
Learn how to build an AI SaaS product: the build-order playbook (prompt, RAG, fine-tune, agents), multi-tenant isolation...
Read guide →
Building AI
How to Build a RAG System
Learn how to implement RAG with a seven-stage pipeline guide covering chunking, embeddings, retrieval, and evaluation. Bu...
Read guide →
Building AI
How to Build an AI Copilot
Learn how to make an AI assistant: eight steps covering RAG, tool calling, guardrails, evals, and telemetry, backed by Mi...
Read guide →
Building AI
How to Build an AI SaaS Product
How to build a SaaS product with AI: the 5-phase build path, stack, margin reality, and pricing models. Trusted by 200+ e...
Read guide →
Building AI
How to Train a Custom Model
How to train an AI model: when to train vs. use an API, the 7-stage workflow, classical ML vs LLM fine-tuning, and the pi...
Read guide →
Agents & RAG
Agentic RAG: When to Use It and How to Build It
Agentic RAG explained: how it differs from naive and advanced RAG, the key patterns like corrective RAG and self-RAG, the...
Read guide →
Agents & RAG
AI Agent for Fintech: Risk, Compliance, Ops, Customer
AI agents in finance: fraud, AML, KYC and servicing use cases, how to build with money-movement guardrails and human appr...
Read guide →
Agents & RAG
AI Agent for Healthcare: Use Cases, Governance & Implementation
AI agents in healthcare: the use cases that pay off first, how to build one HIPAA-safe on FHIR with clinician review, and...
Read guide →
