How to build a domain-specific LLM without training from scratch

A domain-specific LLM almost never means a model trained from a blank slate. It means adapting a strong base model to your field, and the real skill is picking the cheapest method that clears your accuracy bar. This guide walks the options ladder from prompting and RAG up to from-scratch training, the build path for the realistic case, what to budget, and how to evaluate before you commit.

By Kanika Mathur, Head of Service Delivery

Reviewed by Resourcifi engineeringPublished May 26, 2026Updated May 26, 202612 min read

Key takeaways

The short version

A domain-specific LLM is usually an adapted base model rather than one trained from scratch. The decision that matters is which rung of the options ladder clears your accuracy bar at the lowest cost.
Match the method to the problem. Inject facts with RAG. Shape behavior with fine-tuning. Ovadia et al. (EMNLP 2024) found RAG consistently outperformed unsupervised fine-tuning for injecting new factual knowledge.
Fine-tuning is now cheap. QLoRA (Dettmers et al., 2023) fine-tuned a 65B-parameter model on a single 48GB GPU while preserving full 16-bit task performance, and AWS suggests starting with as few as 50 to 100 high-quality examples.
Training from scratch is the wrong default. BloombergGPT used a 700B+ token corpus and roughly 1.3 million GPU-hours on 512 A100s. The gap between that and one GPU is the whole argument for climbing the ladder instead of jumping to the top.
Do not skip evaluation. Build a held-out, in-domain test set and benchmark the tuned model against both the base model and the RAG baseline before declaring success.

The options ladder for a domain-specific LLM

A domain-specific LLM is a language model whose behavior or knowledge has been specialized for a narrow field, such as medicine, law, finance, or your own product, so it uses the right vocabulary and is more reliable on in-domain tasks than a general model. In practice it almost always means adapting an existing strong base model instead of training one from scratch. The four ways to do that form a ladder from cheapest and fastest to most expensive and most controlled: prompting plus RAG, fine-tuning or LoRA, continued pretraining, then training from scratch.

The honest thesis of this page is that most teams should climb the ladder one rung at a time and resist jumping to the top. Start at the bottom rung; move up only when evaluation proves the cheaper rung is insufficient. The vast majority of domain-specific LLM projects are solved at the first two rungs, and from-scratch training is almost never the right call for an individual company. The table below is the spine of the rest of this guide, and the chart that follows plots the same four rungs against cost and control.

The four rungs, from cheapest to most controlled

Each rung adds cost and control. Read top to bottom, and stop at the first rung that clears your accuracy bar.

The domain-specific LLM options ladder
Rung	What it is	When it is justified	Cost and effort
1. Prompting + RAG	Keep the base model frozen; inject domain context through system prompts, few-shot examples, and retrieval from your own documents at inference time	The gap is knowledge: facts, policies, and docs that change often, where you need source citations and have little training data	Lowest. Days to weeks, no GPUs to train
2. Fine-tune / LoRA (PEFT)	Update the model, or small adapter weights, on curated input-to-output examples to teach behavior, format, tone, or a narrow skill	The model knows the domain but will not behave the way you need, or latency, privacy, and offline use matter	Moderate. Adapters trainable on one GPU
3. Continued pretraining	A second pretraining phase on a large corpus of raw in-domain text, then optionally fine-tune	The domain language itself is far from the base model distribution, with rare jargon or specialized notation, and you hold a large unlabeled corpus	High. Multi-GPU, large corpus
4. Train from scratch	Build a foundation model from a fresh initialization	Almost never for one company. Only when no suitable base exists and you have web-scale data, a large ML org, and seven-figure compute	Highest. Millions of dollars, months

Source: Resourcifi delivery framework, with method definitions following HuggingFace PEFT documentation and the LoRA and QLoRA papers cited below.

Cost and effort versus control as you climb

A stepped view of the same four rungs. The y-axis is degree of model control and specialization; the x-axis is cost and effort. The step up to rung four is far larger than the steps below it.

Source: Resourcifi delivery framework. Rung annotations reflect the QLoRA single-GPU result and the BloombergGPT compute figures cited in this guide.

The decision rule to carry through the whole build is short. Inject facts with RAG. Shape behavior with fine-tuning. Adapt the language itself with continued pretraining. Train from scratch only if you are at the scale of a Bloomberg and no base model fits, and even Bloomberg started from a known architecture rather than a blank slate. When the goal is a tuned model rather than a guide, that scoping is the work our custom LLM development team does first.

Knowledge versus behavior, the axis that decides the rung

The single most useful question is whether your gap is knowledge or behavior. If the model needs facts it does not have, reach for RAG. If the model has the knowledge but will not behave the way you need, a consistent format, a house style, a classification, or a tool-call shape, reach for fine-tuning. Research backs this split directly: for injecting new factual knowledge, retrieval beats unsupervised fine-tuning.

Ovadia et al., in Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs (EMNLP 2024), found that RAG consistently outperformed unsupervised fine-tuning for both knowledge seen during training and entirely new knowledge, and that models struggle to learn new factual information through unsupervised fine-tuning.¹ A separate 2024 study, FineTuneBench, reinforced the point: commercial fine-tuning APIs were limited at infusing genuinely new knowledge.²

The practical takeaway is that fine-tuning is for behavior and RAG is for facts, and the strongest production architecture is frequently a hybrid: fine-tune the model for behavior and format, then attach RAG for the facts that change. We unpack that trade in depth in our companion guide on fine-tuning versus RAG, and the broader picture of tool-using systems built on these models lives in the AI agents guide.

The build path for the realistic case

For the rung most teams land on, fine-tuning a strong base model, the build path has five stages: curate the data, choose the method, train, evaluate, and deploy. Data curation is the make-or-break step, and parameter-efficient methods like LoRA make the training stage far cheaper than most teams expect.

Work through it in order.

Data curation. For behavior fine-tuning you need input-to-output pairs that look exactly like production traffic, and quality matters far more than volume. AWS guidance for fine-tuning Anthropic's Claude 3 Haiku in Amazon Bedrock recommends starting with a small but high-quality dataset, calling 50 to 100 rows a reasonable start, and stresses that a clean, high-quality dataset is of paramount importance.³ The concrete steps are to dedupe, decontaminate by removing anything that overlaps your eval set, normalize the format, balance classes, hold out a test split before training, and have domain experts review samples.
Choose the method. Default to parameter-efficient fine-tuning. HuggingFace's PEFT library fine-tunes only a small number of extra parameters, which it states decreases computational and storage costs while yielding performance comparable to a fully fine-tuned model, making it possible to train and store large models on consumer hardware.⁴ LoRA (Hu et al., 2021) injects trainable low-rank matrices while freezing the original weights, and the paper reports up to 10,000 times fewer trainable parameters and roughly 3 times less GPU memory versus full fine-tuning on the models it tested.⁵ QLoRA (Dettmers et al., 2023) backpropagates through a frozen 4-bit model into LoRA adapters, enough to fine-tune a 65B-parameter model on a single 48GB GPU while preserving full 16-bit task performance.⁶ Full fine-tuning is worth the cost only when you have very large high-quality data and adapters demonstrably underperform; a 2024 study even argues LoRA and full fine-tuning can differ structurally, so verify on your own eval instead of assuming equivalence.⁷
Train. Pick a base model whose license permits commercial use and your deployment mode, then configure hyperparameters. AWS calls out that careful adjustment of the learning-rate multiplier and batch size plays a crucial role.³ Watch for overfitting on small datasets by monitoring validation loss and using early stopping.
Evaluate. Compare the tuned model against the base model and against the RAG baseline on a held-out, in-domain eval set before declaring success. Section five covers how.
Deploy. Choose between a managed inference service that hosts the adapter weights and a self-hosted GPU serving stack for privacy, latency, or cost at scale. LoRA adapters are small and can be swapped on top of a shared base, so many domain variants are cheap to host. Add guardrails, monitoring, and a feedback loop that collects new training data from production.

The data and compute to actually budget

The gap between fine-tuning and from-scratch training is enormous, and it is the entire argument for climbing the ladder. Fine-tuning with LoRA or QLoRA can run on a single GPU with hundreds of high-quality examples. Training from scratch needs hundreds of GPUs and billions of tokens. BloombergGPT, a real domain model for finance, used a corpus of more than 700 billion tokens and roughly 1.3 million GPU-hours on 512 A100s, taking about 53 days.

Set expectations against the numbers below. The first two rows are within reach of a small team; the bottom row is a different category of project entirely.

What each rung demands

Directional requirements by approach. Fine-tuning figures are anchored to the Bedrock guidance and the QLoRA result; from-scratch figures are BloombergGPT's published numbers.

Data and compute by approach
Approach	Data needed	Compute (directional)
Prompting + RAG	Your existing documents and knowledge base; no labeled training data	Inference only; embed and index your corpus
LoRA / QLoRA fine-tune	Hundreds to low thousands of high-quality pairs (AWS suggests starting near 50 to 100, then scaling as eval demands)	Often a single GPU; QLoRA fine-tuned a 65B model on one 48GB GPU
Continued pretraining	A large raw in-domain corpus, typically billions of tokens, varying widely by domain	Multi-GPU cluster, days
Train from scratch	Web-scale corpus. BloombergGPT used 700B+ tokens, 363B of them finance-specific	BloombergGPT: roughly 1.3M GPU-hours on 512 A100s, about 53 days

Sources: AWS Bedrock fine-tuning guidance (2024), QLoRA (Dettmers et al., 2023), and BloombergGPT (Wu et al., 2023). Industry estimates put from-scratch training in the multi-million-dollar range; treat that as directional guidance and not a quoted figure.

Costs follow the same shape. RAG carries the lowest build cost but recurring inference and retrieval infrastructure. LoRA and QLoRA add a modest one-time training cost, feasible on a single GPU, and the small adapters are cheap to serve. Continued pretraining is a large one-time compute spend justified only by a genuine vocabulary or distribution gap. From-scratch training sits in the multi-million-dollar range by industry estimates, which is why it is the wrong default for nearly every company. Steering a client to the cheapest rung that meets the bar is the core of how we approach custom LLM development.

How to evaluate a domain-specific LLM

General benchmarks like MMLU measure broad knowledge, not your domain, so they cannot tell you whether your model is ready. You need a held-out, in-domain eval set, supplemented where they exist by public domain benchmarks, and you should always compare the tuned model against the base model and the RAG baseline on the same set.

Real, citable domain benchmarks exist for the regulated fields where domain models are most common. Medical work has MedQA, MedMCQA, and MultiMedQA. Legal reasoning has LegalBench. Finance has FinBen and FinEval.⁸ Use them for orientation, but treat your own held-out set as the deciding test, because public benchmarks rarely match your exact task distribution.

What to measure: task accuracy on your eval split, format and schema validity, faithfulness for RAG (does the answer trace to a retrieved source), latency, and cost per request. The discipline that separates a model that ships from one that quietly hallucinates is treating evaluation as a gate, not a formality, and measuring the cheaper rung as the baseline every time you consider climbing higher.

Frequently asked

Domain-specific LLM questions

What is a domain-specific LLM?

A domain-specific LLM is a language model adapted to a particular field, such as medicine, law, or finance, so it uses correct terminology and is more reliable on in-domain tasks than a general model. It is usually an adapted general model, reached through RAG, fine-tuning, or continued pretraining instead of being built from scratch.

Do I need to train an LLM from scratch for my industry?

Almost never. Training from scratch costs millions and needs web-scale data; BloombergGPT used a 700-billion-token corpus and roughly 1.3 million GPU-hours on 512 A100s. For nearly all companies, RAG or fine-tuning on a strong base model delivers the result at a tiny fraction of that cost.

What is the difference between fine-tuning and RAG for a domain LLM?

Fine-tuning changes the model weights to shape behavior, style, and format, while RAG retrieves your documents at inference to supply facts. Research from Ovadia et al. (2024) shows RAG is better for injecting knowledge and fine-tuning is better for consistent behavior, and the two are often combined in production.

How much data do I need to fine-tune a domain-specific LLM?

With parameter-efficient methods like LoRA you can start small. AWS suggests 50 to 100 high-quality examples as a reasonable starting point for fine-tuning in Amazon Bedrock, then scaling up based on evaluation. Quality and representativeness matter far more than raw volume.

What do LoRA and PEFT do, and why do they matter?

LoRA freezes the base model and trains small low-rank adapter matrices, cutting trainable parameters and memory sharply; the LoRA paper reports up to 10,000 times fewer trainable parameters and about 3 times less GPU memory. QLoRA extends this to fine-tune a 65B-parameter model on a single 48GB GPU, which is what makes domain fine-tuning affordable.

How long does it take to build a domain-specific LLM?

It depends on which rung of the ladder you land on. A RAG prototype over your existing documents can be running in days. A LoRA fine-tuning run on a clean dataset of a few hundred pairs typically takes one to three weeks end to end, including data curation, training, and evaluation. Continued pretraining on a large corpus adds weeks of GPU cluster time. Training from scratch, as BloombergGPT showed, takes around 53 days on 512 A100s, plus months of data preparation. For most projects the realistic timeline is two to eight weeks from scoping to a production-ready fine-tuned model.

Kanika Mathur

Head of Service Delivery, Resourcifi

Kanika Mathur is Head of Service Delivery at Resourcifi, where her engineering pods adapt base models to client domains through RAG, LoRA fine-tuning, and evaluation harnesses built on held-out, in-domain test sets. She has scoped the data-curation and eval reviews that decide whether a model ships reliable in production or quietly hallucinates, and that economics-first lens is the one this guide is written from.

Resourcifi on LinkedIn →