How to train a custom AI model, and whether you should

Most teams asking how to train an AI model do not need to train anything. The honest first step is ruling it out: a hosted API, a better prompt, or retrieval will solve the task faster and cheaper in the common case. This guide gives you the decision framework, the full seven-stage workflow, the data and compute you will actually need, and the pitfalls that quietly wreck a model.

By Kanika Mathur, Head of Service Delivery

Reviewed by Resourcifi engineeringPublished Feb 28, 2026Updated Feb 28, 202613 min read

Key takeaways

The short version

Most teams should not train from scratch. Take the cheapest path that meets the requirement: a pretrained model or API, then prompting, then retrieval (RAG), then fine-tuning, and only then training from scratch.
Two honest do-not-train triggers: you lack labeled data, or your real problem is missing knowledge (a RAG problem) instead of missing skill.
The workflow is seven stages: frame the problem, collect and label and split first, build the representation, select a model against a baseline, train, evaluate on held-out data, then deploy and monitor for drift.
Match the family to the data: tabular and explainable points to classical ML, perceptual data with many labels points to deep learning, language where a base model is almost right points to fine-tuning an LLM. Fine-tuning a checkpoint needs far less compute, data, and time than pretraining, and PEFT/LoRA makes it affordable.
The pitfalls that ruin a model are predictable: data leakage, inconsistent preprocessing, overfitting (including tuning against the test set), and picking the wrong metric. Each has a known fix.

Do you even need to train a custom model?

Probably not, and ruling it out is the honest first step. Take the cheapest path that meets your requirement: use a pretrained model or hosted API as-is, then prompt engineering, then retrieval-augmented generation if the gap is missing knowledge, then fine-tuning, and only train from scratch when no suitable pretrained model exists for your data and task. Most teams never reach the last step, and saying so is the answer most guides avoid.

Think of it as an escalation ladder where each rung costs more in time, data, and money than the one below it. Climb only as far as the task forces you to.

Use a pretrained model or API as-is. When a general model already does the task acceptably, you are done. Zero training, fastest to ship.
Prompt engineering (LLMs only). Adjust instructions, examples, and structure when the model already knows how to do the task and you just need to steer it. Hours to days.
Retrieval-augmented generation (RAG) (LLMs). Reach for this when the task needs knowledge the base model lacks, such as internal docs, product specs, or policies, especially when that knowledge changes often. RAG injects fresh facts at query time with no retraining. If your problem is really that the model lacks knowledge, read fine-tuning vs RAG before training anything.
Fine-tune a pretrained model (classical or LLM). Use this when you need a specific behavior, style, format, or domain adaptation, or to compress a long prompt into the weights. OpenAI's own guidance treats fine-tuning as something to reach for after prompting is insufficient, useful when you need to handle more examples than fit a context window, cut cost with shorter prompts, embed proprietary data, or make a smaller model perform like a larger one.⁹
Train from scratch (classical ML, or rarely a domain deep-learning model). Justified only when no suitable pretrained model exists: a tabular prediction unique to your business, a specialized signal, or a regulated domain you must own end to end. Training an LLM from scratch is almost never justified on cost grounds, since fine-tuning a pretrained checkpoint is "identical to pretraining except you don't start with random weights" and "requires far less compute, data, and time."¹

Two triggers should stop you outright. First, if you lack labeled data: supervised training needs labeled examples, so without quality labels, training is not your next step. Second, if your real problem is missing knowledge, not missing skill: a model that would be fine "if only it knew X" has a retrieval problem, not a training problem. The decision table below is where you commit.

Do you need to train a custom model?

Match your situation to the cheapest approach that solves it. Most rows do not require training at all.

The train-or-not decision table
Your situation	Recommended approach	Train a model?
A general or API model already does it well	Use pretrained / API as-is	No
LLM is nearly right, just needs steering	Prompt engineering	No
Model lacks knowledge that changes often	RAG (retrieval)	No
You have no labeled data for the task	Get labels first, or rethink the task	No (yet)
Need consistent style, format, or domain behavior	Fine-tune a pretrained model	Yes (light)
Unique tabular or business prediction problem	Train classical ML	Yes
No suitable pretrained model for your data	Train from scratch (rare)	Yes (heavy)

Sources: OpenAI model-optimization guidance and Hugging Face fine-tuning docs. The escalation order is the practitioner consensus; do not read the rows as cost multiples.

The seven-stage AI model training workflow

Training a custom model follows seven ordered stages: frame the problem, collect and label data and split it before anything else, build the feature or token representation, select a model against a baseline, train, evaluate on held-out data, then deploy and monitor for drift. The same lifecycle maps onto both classical ML and LLM fine-tuning; the single most common way teams ruin results is getting the order wrong, especially splitting the data too late.

Problem framing. Turn the business goal into a precise, measurable question before any code: what exactly is predicted, what the input is, the task type (classification, regression, ranking, generation), and the single metric you will optimize. Confirm the data you need actually exists.⁷ What good looks like: one sentence stating the prediction target and the metric.
Data collection, labeling, and splits. Collect data representative of the real conditions the model will see in production; the core ML assumption is that production data resembles training data, and the model fails when that breaks.⁷ Label every example with its target, since label quality caps model quality. Then split before you touch the data: hold out a test set first, before any preprocessing or feature selection, using a three-way train / validation / test split. Training fits the model, validation tunes hyperparameters, and the test set is touched once for a final unbiased estimate.⁴ What good looks like: a test set that has never influenced any choice.
Feature or representation. For classical ML, clean, encode, and scale inputs and engineer features, fitting any transformer (scaler, encoder, selector) on the training split only, then applying it to validation and test. Deep learning and LLMs learn representations from raw data, so this stage becomes tokenization, normalization, and data formatting instead of hand-built features.⁴ What good looks like: identical preprocessing at train and inference, ideally wrapped in a Pipeline so a step is never forgotten.
Model selection. Pick the smallest model class that can plausibly solve the task, and establish a baseline first (a trivial model, or "use the API as-is") so you can prove the trained model is actually better. For LLMs, OpenAI's recommended loop is to build evals and measure a baseline before fine-tuning.⁹
Training. Fit the model on the training set. For an LLM fine-tune the controls include epochs, batch size, learning rate, mixed precision, gradient accumulation, and gradient checkpointing. Hugging Face's documented example values for a small-LLM fine-tune are illustrative: 3 epochs, a 2e-5 learning rate, and bf16 mixed precision, with gradient accumulation to simulate a larger effective batch.¹ Treat those as documented example values rather than universal recommendations.
Evaluation. Measure on held-out data with the metric chosen in stage one, and watch the gap between training and validation performance, since a large gap means overfitting. Use k-fold cross-validation when data is limited, because a single split wastes data and depends on one arbitrary partition. Tuning against the test set leaks knowledge into the model and inflates results, so "knowledge about the test set can leak into the model and evaluation metrics no longer report on generalization performance."⁵ What good looks like: a single pre-committed metric and an honest train-validation gap.
Deployment and monitoring. Ship the model behind an API or batch job, then monitor live inputs and outputs for drift, meaning production data moving away from the training distribution, and for quality regressions. ML is iterative, so loop back to data and training as performance degrades or requirements change.⁷ What good looks like: alerting on input drift and a defined retrain trigger.

Two guardrails sit on top of these stages and prevent most self-inflicted failures: split the data before any preprocessing, and tune on validation (or with cross-validation), never on the test set. If a team you trust would rather scope, train, and ship this with you, see how our machine learning development team builds custom models, and our custom LLM development team handles the fine-tuning side.

The data and compute you actually need

There is no universal minimum dataset size; it depends on task difficulty, the number of classes or features, and model size, and "it depends" is the honest answer rather than a fabricated number. Compute splits cleanly: classical ML on tabular data usually trains on CPU or a single modest machine, while deep learning and LLM fine-tuning need GPUs, where memory is the binding constraint and PEFT/LoRA is the standard way to keep cost down.

On data, the rule of thumb tracks how much the model has to learn from your examples. Training from scratch needs the most labeled data, because the model learns everything from you. Fine-tuning a pretrained model needs far less, because it already encodes general knowledge.¹ LLM fine-tuning can start with surprisingly few examples; OpenAI frames it around the behaviors and examples that exceed a prompt rather than a fixed dataset size, and the practical loop is to add and improve examples based on eval feedback.⁹ Across all of these, quality beats quantity: clean, correctly labeled, representative data sets the ceiling, and bad labels cap it.

On compute, classical models such as logistic regression, gradient-boosted trees, and random forests typically train on CPU, which is the cheap, fast, often-correct default for tabular data. Deep learning and LLM fine-tuning need GPUs, and memory is usually the limiting factor more than raw speed. Documented levers to fit training on smaller GPUs are mixed precision (bf16 or fp16), gradient accumulation, and gradient checkpointing, which trades compute for memory by recomputing activations.¹ The single biggest cost lever for large models is parameter-efficient fine-tuning: PEFT methods "only fine-tune a small number of (extra) model parameters, significantly decreasing computational and storage costs, while yielding performance comparable to a fully fine-tuned model" and make it feasible "on consumer hardware."²

What LoRA cuts versus full fine-tuning of GPT-3 175B

The headline figures reported in the LoRA paper, on a logarithmic scale because the reductions span orders of magnitude. This is why most teams fine-tune with PEFT instead of fully retraining.

Data behind this chart
Metric (vs full fine-tune of GPT-3 175B)	LoRA result
Trainable parameters	Up to 10,000× fewer
GPU memory requirement	3× lower
Added inference latency	None; quality on par or better

Source: Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models," arXiv:2106.09685 (2021). Figures are the authors' reported results for GPT-3 175B.

Classical ML vs deep learning vs LLM fine-tuning

Match the model family to the data, not to the trend. Use classical ML for structured tabular data, smaller datasets, and when interpretability matters; use deep learning for unstructured perceptual data such as images, audio, and raw signals when you have many labels and a GPU budget; and fine-tune an LLM for language and generation tasks where a base model is almost right. The one-line heuristic: tabular and need to explain it points to classical ML, images or audio with lots of labels points to deep learning, and language where a base model is nearly right points to fine-tuning an LLM, after checking RAG first.

Classical ML (linear and logistic regression, SVM, random forest, gradient boosting) is best for structured and tabular data, smaller datasets, when you need to explain decisions, and when you want fast, cheap CPU training. It is often the strongest baseline and frequently the right final answer for business prediction such as churn, fraud scoring, demand, and risk; scikit-learn is the canonical library for these estimators and documents their evaluation and pitfalls.⁴ Deep learning trained for your task (CNNs, transformers, and similar architectures trained heavily on your data) suits unstructured perceptual data where features are hard to hand-engineer and you have substantial labels and GPU budget, so the model learns its own representations; PyTorch is the canonical framework here.⁸ LLM fine-tuning adapts a pretrained foundation model and is best for language tasks where the base model is close but you need a consistent style, format, tone, or domain behavior, or want to bake repeated instructions into the weights to shorten prompts. Reach for it only after prompting, and RAG if the gap is knowledge, is insufficient, and use PEFT/LoRA to keep it affordable.²

The pitfalls that quietly ruin a model

The failures that ruin a custom model are predictable and each has a known fix: data leakage, inconsistent preprocessing, overfitting (including overfitting the test set), choosing the wrong metric, and uncontrolled randomness. None of these show up as an error message; they show up as a model that looks great in evaluation and fails in production, which is why disciplined practice matters more than model choice. RAND Corporation research (2024) identified "inadequate or poor data to train the model" as one of five root causes behind the majority of AI project failures studied across 65 interviews of data scientists and engineers,¹⁰ confirming that execution practice, not architecture, is where projects stall.

Data leakage. Using information at training time that would not be available at prediction time produces "overly optimistic performance estimates," and the classic cause is fitting a scaler or feature selector on all data before splitting. The fix: split first, fit_transform on train only, transform on validation and test, and "never call fit on the test data."³
Inconsistent preprocessing. Applying different transforms at train versus inference means "the feature space will change and the model will not perform effectively"; scikit-learn's own example shows mean squared error jumping from 0.90 (correct) to 62.80 (wrong) purely from skipping the test-set transform. The fix: wrap preprocessing and model in a Pipeline.³
Overfitting, including the test set. A model that memorizes training data fails to generalize, and tuning hyperparameters against the test set leaks it into the model. The fix: tune on a validation set or with k-fold cross-validation, keep the test set for a single final read, watch the train-validation gap, and prefer simpler, regularized models.⁵
The wrong metric. Accuracy is misleading on imbalanced data, where a model can look great while failing the minority class. The fix: match the metric to the goal, using precision to avoid false positives, recall to catch all positives, F1 to balance them, ROC-AUC for a threshold-agnostic view, or balanced accuracy, which "avoids inflated performance estimates on imbalanced datasets." For regression, use MAE, RMSE, or R-squared.⁶
Uncontrolled randomness. Different seeds give different results and break fold-to-fold comparison. The fix: set seeds deliberately, passing integers to cross-validation splitters for comparable splits and using RandomState instances for estimators when you want robust averaging.³

What it really costs

There is no honest single dollar figure to train a custom model; cost depends on the approach, the data, and the model size. The structural facts are clear: classical ML on tabular data is the cheap end with no GPU bill, fine-tuning a pretrained model costs far less than training from scratch, and PEFT/LoRA cuts the cost of fine-tuning large models sharply. The recurring cost is rarely the training run itself but the data labeling, evaluation, monitoring, and retraining that dominate a model's total cost of ownership.

Hold the structural facts above the dollar signs. Fine-tuning a checkpoint needs far less compute, data, and time than pretraining,¹ PEFT/LoRA dramatically reduces trainable parameters and GPU memory for large models,² and classical ML on tabular data runs on a single modest machine. For the variable spend, cloud GPU rates and fine-tuning costs vary widely by provider and model size: GPU-pricing trackers in 2025 and 2026 commonly report cloud H100 rates roughly in the low single digits per GPU-hour on independent providers up to high single or low double digits per hour on hyperscalers, and small-model (around 7B-parameter) fine-tunes in the tens of GPU-hours. Treat those only as ranges reported by pricing trackers, never as a fixed quote, since the figures move fast. The honest, rank-worthy answer is that it depends on data, model size, and approach, and the cost drivers above are what to budget for.

Frequently asked

Training a custom AI model: common questions

How much data do I need to train a custom model?

There is no fixed minimum; it depends on task difficulty, the number of classes or features, and model size. Training from scratch needs the most labeled data. Fine-tuning a pretrained model needs far less, because it already encodes general knowledge, and clean, correctly labeled, representative data matters more than raw volume.

Should I train a custom model or use an existing API?

Use the cheapest path that meets your requirement. Try a pretrained model or API first, then prompt engineering, then RAG if the gap is missing knowledge, then fine-tuning, and only train from scratch when no suitable pretrained model exists for your data and task. Most teams never need to train from scratch.

What is the difference between training and fine-tuning a model?

Training from scratch starts from random weights and learns everything from your data. Fine-tuning continues training a pretrained model on your smaller, task-specific dataset, so it is identical to pretraining except you do not start with random weights, and it needs far less compute, data, and time. Fine-tuning is the common, affordable choice.

How do I avoid overfitting when training a model?

Hold out a test set before any preprocessing, tune hyperparameters on a validation set or with k-fold cross-validation instead of the test set, watch the gap between training and validation scores, and prefer simpler, regularized models. Never tune against the test set, or its results stop reflecting real-world generalization.

What does it cost to train a custom AI model?

It depends on approach, data, and model size. Classical ML on tabular data can run on CPU for little. Fine-tuning a pretrained model with PEFT or LoRA is far cheaper than full retraining and can run on modest GPUs. Beyond compute, budget for data labeling, evaluation, and ongoing monitoring and retraining.

Kanika Mathur

Head of Service Delivery, Resourcifi

Kanika Mathur is Head of Service Delivery at Resourcifi, where her ML pods scope, train, and ship custom models, from gradient-boosted models on tabular data to PEFT fine-tunes of foundation LLMs. She has run the data-split reviews and baseline-versus-trained eval gates that decide whether a model earns its place in production or should have stayed an API call, which is the lens this guide is written from.

Resourcifi on LinkedIn →