EnGen Practice · Discipline 01

LLM Development
& Fine-tuning.

The Model Knows the World. Fine-tuning Makes It Know Your Business.

Every large language model trained on public data carries the same blind spot — it has never read an organisation's internal documents, never understood its pricing logic, never absorbed the precise vocabulary its customers use when something goes wrong. Fine-tuning closes that gap. The result is not a smarter version of a general AI. It is an AI that thinks and communicates as though it has spent years inside the organisation.

Why generic AI
underperforms in enterprise.

There is a version of AI adoption that disappoints almost every enterprise that pursues it. The organisation connects a frontier model to its systems, runs a demonstration that impresses everyone, and then deploys it — only to discover that the model hedges on domain-specific questions, uses the wrong terminology in customer-facing outputs, misunderstands internal classification schemes, and occasionally produces answers that are technically coherent but operationally wrong.

This is not a failure of AI. It is a failure of fit.

A model trained on the public internet has absorbed an enormous breadth of knowledge. It has not absorbed the organisation's proprietary logic. It does not know that "priority one escalation" means something specific in a particular support workflow. It does not know that a certain regulatory clause applies differently across product lines. It does not know that customers in one geography use a different phrase for the same problem. These gaps look small in a demonstration and feel very large in production.

Fine-tuning is the discipline of closing that gap systematically — not by adding instructions, not by stuffing context windows with documentation, but by adjusting the model's actual parameters so that domain knowledge becomes intrinsic to how it reasons.

The difference between
prompting and fine-tuning.

Prompt-based instruction works well for tasks that are bounded, well-defined, and do not require deep domain fluency. It is faster to implement and easier to update. For many use cases, it is the right starting point.

Fine-tuning becomes the correct choice when:

The domain has a vocabulary, tone, or reasoning style that departs significantly from general language.
Consistency at scale is non-negotiable — the model must behave the same way across tens of thousands of interactions without drift.
Sensitive data cannot be sent to third-party APIs in prompts, requiring an in-house model.
The organisation needs to reduce inference costs by running a smaller, specialised model rather than a large general one.
The task requires the model to internalise complex rules too long or too nuanced to reliably fit in a context window.
Response latency is critical and a smaller, domain-adapted model outperforms a large general model on speed.

When these conditions are present, fine-tuning does not merely improve performance — it makes certain categories of application possible at all.

How Entiovi approaches
fine-tuning.

Entiovi's fine-tuning practice is built around a core principle: the model is selected and shaped to fit the organisation's requirements, not the other way around.

Task and data audit first

Before any model is selected, the specific tasks are documented in precise terms. The data available for training is audited for volume, quality, coverage, and any privacy or compliance constraints. These two inputs — task definition and data reality — determine everything that follows.

Model selection as a business decision

Entiovi evaluates models against a matrix covering task performance, context window requirements, inference cost at projected volume, data residency constraints, and latency SLAs. The choice between a frontier proprietary model, an open-weight model, and a smaller specialist model is a commercial calculation. A well-fine-tuned 7-billion-parameter model regularly outperforms a 70-billion-parameter general model on narrow domain tasks, at a fraction of the inference cost.

Supervised Fine-tuning (SFT)

High-quality examples of the correct input-output behaviour are curated and used to adjust the model's weights directly. The quality of these examples matters far more than their quantity — a carefully curated set of 2,000 examples consistently outperforms 20,000 poorly labelled ones.

Direct Preference Optimisation (DPO)

Aligns the model to preferred outputs without requiring a separate reward model. Particularly effective when the organisation has strong opinions about tone, format, or reasoning style — a legal team that needs responses written with particular precision, or a customer service operation that needs a consistent voice.

Reinforcement Learning from Human Feedback (RLHF)

Applied when the definition of a 'good' response is complex enough to require ongoing human judgement — typically in high-stakes domains where correctness is multi-dimensional and not fully capturable in static examples.

Low-Rank Adaptation (LoRA / QLoRA)

Allows large models to be fine-tuned with dramatically lower computational requirements by adding small trainable matrices to existing weights rather than adjusting all parameters. A 70-billion-parameter model can be adapted in hours on a single GPU — with performance on domain-specific tasks indistinguishable from full fine-tuning, at a fraction of the cost.

Evaluation before deployment

No fine-tuned model leaves Entiovi's process without a rigorous evaluation against domain-specific benchmarks built from the organisation's own data. Domain evaluation suites are built from real examples, edge cases drawn from operational history, and adversarial cases that probe the boundaries of correct behaviour.

Where fine-tuning makes
the clearest difference.

Financial services

Compliance and regulatory language

Regulatory documents carry highly specific language where small differences in phrasing carry large legal and financial consequences. A fine-tuned model trained on the organisation's own regulatory corpus understands those distinctions as deeply as a trained analyst.

Healthcare and life sciences

Clinical and research language

Clinical documentation, diagnostic reasoning, and research literature operate in a vocabulary that diverges sharply from general language. Fine-tuned models validated against domain expert review produce outputs that are operationally usable rather than requiring extensive human correction.

Legal

Contract analysis and drafting

Legal language is precise in ways that general models systematically misread. Fine-tuning on a firm's own matter history, standard clauses, and deviation flags produces a model that understands not just legal language generally but the firm's specific approach to it.

Manufacturing and engineering

Technical documentation

Maintenance manuals, fault codes, and engineering specifications carry domain logic that general models approximate rather than understand. Fine-tuned models trained on actual technical documentation dramatically reduce errors in AI-assisted maintenance, quality control, and process guidance.

Customer operations

Tone, policy, and product knowledge

Customer-facing AI needs to speak with the organisation's voice, reflect its current policies accurately, and handle product-specific questions with precision. Fine-tuning on interaction history, product documentation, and policy texts produces an AI that behaves like a well-trained team member.

Research perspective

What the research frontier
is telling enterprises.

Smaller models are closing the gap on domain tasks

Research from Microsoft, Google DeepMind, and leading academic groups has demonstrated consistently that models with 7 to 13 billion parameters, when fine-tuned on high-quality domain data, match or exceed the performance of models ten times their size on narrow tasks. For enterprises, this means the cost of fine-tuning and running a domain-adapted small model is a fraction of querying a frontier large model at scale.

Synthetic data is expanding what fine-tuning can achieve

A persistent constraint in enterprise fine-tuning has been the scarcity of high-quality labelled examples. Recent research demonstrates that larger models can generate synthetic training examples that — when carefully curated — produce fine-tuned models with performance comparable to those trained on human-annotated data. Entiovi applies this technique when client data is rich but labelled examples are scarce.

Constitutional AI and preference learning are changing alignment

Research into aligning model behaviour to values and preferences without large volumes of human feedback has made it practical to fine-tune models for consistency of tone and reasoning style even when feedback data is limited. For enterprises needing a consistent organisational voice, this delivers results that previously required sustained human annotation programmes.

Catastrophic forgetting is a solved engineering problem

The concern that fine-tuning degrades a model's general capabilities while improving domain performance — known as catastrophic forgetting — has been substantially addressed by parameter-efficient techniques. LoRA and its variants confine weight changes to small adapter matrices, leaving the base model's broader knowledge intact. The result is a model simultaneously more capable in the domain and no less capable outside it.

Five questions that
shape the outcome.

Organisations that approach fine-tuning with clear answers to these questions consistently achieve better results than those that do not.

What specific behaviour needs to change?

The sharper this answer, the more effective the fine-tuning. "The model needs to classify incoming claims into our 22 internal categories with greater accuracy than our current rules-based system" will produce a better outcome than "we want the AI to understand our insurance business better."

What data exists and what is its quality?

Volume matters less than quality and relevance. Five hundred carefully curated examples of correct behaviour are more valuable than five thousand noisy ones. The data audit is one of the most important steps in the process — and one of the most frequently skipped.

Where does this model need to live?

If data cannot leave the organisation's infrastructure — as is the case for most regulated industries — the model must be built and served internally. This shapes both the model selection and the infrastructure design from the outset.

What does success look like in measurable terms?

Fine-tuning without a defined success metric produces a model that seems better without being verifiably better. Defining evaluation criteria before training begins makes the outcome assessable and improvable.

What happens when the model is wrong?

Every deployed model will make mistakes. Designing the human review and correction workflow before deployment — rather than after the first production failure — is the difference between a resilient system and a reputational risk.

Proof points

68% Reduction in document processing time — financial services client, no data leaving internal infrastructure

4.2× Faster developer velocity with a code assistant fine-tuned on the organisation's own codebase

91% Answer accuracy in a healthcare information system vs 67% with a general model baseline

How Entiovi engages.

Phase 01 01 2–3 weeks

Discovery and data audit

A structured assessment of the target tasks, available data, infrastructure constraints, and success metrics. This produces a recommendation — which technique, which base model, what infrastructure — with a business case attached.

Phase 02 02 4–8 weeks

Fine-tuning and evaluation

Model training, domain evaluation suite development, iterative refinement, and a formal evaluation report comparing the fine-tuned model against the baseline on the organisation's own test cases.

Phase 03 03 2–4 weeks

Deployment and handover

Integration into the target system, inference infrastructure setup, monitoring configuration, and full documentation. The organisation receives a model it understands and can operate — not a black box.

Phase 04 04 Continuous

Ongoing model stewardship

As the organisation's data evolves and the model landscape shifts, Entiovi provides structured re-evaluation and re-training cycles to ensure the model remains accurate, aligned, and cost-efficient over time.

Ready when you are

The right model for
the right organisation.

Generic AI produces generic outcomes. The organisations seeing the clearest return from Generative AI are those that have invested in making the technology genuinely theirs — trained on their data, shaped to their requirements, deployed in their infrastructure.

Entiovi's team can assess, in a structured two-week engagement, whether fine-tuning is the right approach for a given use case — and if so, exactly what that process looks like, what it costs, and what it will deliver.

Start with a discovery engagement

Back to Orion Practice

Entiovi · Orion Practice · Discipline 01

LLM Development & Fine-tuning.

Why generic AIunderperforms in enterprise.

The difference betweenprompting and fine-tuning.

How Entiovi approachesfine-tuning.