EnLearn Practice · Discipline 03

Computer
Vision.

Images Carry Signal. Models Extract It. Systems Act on It — Reliably, Under Real-World Conditions.

Computer vision is the enterprise discipline of turning pixels into decisions. It is not the discipline of winning an academic benchmark — plenty of models win benchmarks and fail on the factory floor. It is the discipline of building perception systems that work at 4 a.m. on the third shift, under flickering light, with the wrong lens, while the conveyor vibration is slightly different from yesterday, and the network has just lost the cloud uplink. Entiovi's Mintaka practice builds vision systems that survive those conditions — because that is the only kind of vision system that matters in production.

Where computer vision
earns its place.

A large share of enterprise information is visual and has been systematically underused. Production lines generate images faster than any human inspection team can examine. Warehouses generate video that sits in dead storage. Instruments generate imagery that operators scroll past. Documents carry structure in their layout that OCR engines flatten. Shelves, drones, clinical scanners, security feeds, vehicles, meter displays — all of them carry signal that rarely makes it into a decision system.

Computer vision converts that signal into something an enterprise can act on. Detection, segmentation, classification, tracking, reading, counting — each with a measured accuracy, a measured latency, and an audit trail. Humans are expensive, inconsistent, and rate-limited at examining visual information. Well-engineered vision systems are fast, consistent, and auditable. The difference between a demo and a deployment is the rest of this page.

We build vision systems that work on the factory floor at 4 a.m., not only in the curated validation set.

What
we build.

Mintaka vision engagements cluster in a handful of well-defined application patterns. Each has a distinct engineering profile.

Defect detection and classification on production and assembly lines — surface flaws, mis-assemblies, cosmetic deviations, dimensional anomalies.
Object detection, counting, and tracking across warehouses, yards, and retail surfaces — pallets, cases, units, vehicles, personnel, on-shelf availability.
Document extraction, layout parsing, and table understanding on structured and semi-structured forms — invoices, contracts, clinical reports, lab results, compliance filings.
Clinical and instrumentation image analysis where accuracy, calibration, and traceability are regulated — triage, measurement, segmentation, report drafting under human review.
Safety and compliance monitoring — PPE detection, zone intrusion, posture and behaviour analysis, safety-protocol adherence in hazardous environments.
OCR and handwriting recognition across languages, scripts, layouts, and quality levels — from typed forms to operator handwritten logs.
Multimodal vision-language systems — where an image is paired with a question, a caption, a search query, or a natural-language instruction.

The vision model families
we work with.

Vision model selection is a function of the use case, the latency envelope, and the deployment target — not of which architecture the team last worked with.

Detection and segmentation architectures

YOLO (v8, v9, v10, v11), Detectron2, SAM, Mask R-CNN, and successors — chosen for the specific accuracy-latency-size envelope the deployment demands.

Vision Transformers and hybrid CNN-ViT backbones

ViT, DINOv2, Swin Transformer, EVA — where representation quality, transfer to new tasks, and scale matter more than raw inference speed.

Multimodal vision-language models

CLIP-style and instruction-tuned models for captioning, grounding, zero-shot classification, visual question answering, and open-vocabulary detection. Mintaka uses these in defect classes where a fixed-label training set is impractical.

Self-supervised and contrastive pretraining

DINOv2, MAE, and related approaches where unlabelled imagery is plentiful and labelling is expensive — typically industrial inspection and clinical domains.

Classical vision pipelines

OpenCV, morphological operations, template matching, edge and contour methods — where determinism, cost, regulatory constraint, or explainability demand it. The best vision stacks frequently combine classical pre-processing with deep models.

Hybrid stacks

Classical pre-processing feeding deep detectors; deep detectors followed by rule-based post-processing; deep models blended with explicit geometric constraints. The factory floor rarely rewards purity.

The perception stack
we engineer.

Vision is a stack, not a model. Most failures are not model failures — they are capture failures, pre-processing failures, or integration failures that no model can compensate for.

Capture

Camera choice, lens, lighting, exposure, shutter, sensor placement, synchronisation. The single highest-leverage engineering stage. A well-engineered capture stack makes the modelling problem tractable; a poor one makes it impossible.

Pre-processing

Calibration, de-skew, colour normalisation, ROI extraction, augmentation policy. The bridge between the sensor and the model.

Inference

Detection, segmentation, classification, tracking — with explicit latency, throughput, and memory budgets per deployment target. Model and runtime chosen together.

Post-processing

Non-maximum suppression, temporal smoothing, tracking association, business-rule overlays, confidence thresholds tuned to the cost of a false positive versus a false negative.

Decision and Handover

The consuming system — MES, WMS, clinical record, ticketing platform, operator console — receives a structured, auditable output, not a raw prediction. Integration is a first-class engineering output.

Feedback

Human corrections, hard-negative capture, and active learning loops that make the model better in production under the conditions it actually encounters — not the conditions the validation set anticipated.

Cloud, edge,
and hybrid deployment.

Vision deployments span a spectrum from pure cloud to pure edge, and the choice has real consequences.

Cloud inference

For high-throughput batch and streaming workloads where connectivity and cost permit — back-office document processing, aerial imagery analysis, video analytics on recorded footage.

Edge inference

On-camera, on-gateway, or on-device — where latency, privacy, bandwidth, or connectivity mandate it. Factory floors, medical devices, retail cameras, drones, vehicles. Many Mintaka engagements run entirely at the edge because no image ever leaves the device.

Hybrid

Edge for low-latency decisioning, cloud for retraining, monitoring, and escalation. The dominant pattern for distributed operational deployments.

Hardware-aware optimisation

Quantisation (INT8, INT4), pruning, and distillation tuned to the target hardware. ONNX, TensorRT, OpenVINO, CoreML, and TFLite runtimes selected for each deployment profile.

Fleet management

Versioned rollouts, remote retraining, drift monitoring, and audit logging across thousands of distributed endpoints. Edge at scale is a fleet problem before it is a model problem.

Data, labelling,
and active learning.

In vision, data strategy is half the engineering.

Mintaka designs labelling schemas and ontologies jointly with domain experts before scale labelling begins. Inter-annotator agreement is measured and improved before any labelling budget is committed at scale — because disagreement between annotators on the training set becomes noise the model inherits. Synthetic data and simulation are used where rare events must be represented and real examples cannot be collected in sufficient volume. Active learning loops prioritise the images that will most improve the model — the uncertain, the anomalous, the out-of-distribution — rather than a random stream. Data versioning is tied to every trained model; retraining is regenerable.

Evaluation under
real-world conditions.

A model validated on a clean test set will not survive the factory floor. Mintaka evaluation harnesses test against the conditions the system will actually face — occlusion, motion blur, variable lighting, sensor drift, seasonal change, adversarial presentations — and measure accuracy, calibration, and robustness separately.

Stratified evaluation by site, shift, line, camera, operator, and season — because a model that is 96 percent accurate in aggregate can be 72 percent accurate on the night shift on line 4 in winter, and that is the only number that matters for the night-shift supervisor. Failure-mode inventory — documented, reproducible, and regression-tested — catches regressions before they reach production. Robustness testing under occlusion, noise, blur, and distribution shift. Drift monitoring wired to production on both the image stream and the output distribution. Human override and escalation paths for low-confidence or high-stakes predictions.

Proof points

94.2% defect detection recall on a precision-assembly line, evaluated on a three-month production hold-out across three shifts.

63ms P95 edge inference latency on a quality-inspection deployment running on-gateway, with zero image egress to the cloud.

71% of legacy manual-inspection effort redeployed to higher-value work after rollout — operators moved from counting to adjudicating.

Zero patient-image egress to the cloud across a regulated clinical deployment — inference entirely at the device under a signed governance envelope.

Representative
use cases.

Mintaka vision engagements span industrial, retail, clinical, operational, and back-office domains.

Surface defect and assembly-error detection on production lines.
Pallet, case, and unit counting across warehouses and distribution centres.
Shelf availability, planogram compliance, and share-of-shelf analytics in retail.
Document classification, extraction, and layout parsing in back-office operations.
Vehicle, equipment, and personnel detection for yard and site safety.
Clinical imaging support — triage, measurement, and report generation under regulatory oversight.
PPE and compliance monitoring in hazardous environments.
Aerial and drone-imagery analysis — infrastructure inspection, agricultural scouting, asset audit.

How Entiovi works
with clients.

Phase 01 01

Discover

Use case, decision, cost of error, environment, capture feasibility, regulatory constraints. The deliverable is a feasibility report grounded in a site visit, not a desk study.

Phase 02 02

Design

Capture stack, labelling schema, candidate model families, deployment target, evaluation harness. A capture-stack recommendation that changes the lighting or the lens is a common and valuable early deliverable.

Phase 03 03

Build

Data pipelines, training, calibration, post-processing, integration into the consuming system.

Phase 04 04

Validate

Stratified evaluation, robustness tests, failure-mode inventory, pilot hold-out against live operations.

Phase 05 05

Deploy

Edge, cloud, or hybrid rollout; fleet management; monitoring hooks; change-management for operators.

Phase 06 06

Operate

Active learning, retraining, drift response, governance reviews, capability extension as new classes and new sites come online.

Pixels into decisions.

Engineered for the
factory floor.

Scope a vision pilot

Audit a capture stack Review a vision evaluation harness Back to Mintaka Practice

Entiovi · Mintaka Practice · Discipline 03

Computer Vision.

Where computer visionearns its place.

Whatwe build.

The vision model familieswe work with.