Detection and segmentation architectures
YOLO (v8, v9, v10, v11), Detectron2, SAM, Mask R-CNN, and successors — chosen for the specific accuracy-latency-size envelope the deployment demands.
Images Carry Signal. Models Extract It. Systems Act on It — Reliably, Under Real-World Conditions.
Computer vision is the enterprise discipline of turning pixels into decisions. It is not the discipline of winning an academic benchmark — plenty of models win benchmarks and fail on the factory floor. It is the discipline of building perception systems that work at 4 a.m. on the third shift, under flickering light, with the wrong lens, while the conveyor vibration is slightly different from yesterday, and the network has just lost the cloud uplink. Entiovi's Mintaka practice builds vision systems that survive those conditions — because that is the only kind of vision system that matters in production.
A large share of enterprise information is visual and has been systematically underused. Production lines generate images faster than any human inspection team can examine. Warehouses generate video that sits in dead storage. Instruments generate imagery that operators scroll past. Documents carry structure in their layout that OCR engines flatten. Shelves, drones, clinical scanners, security feeds, vehicles, meter displays — all of them carry signal that rarely makes it into a decision system.
Computer vision converts that signal into something an enterprise can act on. Detection, segmentation, classification, tracking, reading, counting — each with a measured accuracy, a measured latency, and an audit trail. Humans are expensive, inconsistent, and rate-limited at examining visual information. Well-engineered vision systems are fast, consistent, and auditable. The difference between a demo and a deployment is the rest of this page.
We build vision systems that work on the factory floor at 4 a.m., not only in the curated validation set.
Mintaka vision engagements cluster in a handful of well-defined application patterns. Each has a distinct engineering profile.
Vision model selection is a function of the use case, the latency envelope, and the deployment target — not of which architecture the team last worked with.
YOLO (v8, v9, v10, v11), Detectron2, SAM, Mask R-CNN, and successors — chosen for the specific accuracy-latency-size envelope the deployment demands.
ViT, DINOv2, Swin Transformer, EVA — where representation quality, transfer to new tasks, and scale matter more than raw inference speed.
CLIP-style and instruction-tuned models for captioning, grounding, zero-shot classification, visual question answering, and open-vocabulary detection. Mintaka uses these in defect classes where a fixed-label training set is impractical.
DINOv2, MAE, and related approaches where unlabelled imagery is plentiful and labelling is expensive — typically industrial inspection and clinical domains.
OpenCV, morphological operations, template matching, edge and contour methods — where determinism, cost, regulatory constraint, or explainability demand it. The best vision stacks frequently combine classical pre-processing with deep models.
Classical pre-processing feeding deep detectors; deep detectors followed by rule-based post-processing; deep models blended with explicit geometric constraints. The factory floor rarely rewards purity.
Vision is a stack, not a model. Most failures are not model failures — they are capture failures, pre-processing failures, or integration failures that no model can compensate for.
Camera choice, lens, lighting, exposure, shutter, sensor placement, synchronisation. The single highest-leverage engineering stage. A well-engineered capture stack makes the modelling problem tractable; a poor one makes it impossible.
Calibration, de-skew, colour normalisation, ROI extraction, augmentation policy. The bridge between the sensor and the model.
Detection, segmentation, classification, tracking — with explicit latency, throughput, and memory budgets per deployment target. Model and runtime chosen together.
Non-maximum suppression, temporal smoothing, tracking association, business-rule overlays, confidence thresholds tuned to the cost of a false positive versus a false negative.
The consuming system — MES, WMS, clinical record, ticketing platform, operator console — receives a structured, auditable output, not a raw prediction. Integration is a first-class engineering output.
Human corrections, hard-negative capture, and active learning loops that make the model better in production under the conditions it actually encounters — not the conditions the validation set anticipated.
Vision deployments span a spectrum from pure cloud to pure edge, and the choice has real consequences.
For high-throughput batch and streaming workloads where connectivity and cost permit — back-office document processing, aerial imagery analysis, video analytics on recorded footage.
On-camera, on-gateway, or on-device — where latency, privacy, bandwidth, or connectivity mandate it. Factory floors, medical devices, retail cameras, drones, vehicles. Many Mintaka engagements run entirely at the edge because no image ever leaves the device.
Edge for low-latency decisioning, cloud for retraining, monitoring, and escalation. The dominant pattern for distributed operational deployments.
Quantisation (INT8, INT4), pruning, and distillation tuned to the target hardware. ONNX, TensorRT, OpenVINO, CoreML, and TFLite runtimes selected for each deployment profile.
Versioned rollouts, remote retraining, drift monitoring, and audit logging across thousands of distributed endpoints. Edge at scale is a fleet problem before it is a model problem.
In vision, data strategy is half the engineering.
Mintaka designs labelling schemas and ontologies jointly with domain experts before scale labelling begins. Inter-annotator agreement is measured and improved before any labelling budget is committed at scale — because disagreement between annotators on the training set becomes noise the model inherits. Synthetic data and simulation are used where rare events must be represented and real examples cannot be collected in sufficient volume. Active learning loops prioritise the images that will most improve the model — the uncertain, the anomalous, the out-of-distribution — rather than a random stream. Data versioning is tied to every trained model; retraining is regenerable.
A model validated on a clean test set will not survive the factory floor. Mintaka evaluation harnesses test against the conditions the system will actually face — occlusion, motion blur, variable lighting, sensor drift, seasonal change, adversarial presentations — and measure accuracy, calibration, and robustness separately.
Stratified evaluation by site, shift, line, camera, operator, and season — because a model that is 96 percent accurate in aggregate can be 72 percent accurate on the night shift on line 4 in winter, and that is the only number that matters for the night-shift supervisor. Failure-mode inventory — documented, reproducible, and regression-tested — catches regressions before they reach production. Robustness testing under occlusion, noise, blur, and distribution shift. Drift monitoring wired to production on both the image stream and the output distribution. Human override and escalation paths for low-confidence or high-stakes predictions.
Mintaka vision engagements span industrial, retail, clinical, operational, and back-office domains.
Use case, decision, cost of error, environment, capture feasibility, regulatory constraints. The deliverable is a feasibility report grounded in a site visit, not a desk study.
Capture stack, labelling schema, candidate model families, deployment target, evaluation harness. A capture-stack recommendation that changes the lighting or the lens is a common and valuable early deliverable.
Data pipelines, training, calibration, post-processing, integration into the consuming system.
Stratified evaluation, robustness tests, failure-mode inventory, pilot hold-out against live operations.
Edge, cloud, or hybrid rollout; fleet management; monitoring hooks; change-management for operators.
Active learning, retraining, drift response, governance reviews, capability extension as new classes and new sites come online.