Loading...
Loading...

A deep comparative study of three state-of-the-art segmentation architectures applied to a real mixed-reality dataset — with pipeline design, training tricks, and honest benchmark results.

Abdoul-Wahabou H. Tiambou
MSc Computer Science · AI/ML & Computer Vision
Contents
Tags
Introduction
Mixed reality devices like the Microsoft HoloLens 2 open a fascinating frontier: they can perceive the physical world through depth sensors and RGB cameras, yet they need to understand that world — not just see it. For our research at LIARA lab (Laboratoire d'Intelligence Ambiante pour la Reconnaissance d'Activités, UQAC), the key question was: can we automatically segment every object in a room, reconstruct it in 3D, and overlay digital information on top in real time?
The first critical step is instance segmentation — distinguishing not just what is in the scene (semantic segmentation) but which individual instance of each class each pixel belongs to. This is harder, computationally heavier, and far more useful for AR overlays where each physical object needs its own digital twin.
🔬 Research context
This article is drawn from applied research conducted during my MSc at UQAC (2024–2025) within the LIARA laboratory. The work was submitted to UCAml 2025 (17th Int. Conf. on Ubiquitous Computing and Ambient Intelligence, Springer LNNS). All experiments were run on our custom HoloLens 2 dataset.
We evaluated three leading architectures: YOLOv8-seg (the fastest), Mask R-CNN with a ResNet-50-FPN backbone (the established baseline), and SAM (Segment Anything Model by Meta, the most recent zero/few-shot approach). Each has a radically different design philosophy, and the results — spoiler: SAM does not automatically win — were genuinely surprising.
Dataset
No public dataset captured with a HoloLens 2 existed at the scale we needed. We collected 1,400+ RGB frames across 8 indoor scenes (office, kitchen, corridor, lab benches…), then annotated 200+ object instances using polygon masks in CVAT. The annotation pipeline was the most labour-intensive part — roughly 3 weeks of part-time work for two people.
Total frames
RGB @ 1920×1080
Annotated masks
Polygon instances
Object classes
Indoor objects
Scenes
Unique environments
The dataset exhibits three challenges not present in standard benchmarks like COCO: (1) near-infrared sensor noise from the depth camera bleeding into colour channels, (2) motion blur from users moving their heads naturally while wearing the device, and (3) extreme occlusion since the device captures from a first-person perspective where hands and arms frequently occlude objects.
⚠️ Data imbalance
Eight classes (chair, table, laptop, mug, bottle, keyboard, monitor, book) account for 71% of all instances. Rare classes like 'microscope' or 'projector' had fewer than 40 training samples. We addressed this with targeted augmentation and class-weighted loss.
The train/val/test split was 70/15/15 stratified by scene, not by frame — meaning the test set contains entirely unseen environments. This makes the benchmark more realistic and harder than random splits would produce.
Architectures
Before comparing numbers, it is worth understanding why the three architectures are built so differently — because the design choices directly explain the benchmark results.
YOLOv8 (Ultralytics, 2023) extends the YOLO detection family with a lightweight segmentation head. A CSPDarknet backbone extracts features, a Path Aggregation Network (PANNet) fuses multi-scale representations, and a prototype-based mask head predicts instance masks in a single forward pass. The model has ~11M parameters in the -n variant and can run at 80+ FPS on a mid-range GPU.
1from ultralytics import YOLO23# Load YOLOv8n-seg pretrained on COCO4model = YOLO("yolov8n-seg.pt")56results = model.train(7 data="hololens2.yaml",8 epochs=100,9 imgsz=640,10 batch=16,11 lr0=1e-3,12 lrf=1e-2,13 mosaic=0.8,14 mixup=0.1,15 copy_paste=0.3,16 degrees=10.0,17 translate=0.1,18 scale=0.5,19 cls=0.5,20 device="cuda:0",21 workers=8,22 project="runs/segment",23 name="yolov8n_holo",24)Mask R-CNN (He et al., 2017) remains the gold standard for instance segmentation in constrained settings. It adds a small FCN mask branch in parallel to the box-regression head of Faster R-CNN. The Region Proposal Network generates candidate boxes, RoIAlign extracts fixed-size feature maps, and the mask head generates a 28×28 binary mask per proposal. This two-stage approach sacrifices speed but gains significant robustness on small and occluded objects.
Segment Anything Model (Kirillov et al., Meta AI, 2023) is a prompt-based foundation model trained on 1.1 billion masks. SAM does not learn class-specific detectors — instead it accepts a point, a box, or a text prompt and returns a mask. For our use case, we used the automatic mask generator mode. SAM ViT-H has 636M parameters and is orders of magnitude larger than the other two models.
💡 SAM fine-tuning with SAM-2
Meta released SAM 2 (2024) which supports video and can be fine-tuned. We experimented with it briefly but the full fine-tuning pipeline was outside our compute budget. The numbers reported here use the original SAM ViT-H in automatic mode without fine-tuning.
Training
Training on a small, domain-shifted dataset demands careful regularisation. We applied a multi-stage augmentation pipeline designed specifically for the HoloLens 2 failure modes.
| Augmentation | YOLO | Mask R-CNN | SAM (auto) |
|---|---|---|---|
| Horizontal flip | ✓ | ✓ | N/A |
| Colour jitter (HSV) | ✓ | ✓ | N/A |
| Gaussian blur (σ≤2) | ✓ | ✓ | N/A |
| Mosaic (4-image) | ✓ | ✗ | N/A |
| Copy-paste | ✓ | ✗ | N/A |
| Random scale (0.5–1.5×) | ✓ | ✓ | N/A |
| Synthetic motion blur | ✓ | ✓ | N/A |
| MixUp (α=0.1) | ✓ | ✗ | N/A |
| Simulated IR noise | ✓ | ✓ | N/A |
Augmentation strategies applied per model. SAM was used in zero-shot mode.
For Mask R-CNN, we used a step LR schedule (warm-up for 1k iterations, decay at epoch 60 and 80), SGD with momentum=0.9, weight_decay=1e-4, and learning rate 0.005. Training for 100 epochs on a single NVIDIA T4 took approximately 4 hours.
Results
All three models were evaluated on the held-out test split using mask AP at IoU=0.50:0.95 (COCO standard), mask AP50, inference time, and GPU memory.
YOLOv8n-seg mAP@50
Best accuracy
Mask R-CNN mAP@50
Most stable
SAM (auto) mAP@50
Zero-shot baseline
YOLOv8 FPS
RTX 4060, 32GB RAM, 640px
| Model | mAP@50 | mAP@50:95 | FPS (GPU) | Params | GPU Mem |
|---|---|---|---|---|---|
| YOLOv8n-seg | 79.4% | 52.3% | 94 | 11M | 1.8 GB |
| YOLOv8m-seg | 80.1% | 54.7% | 61 | 27M | 3.2 GB |
| Mask R-CNN R50 | 72.1% | 44.8% | 22 | 44M | 4.1 GB |
| SAM ViT-H | 41.8% | 29.3% | 4.2 | 636M | 16+ GB |
Benchmark results on the HoloLens 2 test set. FPS measured with batch=1, RTX 4060, 32GB RAM.
🔬 Why does SAM underperform?
SAM was trained on natural web images and its automatic mode tends to over-segment: it creates dozens of tiny fragments per object rather than clean instance masks. It also has no class awareness — it cannot distinguish a 'mug' from a 'bowl'. In our use case where we need both precise instance boundaries AND class labels, SAM in zero-shot mode is simply not competitive.
The YOLOv8 superiority on mAP@50 is explained by its aggressive augmentation pipeline (mosaic, copy-paste) which dramatically improves recall on occluded objects. Mask R-CNN's RoIAlign gives it noticeably better mask boundary precision on thin objects like pens and cables.
Pipeline
The segmentation model is one component in a larger pipeline. The full system flow is: HoloLens 2 RGB stream → YOLO segmentation → depth-aligned bounding volume extraction → OccupancyNet 3D reconstruction → mesh export → HoloLens MRTK overlay.
RGB & Depth Capture
HoloLens 2 streams 1080p RGB at 30 FPS alongside a 512×512 depth map from the time-of-flight sensor.
Instance Segmentation (YOLOv8)
YOLOv8m-seg runs at 61 FPS on a companion GPU server. Each frame returns bounding boxes, class labels, and binary masks.
Depth Masking & Point Cloud
For each instance mask, we back-project the depth values to extract a partial 3D point cloud for that object.
3D Reconstruction (OccupancyNet)
The point cloud is fed to OccupancyNet which predicts a continuous occupancy field and extracts the mesh via Marching Cubes.
HoloLens MRTK Overlay
Reconstructed meshes are streamed back to the HoloLens 2 and rendered as holographic overlays using Unity + MRTK.
⚠️ Latency challenge
The full round-trip latency (capture → GPU server → HoloLens display) is 180–240ms over Wi-Fi 6. We are investigating edge deployment using YOLO on the device's NPU to remove the network hop.
Takeaways
“The most dangerous metric is validation accuracy on in-distribution data. Always hold out entire scenes, sessions, or domains as your test set.”
Conclusion
Instance segmentation on custom, domain-shifted data is a solvable problem — but it requires attention to every link in the chain. YOLOv8 is the right choice for real-time AR applications today; Mask R-CNN remains relevant when boundary precision matters more than speed; SAM's role is evolving toward fine-tuned specialisation.
Next steps for this research: (1) expand the dataset to 5,000+ annotated frames, (2) explore SAM 2 fine-tuning on our domain, (3) benchmark YOLOv10 and RT-DETR as drop-in replacements, and (4) push for on-device inference on the HoloLens 2's integrated NPU.
Tags