Computer VisionDeep LearningResearch3D Reconstruction

Instance Segmentation at Scale: YOLO, Mask R-CNN & SAM on HoloLens 2 Data

A deep comparative study of three state-of-the-art segmentation architectures applied to a real mixed-reality dataset — with pipeline design, training tricks, and honest benchmark results.

Abdoul-Wahabou H. Tiambou
MSc Computer Science · AI/ML & Computer Vision

June 15, 2025Updated July 1, 2025 18 min read

Contents

Tags

YOLOMask R-CNNSAMInstance SegmentationHoloLens 2PyTorch

Introduction

Why instance segmentation in mixed reality?

Mixed reality devices like the Microsoft HoloLens 2 open a fascinating frontier: they can perceive the physical world through depth sensors and RGB cameras, yet they need to understand that world — not just see it. For our research at LIARA lab (Laboratoire d'Intelligence Ambiante pour la Reconnaissance d'Activités, UQAC), the key question was: can we automatically segment every object in a room, reconstruct it in 3D, and overlay digital information on top in real time?

The first critical step is instance segmentation — distinguishing not just what is in the scene (semantic segmentation) but which individual instance of each class each pixel belongs to. This is harder, computationally heavier, and far more useful for AR overlays where each physical object needs its own digital twin.

Research context

This article is drawn from applied research conducted during my MSc at UQAC (2024–2025) within the LIARA laboratory. The work was submitted to UCAml 2025 (17th Int. Conf. on Ubiquitous Computing and Ambient Intelligence, Springer LNNS). All experiments were run on our custom HoloLens 2 dataset.

We evaluated three leading architectures: YOLOv8-seg (the fastest), Mask R-CNN with a ResNet-50-FPN backbone (the established baseline), and SAM (Segment Anything Model by Meta, the most recent zero/few-shot approach). Each has a radically different design philosophy, and the results — spoiler: SAM does not automatically win — were genuinely surprising.

Dataset

Building a custom HoloLens 2 segmentation dataset

No public dataset captured with a HoloLens 2 existed at the scale we needed. We collected 1,400+ RGB frames across 8 indoor scenes (office, kitchen, corridor, lab benches…), then annotated 200+ object instances using polygon masks in CVAT. The annotation pipeline was the most labour-intensive part — roughly 3 weeks of part-time work for two people.

Total frames

RGB @ 1920×1080

Annotated masks

Polygon instances

Object classes

Indoor objects

Scenes

Unique environments

The dataset exhibits three challenges not present in standard benchmarks like COCO: (1) near-infrared sensor noise from the depth camera bleeding into colour channels, (2) motion blur from users moving their heads naturally while wearing the device, and (3) extreme occlusion since the device captures from a first-person perspective where hands and arms frequently occlude objects.

Data imbalance

Eight classes (chair, table, laptop, mug, bottle, keyboard, monitor, book) account for 71% of all instances. Rare classes like 'microscope' or 'projector' had fewer than 40 training samples. We addressed this with targeted augmentation and class-weighted loss.

The train/val/test split was 70/15/15 stratified by scene, not by frame — meaning the test set contains entirely unseen environments. This makes the benchmark more realistic and harder than random splits would produce.

Architectures

YOLOv8-seg vs Mask R-CNN vs SAM — a design perspective

Before comparing numbers, it is worth understanding why the three architectures are built so differently — because the design choices directly explain the benchmark results.

YOLOv8-seg: single-stage speed

YOLOv8 (Ultralytics, 2023) extends the YOLO detection family with a lightweight segmentation head. A CSPDarknet backbone extracts features, a Path Aggregation Network (PANNet) fuses multi-scale representations, and a prototype-based mask head predicts instance masks in a single forward pass. The model has ~11M parameters in the -n variant and can run at 80+ FPS on a mid-range GPU.

train_yolo.py·python

1from ultralytics import YOLO
2
3# Load YOLOv8n-seg pretrained on COCO
4model = YOLO("yolov8n-seg.pt")
5
6results = model.train(
7    data="hololens2.yaml",
8    epochs=100,
9    imgsz=640,
10    batch=16,
11    lr0=1e-3,
12    lrf=1e-2,
13    mosaic=0.8,
14    mixup=0.1,
15    copy_paste=0.3,
16    degrees=10.0,
17    translate=0.1,
18    scale=0.5,
19    cls=0.5,
20    device="cuda:0",
21    workers=8,
22    project="runs/segment",
23    name="yolov8n_holo",
24)

Mask R-CNN: two-stage robustness

Mask R-CNN (He et al., 2017) remains the gold standard for instance segmentation in constrained settings. It adds a small FCN mask branch in parallel to the box-regression head of Faster R-CNN. The Region Proposal Network generates candidate boxes, RoIAlign extracts fixed-size feature maps, and the mask head generates a 28×28 binary mask per proposal. This two-stage approach sacrifices speed but gains significant robustness on small and occluded objects.

SAM: zero-shot foundation model

Segment Anything Model (Kirillov et al., Meta AI, 2023) is a prompt-based foundation model trained on 1.1 billion masks. SAM does not learn class-specific detectors — instead it accepts a point, a box, or a text prompt and returns a mask. For our use case, we used the automatic mask generator mode. SAM ViT-H has 636M parameters and is orders of magnitude larger than the other two models.

SAM fine-tuning with SAM-2

Meta released SAM 2 (2024) which supports video and can be fine-tuned. We experimented with it briefly but the full fine-tuning pipeline was outside our compute budget. The numbers reported here use the original SAM ViT-H in automatic mode without fine-tuning.

Training

Training strategy and augmentation pipeline

Training on a small, domain-shifted dataset demands careful regularisation. We applied a multi-stage augmentation pipeline designed specifically for the HoloLens 2 failure modes.

Augmentation	YOLO	Mask R-CNN	SAM (auto)
Horizontal flip	✓	✓	N/A
Colour jitter (HSV)	✓	✓	N/A
Gaussian blur (σ≤2)	✓	✓	N/A
Mosaic (4-image)	✓	✗	N/A
Copy-paste	✓	✗	N/A
Random scale (0.5–1.5×)	✓	✓	N/A
Synthetic motion blur	✓	✓	N/A
MixUp (α=0.1)	✓	✗	N/A
Simulated IR noise	✓	✓	N/A

Augmentation strategies applied per model. SAM was used in zero-shot mode.

For Mask R-CNN, we used a step LR schedule (warm-up for 1k iterations, decay at epoch 60 and 80), SGD with momentum=0.9, weight_decay=1e-4, and learning rate 0.005. Training for 100 epochs on a single NVIDIA T4 took approximately 4 hours.

Results

Benchmark results and critical analysis

All three models were evaluated on the held-out test split using mask AP at IoU=0.50:0.95 (COCO standard), mask AP50, inference time, and GPU memory.

YOLOv8n-seg mAP@50

Best accuracy

Mask R-CNN mAP@50

Most stable

SAM (auto) mAP@50

Zero-shot baseline

YOLOv8 FPS

RTX 4060, 32GB RAM, 640px

Model	mAP@50	mAP@50:95	FPS (GPU)	Params	GPU Mem
YOLOv8n-seg	79.4%	52.3%	94	11M	1.8 GB
YOLOv8m-seg	80.1%	54.7%	61	27M	3.2 GB
Mask R-CNN R50	72.1%	44.8%	22	44M	4.1 GB
SAM ViT-H	41.8%	29.3%	4.2	636M	16+ GB

Benchmark results on the HoloLens 2 test set. FPS measured with batch=1, RTX 4060, 32GB RAM.

Why does SAM underperform?

SAM was trained on natural web images and its automatic mode tends to over-segment: it creates dozens of tiny fragments per object rather than clean instance masks. It also has no class awareness — it cannot distinguish a 'mug' from a 'bowl'. In our use case where we need both precise instance boundaries AND class labels, SAM in zero-shot mode is simply not competitive.

The YOLOv8 superiority on mAP@50 is explained by its aggressive augmentation pipeline (mosaic, copy-paste) which dramatically improves recall on occluded objects. Mask R-CNN's RoIAlign gives it noticeably better mask boundary precision on thin objects like pens and cables.

Pipeline

Integrating segmentation into the HoloLens 2 MR pipeline

The segmentation model is one component in a larger pipeline. The full system flow is: HoloLens 2 RGB stream → YOLO segmentation → depth-aligned bounding volume extraction → OccupancyNet 3D reconstruction → mesh export → HoloLens MRTK overlay.

RGB & Depth Capture

HoloLens 2 streams 1080p RGB at 30 FPS alongside a 512×512 depth map from the time-of-flight sensor.

Instance Segmentation (YOLOv8)

YOLOv8m-seg runs at 61 FPS on a companion GPU server. Each frame returns bounding boxes, class labels, and binary masks.

Depth Masking & Point Cloud

For each instance mask, we back-project the depth values to extract a partial 3D point cloud for that object.

3D Reconstruction (OccupancyNet)

The point cloud is fed to OccupancyNet which predicts a continuous occupancy field and extracts the mesh via Marching Cubes.

HoloLens MRTK Overlay

Reconstructed meshes are streamed back to the HoloLens 2 and rendered as holographic overlays using Unity + MRTK.

Latency challenge

The full round-trip latency (capture → GPU server → HoloLens display) is 180–240ms over Wi-Fi 6. We are investigating edge deployment using YOLO on the device's NPU to remove the network hop.

Takeaways

What I learned — and what I'd do differently

Annotate more, train later. We underestimated annotation time. Going from 1,400 to 3,000 frames would likely have pushed mAP above 85%.
YOLOv8 copy-paste is transformative for scenes with heavy occlusion.
SAM is not a free lunch. It shines as a post-processing refinement tool rather than as a standalone detector.
Mixed precision training is non-optional on small GPU budgets — it halved memory usage with no accuracy penalty.
Validate on unseen scenes from day one. We initially validated on random frame splits and saw optimistically inflated numbers.

“The most dangerous metric is validation accuracy on in-distribution data. Always hold out entire scenes, sessions, or domains as your test set.”
Lesson learned·LIARA Lab, 2025

Conclusion

Summary and next steps

Instance segmentation on custom, domain-shifted data is a solvable problem — but it requires attention to every link in the chain. YOLOv8 is the right choice for real-time AR applications today; Mask R-CNN remains relevant when boundary precision matters more than speed; SAM's role is evolving toward fine-tuned specialisation.

Next steps for this research: (1) expand the dataset to 5,000+ annotated frames, (2) explore SAM 2 fine-tuning on our domain, (3) benchmark YOLOv10 and RT-DETR as drop-in replacements, and (4) push for on-device inference on the HoloLens 2's integrated NPU.

UCAml 2025 Paper Full paper: Automatic 3D Object Segmentation and Reconstruction from HoloLens 2 Data Ultralytics YOLOv8 Docs Official segmentation documentation and training guide Segment Anything (Meta AI) SAM model weights, demo, and research paper

Tags

YOLOMask R-CNNSAMInstance SegmentationHoloLens 2PyTorchMixed RealityLIARA

Computer Vision

From 2D Masks to 3D Meshes: Object Reconstruction with OccupancyNet on HoloLens 2

OccupancyNet learns a continuous occupancy function over 3D space instead of predicting discrete voxels. I explore how we used it to reconstruct 200+ indoor objects from partial HoloLens 2 observations — and what the limits are.

← All articles

Computer VisionDeep LearningResearch3D Reconstruction

Instance Segmentation at Scale: YOLO, Mask R-CNN & SAM on HoloLens 2 Data

A deep comparative study of three state-of-the-art segmentation architectures applied to a real mixed-reality dataset — with pipeline design, training tricks, and honest benchmark results.

Abdoul-Wahabou H. Tiambou
MSc Computer Science · AI/ML & Computer Vision

June 15, 2025Updated July 1, 2025 18 min read

Contents

Tags

YOLOMask R-CNNSAMInstance SegmentationHoloLens 2PyTorch

Introduction

Why instance segmentation in mixed reality?

Research context

Dataset

Building a custom HoloLens 2 segmentation dataset

Total frames

RGB @ 1920×1080

Annotated masks

Polygon instances

Object classes

Indoor objects

Scenes

Unique environments

Data imbalance

Architectures

YOLOv8-seg vs Mask R-CNN vs SAM — a design perspective

Before comparing numbers, it is worth understanding why the three architectures are built so differently — because the design choices directly explain the benchmark results.

YOLOv8-seg: single-stage speed

train_yolo.py·python

1from ultralytics import YOLO
2
3# Load YOLOv8n-seg pretrained on COCO
4model = YOLO("yolov8n-seg.pt")
5
6results = model.train(
7    data="hololens2.yaml",
8    epochs=100,
9    imgsz=640,
10    batch=16,
11    lr0=1e-3,
12    lrf=1e-2,
13    mosaic=0.8,
14    mixup=0.1,
15    copy_paste=0.3,
16    degrees=10.0,
17    translate=0.1,
18    scale=0.5,
19    cls=0.5,
20    device="cuda:0",
21    workers=8,
22    project="runs/segment",
23    name="yolov8n_holo",
24)

Mask R-CNN: two-stage robustness

SAM: zero-shot foundation model

SAM fine-tuning with SAM-2

Training

Training strategy and augmentation pipeline

Training on a small, domain-shifted dataset demands careful regularisation. We applied a multi-stage augmentation pipeline designed specifically for the HoloLens 2 failure modes.

Augmentation	YOLO	Mask R-CNN	SAM (auto)
Horizontal flip	✓	✓	N/A
Colour jitter (HSV)	✓	✓	N/A
Gaussian blur (σ≤2)	✓	✓	N/A
Mosaic (4-image)	✓	✗	N/A
Copy-paste	✓	✗	N/A
Random scale (0.5–1.5×)	✓	✓	N/A
Synthetic motion blur	✓	✓	N/A
MixUp (α=0.1)	✓	✗	N/A
Simulated IR noise	✓	✓	N/A

Augmentation strategies applied per model. SAM was used in zero-shot mode.

Results

Benchmark results and critical analysis

All three models were evaluated on the held-out test split using mask AP at IoU=0.50:0.95 (COCO standard), mask AP50, inference time, and GPU memory.

YOLOv8n-seg mAP@50

Best accuracy

Mask R-CNN mAP@50

Most stable

SAM (auto) mAP@50

Zero-shot baseline

YOLOv8 FPS

RTX 4060, 32GB RAM, 640px

Model	mAP@50	mAP@50:95	FPS (GPU)	Params	GPU Mem
YOLOv8n-seg	79.4%	52.3%	94	11M	1.8 GB
YOLOv8m-seg	80.1%	54.7%	61	27M	3.2 GB
Mask R-CNN R50	72.1%	44.8%	22	44M	4.1 GB
SAM ViT-H	41.8%	29.3%	4.2	636M	16+ GB

Benchmark results on the HoloLens 2 test set. FPS measured with batch=1, RTX 4060, 32GB RAM.

Why does SAM underperform?

Pipeline

Integrating segmentation into the HoloLens 2 MR pipeline

RGB & Depth Capture

HoloLens 2 streams 1080p RGB at 30 FPS alongside a 512×512 depth map from the time-of-flight sensor.

Instance Segmentation (YOLOv8)

YOLOv8m-seg runs at 61 FPS on a companion GPU server. Each frame returns bounding boxes, class labels, and binary masks.

Depth Masking & Point Cloud

For each instance mask, we back-project the depth values to extract a partial 3D point cloud for that object.

3D Reconstruction (OccupancyNet)

The point cloud is fed to OccupancyNet which predicts a continuous occupancy field and extracts the mesh via Marching Cubes.

HoloLens MRTK Overlay

Reconstructed meshes are streamed back to the HoloLens 2 and rendered as holographic overlays using Unity + MRTK.

Latency challenge

The full round-trip latency (capture → GPU server → HoloLens display) is 180–240ms over Wi-Fi 6. We are investigating edge deployment using YOLO on the device's NPU to remove the network hop.

Takeaways

What I learned — and what I'd do differently

Annotate more, train later. We underestimated annotation time. Going from 1,400 to 3,000 frames would likely have pushed mAP above 85%.
YOLOv8 copy-paste is transformative for scenes with heavy occlusion.
SAM is not a free lunch. It shines as a post-processing refinement tool rather than as a standalone detector.
Mixed precision training is non-optional on small GPU budgets — it halved memory usage with no accuracy penalty.
Validate on unseen scenes from day one. We initially validated on random frame splits and saw optimistically inflated numbers.

“The most dangerous metric is validation accuracy on in-distribution data. Always hold out entire scenes, sessions, or domains as your test set.”
Lesson learned·LIARA Lab, 2025

Conclusion

Summary and next steps

Tags

YOLOMask R-CNNSAMInstance SegmentationHoloLens 2PyTorchMixed RealityLIARA

Computer Vision

From 2D Masks to 3D Meshes: Object Reconstruction with OccupancyNet on HoloLens 2

← All articles

Instance Segmentation at Scale: YOLO, Mask R-CNN & SAM on HoloLens 2 Data

Why instance segmentation in mixed reality?

Building a custom HoloLens 2 segmentation dataset

YOLOv8-seg vs Mask R-CNN vs SAM — a design perspective

YOLOv8-seg: single-stage speed

Mask R-CNN: two-stage robustness

SAM: zero-shot foundation model

Training strategy and augmentation pipeline

Benchmark results and critical analysis

Integrating segmentation into the HoloLens 2 MR pipeline

What I learned — and what I'd do differently

Summary and next steps

Related articles

From 2D Masks to 3D Meshes: Object Reconstruction with OccupancyNet on HoloLens 2

Instance Segmentation at Scale: YOLO, Mask R-CNN & SAM on HoloLens 2 Data

Why instance segmentation in mixed reality?

Building a custom HoloLens 2 segmentation dataset

YOLOv8-seg vs Mask R-CNN vs SAM — a design perspective

YOLOv8-seg: single-stage speed

Mask R-CNN: two-stage robustness

SAM: zero-shot foundation model

Training strategy and augmentation pipeline

Benchmark results and critical analysis

Integrating segmentation into the HoloLens 2 MR pipeline

What I learned — and what I'd do differently

Summary and next steps

Related articles

From 2D Masks to 3D Meshes: Object Reconstruction with OccupancyNet on HoloLens 2