Computer Vision3D ReconstructionMixed RealityDeep LearningResearch

From 2D Masks to 3D Meshes: Object Reconstruction with OccupancyNet on HoloLens 2

How implicit neural representations can turn a handful of depth-masked point clouds into watertight 3D meshes — and why this is a game-changer for mixed reality spatial understanding.

Abdoul-Wahabou H. Tiambou
MSc Computer Science · AI/ML & Computer Vision

July 10, 2025Updated July 20, 2025 22 min read

Contents

Tags

OccupancyNet3D ReconstructionImplicit Neural RepresentationHoloLens 2Marching CubesPyTorch

Introduction

The 3D reconstruction problem in augmented reality

Imagine wearing a HoloLens 2 and walking into any room. You want the device to not only know where walls and floors are but to understand each individual object: its precise 3D shape, orientation, and surface, so that you can attach digital annotations, simulate physical interactions, or replace it with a virtual counterpart.

This is the 3D object reconstruction problem in AR, and it is surprisingly hard. A HoloLens 2 sees each object from a limited viewing angle, with significant depth noise, and often only partially. Classical approaches — volumetric fusion, photogrammetry — require dense, calibrated multi-view capture incompatible with a freely-moving headset.

Key insight: implicit representations

Instead of representing a 3D shape as a mesh, voxel grid, or point cloud, implicit neural representations (INR) learn a function f(x,y,z) → occupancy probability. This function is differentiable, resolution-independent, and can generalise to unseen shapes by learning a latent shape space.

In this post I walk through OccupancyNet (Mescheder et al., CVPR 2019), explain why it fits our constraints, show the full reconstruction pipeline, and share honest results including the cases where it fails.

Background

A brief taxonomy of 3D shape representations

To appreciate why occupancy networks are interesting, it helps to understand what came before them.

Representation	Memory	Resolution	Watertight?	Generative?
Point Cloud	O(N)	Discrete	✗	✓ (PointNet)
Voxel Grid	O(r³)	Discrete	✓	✓ (3D-GAN)
Mesh	Variable	Discrete	✓	Difficult
SDF / TSDF	O(r³)	Discrete	✓	✗
Occupancy Network	O(1)*	Continuous	✓	✓
NeRF	O(1)*	Continuous	✗	Partial

* Network weights are fixed size; only query time scales with desired resolution.

The key advantage of OccupancyNet over voxel grids is that it does not discretise space upfront: you can extract a mesh at any resolution by running Marching Cubes at whatever grid density you want.

Why not NeRF?

NeRF requires dense multi-view capture (100+ calibrated images) and takes hours to train per scene. Our use case provides at most 5–10 partial views of a single object from a moving headset. OccupancyNet with a learned shape prior can reconstruct from a single partial point cloud by drawing on the latent space it learned during training on ShapeNet.

Architecture

OccupancyNet architecture deep-dive

OccupancyNet defines a binary classifier f_θ(p, z) → [0,1] where p is a 3D query point and z is a latent code encoding the shape. Training minimises binary cross-entropy over randomly sampled surface and free-space points.

Encoder: PointNet for partial observations

The input is a partial point cloud from HoloLens 2 depth data. A PointNet encoder applies a shared MLP to each point independently, then max-pools over all points to produce a permutation-invariant global feature vector z ∈ ℝ^256.

encoder.py·python

1import torch
2import torch.nn as nn
3
4class PointNetEncoder(nn.Module):
5    """Encode a partial point cloud into a global shape latent code."""
6    def __init__(self, latent_dim: int = 256):
7        super().__init__()
8        self.mlp = nn.Sequential(
9            nn.Linear(3, 64),   nn.ReLU(),
10            nn.Linear(64, 128), nn.ReLU(),
11            nn.Linear(128, latent_dim),
12        )
13
14    def forward(self, pts: torch.Tensor) -> torch.Tensor:
15        features = self.mlp(pts)        # (B, N, latent_dim)
16        z, _ = features.max(dim=1)      # global max-pool
17        return z

Decoder: conditioned occupancy MLP

The decoder takes query point p ∈ ℝ³ and shape code z ∈ ℝ^256, outputting occupancy probability. We use Conditional Batch Normalisation (CBN) to inject the shape code at each layer — this improves reconstruction quality on thin structures.

Mesh extraction with Marching Cubes

Once trained, we extract meshes by evaluating the occupancy function on a dense 3D grid and running Marching Cubes at the 0.5 iso-surface. Resolution is a trade-off: 32³ is fast (0.05s) but blocky; 128³ gives smooth surfaces but takes ~2s per object.

Training

Pre-training on ShapeNet and fine-tuning for indoor objects

We pre-trained OccupancyNet on ShapeNet Core55 — 51,300 3D models across 55 categories. Pre-training provides a rich shape prior: the encoder learns a latent space where similar shapes cluster, and the decoder learns smooth implicit surfaces.

ShapeNet shapes

Pre-training

Indoor objects

Fine-tuning

Pre-training epochs

~18h on 2× A100

Fine-tuning epochs

~4h on 1× T4

Fine-tuning on our custom 214-object dataset improved Chamfer Distance by 34% over zero-shot inference. The biggest gains were on non-ShapeNet categories like 'lab equipment' and 'wiring panels'.

Training tip: occupancy sampling matters enormously

Don't sample query points uniformly — oversample near the surface. We use 70% near-surface points (Gaussian noise σ=0.05) and 30% uniform. This dramatically improves boundary sharpness.

Results

Reconstruction results and failure modes

We evaluated on 42 held-out objects using Chamfer Distance (CD), F-Score at τ=0.01, and volumetric IoU.

Method	Chamfer ↓	F-Score ↑	Vol. IoU ↑	Time/obj
3D-EPN (voxel, 32³)	0.142	0.58	0.51	0.08s
PCN (point comp.)	0.098	0.67	0.57	0.12s
OccNet (ours, 64³)	0.061	0.79	0.71	0.45s
OccNet (ours, 128³)	0.058	0.81	0.74	2.10s

Reconstruction quality on 42 held-out indoor objects.

Where the model fails

Very thin structures (cables, pens, chair legs under 5mm): Marching Cubes at 128³ cannot resolve sub-voxel details.
Symmetric objects with extreme partial occlusion: when less than 15% of the surface is observed, the latent code is ambiguous.
Transparent objects (glass, bottles): the depth sensor returns no depth returns for transparent surfaces.
Objects outside ShapeNet categories: near-total failure on fume hoods and wall-mounted projector screens.

Transparent objects: a hard limit

Time-of-flight depth sensors fundamentally cannot measure transparent surfaces — the IR light passes through. Until depth estimation from RGB becomes reliable, our system skips objects flagged as 'transparent'.

Integration

Deploying reconstructed meshes on the HoloLens 2 with MRTK

The output of OccupancyNet is a standard .obj mesh. Deploying it into a running HoloLens 2 mixed reality session requires several engineering steps beyond the ML pipeline.

🔧

Mesh post-processing

Laplacian smoothing, remove small components, decimate to ≤5,000 triangles for real-time rendering.

📐

Coordinate frame alignment

PCA for orientation estimation, then Procrustes alignment to the HoloLens world coordinate frame.

🎮

Unity + MRTK streaming

Meshes serialised to binary format, streamed via WebSocket, rendered as holographic wireframe overlays.

📌

Annotation attachment

Digital annotations parented to the mesh GameObject. MRTK spatial anchoring keeps them locked to physical objects.

Conclusion

Key takeaways

OccupancyNet is the right tool for shape completion from sparse partial observations.
Pre-training on ShapeNet is essential. Training from scratch on 200 objects produces degenerate results.
128³ resolution is the practical sweet spot: good surface quality, 2s per object, 8GB VRAM.
Transparent and very thin objects are unresolved problems requiring different sensor modalities.
The engineering around the ML model — coordinate alignment, mesh post-processing, MRTK streaming — takes more time than the model itself.

OccupancyNet Paper (CVPR 2019) Mescheder et al. — the original paper.ConvONet (ECCV 2020) Peng et al. — convolutional occupancy networks.3D Gaussian Splatting Kerbl et al. (SIGGRAPH 2023) — real-time novel view synthesis.

Tags

OccupancyNet3D ReconstructionImplicit Neural RepresentationHoloLens 2Marching CubesPyTorchLIARAPoint Cloud

Computer Vision

Instance Segmentation at Scale: YOLO, Mask R-CNN & SAM on HoloLens 2 Data

How do YOLO, Mask R-CNN and SAM compare when pushed to segment objects in the wild from HoloLens 2 sensor data? I trained, tuned and benchmarked all three on a custom dataset of 200+ indoor objects — here is everything I learned.

Computer Vision

Vision Transformers Explained: From ViT to DINOv2 — A Practitioner's Guide

Transformers have conquered computer vision. In this guide, I break down the fundamental mechanics of ViT, trace its evolution through to Meta's DINOv2, and share practical code for leveraging frozen DINOv2 features for downstream tasks.

← All articles

Computer Vision3D ReconstructionMixed RealityDeep LearningResearch

From 2D Masks to 3D Meshes: Object Reconstruction with OccupancyNet on HoloLens 2

How implicit neural representations can turn a handful of depth-masked point clouds into watertight 3D meshes — and why this is a game-changer for mixed reality spatial understanding.

Abdoul-Wahabou H. Tiambou
MSc Computer Science · AI/ML & Computer Vision

July 10, 2025Updated July 20, 2025 22 min read

Contents

Tags

OccupancyNet3D ReconstructionImplicit Neural RepresentationHoloLens 2Marching CubesPyTorch

Introduction

The 3D reconstruction problem in augmented reality

Key insight: implicit representations

Background

A brief taxonomy of 3D shape representations

To appreciate why occupancy networks are interesting, it helps to understand what came before them.

Representation	Memory	Resolution	Watertight?	Generative?
Point Cloud	O(N)	Discrete	✗	✓ (PointNet)
Voxel Grid	O(r³)	Discrete	✓	✓ (3D-GAN)
Mesh	Variable	Discrete	✓	Difficult
SDF / TSDF	O(r³)	Discrete	✓	✗
Occupancy Network	O(1)*	Continuous	✓	✓
NeRF	O(1)*	Continuous	✗	Partial

* Network weights are fixed size; only query time scales with desired resolution.

Why not NeRF?

Architecture

OccupancyNet architecture deep-dive

Encoder: PointNet for partial observations

encoder.py·python

1import torch
2import torch.nn as nn
3
4class PointNetEncoder(nn.Module):
5    """Encode a partial point cloud into a global shape latent code."""
6    def __init__(self, latent_dim: int = 256):
7        super().__init__()
8        self.mlp = nn.Sequential(
9            nn.Linear(3, 64),   nn.ReLU(),
10            nn.Linear(64, 128), nn.ReLU(),
11            nn.Linear(128, latent_dim),
12        )
13
14    def forward(self, pts: torch.Tensor) -> torch.Tensor:
15        features = self.mlp(pts)        # (B, N, latent_dim)
16        z, _ = features.max(dim=1)      # global max-pool
17        return z

Decoder: conditioned occupancy MLP

Mesh extraction with Marching Cubes

Training

Pre-training on ShapeNet and fine-tuning for indoor objects

ShapeNet shapes

Pre-training

Indoor objects

Fine-tuning

Pre-training epochs

~18h on 2× A100

Fine-tuning epochs

~4h on 1× T4

Fine-tuning on our custom 214-object dataset improved Chamfer Distance by 34% over zero-shot inference. The biggest gains were on non-ShapeNet categories like 'lab equipment' and 'wiring panels'.

Training tip: occupancy sampling matters enormously

Don't sample query points uniformly — oversample near the surface. We use 70% near-surface points (Gaussian noise σ=0.05) and 30% uniform. This dramatically improves boundary sharpness.

Results

Reconstruction results and failure modes

We evaluated on 42 held-out objects using Chamfer Distance (CD), F-Score at τ=0.01, and volumetric IoU.

Method	Chamfer ↓	F-Score ↑	Vol. IoU ↑	Time/obj
3D-EPN (voxel, 32³)	0.142	0.58	0.51	0.08s
PCN (point comp.)	0.098	0.67	0.57	0.12s
OccNet (ours, 64³)	0.061	0.79	0.71	0.45s
OccNet (ours, 128³)	0.058	0.81	0.74	2.10s

Reconstruction quality on 42 held-out indoor objects.

Where the model fails

Very thin structures (cables, pens, chair legs under 5mm): Marching Cubes at 128³ cannot resolve sub-voxel details.
Symmetric objects with extreme partial occlusion: when less than 15% of the surface is observed, the latent code is ambiguous.
Transparent objects (glass, bottles): the depth sensor returns no depth returns for transparent surfaces.
Objects outside ShapeNet categories: near-total failure on fume hoods and wall-mounted projector screens.

Transparent objects: a hard limit

Integration

Deploying reconstructed meshes on the HoloLens 2 with MRTK

The output of OccupancyNet is a standard .obj mesh. Deploying it into a running HoloLens 2 mixed reality session requires several engineering steps beyond the ML pipeline.

🔧

Mesh post-processing

Laplacian smoothing, remove small components, decimate to ≤5,000 triangles for real-time rendering.

📐

Coordinate frame alignment

PCA for orientation estimation, then Procrustes alignment to the HoloLens world coordinate frame.

🎮

Unity + MRTK streaming

Meshes serialised to binary format, streamed via WebSocket, rendered as holographic wireframe overlays.

📌

Annotation attachment

Digital annotations parented to the mesh GameObject. MRTK spatial anchoring keeps them locked to physical objects.

Conclusion

Key takeaways

OccupancyNet is the right tool for shape completion from sparse partial observations.
Pre-training on ShapeNet is essential. Training from scratch on 200 objects produces degenerate results.
128³ resolution is the practical sweet spot: good surface quality, 2s per object, 8GB VRAM.
Transparent and very thin objects are unresolved problems requiring different sensor modalities.
The engineering around the ML model — coordinate alignment, mesh post-processing, MRTK streaming — takes more time than the model itself.

Tags

OccupancyNet3D ReconstructionImplicit Neural RepresentationHoloLens 2Marching CubesPyTorchLIARAPoint Cloud

Computer Vision

Instance Segmentation at Scale: YOLO, Mask R-CNN & SAM on HoloLens 2 Data

Computer Vision

Vision Transformers Explained: From ViT to DINOv2 — A Practitioner's Guide

← All articles

From 2D Masks to 3D Meshes: Object Reconstruction with OccupancyNet on HoloLens 2

The 3D reconstruction problem in augmented reality

A brief taxonomy of 3D shape representations

Why not NeRF?

OccupancyNet architecture deep-dive

Encoder: PointNet for partial observations

Decoder: conditioned occupancy MLP

Mesh extraction with Marching Cubes

Pre-training on ShapeNet and fine-tuning for indoor objects

Reconstruction results and failure modes

Where the model fails

Deploying reconstructed meshes on the HoloLens 2 with MRTK

Key takeaways

Related articles

Instance Segmentation at Scale: YOLO, Mask R-CNN & SAM on HoloLens 2 Data

Vision Transformers Explained: From ViT to DINOv2 — A Practitioner's Guide

From 2D Masks to 3D Meshes: Object Reconstruction with OccupancyNet on HoloLens 2

The 3D reconstruction problem in augmented reality

A brief taxonomy of 3D shape representations

Why not NeRF?

OccupancyNet architecture deep-dive

Encoder: PointNet for partial observations

Decoder: conditioned occupancy MLP

Mesh extraction with Marching Cubes

Pre-training on ShapeNet and fine-tuning for indoor objects

Reconstruction results and failure modes

Where the model fails

Deploying reconstructed meshes on the HoloLens 2 with MRTK

Key takeaways

Related articles

Instance Segmentation at Scale: YOLO, Mask R-CNN & SAM on HoloLens 2 Data

Vision Transformers Explained: From ViT to DINOv2 — A Practitioner's Guide