Loading...
Loading...

How implicit neural representations can turn a handful of depth-masked point clouds into watertight 3D meshes — and why this is a game-changer for mixed reality spatial understanding.

Abdoul-Wahabou H. Tiambou
MSc Computer Science · AI/ML & Computer Vision
Contents
Tags
Introduction
Imagine wearing a HoloLens 2 and walking into any room. You want the device to not only know where walls and floors are but to understand each individual object: its precise 3D shape, orientation, and surface, so that you can attach digital annotations, simulate physical interactions, or replace it with a virtual counterpart.
This is the 3D object reconstruction problem in AR, and it is surprisingly hard. A HoloLens 2 sees each object from a limited viewing angle, with significant depth noise, and often only partially. Classical approaches — volumetric fusion, photogrammetry — require dense, calibrated multi-view capture incompatible with a freely-moving headset.
🔬 Key insight: implicit representations
Instead of representing a 3D shape as a mesh, voxel grid, or point cloud, implicit neural representations (INR) learn a function f(x,y,z) → occupancy probability. This function is differentiable, resolution-independent, and can generalise to unseen shapes by learning a latent shape space.
In this post I walk through OccupancyNet (Mescheder et al., CVPR 2019), explain why it fits our constraints, show the full reconstruction pipeline, and share honest results including the cases where it fails.
Background
To appreciate why occupancy networks are interesting, it helps to understand what came before them.
| Representation | Memory | Resolution | Watertight? | Generative? |
|---|---|---|---|---|
| Point Cloud | O(N) | Discrete | ✗ | ✓ (PointNet) |
| Voxel Grid | O(r³) | Discrete | ✓ | ✓ (3D-GAN) |
| Mesh | Variable | Discrete | ✓ | Difficult |
| SDF / TSDF | O(r³) | Discrete | ✓ | ✗ |
| Occupancy Network | O(1)* | Continuous | ✓ | ✓ |
| NeRF | O(1)* | Continuous | ✗ | Partial |
* Network weights are fixed size; only query time scales with desired resolution.
The key advantage of OccupancyNet over voxel grids is that it does not discretise space upfront: you can extract a mesh at any resolution by running Marching Cubes at whatever grid density you want.
NeRF requires dense multi-view capture (100+ calibrated images) and takes hours to train per scene. Our use case provides at most 5–10 partial views of a single object from a moving headset. OccupancyNet with a learned shape prior can reconstruct from a single partial point cloud by drawing on the latent space it learned during training on ShapeNet.
Architecture
OccupancyNet defines a binary classifier f_θ(p, z) → [0,1] where p is a 3D query point and z is a latent code encoding the shape. Training minimises binary cross-entropy over randomly sampled surface and free-space points.
The input is a partial point cloud from HoloLens 2 depth data. A PointNet encoder applies a shared MLP to each point independently, then max-pools over all points to produce a permutation-invariant global feature vector z ∈ ℝ^256.
1import torch2import torch.nn as nn34class PointNetEncoder(nn.Module):5 """Encode a partial point cloud into a global shape latent code."""6 def __init__(self, latent_dim: int = 256):7 super().__init__()8 self.mlp = nn.Sequential(9 nn.Linear(3, 64), nn.ReLU(),10 nn.Linear(64, 128), nn.ReLU(),11 nn.Linear(128, latent_dim),12 )1314 def forward(self, pts: torch.Tensor) -> torch.Tensor:15 features = self.mlp(pts) # (B, N, latent_dim)16 z, _ = features.max(dim=1) # global max-pool17 return zThe decoder takes query point p ∈ ℝ³ and shape code z ∈ ℝ^256, outputting occupancy probability. We use Conditional Batch Normalisation (CBN) to inject the shape code at each layer — this improves reconstruction quality on thin structures.
Once trained, we extract meshes by evaluating the occupancy function on a dense 3D grid and running Marching Cubes at the 0.5 iso-surface. Resolution is a trade-off: 32³ is fast (0.05s) but blocky; 128³ gives smooth surfaces but takes ~2s per object.
Training
We pre-trained OccupancyNet on ShapeNet Core55 — 51,300 3D models across 55 categories. Pre-training provides a rich shape prior: the encoder learns a latent space where similar shapes cluster, and the decoder learns smooth implicit surfaces.
ShapeNet shapes
Pre-training
Indoor objects
Fine-tuning
Pre-training epochs
~18h on 2× A100
Fine-tuning epochs
~4h on 1× T4
Fine-tuning on our custom 214-object dataset improved Chamfer Distance by 34% over zero-shot inference. The biggest gains were on non-ShapeNet categories like 'lab equipment' and 'wiring panels'.
💡 Training tip: occupancy sampling matters enormously
Don't sample query points uniformly — oversample near the surface. We use 70% near-surface points (Gaussian noise σ=0.05) and 30% uniform. This dramatically improves boundary sharpness.
Results
We evaluated on 42 held-out objects using Chamfer Distance (CD), F-Score at τ=0.01, and volumetric IoU.
| Method | Chamfer ↓ | F-Score ↑ | Vol. IoU ↑ | Time/obj |
|---|---|---|---|---|
| 3D-EPN (voxel, 32³) | 0.142 | 0.58 | 0.51 | 0.08s |
| PCN (point comp.) | 0.098 | 0.67 | 0.57 | 0.12s |
| OccNet (ours, 64³) | 0.061 | 0.79 | 0.71 | 0.45s |
| OccNet (ours, 128³) | 0.058 | 0.81 | 0.74 | 2.10s |
Reconstruction quality on 42 held-out indoor objects.
🚨 Transparent objects: a hard limit
Time-of-flight depth sensors fundamentally cannot measure transparent surfaces — the IR light passes through. Until depth estimation from RGB becomes reliable, our system skips objects flagged as 'transparent'.
Integration
The output of OccupancyNet is a standard .obj mesh. Deploying it into a running HoloLens 2 mixed reality session requires several engineering steps beyond the ML pipeline.
Mesh post-processing
Laplacian smoothing, remove small components, decimate to ≤5,000 triangles for real-time rendering.
Coordinate frame alignment
PCA for orientation estimation, then Procrustes alignment to the HoloLens world coordinate frame.
Unity + MRTK streaming
Meshes serialised to binary format, streamed via WebSocket, rendered as holographic wireframe overlays.
Annotation attachment
Digital annotations parented to the mesh GameObject. MRTK spatial anchoring keeps them locked to physical objects.
Conclusion
Tags
Continue reading

Computer Vision
How do YOLO, Mask R-CNN and SAM compare when pushed to segment objects in the wild from HoloLens 2 sensor data? I trained, tuned and benchmarked all three on a custom dataset of 200+ indoor objects — here is everything I learned.

Computer Vision
Transformers have conquered computer vision. In this guide, I break down the fundamental mechanics of ViT, trace its evolution through to Meta's DINOv2, and share practical code for leveraging frozen DINOv2 features for downstream tasks.