Loading...
Loading...

How self-attention replaced convolutions, why patchifying images works, and how self-supervised learning with DINOv2 created universal visual features that don't need fine-tuning.

Abdoul-Wahabou H. Tiambou
MSc Computer Science · AI/ML & Computer Vision
Contents
Tags
Introduction
For a decade, the Convolutional Neural Network (CNN) was the undisputed king of computer vision. Convolutions are brilliant because they encode inductive biases perfectly suited for images: translation invariance (a cat is a cat anywhere) and local neighbourhood connectivity (pixels close to each other are related).
But in 2020, Dosovitskiy et al. published 'An Image is Worth 16x16 Words'. They stripped away these inductive biases entirely, treating an image exactly like a sequence of text tokens, and trained a standard Transformer encoder on it. It scaled better, learned more general representations, and outperformed ResNets on ImageNet.
🔬 The paradigm shift
Vision Transformers (ViT) proved that with enough data, self-attention can learn the inductive biases of the visual world from scratch better than human engineers can hand-code them via convolutions.
We will cover four milestones in the ViT evolution: the original ViT (Dosovitskiy et al., 2020), DeiT (Touvron et al., 2021) which made ViTs data-efficient, Swin Transformer (Liu et al., 2021) which introduced hierarchical structure, and DINOv2 (Oquab et al., 2024) — Meta's self-supervised powerhouse that may be the most useful foundation model for CV today.
ViT Basics
The core insight of ViT is brutally simple: split an image into fixed-size patches, flatten them into vectors, and feed them to a standard transformer encoder — the same architecture used for text.
Patch Embedding
A 224×224 image is split into 196 patches of 16×16 pixels. Each patch is linearly projected into a D-dimensional embedding vector (typically D=768 for ViT-Base).
Position Encoding
Learnable position embeddings are added to each patch embedding so the model knows spatial arrangement. A special [CLS] token is prepended.
Transformer Encoder
The sequence of 197 tokens passes through L layers of multi-head self-attention + MLP blocks (L=12 for ViT-Base). Every token can attend to every other token — global context from layer 1.
Classification Head
The final [CLS] token representation is fed to a linear classifier for the downstream task. For dense tasks, all patch tokens are used.
The key difference from CNNs: a ViT has global receptive field from the very first layer. A CNN builds receptive field gradually through stacked convolutions. This is why ViTs excel on tasks requiring long-range spatial reasoning (e.g., understanding that a cat's tail 200px away belongs to the same instance as its head).
⚠️ The data hunger problem
The original ViT trained on JFT-300M (300 million images). On ImageNet alone (1.2M images), ViT-Base performs worse than a ResNet-50. Transformers lack the inductive biases of convolutions (translation equivariance, locality) and compensate with sheer data volume. This is the problem DeiT and DINOv2 solve.
Landscape
The ViT ecosystem has exploded. Here are the four models every CV practitioner should know, ranked by practical impact.
Dosovitskiy et al. proved that a pure transformer — with zero convolutions — could match EfficientNet on ImageNet when pre-trained on enough data. Impact: enormous for research. Practical use: limited unless you have Google-scale data.
Touvron et al. (Facebook AI) showed that with the right training recipe — aggressive augmentation, regularisation, and a knowledge distillation token — ViTs can be trained competitively on ImageNet-1K alone. DeiT-B achieves 81.8% top-1 accuracy on ImageNet without external data.
Liu et al. (Microsoft Research) replaced global self-attention with shifted window (Swin) attention — computing attention within local windows and shifting them to enable cross-window information flow. This produces hierarchical feature maps, making Swin a drop-in replacement for ResNet in detection and segmentation frameworks like Mask R-CNN and UPerNet.
Oquab et al. (Meta AI) trained ViT models on 142M curated images using a combination of self-supervised objectives (self-distillation + masked image modelling). The result: DINOv2 produces features that are useful for everything — classification, segmentation, depth estimation, retrieval — without any task-specific fine-tuning. Frozen DINOv2 features + a linear probe often beat fully fine-tuned supervised models.
| Model | Year | ImageNet Top-1 | Pre-training data | Key innovation |
|---|---|---|---|---|
| ViT-B/16 | 2020 | 77.9%* | ImageNet-1K | Pure transformer for vision |
| ViT-L/16 | 2020 | 85.3% | JFT-300M | Scale is all you need |
| DeiT-B | 2021 | 81.8% | ImageNet-1K | Training recipe + distillation |
| Swin-B | 2021 | 83.5% | ImageNet-1K | Shifted window attention |
| DINOv2 ViT-g | 2024 | 86.5%† | LVD-142M | Self-supervised foundation |
Progression of Vision Transformer milestones. * Without external data. † Linear probe accuracy.
Decision Guide
Despite the hype, CNNs are not dead. Here is an honest decision framework based on practical experience.
| Criterion | Use ViT/DINOv2 | Use CNN (ResNet/EfficientNet) |
|---|---|---|
| Dataset size | > 10K images (or use pre-trained) | < 5K images, no pre-training |
| Task type | Dense prediction, global context needed | Speed-critical, edge deployment |
| Compute budget | GPU available (V100+) | CPU or mobile NPU |
| Latency requirement | > 20ms acceptable | < 5ms required |
| Transfer learning | DINOv2 frozen features are best-in-class | EfficientNet is still very competitive |
💡 The DINOv2 shortcut
For most practical problems in 2025, the fastest path to a strong model is: extract DINOv2 features (frozen backbone) → train a lightweight head (linear layer or small MLP). This requires minimal compute, no augmentation tuning, and often matches fully supervised baselines.
On edge devices (mobile phones, embedded systems, IoT), CNNs still dominate. MobileNetV3 and EfficientNet-Lite run at 30+ FPS on a smartphone CPU. ViTs require dedicated NPU/GPU acceleration and even then lag behind in latency-per-accuracy on small models.
Fine-tuning
Here is the recipe I use for fine-tuning DINOv2 on custom datasets. It works for classification, segmentation, and retrieval tasks with minimal modification.
1import torch2import torch.nn as nn3from torchvision import transforms45# Load DINOv2 ViT-Base from torch.hub6backbone = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')78# Freeze the backbone — train only the head9for param in backbone.parameters():10 param.requires_grad = False1112# Add classification head13class DINOv2Classifier(nn.Module):14 def __init__(self, backbone, num_classes: int):15 super().__init__()16 self.backbone = backbone17 self.head = nn.Sequential(18 nn.LayerNorm(768),19 nn.Linear(768, 256),20 nn.GELU(),21 nn.Dropout(0.1),22 nn.Linear(256, num_classes),23 )2425 def forward(self, x):26 with torch.no_grad():27 features = self.backbone(x) # [CLS] token28 return self.head(features)2930model = DINOv2Classifier(backbone, num_classes=10)3132# Training config33optimizer = torch.optim.AdamW(34 model.head.parameters(), lr=1e-3, weight_decay=0.0535)36scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(37 optimizer, T_max=3038)🔬 When to unfreeze
If your dataset is > 50K images and domain-shifted from natural images (medical imaging, satellite imagery, industrial inspection), unfreezing the last 4 transformer blocks and fine-tuning with lr=1e-5 can improve accuracy by 2–5%.
Benchmarks
I have used ViTs in several projects. Here are honest benchmarks from production-adjacent work — not cherry-picked paper results.
| Task | Model | Accuracy/mAP | Inference | Notes |
|---|---|---|---|---|
| Indoor object classification | DINOv2-B + linear | 91.2% top-1 | 12ms (T4) | 28 classes, 3,850 images |
| Indoor object classification | ResNet-50 (fine-tuned) | 87.4% top-1 | 4ms (T4) | Same dataset |
| Instance segmentation | YOLOv8m (ConvNet) | 80.1% mAP50 | 16ms (3070) | HoloLens 2 data |
| Depth estimation | DINOv2-B + DPT head | Rel. err 0.11 | 45ms (T4) | NYU Depth V2 |
| Image retrieval | DINOv2-B frozen | R@1: 89.3% | 8ms (T4) | Indoor scene retrieval |
Real-world ViT performance from my projects.
The pattern is clear: DINOv2 frozen features are extraordinarily good for classification and retrieval. For dense prediction (segmentation, depth), you need a proper decoder head, but the backbone quality still gives you a significant advantage over training from scratch.
“Foundation models like DINOv2 represent a genuine paradigm shift. The question is no longer 'how do I train a good backbone?' but 'how do I build the best head for my task?'”
Conclusion
Vision Transformers are no longer experimental — they are the default choice for most computer vision tasks in 2025. But knowing which ViT to use and how to deploy it efficiently is where the real skill lies.
Tags
Continue reading

Computer Vision
How do YOLO, Mask R-CNN and SAM compare when pushed to segment objects in the wild from HoloLens 2 sensor data? I trained, tuned and benchmarked all three on a custom dataset of 200+ indoor objects — here is everything I learned.

Computer Vision
OccupancyNet learns a continuous occupancy function over 3D space instead of predicting discrete voxels. I explore how we used it to reconstruct 200+ indoor objects from partial HoloLens 2 observations — and what the limits are.