Computer VisionDeep LearningTutorial

Vision Transformers Explained: From ViT to DINOv2 — A Practitioner's Guide

How self-attention replaced convolutions, why patchifying images works, and how self-supervised learning with DINOv2 created universal visual features that don't need fine-tuning.

Abdoul-Wahabou H. Tiambou
MSc Computer Science · AI/ML & Computer Vision

August 5, 2025Updated August 15, 2025 25 min read

Contents

Tags

Vision TransformerViTDINOv2Swin TransformerDeiTSelf-Supervised Learning

Introduction

The fall of the convolution

For a decade, the Convolutional Neural Network (CNN) was the undisputed king of computer vision. Convolutions are brilliant because they encode inductive biases perfectly suited for images: translation invariance (a cat is a cat anywhere) and local neighbourhood connectivity (pixels close to each other are related).

But in 2020, Dosovitskiy et al. published 'An Image is Worth 16x16 Words'. They stripped away these inductive biases entirely, treating an image exactly like a sequence of text tokens, and trained a standard Transformer encoder on it. It scaled better, learned more general representations, and outperformed ResNets on ImageNet.

The paradigm shift

Vision Transformers (ViT) proved that with enough data, self-attention can learn the inductive biases of the visual world from scratch better than human engineers can hand-code them via convolutions.

We will cover four milestones in the ViT evolution: the original ViT (Dosovitskiy et al., 2020), DeiT (Touvron et al., 2021) which made ViTs data-efficient, Swin Transformer (Liu et al., 2021) which introduced hierarchical structure, and DINOv2 (Oquab et al., 2024) — Meta's self-supervised powerhouse that may be the most useful foundation model for CV today.

ViT Basics

How Vision Transformers work — the 5-minute version

The core insight of ViT is brutally simple: split an image into fixed-size patches, flatten them into vectors, and feed them to a standard transformer encoder — the same architecture used for text.

🖼️

Patch Embedding

A 224×224 image is split into 196 patches of 16×16 pixels. Each patch is linearly projected into a D-dimensional embedding vector (typically D=768 for ViT-Base).

📍

Position Encoding

Learnable position embeddings are added to each patch embedding so the model knows spatial arrangement. A special [CLS] token is prepended.

⚡

Transformer Encoder

The sequence of 197 tokens passes through L layers of multi-head self-attention + MLP blocks (L=12 for ViT-Base). Every token can attend to every other token — global context from layer 1.

🎯

Classification Head

The final [CLS] token representation is fed to a linear classifier for the downstream task. For dense tasks, all patch tokens are used.

The key difference from CNNs: a ViT has global receptive field from the very first layer. A CNN builds receptive field gradually through stacked convolutions. This is why ViTs excel on tasks requiring long-range spatial reasoning (e.g., understanding that a cat's tail 200px away belongs to the same instance as its head).

The data hunger problem

The original ViT trained on JFT-300M (300 million images). On ImageNet alone (1.2M images), ViT-Base performs worse than a ResNet-50. Transformers lack the inductive biases of convolutions (translation equivariance, locality) and compensate with sheer data volume. This is the problem DeiT and DINOv2 solve.

Landscape

The Vision Transformer family tree

The ViT ecosystem has exploded. Here are the four models every CV practitioner should know, ranked by practical impact.

1. ViT (2020): the proof of concept

Dosovitskiy et al. proved that a pure transformer — with zero convolutions — could match EfficientNet on ImageNet when pre-trained on enough data. Impact: enormous for research. Practical use: limited unless you have Google-scale data.

2. DeiT (2021): data-efficient training

Touvron et al. (Facebook AI) showed that with the right training recipe — aggressive augmentation, regularisation, and a knowledge distillation token — ViTs can be trained competitively on ImageNet-1K alone. DeiT-B achieves 81.8% top-1 accuracy on ImageNet without external data.

3. Swin Transformer (2021): hierarchical + efficient

Liu et al. (Microsoft Research) replaced global self-attention with shifted window (Swin) attention — computing attention within local windows and shifting them to enable cross-window information flow. This produces hierarchical feature maps, making Swin a drop-in replacement for ResNet in detection and segmentation frameworks like Mask R-CNN and UPerNet.

4. DINOv2 (2024): self-supervised everything

Oquab et al. (Meta AI) trained ViT models on 142M curated images using a combination of self-supervised objectives (self-distillation + masked image modelling). The result: DINOv2 produces features that are useful for everything — classification, segmentation, depth estimation, retrieval — without any task-specific fine-tuning. Frozen DINOv2 features + a linear probe often beat fully fine-tuned supervised models.

Model	Year	ImageNet Top-1	Pre-training data	Key innovation
ViT-B/16	2020	77.9%*	ImageNet-1K	Pure transformer for vision
ViT-L/16	2020	85.3%	JFT-300M	Scale is all you need
DeiT-B	2021	81.8%	ImageNet-1K	Training recipe + distillation
Swin-B	2021	83.5%	ImageNet-1K	Shifted window attention
DINOv2 ViT-g	2024	86.5%†	LVD-142M	Self-supervised foundation

Progression of Vision Transformer milestones. * Without external data. † Linear probe accuracy.

Decision Guide

ViTs vs CNNs: a practical decision framework

Despite the hype, CNNs are not dead. Here is an honest decision framework based on practical experience.

Criterion	Use ViT/DINOv2	Use CNN (ResNet/EfficientNet)
Dataset size	> 10K images (or use pre-trained)	< 5K images, no pre-training
Task type	Dense prediction, global context needed	Speed-critical, edge deployment
Compute budget	GPU available (V100+)	CPU or mobile NPU
Latency requirement	> 20ms acceptable	< 5ms required
Transfer learning	DINOv2 frozen features are best-in-class	EfficientNet is still very competitive

The DINOv2 shortcut

For most practical problems in 2025, the fastest path to a strong model is: extract DINOv2 features (frozen backbone) → train a lightweight head (linear layer or small MLP). This requires minimal compute, no augmentation tuning, and often matches fully supervised baselines.

On edge devices (mobile phones, embedded systems, IoT), CNNs still dominate. MobileNetV3 and EfficientNet-Lite run at 30+ FPS on a smartphone CPU. ViTs require dedicated NPU/GPU acceleration and even then lag behind in latency-per-accuracy on small models.

Fine-tuning

Practical fine-tuning recipe for DINOv2

Here is the recipe I use for fine-tuning DINOv2 on custom datasets. It works for classification, segmentation, and retrieval tasks with minimal modification.

finetune_dinov2.py·python

1import torch
2import torch.nn as nn
3from torchvision import transforms
4
5# Load DINOv2 ViT-Base from torch.hub
6backbone = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
7
8# Freeze the backbone — train only the head
9for param in backbone.parameters():
10    param.requires_grad = False
11
12# Add classification head
13class DINOv2Classifier(nn.Module):
14    def __init__(self, backbone, num_classes: int):
15        super().__init__()
16        self.backbone = backbone
17        self.head = nn.Sequential(
18            nn.LayerNorm(768),
19            nn.Linear(768, 256),
20            nn.GELU(),
21            nn.Dropout(0.1),
22            nn.Linear(256, num_classes),
23        )
24
25    def forward(self, x):
26        with torch.no_grad():
27            features = self.backbone(x)  # [CLS] token
28        return self.head(features)
29
30model = DINOv2Classifier(backbone, num_classes=10)
31
32# Training config
33optimizer = torch.optim.AdamW(
34    model.head.parameters(), lr=1e-3, weight_decay=0.05
35)
36scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
37    optimizer, T_max=30
38)

Freeze the backbone for small datasets (< 10K images). Only train the head.
Use AdamW with weight_decay=0.05 and lr=1e-3 for the head.
Cosine annealing schedule for 30 epochs is usually sufficient.
Minimal augmentation needed — DINOv2 features are already robust to transformations.
Input resolution: DINOv2 ViT-B/14 expects 518×518 by default. You can use 224×224 with interpolated position embeddings for 2× speed-up with ~1% accuracy loss.

When to unfreeze

If your dataset is > 50K images and domain-shifted from natural images (medical imaging, satellite imagery, industrial inspection), unfreezing the last 4 transformer blocks and fine-tuning with lr=1e-5 can improve accuracy by 2–5%.

Benchmarks

Real-world benchmarks: ViTs in my projects

I have used ViTs in several projects. Here are honest benchmarks from production-adjacent work — not cherry-picked paper results.

Task	Model	Accuracy/mAP	Inference	Notes
Indoor object classification	DINOv2-B + linear	91.2% top-1	12ms (T4)	28 classes, 3,850 images
Indoor object classification	ResNet-50 (fine-tuned)	87.4% top-1	4ms (T4)	Same dataset
Instance segmentation	YOLOv8m (ConvNet)	80.1% mAP50	16ms (3070)	HoloLens 2 data
Depth estimation	DINOv2-B + DPT head	Rel. err 0.11	45ms (T4)	NYU Depth V2
Image retrieval	DINOv2-B frozen	R@1: 89.3%	8ms (T4)	Indoor scene retrieval

Real-world ViT performance from my projects.

The pattern is clear: DINOv2 frozen features are extraordinarily good for classification and retrieval. For dense prediction (segmentation, depth), you need a proper decoder head, but the backbone quality still gives you a significant advantage over training from scratch.

“Foundation models like DINOv2 represent a genuine paradigm shift. The question is no longer 'how do I train a good backbone?' but 'how do I build the best head for my task?'”
Personal reflection·After 6 months of experiments, 2025

Conclusion

Where do we go from here?

Vision Transformers are no longer experimental — they are the default choice for most computer vision tasks in 2025. But knowing which ViT to use and how to deploy it efficiently is where the real skill lies.

DINOv2 is the best starting point for most CV tasks. Frozen features + lightweight head = fast iteration.
Swin Transformer is the best drop-in replacement for ResNet in detection/segmentation frameworks.
CNNs still win on edge devices, tiny datasets, and extreme latency requirements.
Don't train ViTs from scratch unless you have 10M+ images. Use pre-trained weights and fine-tune.
Self-supervised pre-training (DINO, MAE, I-JEPA) is the future — it removes the data labelling bottleneck.

DINOv2 Official Repo Model weights, fine-tuning notebooks, and evaluation benchmarks.ViT Paper (ICLR 2021) Dosovitskiy et al. — An Image is Worth 16x16 Words.Swin Transformer Paper Liu et al. — Hierarchical Vision Transformer using Shifted Windows.

Tags

Vision TransformerViTDINOv2Swin TransformerDeiTSelf-Supervised LearningPyTorchTransfer Learning

Computer Vision

Instance Segmentation at Scale: YOLO, Mask R-CNN & SAM on HoloLens 2 Data

How do YOLO, Mask R-CNN and SAM compare when pushed to segment objects in the wild from HoloLens 2 sensor data? I trained, tuned and benchmarked all three on a custom dataset of 200+ indoor objects — here is everything I learned.

Computer Vision

From 2D Masks to 3D Meshes: Object Reconstruction with OccupancyNet on HoloLens 2

OccupancyNet learns a continuous occupancy function over 3D space instead of predicting discrete voxels. I explore how we used it to reconstruct 200+ indoor objects from partial HoloLens 2 observations — and what the limits are.

← All articles

Computer VisionDeep LearningTutorial

Vision Transformers Explained: From ViT to DINOv2 — A Practitioner's Guide

How self-attention replaced convolutions, why patchifying images works, and how self-supervised learning with DINOv2 created universal visual features that don't need fine-tuning.

Abdoul-Wahabou H. Tiambou
MSc Computer Science · AI/ML & Computer Vision

August 5, 2025Updated August 15, 2025 25 min read

Contents

Tags

Vision TransformerViTDINOv2Swin TransformerDeiTSelf-Supervised Learning

Introduction

The fall of the convolution

The paradigm shift

Vision Transformers (ViT) proved that with enough data, self-attention can learn the inductive biases of the visual world from scratch better than human engineers can hand-code them via convolutions.

ViT Basics

How Vision Transformers work — the 5-minute version

🖼️

Patch Embedding

A 224×224 image is split into 196 patches of 16×16 pixels. Each patch is linearly projected into a D-dimensional embedding vector (typically D=768 for ViT-Base).

📍

Position Encoding

Learnable position embeddings are added to each patch embedding so the model knows spatial arrangement. A special [CLS] token is prepended.

⚡

Transformer Encoder

The sequence of 197 tokens passes through L layers of multi-head self-attention + MLP blocks (L=12 for ViT-Base). Every token can attend to every other token — global context from layer 1.

🎯

Classification Head

The final [CLS] token representation is fed to a linear classifier for the downstream task. For dense tasks, all patch tokens are used.

The data hunger problem

Landscape

The Vision Transformer family tree

The ViT ecosystem has exploded. Here are the four models every CV practitioner should know, ranked by practical impact.

1. ViT (2020): the proof of concept

2. DeiT (2021): data-efficient training

3. Swin Transformer (2021): hierarchical + efficient

4. DINOv2 (2024): self-supervised everything

Model	Year	ImageNet Top-1	Pre-training data	Key innovation
ViT-B/16	2020	77.9%*	ImageNet-1K	Pure transformer for vision
ViT-L/16	2020	85.3%	JFT-300M	Scale is all you need
DeiT-B	2021	81.8%	ImageNet-1K	Training recipe + distillation
Swin-B	2021	83.5%	ImageNet-1K	Shifted window attention
DINOv2 ViT-g	2024	86.5%†	LVD-142M	Self-supervised foundation

Progression of Vision Transformer milestones. * Without external data. † Linear probe accuracy.

Decision Guide

ViTs vs CNNs: a practical decision framework

Despite the hype, CNNs are not dead. Here is an honest decision framework based on practical experience.

Criterion	Use ViT/DINOv2	Use CNN (ResNet/EfficientNet)
Dataset size	> 10K images (or use pre-trained)	< 5K images, no pre-training
Task type	Dense prediction, global context needed	Speed-critical, edge deployment
Compute budget	GPU available (V100+)	CPU or mobile NPU
Latency requirement	> 20ms acceptable	< 5ms required
Transfer learning	DINOv2 frozen features are best-in-class	EfficientNet is still very competitive

The DINOv2 shortcut

Fine-tuning

Practical fine-tuning recipe for DINOv2

Here is the recipe I use for fine-tuning DINOv2 on custom datasets. It works for classification, segmentation, and retrieval tasks with minimal modification.

finetune_dinov2.py·python

1import torch
2import torch.nn as nn
3from torchvision import transforms
4
5# Load DINOv2 ViT-Base from torch.hub
6backbone = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
7
8# Freeze the backbone — train only the head
9for param in backbone.parameters():
10    param.requires_grad = False
11
12# Add classification head
13class DINOv2Classifier(nn.Module):
14    def __init__(self, backbone, num_classes: int):
15        super().__init__()
16        self.backbone = backbone
17        self.head = nn.Sequential(
18            nn.LayerNorm(768),
19            nn.Linear(768, 256),
20            nn.GELU(),
21            nn.Dropout(0.1),
22            nn.Linear(256, num_classes),
23        )
24
25    def forward(self, x):
26        with torch.no_grad():
27            features = self.backbone(x)  # [CLS] token
28        return self.head(features)
29
30model = DINOv2Classifier(backbone, num_classes=10)
31
32# Training config
33optimizer = torch.optim.AdamW(
34    model.head.parameters(), lr=1e-3, weight_decay=0.05
35)
36scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
37    optimizer, T_max=30
38)

Freeze the backbone for small datasets (< 10K images). Only train the head.
Use AdamW with weight_decay=0.05 and lr=1e-3 for the head.
Cosine annealing schedule for 30 epochs is usually sufficient.
Minimal augmentation needed — DINOv2 features are already robust to transformations.
Input resolution: DINOv2 ViT-B/14 expects 518×518 by default. You can use 224×224 with interpolated position embeddings for 2× speed-up with ~1% accuracy loss.

When to unfreeze

Benchmarks

Real-world benchmarks: ViTs in my projects

I have used ViTs in several projects. Here are honest benchmarks from production-adjacent work — not cherry-picked paper results.

Task	Model	Accuracy/mAP	Inference	Notes
Indoor object classification	DINOv2-B + linear	91.2% top-1	12ms (T4)	28 classes, 3,850 images
Indoor object classification	ResNet-50 (fine-tuned)	87.4% top-1	4ms (T4)	Same dataset
Instance segmentation	YOLOv8m (ConvNet)	80.1% mAP50	16ms (3070)	HoloLens 2 data
Depth estimation	DINOv2-B + DPT head	Rel. err 0.11	45ms (T4)	NYU Depth V2
Image retrieval	DINOv2-B frozen	R@1: 89.3%	8ms (T4)	Indoor scene retrieval

Real-world ViT performance from my projects.

“Foundation models like DINOv2 represent a genuine paradigm shift. The question is no longer 'how do I train a good backbone?' but 'how do I build the best head for my task?'”
Personal reflection·After 6 months of experiments, 2025

Conclusion

Where do we go from here?

DINOv2 is the best starting point for most CV tasks. Frozen features + lightweight head = fast iteration.
Swin Transformer is the best drop-in replacement for ResNet in detection/segmentation frameworks.
CNNs still win on edge devices, tiny datasets, and extreme latency requirements.
Don't train ViTs from scratch unless you have 10M+ images. Use pre-trained weights and fine-tune.
Self-supervised pre-training (DINO, MAE, I-JEPA) is the future — it removes the data labelling bottleneck.

Tags

Vision TransformerViTDINOv2Swin TransformerDeiTSelf-Supervised LearningPyTorchTransfer Learning

Computer Vision

Instance Segmentation at Scale: YOLO, Mask R-CNN & SAM on HoloLens 2 Data

Computer Vision

From 2D Masks to 3D Meshes: Object Reconstruction with OccupancyNet on HoloLens 2

← All articles

Vision Transformers Explained: From ViT to DINOv2 — A Practitioner's Guide

The fall of the convolution

How Vision Transformers work — the 5-minute version

The Vision Transformer family tree

1. ViT (2020): the proof of concept

2. DeiT (2021): data-efficient training

3. Swin Transformer (2021): hierarchical + efficient

4. DINOv2 (2024): self-supervised everything

ViTs vs CNNs: a practical decision framework

Practical fine-tuning recipe for DINOv2

Real-world benchmarks: ViTs in my projects

Where do we go from here?

Related articles

Instance Segmentation at Scale: YOLO, Mask R-CNN & SAM on HoloLens 2 Data

From 2D Masks to 3D Meshes: Object Reconstruction with OccupancyNet on HoloLens 2

Vision Transformers Explained: From ViT to DINOv2 — A Practitioner's Guide

The fall of the convolution

How Vision Transformers work — the 5-minute version

The Vision Transformer family tree

1. ViT (2020): the proof of concept

2. DeiT (2021): data-efficient training

3. Swin Transformer (2021): hierarchical + efficient

4. DINOv2 (2024): self-supervised everything

ViTs vs CNNs: a practical decision framework

Practical fine-tuning recipe for DINOv2

Real-world benchmarks: ViTs in my projects

Where do we go from here?

Related articles

Instance Segmentation at Scale: YOLO, Mask R-CNN & SAM on HoloLens 2 Data

From 2D Masks to 3D Meshes: Object Reconstruction with OccupancyNet on HoloLens 2