A Deep Dive into the Code of the Visible Transformer (ViT) Mannequin | by Alexey Kravets

Breaking down the HuggingFace ViT Implementation

Imaginative and prescient Transformer (ViT) stands as a exceptional milestone within the evolution of pc imaginative and prescient. ViT challenges the standard knowledge that photos are finest processed by convolutional layers, proving that sequence-based consideration mechanisms can successfully seize the intricate patterns, context, and semantics current in photos. By breaking down photos into manageable patches and leveraging self-attention, ViT captures each native and world relationships, enabling it to excel in numerous imaginative and prescient duties, from picture classification to object detection and past. On this article, we’re going to break down how ViT for classification works beneath the hood.

The core concept of ViT is to deal with a picture as a sequence of fixed-size patches, that are then flattened and transformed into 1D vectors. These patches are subsequently processed by a transformer encoder, which permits the mannequin to seize world context and dependencies throughout your entire picture. By dividing the picture into patches, ViT successfully reduces the computational complexity of dealing with massive photos whereas retaining the power to mannequin complicated spatial interactions.

To begin with, we import the ViT mannequin for classification from hugging face transformers library:

from transformers import ViTForImageClassification
import torch
import numpy as npmannequin = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224")

patch16–224 signifies that the mannequin accepts photos of measurement 224×224 and every patch has width and hight of 16 pixels.

That is what the mannequin structure appears to be like like:

ViTForImageClassification(
(vit): ViTModel(
(embeddings): ViTEmbeddings(
(patch_embeddings): PatchEmbeddings(
(projection): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
)
(dropout): Dropout(p=0.0, inplace=False)
)
(encoder): ViTEncoder(
(layer): ModuleList(
(0): ViTLayer(
(consideration): ViTAttention(
(consideration): ViTSelfAttention(
(question): Linear(in_features=768, out_features=768, bias=True)
(key)…