On this half, I’ll attempt to display how DINOv2 works in a real-case situation. I’ll create fine-grained picture classification job.
Classification workflow:
- Obtain the Food101 dataset from PyTorch datasets.
- Extract options from prepare and take a look at datasets utilizing small DINOv2
- Prepare ML classifier fashions (SVM, XGBoost and KNN) utilizing extracted options from coaching dataset.
- Make a prediction on extracted options from take a look at dataset.
- Consider every ML mannequin’s accuracy and F1score.
Information: Food 101 is a difficult information set of 101 meals classes with 101,000 photos. For every class, 250 manually reviewed take a look at photos are offered in addition to 750 coaching photos.
Mannequin: small DINOv2 model (ViT-S/14 distilled)
ML fashions: SVM, XGBoost, KNN.
Step 1 — Arrange (You should use Google Colab to run the code and switch GPU on)
import torch
import numpy as np
import torchvision
from torchvision import transforms
from torch.utils.information import Subset, DataLoader
import matplotlib.pyplot as plt
import time
import os
import random
from tqdm import tqdmfrom xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score
import pandas as pd
def set_seed(no):
torch.manual_seed(no)
random.seed(no)
np.random.seed(no)
os.environ['PYTHONHASHSEED'] = str()
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
set_seed(100)
Step 2 — Create Transformation, obtain and create Food101 Pytorch datasets, create prepare and take a look at dataloader objects.
batch_size = 8transformation = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
trainset = torchvision.datasets.Food101(root='./information', break up='prepare',
obtain=True, remodel=transformation)
testset = torchvision.datasets.Food101(root='./information', break up='take a look at',
obtain=True, remodel=transformation)
# train_indices = random.pattern(vary(len(trainset)), 20000)
# test_indices = random.pattern(vary(len(testset)), 5000)
# trainset = Subset(trainset, train_indices)
# testset = Subset(testset, test_indices)
trainloader = torch.utils.information.DataLoader(trainset, batch_size=batch_size,
shuffle=True)
testloader = torch.utils.information.DataLoader(testset, batch_size=batch_size,
shuffle=False)
lessons = trainset.lessons
print(len(trainset), len(testset))
print(len(trainloader), len(testloader))
[out] 75750 25250
[out] 9469 3157
Step 3 (Optionally available) — Visualize coaching dataloader batch
# Get a batch of photos
dataiter = iter(trainloader)
photos, labels = subsequent(dataiter)# Plot the photographs
fig, axes = plt.subplots(1, len(photos),figsize=(12,12))
for i, ax in enumerate(axes):
# Convert the tensor picture to numpy format
picture = photos[i].numpy()
picture = picture.transpose((1, 2, 0)) # Transpose to (peak, width, channels)
# Normalize the picture
imply = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
normalized_image = (picture * std) + imply
# Show the picture
ax.imshow(normalized_image)
ax.axis('off')
ax.set_title(f'Label: {labels[i]}')
# Present the plot
plt.present()
Step 4 — load small DINOv2 mannequin and extract options from coaching and take a look at dataloaders.
machine = torch.machine("cuda:0" if torch.cuda.is_available() else "cpu")
dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14').to(machine)#coaching
train_embeddings = []
train_labels = []
dinov2_vits14.eval()
with torch.no_grad():
for information, labels in tqdm(trainloader):
image_embeddings_batch = dinov2_vits14(information.to(machine))
train_embeddings.append(image_embeddings_batch.detach().cpu().numpy())
train_labels.append(labels.detach().cpu().numpy())
#testing
test_embeddings = []
test_labels = []
dinov2_vits14.eval()
with torch.no_grad():
for information, labels in tqdm(testloader):
image_embeddings_batch = dinov2_vits14(information.to(machine))
test_embeddings.append(image_embeddings_batch.detach().cpu().numpy())
test_labels.append(labels.detach().cpu().numpy())
#concatinate outcome
train_embeddings_f = np.vstack(train_embeddings)
train_labels_f = np.concatenate(train_labels).flatten()
test_embeddings_f = np.vstack(test_embeddings)
test_labels_f = np.concatenate(test_labels).flatten()
train_embeddings_f.form, train_labels_f.form, test_embeddings_f.form, test_labels_f.form
[out] ((75750, 384), (75750,), (25250, 384), (25250,))
Step 5 — Construct a perform for SVM, XGBoost and KNN classifiers.
def evaluate_classifiers(X_train, y_train, X_test, y_test):
# Help Vector Machine (SVM)
svm_classifier = SVC()
svm_classifier.match(X_train, y_train)
svm_predictions = svm_classifier.predict(X_test)# XGBoost Classifier
xgb_classifier = XGBClassifier(tree_method='gpu_hist')
xgb_classifier.match(X_train, y_train)
xgb_predictions = xgb_classifier.predict(X_test)
# Ok-Nearest Neighbors (KNN) Classifier
knn_classifier = KNeighborsClassifier()
knn_classifier.match(X_train, y_train)
knn_predictions = knn_classifier.predict(X_test)
# Calculating High-1
top1_svm = accuracy_score(y_test, svm_predictions)
top1_xgb = accuracy_score(y_test, xgb_predictions)
top1_knn = accuracy_score(y_test, knn_predictions)
# Calculating F1 Rating
f1_svm = f1_score(y_test, svm_predictions, common='weighted')
f1_xgb = f1_score(y_test, xgb_predictions, common='weighted')
f1_knn = f1_score(y_test, knn_predictions, common='weighted')
return pd.DataFrame({
'SVM': {'High-1 Accuracy': top1_svm, 'F1 Rating': f1_svm},
'XGBoost': {'High-1 Accuracy': top1_xgb,'F1 Rating': f1_xgb},
'KNN': {'High-1 Accuracy': top1_knn, 'F1 Rating': f1_knn}
})
X_train = train_embeddings_f # Coaching information options
y_train = train_labels_f # Coaching information labels
X_test = test_embeddings_f # Check information options
y_test = test_labels_f # Check information labels
outcomes = evaluate_classifiers(X_train, y_train, X_test, y_test)
print(outcomes)
Outcomes
Wow, the outcomes are nice! As demonstrated, SVM mannequin skilled on small DINOv2 extracted options outperformed different ML fashions and achieved virtually 90% accuracy.
Although we used small DINOv2 mannequin to extract options, ML fashions (particularly SVM) skilled on extracted options demonstrated nice efficiency on the superb grained classification job. The mannequin can classify objects with virtually 90% accuracy out of 101 completely different lessons.
The accuracy would enhance if it was used massive, massive or large DINOv2 fashions. You simply want to vary the dinov2_vits14 in step 4 with dinov2_vitb14, dinov2_vitl14 or dinov2_vitg14. You may have a try to be happy to share the accuracy outcome within the remark part 🙂