in

Fixing Bottlenecks on the Knowledge Enter Pipeline with PyTorch Profiler and TensorBoard | by Chaim Rand | Aug, 2023


PyTorch Mannequin Efficiency Evaluation and Optimization — Half 4

Photograph by Alexander Grey on Unsplash

That is the fourth submit in our collection of posts on the subject of efficiency evaluation and optimization of GPU-based PyTorch workloads. Our focus on this submit shall be on the coaching knowledge enter pipeline. In a typical coaching software, the host’s CPUs load, pre-process, and collate knowledge earlier than feeding it into the GPU for coaching. Bottlenecks within the enter pipeline happen when the host is just not in a position to sustain with the pace of the GPU. This leads to the GPU — the costliest useful resource within the coaching setup — remaining idle for intervals of time whereas it waits for knowledge enter from the overly tasked host. In earlier posts (e.g., here) we now have mentioned enter pipeline bottlenecks intimately and reviewed other ways of addressing them, equivalent to:

  1. Selecting a coaching occasion with a CPU to GPU compute ratio that’s extra suited to your workload (e.g., see our previous post on ideas for selecting the most effective occasion kind in your ML workload),
  2. Bettering the workload stability between the CPU and GPU by shifting a few of the CPU pre-processing exercise to the GPU, and
  3. Offloading a few of the CPU computation to auxiliary CPU-worker units (e.g., see here).

In fact, step one to addressing a efficiency bottleneck within the data-input pipeline is to establish and perceive it. On this submit we’ll exhibit how this may be carried out utilizing PyTorch Profiler and its related TensorBoard plugin.

As in our earlier posts, we’ll outline a toy PyTorch mannequin after which iteratively profile its efficiency, establish bottlenecks, and try to repair them. We are going to run our experiments on an Amazon EC2 g5.2xlarge occasion (containing an NVIDIA A10G GPU and eight vCPUs) and utilizing the official AWS PyTorch 2.0 Docker image. Remember the fact that a few of the behaviors we describe might range between variations of PyTorch.

Many due to Yitzhak Levi for his contributions to this submit.

Within the following blocks we introduce the toy instance we’ll use for our demonstration. We begin by defining a easy picture classification mannequin. The enter to the mannequin is a batch of 256x256 YUV photos and the output is its related batch of semantic class predictions.

from math import log2
import torch
import torch.nn as nn
import torch.nn.practical as F

img_size = 256
num_classes = 10
hidden_size = 30

# toy CNN classification mannequin
class Internet(nn.Module):
def __init__(self, img_size=img_size, num_classes=num_classes):
tremendous().__init__()
self.conv_in = nn.Conv2d(3, hidden_size, 3, padding='similar')
num_hidden = int(log2(img_size))
hidden = []
for i in vary(num_hidden):
hidden.append(nn.Conv2d(hidden_size, hidden_size, 3, padding='similar'))
hidden.append(nn.ReLU())
hidden.append(nn.MaxPool2d(2))
self.hidden = nn.Sequential(*hidden)
self.conv_out = nn.Conv2d(hidden_size, num_classes, 3, padding='similar')

def ahead(self, x):
x = F.relu(self.conv_in(x))
x = self.hidden(x)
x = self.conv_out(x)
x = torch.flatten(x, 1)
return x

The code block beneath incorporates our dataset definition. Our dataset incorporates ten thousand jpeg-image file-paths and their related (randomly generated) semantic labels. To simplify our demonstration, we’ll assume that the entire jpeg-file paths level to the identical picture — the image of the colourful “bottlenecks” on the prime of this submit.

import numpy as np
from PIL import Picture
from torchvision.datasets.imaginative and prescient import VisionDataset
input_img_size = [533, 800]
class FakeDataset(VisionDataset):
def __init__(self, rework):
tremendous().__init__(root=None, rework=rework)
measurement = 10000
self.img_files = [f'0.jpg' for i in range(size)]
self.targets = np.random.randint(low=0,excessive=num_classes,
measurement=(measurement),dtype=np.uint8).tolist()

def __getitem__(self, index):
img_file, goal = self.img_files[index], self.targets[index]
with torch.profiler.record_function('PIL open'):
img = Picture.open(img_file)
if self.rework is just not None:
img = self.rework(img)
return img, goal

def __len__(self):
return len(self.img_files)

Notice that we now have wrapped the file reader with a torch.profiler.record_function context supervisor.

Our enter knowledge pipeline contains the next transformations on the picture:

  1. PILToTensor converts the PIL picture to a PyTorch Tensor.
  2. RandomCrop returns a 256x256 crop at a random offset within the picture.
  3. RandomMask is a customized rework that creates a random 256x256 boolean masks and applies it to the picture. The rework features a four-neighbor dilation operation on the masks.
  4. ConvertColor is a customized transformation that converts the picture format from RGB to YUV.
  5. Scale is a customized transformation that scales the pixels to the vary [0,1].
class RandomMask(torch.nn.Module):
def __init__(self, ratio=0.25):
tremendous().__init__()
self.ratio=ratio

def dilate_mask(self, masks):
# carry out 4 neighbor dilation on masks
with torch.profiler.record_function('dilation'):
from scipy.sign import convolve2d
dilated = convolve2d(masks, [[0, 1, 0],
[1, 1, 1],
[0, 1, 0]], mode='similar').astype(bool)
return dilated

def ahead(self, img):
with torch.profiler.record_function('random'):
masks = np.random.uniform(measurement=(img_size, img_size)) < self.ratio
dilated_mask = torch.unsqueeze(torch.tensor(self.dilate_mask(masks)),0)
dilated_mask = dilated_mask.increase(3,-1,-1)
img[dilated_mask] = 0.
return img

def __repr__(self):
return f"{self.__class__.__name__}(ratio={self.ratio})"

class ConvertColor(torch.nn.Module):
def __init__(self):
tremendous().__init__()
self.A=torch.tensor(
[[0.299, 0.587, 0.114],
[-0.16874, -0.33126, 0.5],
[0.5, -0.41869, -0.08131]]
)
self.b=torch.tensor([0.,128.,128.])

def ahead(self, img):
img = img.to(dtype=torch.get_default_dtype())
img = torch.matmul(self.A,img.view([3,-1])).view(img.form)
img = img + self.b[:,None,None]
return img

def __repr__(self):
return f"{self.__class__.__name__}()"

class Scale(object):
def __call__(self, img):
return img.to(dtype=torch.get_default_dtype()).div(255)

def __repr__(self):
return f"{self.__class__.__name__}()"

We chain the transformations utilizing the Compose class which we now have modified barely to incorporate a torch.profiler.record_function context supervisor round every of the rework invocations.

import torchvision.transforms as T
class CustomCompose(T.Compose):
def __call__(self, img):
for t in self.transforms:
with torch.profiler.record_function(t.__class__.__name__):
img = t(img)
return img

rework = CustomCompose(
[T.PILToTensor(),
T.RandomCrop(img_size),
RandomMask(),
ConvertColor(),
Scale()])

Within the code block beneath we outline the dataset and dataloader. We configure the DataLoader to make use of a custom collate function by which we wrap the default collate function with a torch.profiler.record_function context supervisor.

train_set = FakeDataset(rework=rework)

def custom_collate(batch):
from torch.utils.knowledge._utils.collate import default_collate
with torch.profiler.record_function('collate'):
batch = default_collate(batch)
picture, label = batch
return picture, label

train_loader = torch.utils.knowledge.DataLoader(train_set, batch_size=256,
collate_fn=custom_collate,
num_workers=4, pin_memory=True)

Lastly, we outline the mannequin, loss perform, optimizer, and the coaching loop, which we wrap with a profiler context manager.

from statistics import imply, variance
from time import time

gadget = torch.gadget("cuda:0")
mannequin = Internet().cuda(gadget)
criterion = nn.CrossEntropyLoss().cuda(gadget)
optimizer = torch.optim.SGD(mannequin.parameters(), lr=0.001, momentum=0.9)
mannequin.practice()

t0 = time()
instances = []

with torch.profiler.profile(
schedule=torch.profiler.schedule(wait=10, warmup=2, energetic=10, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler('/tmp/prof'),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for step, knowledge in enumerate(train_loader):
with torch.profiler.record_function('h2d copy'):
inputs, labels = knowledge[0].to(gadget=gadget, non_blocking=True),
knowledge[1].to(gadget=gadget, non_blocking=True)
if step >= 40:
break
outputs = mannequin(inputs)
loss = criterion(outputs, labels)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
prof.step()
instances.append(time()-t0)
t0 = time()

print(f'common time: {imply(instances[1:])}, variance: {variance(instances[1:])}')

Within the following sections we’ll use PyTorch Profiler and its related TensorBoard plugin so as to assess the efficiency of our mannequin. Our focus shall be on the Hint View of the profiler report. Please see the first post in our collection for an indication of learn how to use the opposite sections of the report.

The common step time reported by the script we outlined is 1.3 seconds and the common GPU utilization is a really low 18.21%. Within the picture beneath we seize the efficiency outcomes as displayed within the TensorBoard plugin Hint View:

Hint View of Baseline Mannequin (Captured by Creator)

We will see that each fourth coaching step features a lengthy (~5.5 second) interval of data-loading throughout which the GPU is totally idle. The rationale that this happens on each fourth step is straight associated to the variety of DataLoader employees we selected — 4. Each fourth step we discover the entire employees busy producing the samples for the subsequent batch whereas the GPU waits. It is a clear indication of a bottleneck within the knowledge enter pipeline. The query is how will we analyze it? Complicating issues is the truth that the various record_function markers that we inserted into the code are nowhere to be discovered within the profile hint.

Using a number of employees within the DataLoader is vital for optimizing efficiency. Sadly, it additionally makes the profiling course of tougher. Though there exist profilers that help multi-process evaluation (e.g., try VizTracer), the strategy we’ll take on this submit is to run, analyze, and optimize our mannequin in single-process mode (i.e., with zero DataLoader employees) after which apply the optimizations to the multi-worker mode. Admittedly, optimizing the pace of a standalone perform doesn’t assure that a number of (parallel) invocations of the identical perform can even profit. Nonetheless, as we’ll see on this submit, this technique will allow us to establish and handle some core points that we have been not in a position to establish in any other case, and, a minimum of as regards to the problems mentioned right here, we will discover a sturdy correlation between the efficiency impacts on the 2 modes. However simply earlier than we apply this technique, allow us to tune our alternative of the variety of employees.

Figuring out the optimum variety of threads or processes in a multi-process/multi-threaded software, equivalent to ours, may be tough. On the one hand, if we select a quantity that’s too low, we’d find yourself under-utilizing the CPU assets. Alternatively, if we go too excessive, we run the chance of thrashing, an undesired scenario by which the working system spends most of its time managing the a number of threading/processing relatively than operating our code. Within the case of a PyTorch coaching workload, it is suggested to check out completely different decisions for the DataLoader num_workers setting. place to begin is to set the quantity primarily based on the variety of CPUs on the host, (e.g., num_workers:=num_cpus/num_gpus). In our case the Amazon EC2 g5.2xlarge has eight vCPUs and, actually, rising the variety of DataLoader employees to eight leads to a barely higher common step time of 1.17 seconds (an 11% enhance).

Importantly, look out for different, much less apparent, configuration settings which may affect the variety of threads or processes being utilized by the data-input pipeline. For instance, opencv-python, a library generally used for picture pre-processing in laptop imaginative and prescient workloads, contains the cv2.setNumThreads(int) perform for controlling the variety of threads.

Within the picture beneath we seize a portion of the Hint View when operating the script with num_workers set to zero.

Hint View of Baseline Mannequin in Single-process Mode (Captured by Creator)

Working the script on this method allows us to see the record_function labels we set and to establish the RandomMask rework, or extra particularly our dilation perform, as essentially the most time-consuming operation within the retrieval of every particular person pattern.

Our present implementation of the dilation perform makes use of a 2D convolution, usually carried out utilizing matrix multiplication and never recognized to run particularly quick on CPU. One possibility could be to run the dilation on the GPU (as described in this post). Nonetheless, the overhead concerned within the host-device transaction would doubtless outweigh the potential efficiency beneficial properties from such a answer, to not point out that we want to not enhance the load on the GPU.

Within the code block beneath we suggest an alternate, extra CPU-friendly, implementation of the dilation perform that makes use of boolean operations as an alternative of a convolution:

    def dilate_mask(self, masks):
# carry out 4 neighbor dilation on masks
with torch.profiler.record_function('dilation'):
padded = np.pad(masks, [(1,1),(1,1)])
dilated = padded[0:-2,1:-1] | padded[1:-1,1:-1] | padded[2:,1:-1] | padded[1:-1,0:-2]| padded[1:-1,2:]
return dilated

Following this modification, our step time drops to 0.78 seconds, which quantities to an extra 50% enchancment. The replace single-process Hint-View is displayed beneath:

Hint View Following Dilation Optimization in Single-process Mode (Captured by Creator)

We will see that the dilation operation has shrunk considerably and that essentially the most time-consuming operation is now the PILToTensor rework.

A more in-depth take a look at the PILToTensor perform (see here) reveals three underlying operations:

  1. Loading the PIL picture — resulting from lazy loading property of Image.open, the picture is loaded right here.
  2. The PIL picture is transformed to a numpy array.
  3. The numpy array is transformed to a PyTorch Tensor.

Though the picture loading takes the vast majority of time, we word the acute wastefulness of making use of the following operations to the full-size picture solely to crop it instantly afterwards. This leads us to our subsequent optimization.

Luckily, the RandomCrop transformation may be utilized on to the PIL picture enabling us to use the image-size discount as the primary operation on our pipeline:

rework = CustomCompose(
[T.RandomCrop(img_size),
T.PILToTensor(),
RandomMask(),
ConvertColor(),
Scale()])

Following this optimization our step time drops to 0.72 seconds, an extra 8% optimization. The Hint View seize beneath reveals that the RandomCrop transformation is now the dominant operation:

Hint View Following Transformation Reordering in Single-process Mode (Captured by Creator)

In actuality, as earlier than, it’s really the PIL picture (lazy) loading that causes the bottleneck, not the random crop.

Ideally, we might be capable to optimize this additional by limiting the learn operation to solely the crop by which we have an interest. Sadly, as of the time of this writing, torchvision doesn’t help this feature. In a future submit we’ll exhibit how we will overcome this shortcoming by implementing our personal custom decode_and_crop PyTorch operator.

In our present implementation, every of the picture transformations are utilized on every picture individually. Nonetheless, some transformations might run extra optimally when utilized on your complete batch without delay. Within the code block beneath we modify our pipeline in order that the ColorTransformation and Scale transforms are utilized on picture batches inside our custom collate function:

def batch_transform(img):
img = img.to(dtype=torch.get_default_dtype())
A = torch.tensor(
[[0.299, 0.587, 0.114],
[-0.16874, -0.33126, 0.5],
[0.5, -0.41869, -0.08131]]
)
b = torch.tensor([0., 128., 128.])

A = torch.broadcast_to(A, ([img.shape[0],3,3]))
t_img = torch.bmm(A,img.view(img.form[0],3,-1))
t_img = t_img + b[None,:, None]
return t_img.view(img.form)/255

def custom_collate(batch):
from torch.utils.knowledge._utils.collate import default_collate
with torch.profiler.record_function('collate'):
batch = default_collate(batch)

picture, label = batch
with torch.profiler.record_function('batch_transform'):
picture = batch_transform(picture)
return picture, label

The results of this variation is definitely a slight enhance within the step time, to 0.75 seconds. Though unhelpful within the case of our toy mannequin, the flexibility to use sure operations as batch transforms relatively than per-sample transforms carries the potential to optimize sure workloads.

The successive optimizations we now have utilized on this submit resulted in an 80% enchancment in runtime efficiency. Nonetheless, though much less extreme, there nonetheless stays a bottleneck within the enter pipeline and the GPU stays extremely under-utilized (~30%). Please revisit our earlier posts (e.g., here) for added strategies of addressing such points.

On this submit we now have targeted on efficiency points within the training-data enter pipeline. As in our earlier posts on this collection we now have chosen PyTorch Profiler and its related TensorBoard plugin as our weapons of alternative and demonstrated their use in accelerating the pace of coaching. Particularly, we confirmed how operating the DataLoader with zero employees will increase our capability to establish, analyze, and optimize bottlenecks on the data-input pipeline.

As in our earlier posts, we emphasize that the trail to profitable optimization will range tremendously primarily based on the small print of the coaching undertaking, together with the mannequin structure and coaching surroundings. In follow, reaching your objectives could also be tougher than within the instance we introduced right here. A number of the strategies we described might have little affect in your efficiency or may even make it worse. We additionally word that the exact optimizations that we selected, and the order by which we selected to use them, was considerably arbitrary. You’re extremely inspired to develop your individual instruments and strategies for reaching your optimization objectives primarily based on the particular particulars of your undertaking.


ETL vs ELT vs Streaming ETL

Lacking Knowledge Demystified: The Absolute Primer for Knowledge Scientists