PyTorch Mannequin Efficiency Evaluation and Optimization | by Chaim Rand | Jun, 2023

The right way to Use PyTorch Profiler and TensorBoard to Speed up Coaching and Scale back Price

Photograph by Torsten Dederichs on Unsplash

Coaching deep studying fashions, particularly massive ones, generally is a pricey expenditure. One of many foremost strategies we have now at our disposal for managing these prices is efficiency optimization. Efficiency optimization is an iterative course of during which we persistently seek for alternatives to extend the efficiency of our utility after which reap the benefits of these alternatives. In earlier posts (e.g., here) we have now harassed the significance of getting applicable instruments for conducting this evaluation. The instruments of selection will probably depend upon various components together with the kind of coaching accelerator (e.g., GPU, HPU, or different) and the coaching framework.

Efficiency Optimization Movement (By Writer)

The main focus on this put up will probably be on coaching in PyTorch on GPU. Extra particularly, we are going to deal with the PyTorch’s built-in efficiency analyzer, PyTorch Profiler, and on one of many methods to view its outcomes, the PyTorch Profiler TensorBoard plugin.

This put up isn’t meant to be a substitute for the official PyTorch documentation on both PyTorch Profiler or the usage of the TensorBoard plugin for analyzing the profiler results. Our intention is moderately to show how these instruments is likely to be used throughout the course of 1’s every day growth. In truth, in the event you haven’t already, we advocate that you just have a look over the official documentation earlier than studying this put up.

For some time, I’ve been intrigued by one portion particularly of the TensorBoard-plugin tutorial. The tutorial introduces a classification mannequin (primarily based on the Resnet structure) that’s skilled on the favored Cifar10 dataset. It proceeds to show how PyTorch Profiler and the TensorBoard plugin can be utilized to identify and fix a bottleneck in the data loader. Efficiency bottlenecks within the enter information pipeline are usually not unusual and we have now mentioned them at size in a few of our earlier posts (e.g., here). What’s shocking in regards to the tutorial is the ultimate (post-optimization) outcomes which can be offered (as of the time of this writing) which we have now pasted in beneath:

Efficiency Following Optimization (From PyTorch Website)

In case you look intently, you will notice that the post-optimization GPU utilization is 40.46%. Now there is no such thing as a approach to sugarcoat this: These outcomes are completely abysmal and will maintain you up at night time. As we have now expanded on previously (e.g., here), the GPU is the most costly useful resource in our coaching machine and our aim needs to be to maximise its utilization. A 40.46% utilization outcome normally represents a big alternative for coaching acceleration and value financial savings. Absolutely, we are able to do higher! On this weblog put up we are going to attempt to do higher. We’ll begin by making an attempt to breed the outcomes offered within the official tutorial and see whether or not we are able to use the identical instruments to additional enhance the coaching efficiency.

The code block beneath comprises the coaching loop outlined by the TensorBoard-plugin tutorial, with two minor modifications:

  1. We use a pretend dataset with the identical properties and behaviors because the CIFAR10 dataset that was used within the tutorial. The motivation for this alteration may be discovered here.
  2. We initialize the torch.profiler.schedule with the warmup flag set to 3 and the repeat flag set to 1. We discovered that this slight enhance within the variety of warmup steps improves the steadiness of the profiling outcomes.
import numpy as np
import torch
import torch.nn
import torch.optim
import torch.profiler
import torch.utils.information
import torchvision.datasets
import torchvision.fashions
import torchvision.transforms as T
from torchvision.datasets.imaginative and prescient import VisionDataset
from PIL import Picture

class FakeCIFAR(VisionDataset):
def __init__(self, remodel):
tremendous().__init__(root=None, remodel=remodel)
self.information = np.random.randint(low=0,excessive=256,measurement=(10000,32,32,3),dtype=np.uint8)
self.targets = np.random.randint(low=0,excessive=10,measurement=(10000),dtype=np.uint8).tolist()

def __getitem__(self, index):
img, goal = self.information[index], self.targets[index]
img = Picture.fromarray(img)
if self.remodel isn't None:
img = self.remodel(img)
return img, goal

def __len__(self) -> int:
return len(self.information)

remodel = T.Compose(
T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

train_set = FakeCIFAR(remodel=remodel)
train_loader = torch.utils.information.DataLoader(train_set, batch_size=32,

system = torch.system("cuda:0")
mannequin = torchvision.fashions.resnet18(weights='IMAGENET1K_V1').cuda(system)
criterion = torch.nn.CrossEntropyLoss().cuda(system)
optimizer = torch.optim.SGD(mannequin.parameters(), lr=0.001, momentum=0.9)

# practice step
def practice(information):
inputs, labels = information[0].to(system=system), information[1].to(system=system)
outputs = mannequin(inputs)
loss = criterion(outputs, labels)

# coaching loop wrapped with profiler object
with torch.profiler.profile(
schedule=torch.profiler.schedule(wait=1, warmup=4, energetic=3, repeat=1),
) as prof:
for step, batch_data in enumerate(train_loader):
if step >= (1 + 4 + 3) * 1:
prof.step() # Must name this on the finish of every step

The GPU that was used within the tutorial was a Tesla V100-DGXS-32GB. On this put up we try to breed — and enhance on — the efficiency outcomes from the tutorial utilizing an Amazon EC2 p3.2xlarge occasion that comprises a Tesla V100-SXM2–16GB GPU. Though they share the identical structure, there are some variations between the 2 GPUs which you’ll find out about here. We ran the coaching script utilizing an AWS PyTorch 2.0 Docker image. The efficiency outcomes of the coaching script as displayed within the overview web page of the TensorBoard viewer is captured within the picture beneath:

Baseline Efficiency Outcomes as Proven within the TensorBoard Profiler Overview Tab (Captured by Writer)

We first word that, opposite to the tutorial, the Overview web page (of torch-tb-profiler model 0.4.1) in our experiment mixed the three profiling steps into one . Thus, the common general step time is 80 milliseconds and never 240 milliseconds as reported. This may be seen clearly within the Hint tab (which, in our expertise, nearly at all times supplies a extra correct report) the place every step takes ~80 milliseconds.

Baseline Efficiency Outcomes as Proven within the TensorBoard Profiler Hint View Tab (Captured by Writer)

Be aware that our place to begin of 31.65% GPU utilization and a step time of 80 milliseconds is completely different than the place to begin offered within the tutorial of 23.54% and 132 milliseconds, respectively. That is probably a results of variations within the coaching setting together with the GPU kind and the PyTorch model. We additionally word that whereas the tutorial baseline outcomes clearly diagnose the efficiency difficulty as a bottleneck within the DataLoader, our outcomes don’t. We now have usually discovered that information loading bottlenecks will disguise themselves as a excessive proportion of “CPU Exec” or “Different” in the Overview tab.

Optimization #1: Multi-process Knowledge Loading

Let’s begin by making use of multi process data loading as described in the tutorial. Being that the Amazon EC2 p3.2xlarge occasion has 8 vCPUs, we set the variety of DataLoader staff to eight for max efficiency:

train_loader = torch.utils.information.DataLoader(train_set, batch_size=32, 
shuffle=True, num_workers=8)

The outcomes of this optimization are displayed beneath:

Outcomes of Multi-proc Knowledge Loading within the TensorBoard Profiler Overview Tab (Captured by Writer)

The change to a single line of code elevated the GPU utilization by greater than 200% (31.65% from to 72.81%), and greater than halved our coaching step time, (from 80 milliseconds right down to 37).

That is the place the optimization course of within the tutorial comes to finish. Though our GPU utilization (72.81%) is kind of a bit increased than the ends in the tutorial (40.46%), I’ve little doubt that, like us, you discover these outcomes to nonetheless be fairly unsatisfactory.

Private commentary that it’s best to be happy to skip: Think about how a lot international cash might be saved if PyTorch utilized multi-process information loading by default when coaching on GPU! True, there could also be some undesirable side-effects to utilizing multiprocessing. However, there have to be some type of auto-detection algorithm that might be run that will rule out the presence of determine doubtlessly problematic situations and apply this optimization accordingly.

Optimization #2: Reminiscence Pinning

If we analyze the Hint view of our final experiment, we are able to see {that a} important period of time (10 out of 37 milliseconds) remains to be spent on loading the coaching information into the GPU.

Outcomes of Multi-proc Knowledge Loading within the Hint View Tab (Captured by Writer)

To deal with this, we are going to apply one other PyTorch-recommended optimization for streamlining the info enter movement, memory pinning. Utilizing pinned reminiscence can enhance the pace of host to GPU information copy and, extra importantly, permits us to make them asynchronous. Which means we are able to put together the following coaching batch within the GPU in parallel to operating the coaching step on the present batch. For extra particulars in addition to the potential negative effects of reminiscence pinning, please see the PyTorch documentation.

This optimization requires modifications to 2 traces of code. First, we set the pin_memory flag of the DataLoader to True.

train_loader = torch.utils.information.DataLoader(train_set, batch_size=32, 
shuffle=True, num_workers=8, pin_memory=True)

Then we modify the host-to-device reminiscence switch (within the practice perform) to be non-blocking:

inputs, labels = information[0].to(system=system, non_blocking=True), 
information[1].to(system=system, non_blocking=True)

The outcomes of the reminiscence pinning optimization are displayed beneath:

Outcomes of Reminiscence Pinning within the TensorBoard Profiler Overview Tab (Captured by Writer)

Our GPU utilization now stands at a good 92.37% and our step time has additional decreased. However we are able to nonetheless do higher. Be aware that regardless of this optimization, the efficiency report continues to point that we’re spending a variety of time copying the info into the GPU. We’ll come again to this in step 4 beneath.

Optimization #3: Improve Batch Dimension

For our subsequent optimization, we draw our consideration to the Memory View of the final experiment:

Reminiscence View in TensorBoard Profiler (Captured by Writer)

The chart reveals that out of 16 GB of GPU reminiscence, we’re peaking at lower than 1 GB of utilization. That is an excessive instance of useful resource under-utilization that usually (although not at all times) signifies a chance to spice up efficiency. One approach to management the reminiscence utilization is to extend the batch measurement. Within the picture beneath we show the efficiency outcomes once we enhance the batch measurement to 512 (and the reminiscence utilization to 11.3 GB).

Outcomes of Rising Batch Dimension within the TensorBoard Profiler Overview Tab (Captured by Writer)

Though the GPU utilization measure didn’t change a lot, our coaching pace has elevated significantly, from 1200 samples per second (46 milliseconds for batch measurement 32) to 1584 samples per second (324 milliseconds for batch measurement 512).

Warning: Opposite to our earlier optimizations, growing the batch measurement may have an effect on the conduct of your coaching utility. Completely different fashions exhibit completely different ranges of sensitivity to a change in batch measurement. Some could require nothing greater than some tuning to the optimizer settings. For others, adjusting to a big batch measurement could also be tougher and even unimaginable. See this previous post for among the challenges concerned in coaching on massive batches.

Optimization #4: Scale back Host to Machine Copy

You most likely seen the massive crimson eyesore representing the host-to-device information copy within the pie chart from our earlier outcomes. Essentially the most direct approach of attempting to handle this sort of bottleneck is to see if we are able to scale back the quantity of information in every batch. Discover that within the case of our picture enter, we convert the info kind from an 8-bit unsigned integer to a 32-bit float and apply normalization earlier than performing the info copy. Within the code block beneath, we suggest a change to the enter information movement during which we delay the info kind conversion and normalization till the info is on the GPU:

# preserve the picture enter as an 8-bit uint8 tensor
remodel = T.Compose(
train_set = FakeCIFAR(remodel=remodel)
train_loader = torch.utils.information.DataLoader(train_set, batch_size=1024, shuffle=True, num_workers=8, pin_memory=True)

system = torch.system("cuda:0")
mannequin = torch.compile(torchvision.fashions.resnet18(weights='IMAGENET1K_V1').cuda(system), fullgraph=True)
criterion = torch.nn.CrossEntropyLoss().cuda(system)
optimizer = torch.optim.SGD(mannequin.parameters(), lr=0.001, momentum=0.9)

# practice step
def practice(information):
inputs, labels = information[0].to(system=system, non_blocking=True),
information[1].to(system=system, non_blocking=True)
# convert to float32 and normalize
inputs = ( / 255. - 0.5) / 0.5
outputs = mannequin(inputs)
loss = criterion(outputs, labels)

On account of this alteration the quantity of information being copied from the CPU to the GPU is decreased by 4x and the crimson eyesore nearly disappears:

Outcomes of Decreasing CPU to GPU Copy within the TensorBoard Profiler Overview Tab (Captured by Writer)

We now stand at a brand new excessive of 97.51%(!!) GPU utilization and a coaching pace of 1670 samples per second! Let’s see what else we are able to do.

Optimization #5: Set Gradients to None

At this stage we seem like totally using the GPU, however that doesn’t imply that we are able to’t put it to use extra successfully. One well-liked optimization that’s mentioned to scale back reminiscence operations within the GPU is to set the mannequin parameters gradients to None moderately than zero in every coaching step. Please see the PyTorch documentation for extra particulars on this optimization. All that’s required to implement this optimization is to set the set_to_none of the optimizer.zero_grad name to True:


In our case this optimization didn’t increase our efficiency in any significant approach.

Optimization #6: Automated Combined Precision

The GPU Kernel View shows the period of time that the GPU kernels have been energetic and generally is a useful useful resource for bettering GPU utilization:

Kernel View in TensorBoard Profiler (Captured by Writer)

One of the evident particulars on this report is the dearth of use of the GPU Tensor Cores. Obtainable on comparatively newer GPU architectures, Tensor Cores are devoted processing items for matrix multiplication that may increase AI utility efficiency considerably. Their lack of use could signify a serious alternative for optimization.

Being that Tensor Cores are particularly designed for mixed-precision computing, one straight-forward approach to enhance their utilization is to switch our mannequin to make use of Automatic Mixed Precision (AMP). In AMP mode parts of the mannequin are routinely solid to lower-precision 16-bit floats and run on the GPU TensorCores.

Importantly, word {that a} full implementation of AMP could require gradient scaling which we don’t embrace in our demonstration. Remember to see the documentation on blended precision coaching earlier than adapting it.

The modification to the coaching step required to allow AMP is demonstrated within the code block beneath.

def practice(information):
inputs, labels = information[0].to(system=system, non_blocking=True),
information[1].to(system=system, non_blocking=True)
inputs = ( / 255. - 0.5) / 0.5
with torch.autocast(device_type='cuda', dtype=torch.float16):
outputs = mannequin(inputs)
loss = criterion(outputs, labels)
# Be aware - torch.cuda.amp.GradScaler() could also be required

The affect to the Tensor Core utilization is displayed within the picture beneath. Though it continues to point alternative for additional enchancment, with only one line of code the utilization jumped from 0% to 26.3%.

Tensor Core Utilization with AMP optimization from Kernel View in TensorBoard Profiler (Captured by Writer)

Along with growing Tensor Core utilization, utilizing AMP lowers the GPU reminiscence utilization releasing up more room to extend the batch measurement. The picture beneath captures the coaching efficiency outcomes following the AMP optimization and the batch measurement set to 1024:

Outcomes of AMP Optimization within the TensorBoard Profiler Overview Tab (Captured by Writer)

Though the GPU utilization has barely decreased, our main throughput metric has additional elevated by practically 50%, from 1670 samples per second to 2477. We’re on a task!

Warning: Decreasing the precision of parts of your mannequin may have a significant impact on its convergence. As within the case of accelerating the batch measurement (see above) the affect of utilizing blended precision will fluctuate per mannequin. In some circumstances, AMP will work with little to no effort. Different instances you would possibly have to work a bit tougher to tune the autoscaler. Nonetheless different instances you would possibly have to set the precision sorts of completely different parts of the mannequin explicitly (i.e., guide blended precision).

For extra particulars on utilizing blended precision as a technique for reminiscence optimization please see our previous blog post on the subject.

Optimization #7: Prepare in Graph Mode

The ultimate optimization we are going to apply is mannequin compilation. Opposite to the default PyTorch eager-execution mode during which every PyTorch operation is run “eagerly”, the compile API converts your mannequin into an intermediate computation graph which it then compiles into low-level compute kernels in a way that’s optimum for the underlying coaching accelerator. For extra on mannequin compilation in PyTorch 2, try our previous post on the subject.

The next code block demonstrates the change required to use mannequin compilation:

mannequin = torchvision.fashions.resnet18(weights='IMAGENET1K_V1').cuda(system)
mannequin = torch.compile(mannequin)

The outcomes of the mannequin compilation optimization are displayed beneath:

Outcomes of Graph Compilation within the TensorBoard Profiler Overview Tab (Captured by Writer)

Mannequin compilation additional will increase our throughput to 3268 samples per second in comparison with 2477 within the earlier experiment, an extra 32% (!!) increase in efficiency.

The style during which graph compilation modifications the coaching step may be very evident within the completely different views of the TensorBoard plugin. The Kernel View, for instance, signifies the usage of new (fused) GPU kernels, and the Hint View (proven beneath) shows an entirely completely different sample than what we noticed beforehand.

Outcomes of Graph Compilation within the TensorBoard Profiler Hint View Tab (Captured by Writer)

Within the desk beneath we summarize the outcomes of the successive optimizations we have now utilized.

Efficiency Outcomes Abstract (By Writer)

By making use of our iterative method of evaluation and optimization utilizing PyTorch Profiler and the TensorBoard plugin, we have been in a position to enhance efficiency by 817%!!

Is our work full? Completely not! Every optimization that we implement uncovers new potential alternatives for efficiency enchancment. These alternatives are offered within the type of sources being freed up (e.g., the way in which during which shifting to blended precision enabled our growing the batch measurement) or within the type of newly uncovered efficiency bottlenecks (e.g., the way in which during which our remaining optimization uncovered a bottleneck in host-to-device information switch). Moreover, there are numerous different well-known types of optimization that we didn’t try on this put up (e.g., see here and here). And lastly, new library optimizations (e.g., the mannequin compilation function that we demonstrated in step 7), are launched on a regular basis, additional enabling our efficiency enchancment targets. As we emphasised within the introduction, to completely leverage such alternatives, efficiency optimization have to be an iterative and constant a part of your growth workflow.

On this put up we have now demonstrated the numerous potential of efficiency optimization on a toy classification mannequin. Though there are different efficiency analyzers that you should utilize, every with their professionals and cons, we selected PyTorch Profiler and the TensorBoard plugin attributable to their ease of integration.

We should always emphasize that the trail to profitable optimization will fluctuate tremendously primarily based on the main points of the coaching venture, together with the mannequin structure and coaching setting. In observe, reaching your targets could also be tougher than within the instance we offered right here. A few of the strategies we described could have little affect in your efficiency or would possibly even make it worse. We additionally word that the exact optimizations that we selected, and the order during which we selected to use them, was considerably arbitrary. You’re extremely inspired to develop your personal instruments and strategies for reaching your optimization targets primarily based on the precise particulars of your venture.

Efficiency optimization of machine studying workloads is usually considered as secondary, non-critical, and odious. I hope that we have now succeeded in convincing you that the potential for financial savings in growth time and value warrant a significant funding in efficiency evaluation and optimization. And, hey, you would possibly even discover it to be enjoyable :).

What Subsequent?

This was simply the tip of the iceberg. There’s much more to efficiency optimization than we have now lined right here. In a sequel to this put up, we are going to dive right into a efficiency difficulty that’s fairly widespread in PyTorch fashions during which an extreme quantity of computation is run on the CPU moderately than the GPU, usually in a way that’s unbeknownst to the developer. We additionally encourage you to take a look at our other posts on medium, a lot of which cowl completely different parts of efficiency optimization of machine studying workloads.

Mastering Configuration Administration in Machine Studying with Hydra | by Joseph Robinson, Ph.D. | Jun, 2023

5 Ranges of MLOps Maturity. Introduction | by Maciej Balawejder | Jun, 2023