in

PyTorch Mannequin Efficiency Evaluation and Optimization — Half 2 | by Chaim Rand | Jun, 2023


Establish and Scale back CPU Computation In Your Coaching Step with PyTorch Profiler and TensorBoard

Photograph by Denise Chan on Unsplash

That is the second a part of a collection of posts on the subject of analyzing and optimizing a PyTorch mannequin operating on a GPU. In our first post we demonstrated the method — and the vital potential — of iteratively analyzing and optimizing a PyTorch mannequin utilizing PyTorch Profiler and TensorBoard. On this put up we’ll concentrate on a particular sort of efficiency difficulty that’s notably prevalent in PyTorch resulting from its use of keen execution: The dependency on the CPU for parts of the mannequin execution. Figuring out the presence and supply of those sorts of points may be fairly tough and sometimes requires using a devoted efficiency analyzer. On this put up we’ll share some ideas for figuring out such efficiency points when utilizing PyTorch Profiler and the PyTorch Profiler TensorBoard plugin.

The Execs and Cons of Keen Execution

One of many essential appeals of PyTorch is its keen execution mode. In keen mode, every PyTorch operation that kinds the mannequin is executed independently as quickly as it’s reached. That is in distinction to graph mode through which the whole mannequin is pre-compiled right into a single graph in a fashion that’s optimum for operating on the GPU and executed as an entire. Normally, this pre-compilation ends in higher efficiency (e.g., see here). In keen mode, the programming context returns to the applying following every operation thus permitting us to entry and consider arbitrary tensors. This makes it simpler to construct, analyze, and debug ML fashions. Alternatively, it additionally makes our mannequin extra inclined to (generally unintentional) insertion of suboptimal blocks of code. As we’ll show, understanding find out how to establish and repair such blocks of code can have a big affect on the velocity of your mannequin.

Within the following blocks we introduce the toy instance we’ll use for our demonstration. The code may be very loosely based mostly on the instance from our previous post and the loss operate outlined in this PyTorch tutorial.

We begin by defining a easy classification mannequin. Its structure shouldn’t be vital for this put up.

import torch
import torch.nn as nn
import torch.nn.useful as F
import torch.optim
import torch.profiler
import torch.utils.knowledge
import torchvision.fashions
import torchvision.transforms as T
from torchvision.datasets.imaginative and prescient import VisionDataset
import numpy as np
from PIL import Picture

# pattern mannequin
class Internet(nn.Module):
def __init__(self):
tremendous().__init__()
self.conv1 = nn.Conv2d(3, 8, 3, padding=1)
self.conv2 = nn.Conv2d(8, 12, 3, padding=1)
self.conv3 = nn.Conv2d(12, 16, 3, padding=1)
self.conv4 = nn.Conv2d(16, 20, 3, padding=1)
self.conv5 = nn.Conv2d(20, 24, 3, padding=1)
self.conv6 = nn.Conv2d(24, 28, 3, padding=1)
self.conv7 = nn.Conv2d(28, 32, 3, padding=1)
self.conv8 = nn.Conv2d(32, 10, 3, padding=1)
self.pool = nn.MaxPool2d(2, 2)

def ahead(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = self.pool(F.relu(self.conv3(x)))
x = self.pool(F.relu(self.conv4(x)))
x = self.pool(F.relu(self.conv5(x)))
x = self.pool(F.relu(self.conv6(x)))
x = self.pool(F.relu(self.conv7(x)))
x = self.pool(F.relu(self.conv8(x)))
x = torch.flatten(x, 1) # flatten all dimensions besides batch
return x

Subsequent, we outline a fairly commonplace cross-entropy loss operate. This loss operate would be the essential focus of our dialogue.

def log_softmax(x):
return x - x.exp().sum(-1).log().unsqueeze(-1)

def weighted_nll(pred, goal, weight):
assert goal.max() < 10
nll = -pred[range(target.shape[0]), goal]
nll = nll * weight[target]
nll = nll / weight[target].sum()
sum_nll = nll.sum()
return sum_nll

# customized loss definition
class CrossEntropyLoss(nn.Module):
def ahead(self, enter, goal):
pred = log_softmax(enter)
loss = weighted_nll(pred, goal, torch.Tensor([0.1]*10).cuda())
return loss

Final, we outline the dataset and the coaching loop:

# dataset with random pictures that mimics the properties of CIFAR10
class FakeCIFAR(VisionDataset):
def __init__(self, remodel):
tremendous().__init__(root=None, remodel=remodel)
self.knowledge = np.random.randint(low=0,excessive=256,measurement=(10000,32,32,3),dtype=np.uint8)
self.targets = np.random.randint(low=0,excessive=10,measurement=(10000),dtype=np.uint8).tolist()

def __getitem__(self, index):
img, goal = self.knowledge[index], self.targets[index]
img = Picture.fromarray(img)
if self.remodel shouldn't be None:
img = self.remodel(img)
return img, goal

def __len__(self) -> int:
return len(self.knowledge)

remodel = T.Compose(
[T.Resize(256),
T.PILToTensor()])

train_set = FakeCIFAR(remodel=remodel)
train_loader = torch.utils.knowledge.DataLoader(train_set, batch_size=1024,
shuffle=True, num_workers=8, pin_memory=True)

system = torch.system("cuda:0")
mannequin = Internet().cuda(system)
criterion = CrossEntropyLoss().cuda(system)
optimizer = torch.optim.SGD(mannequin.parameters(), lr=0.001, momentum=0.9)
mannequin.practice()

# coaching loop wrapped with profiler object
with torch.profiler.profile(
schedule=torch.profiler.schedule(wait=1, warmup=4, lively=3, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler(’./log/instance’),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for step, knowledge in enumerate(train_loader):
inputs = knowledge[0].to(system=system, non_blocking=True)
labels = knowledge[1].to(system=system, non_blocking=True)
inputs = (inputs.to(torch.float32) / 255. - 0.5) / 0.5
if step >= (1 + 4 + 3) * 1:
break
outputs = mannequin(inputs)
loss = criterion(outputs, labels)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
prof.step()

An skilled PyTorch developer might have already seen that our instance comprises numerous inefficient strains of code within the loss operate. On the similar time, there’s nothing clearly mistaken with it and some of these inefficiencies will not be unusual. If you want to check your PyTorch proficiency, see if you’ll find three points with our implementation of the cross-entropy loss earlier than studying on. Within the subsequent sections we’ll assume that we weren’t capable of finding these points on our personal and present how we are able to use PyTorch Profiler and its related TensorBoard plugin to establish them.

As in our previous post, we’ll iteratively run an experiment, establish efficiency points, and try to repair them. We are going to run our experiments on an Amazon EC2 g5.2xlarge occasion (containing an NVIDIA A10G GPU and eight vCPUs) and utilizing the official AWS PyTorch 2.0 Docker image. Our alternative of coaching setting was considerably arbitrary and may not be considered as an endorsement for any of its elements.

Preliminary Efficiency Outcomes

Within the picture beneath we present the Overview tab of the efficiency report of the script above.

Efficiency Overview of Baseline Mannequin (Captured by Writer)

As we are able to see, our GPU utilization is at a comparatively excessive 92.04% and our step time is 216 milliseconds. (As in our previous post, the Overview in torch-tb-profiler model 0.4.1 sums the step time of all three coaching steps.) From this report alone it’s possible you’ll not assume that there was something mistaken with our mannequin. Nevertheless, the Hint View of the efficiency report tells a very completely different story:

Hint View of Baseline Mannequin (Captured by Writer)

As highlighted above, the ahead move of our cross-entropy loss alone takes up 211 of the 216 milliseconds of the coaching step! This can be a clear indication that one thing is mistaken. Our loss operate comprises a small variety of calculations in comparison with the mannequin and may actually not account for 98% of the step time. Taking a better have a look at the decision stack, we are able to see a couple of operate calls that strengthen our suspicions, together with “to”, “copy_”, and “cudaStreamSynchronize”. This mixture often signifies that knowledge is being copied from the CPU into the GPU — not one thing we need to be occurring in the midst of our loss calculation. On this case, our efficiency difficulty additionally aligns with a quick dip within the GPU utilization, as highlighted within the picture. Nevertheless, that is not all the time the case. Typically, dips within the GPU utilization is not going to be aligned with the efficiency difficulty or they is probably not seen in any respect.

We now know that we have now a efficiency difficulty in our loss operate and that it’s more likely to be associated to copying tensors from the host to the GPU. Nevertheless, this won’t be sufficient to establish the exact line of code that’s inflicting the difficulty. To facilitate our search we’ll wrap every line of code with a labeled torch.profiler.record_function context supervisor and rerun the profiling evaluation.

# customized loss definition
class CrossEntropyLoss(nn.Module):
def ahead(self, enter, goal):
with torch.profiler.record_function('log_softmax'):
pred = log_softmax(enter)
with torch.profiler.record_function('define_weights'):
weights = torch.Tensor([0.1]*10).cuda()
with torch.profiler.record_function('weighted_nll'):
loss = weighted_nll(pred, goal, torch.Tensor([0.1]*10).cuda())
return loss

The addition of the labels assist us establish the weight definition, or extra precisely, the copying of the weights into the GPU, because the problematic line of code.

Efficiency Situation of Weights Definition as Seen in Hint View (Captured by Writer)

Optimization #1: Take away redundant host-to-GPU copies from the coaching step

As soon as we have now recognized our first difficulty, fixing it’s quite trivial. Within the code block beneath, we copy our weight vector to the GPU a single time within the loss init operate:

class CrossEntropyLoss(nn.Module):
def __init__(self):
tremendous().__init__()
self.weight = torch.Tensor([0.1]*10).cuda()

def ahead(self, enter, goal):
with torch.profiler.record_function('log_softmax'):
pred = log_softmax(enter)
with torch.profiler.record_function('weighted_nll'):
loss = weighted_nll(pred, goal, self.weight)
return loss

The picture beneath reveals the outcomes of the efficiency evaluation following this repair:

Efficiency Overview Following Optimization #1 (Captured by Writer)

Disappointingly, our first optimization had a really marginal affect on the step time. If we have a look at the Hint View report, we are able to see that we have now a brand new extreme efficiency difficulty that we have to deal with.

Hint View Following Optimization #1 (Captured by Writer)

Our new report signifies a difficulty coming from our weighted_nll operate. As earlier than, we used torch.profiler.record_function to establish the problematic line of code. On this case it’s the assert name.

def weighted_nll(pred, goal, weight):
with torch.profiler.record_function('assert'):
assert goal.max() < 10
with torch.profiler.record_function('vary'):
r = vary(goal.form[0])
with torch.profiler.record_function('index'):
nll = -pred[r, target]
with torch.profiler.record_function('nll_calc'):
nll = nll * weight[target]
nll = nll/ weight[target].sum()
sum_nll = nll.sum()
return sum_nll

Be aware that this difficulty existed within the base experiment, as nicely, however was hidden by our earlier efficiency difficulty. It’s not unusual in the middle of efficiency optimization for extreme points, that have been beforehand hidden by different points, to instantly seem on this method.

A better evaluation of the decision stack reveals calls to “merchandise”, “_local_scalar_dense”, and “cudaMemcpyAsync”. That is usually a sign that knowledge is being copied from the GPU to the host. Certainly, our assert name, which is carried out on the CPU, requires entry to the goal tensor residing on the GPU, thus invoking the extremely inefficient knowledge copy.

Optimization #2: Take away redundant GPU-to-host copies from the coaching step

Whereas verifying the legality of the enter labels could also be warranted, it must be accomplished in a manner that doesn’t affect our coaching efficiency so negatively. In our case, fixing the difficulty is a straightforward matter of shifting the assert to the information enter pipeline, earlier than the labels are copied into the GPU. Following the removing of the assert our efficiency nonetheless stays largely unchanged:

Efficiency Overview Following Optimization #2 (Captured by Writer)

Necessary Be aware: Though our objective is often to try to scale back copies between the host and the GPU within the ahead move, there are occasions when that is both not potential (e.g., if we require a kernel that’s not supported by the GPU) or undesirable (e.g., if operating a selected kernel on the CPU will improve efficiency).

Analyzing the Hint View introduces us to our subsequent efficiency difficulty:

Hint View Following Optimization #2 (Captured by Writer)

As soon as once more, we see that our earlier optimization has uncovered a brand new extreme efficiency difficulty, this time when indexing our pred tensor. The indexes are outlined by the r and goal tensors. Whereas the goal tensor already resides on the GPU, the r tensor, which was outlined on the earlier line, doesn’t. This, as soon as once more, triggers an inefficient host-to-GPU knowledge copy.

Optimization #3: Change vary with torch.arange

Python’s vary operate outputs a listing on the CPU. The presence of any record in your coaching step must be a crimson flag. Within the code block beneath, we change using vary with torch.arange and configure it to create the output tensor instantly on the GPU:

def weighted_nll(pred, goal, weight):
with torch.profiler.record_function('vary'):
r = torch.arange(goal.form[0], system="cuda:0")
with torch.profiler.record_function('index'):
nll = -pred[r, target]
with torch.profiler.record_function('nll_calc'):
nll = nll * weight[target]
nll = nll/ weight[target].sum()
sum_nll = nll.sum()
return sum_nll

The outcomes of this optimization are proven beneath:

Efficiency Overview Following Optimization #3 (Captured by Writer)

Now we’re speaking!! Our step time has dropped down to five.8 milliseconds, a efficiency improve of a whopping 3700%.

The up to date Hint View reveals that the loss operate has dropped to a really affordable 0.5 milliseconds.

Hint View Following Optimization #3 (Captured by Writer)

However there’s nonetheless room for enchancment. Let’s take a better have a look at the Hint View of the weighted_nll operate which takes up nearly all of the loss calculation.

Hint View of weighted_nll Perform (Captured by Writer)

We will see from the hint that the operate is shaped from a number of small blocks, every of which is in the end mapped to a person CUDA kernel which is loaded onto the GPU by way of the CudaLaunchKernel name. Ideally, we’d be like to scale back the overall variety of GPU kernels in order to scale back the quantity of interplay between the CPU and GPU. A technique to do that is to desire, every time potential, larger stage PyTorch operators, resembling torch.nn.NLLLoss. Such capabilities are presumed to “fuse” collectively underlying operations, thus requiring a decrease variety of general kernels.

Optimization #4: Change customized NLL with torch.nn.NLLLoss

The code block beneath comprises our up to date loss definition, which now makes use of torch.nn.NLLLoss.

class CrossEntropyLoss(nn.Module):
def __init__(self):
tremendous().__init__()
self.weight = torch.Tensor([0.1]*10).cuda()

def ahead(self, enter, goal):
pred = log_softmax(enter)
nll = torch.nn.NLLLoss(self.weight)
loss = nll(pred, goal)
return loss

Right here we have now taken the freedom of introducing one other widespread error which we’ll proceed to show.

Utilizing the higher-level operate additional reduces our step time to five.3 milliseconds (down from 5.8).

Efficiency Overview Following Optimization #4 (Captured by Writer)

Nevertheless, if we take a better have a look at the Hint View, we are able to see that a good portion of the loss operate is now spent on initializing the torch.nn.NLLLoss object!

Hint View Following Optimization #4 (Captured by Writer)

Trying again at our loss operate, we are able to see that we’re initializing a brand new NLLLoss object in every iteration of the coaching step. Naturally, object initialization happens on the CPU, and though (in our case) it’s comparatively quick, it’s one thing we wish to keep away from doing throughout our coaching step.

Optimization #5: Chorus from initializing objects within the practice step

Within the code block beneath we have now modified our loss implementation so {that a} single occasion of torch.nn.NLLLoss is created within the init operate.

class CrossEntropyLoss(nn.Module):
def __init__(self):
tremendous().__init__()
self.weight = torch.Tensor([0.1]*10).cuda()
self.nll = torch.nn.NLLLoss(self.weight)

def ahead(self, enter, goal):
pred = log_softmax(enter)
loss = self.nll(pred, goal)
return loss

The outcomes present but an additional enchancment within the step time which now stands at 5.2 milliseconds.

PyTorch features a built-in torch.nn.CrossEntropyLoss which we now consider and evaluate with our customized loss implementation.

criterion = torch.nn.CrossEntropyLoss().cuda(system)

The resultant step time is a brand new low of 5 milliseconds for an general efficiency enhance of 4200% (in comparison with the 216 milliseconds we began with).

The efficiency enchancment of the ahead move of the loss calculation is much more dramatic: From a place to begin of 211 milliseconds, we have now dropped all the best way right down to 79 microseconds(!!), as seen beneath:

Optimization #7: Compile loss operate

For our last optimization try, we’ll configure the loss operate to run in graph mode utilizing the torch.compile API. As we mentioned at size in this post and demonstrated within the prequel to this post, torch.compile will use methods resembling kernel fusion and out-of-order execution to map the loss operate into low-level compute kernels in a fashion that’s optimum for the underlying coaching accelerator.

criterion = torch.compile(torch.nn.CrossEntropyLoss().cuda(system))

The picture beneath reveals the Hint View results of this experiment.

The very first thing we are able to see is the looks of phrases containing “OptimizedModule” and “dynamo” that are indicative of using torch.compile. We will additionally see that, in follow, mannequin compilation didn’t cut back the variety of kernels loaded by the loss operate which implies that it didn’t establish any oportunites for added kernel fusion. In reality, in our case, the loss compilation really brought about the time of the ahead move of the loss operate to extend from 79 to 154 microseconds. It seems that the CrossEntropyLoss shouldn’t be meaty sufficient to learn from this optimization.

Chances are you’ll be questioning why we are able to’t simply apply torch compilation to our preliminary loss operate and depend on it to compile our code in an optimum method. This might save all the trouble of the step-by-step optimization we described above. The issue with this method is that though PyTorch 2.0 compilation (as of the time of this writing) does certainly optimize sure sorts of GPU-to-CPU crossovers, some varieties will crash the graph compilation, and others will outcome within the creation of a number of small graphs quite than a single giant one. The final class causes graph breaks which basically limits the torch.compile function’s capacity to spice up efficiency. (One strategy to deal with that is to name torch.compile with the fullgraph flag set to True.) See our previous post for extra particulars on utilizing this feature.

Within the desk beneath we summarize the outcomes of the experiments we have now run:

Optimization Experiments Outcomes (By Writer)

Our successive optimizations have led to a mind-blowing 4143% efficiency enhance!! Recall that we began with a fairly harmless wanting loss operate. With out an in-depth evaluation of our utility’s habits, we might have by no means identified that there was something mistaken and would have continued on with our lives whereas paying 41 occasions(!!) greater than we would have liked to.

You will have seen that the GPU utilization dropped considerably in our last trials. This means main potential for additional efficiency optimization. Though our demonstration has neared its finish, our work shouldn’t be accomplished. See our previous post for some concepts on find out how to proceed from right here.

Let’s summarize a number of the issues we have now realized. We divide the abstract into two components. Within the first, we describe some coding habits that will affect coaching efficiency. Within the second, we advocate some ideas for efficiency profiling. Be aware that these conclusions are based mostly on the instance that we have now shared on this put up and should not apply to your individual use case. Machine studying fashions range tremendously in property and habits. Due to this fact, you’re strongly suggested to evaluate these conclusions based mostly on the small print of your individual venture.

Coding Suggestions

The way in which through which you implement the ahead move of your mannequin can have a big affect on its efficiency. Right here we record just some suggestions based mostly on the instance that we lined on this put up.

  1. Keep away from initializing fixed tensors within the ahead move. Do it within the constructor as a substitute.
  2. Keep away from utilizing asserts on tensors residing on the GPU within the ahead move. Both transfer them to the information enter pipeline and/or examine if PyTorch has any in-built strategies for performing the information verification that you simply want.
  3. Keep away from using lists. Verify if utilizing torch.arange to create a tensor instantly on the system generally is a higher different.
  4. Use PyTorch operators resembling torch.nn.NLLLoss and torch.nn.CrossEntropyLoss quite than creating your individual loss implementations.
  5. Keep away from initializing objects within the ahead move. Do it within the constructor as a substitute.
  6. Think about using torch.compile when related.

Efficiency Evaluation Suggestions

As we demonstrated, the Hint View of the Tensorboard PyTorch Profiler plugin was essential in figuring out the efficiency points in our mannequin. Under we summarize a number of the major takeaways from our instance:

  1. Excessive GPU utilization is NOT essentially an indication that your code is operating optimally.
  2. Look out for parts of the code that take longer than anticipated.
  3. Use torch.profiler.record_function to pinpoint efficiency points.
  4. Dips in GPU utilization will not be essentially aligned with the supply of the efficiency difficulty.
  5. Look out for unintended knowledge copies from the host to the GPU. These are usually recognized by calls to “to”, “copy_”, and “cudaStreamSynchronize”, which you’ll seek for within the Hint View.
  6. Look out for unintended knowledge copies from the GPU to the host. These are usually recognized by calls to “merchandise”, and “cudaStreamSynchronize”, which you’ll seek for within the Hint View.

On this put up we have now targeted on efficiency points in coaching purposes ensuing from redundant interplay between the CPU and GPU through the ahead move of the coaching step. We demonstrated how efficiency analyzers resembling PyTorch Profiler and its related TensorBoard plugin can be utilized to establish such points and facilitate vital efficiency enchancment.

As in our earlier put up, we emphasize that the trail to profitable optimization will range tremendously based mostly on the small print of the coaching venture, together with the mannequin structure and coaching setting. In follow, reaching your objectives could also be tougher than within the instance we offered right here. Among the methods we described might have little affect in your efficiency or may even make it worse. We additionally be aware that the exact optimizations that we selected, and the order through which we selected to use them, was considerably arbitrary. You’re extremely inspired to develop your individual instruments and methods for reaching your optimization objectives based mostly on the particular particulars of your venture.


Constructing Machine Studying Operations for Companies | by John Adeojo | Jun, 2023

Predicting Success of a Reward Program at Starbucks | by Erdem Isbilen | Jun, 2023