Speed up PyTorch with DeepSpeed to coach massive language fashions with Intel Habana Gaudi-based DL1 EC2 situations

Coaching massive language fashions (LLMs) with billions of parameters will be difficult. Along with designing the mannequin structure, researchers have to arrange state-of-the-art coaching methods for distributed coaching like blended precision assist, gradient accumulation, and checkpointing. With massive fashions, the coaching setup is much more difficult as a result of the obtainable reminiscence in a single accelerator gadget bounds the dimensions of fashions educated utilizing solely information parallelism, and utilizing mannequin parallel coaching requires extra degree of modifications to the coaching code. Libraries akin to DeepSpeed (an open-source deep studying optimization library for PyTorch) deal with a few of these challenges, and may also help speed up mannequin growth and coaching.

On this publish, we arrange coaching on the Intel Habana Gaudi-based Amazon Elastic Compute Cloud (Amazon EC2) DL1 situations and quantify the advantages of utilizing a scaling framework akin to DeepSpeed. We current scaling outcomes for an encoder-type transformer mannequin (BERT with 340 million to 1.5 billion parameters). For the 1.5-billion-parameter mannequin, we achieved a scaling effectivity of 82.7% throughout 128 accelerators (16 dl1.24xlarge situations) utilizing DeepSpeed ZeRO stage 1 optimizations. The optimizer states had been partitioned by DeepSpeed to coach massive fashions utilizing the info parallel paradigm. This strategy has been prolonged to coach a 5-billion-parameter mannequin utilizing information parallelism. We additionally used Gaudi’s native assist of the BF16 information sort for diminished reminiscence dimension and elevated coaching efficiency in comparison with utilizing the FP32 information sort. Consequently, we achieved pre-training (part 1) mannequin convergence inside 16 hours (our goal was to coach a big mannequin inside a day) for the BERT 1.5-billion-parameter mannequin utilizing the wikicorpus-en dataset.

Coaching setup

We provisioned a managed compute cluster comprised of 16 dl1.24xlarge situations utilizing AWS Batch. We developed an AWS Batch workshop that illustrates the steps to arrange the distributed coaching cluster with AWS Batch. Every dl1.24xlarge occasion has eight Habana Gaudi accelerators, every with 32 GB of reminiscence and a full mesh RoCE community between playing cards with a complete bi-directional interconnect bandwidth of 700 Gbps every (see Amazon EC2 DL1 instances Deep Dive for extra data). The dl1.24xlarge cluster additionally used 4 AWS Elastic Fabric Adapters (EFA), with a complete of 400 Gbps interconnect between nodes.

The distributed coaching workshop illustrates the steps to arrange the distributed coaching cluster. The workshop reveals the distributed coaching setup utilizing AWS Batch and specifically, the multi-node parallel jobs function to launch large-scale containerized coaching jobs on absolutely managed clusters. Extra particularly, a completely managed AWS Batch compute surroundings is created with DL1 situations. The containers are pulled from Amazon Elastic Container Registry (Amazon ECR) and launched mechanically into the situations within the cluster primarily based on the multi-node parallel job definition. The workshop concludes by operating a multi-node, multi-HPU information parallel coaching of a BERT (340 million to 1.5 billion parameters) mannequin utilizing PyTorch and DeepSpeed.

BERT 1.5B pre-training with DeepSpeed

Habana SynapseAI v1.5 and v1.6 assist DeepSpeed ZeRO1 optimizations. The Habana fork of the DeepSpeed GitHub repository contains the modifications essential to assist the Gaudi accelerators. There may be full assist of distributed information parallel (multi-card, multi-instance), ZeRO1 optimizations, and BF16 information sorts.

All these options are enabled on the BERT 1.5B model reference repository, which introduces a 48-layer, 1600-hidden dimension, and 25-head bi-directional encoder mannequin, derived from a BERT implementation. The repository additionally incorporates the baseline BERT Massive mannequin implementation: a 24-layer, 1024-hidden, 16-head, 340-million-parameter neural community structure. The pre-training modeling scripts are derived from the NVIDIA Deep Learning Examples repository to obtain the wikicorpus_en information, preprocess the uncooked information into tokens, and shard the info into smaller h5 datasets for distributed information parallel coaching. You may undertake this generic strategy to coach your customized PyTorch mannequin architectures utilizing your datasets utilizing DL1 situations.

Pre-training (part 1) scaling outcomes

For pre-training massive fashions at scale, we primarily centered on two features of the answer: coaching efficiency, as measured by the point to coach, and cost-effectiveness of arriving at a completely converged resolution. Subsequent, we dive deeper into these two metrics with BERT 1.5B pre-training for example.

Scaling efficiency and time to coach

We begin by measuring the efficiency of the BERT Massive implementation as a baseline for scalability. The next desk lists the measured throughput of sequences per second from 1-8 dl1.24xlarge situations (with eight accelerator units per occasion). Utilizing the single-instance throughput as baseline, we measured the effectivity of scaling throughout a number of situations, which is a vital lever to know the price-performance coaching metric.

Variety of Cases	Variety of Accelerators	Sequences per Second	Sequences per Second per Accelerator	Scaling Effectivity
1	8	1,379.76	172.47	100.0%
2	16	2,705.57	169.10	98.04%
4	32	5,291.58	165.36	95.88%
8	64	9,977.54	155.90	90.39%

The next determine illustrates the scaling effectivity.

For BERT 1.5B, we modified the hyperparameters for the mannequin within the reference repository to ensure convergence. The efficient batch dimension per accelerator was set to 384 (for optimum reminiscence utilization), with micro-batches of 16 per step and 24 steps of gradient accumulation. Studying charges of 0.0015 and 0.003 had been used for 8 and 16 nodes, respectively. With these configurations, we achieved convergence of the part 1 pre-training of BERT 1.5B throughout 8 dl1.24xlarge situations (64 accelerators) in roughly 25 hours, and 15 hours throughout 16 dl1.24xlarge situations (128 accelerators). The next determine reveals the typical loss as a perform of variety of coaching epochs, as we scale up the variety of accelerators.

With the configuration described earlier, we obtained 85% robust scaling effectivity with 64 accelerators and 83% with 128 accelerators, from a baseline of 8 accelerators in a single occasion. The next desk summarizes the parameters.

Variety of Cases	Variety of Accelerators	Sequences per Second	Sequences per Second per Accelerator	Scaling Effectivity
1	8	276.66	34.58	100.0%
8	64	1,883.63	29.43	85.1%
16	128	3,659.15	28.59	82.7%

The next determine illustrates the scaling effectivity.

Conclusion

On this publish, we evaluated assist for DeepSpeed by Habana SynapseAI v1.5/v1.6 and the way it helps scale LLM coaching on Habana Gaudi accelerators. Pre-training of a 1.5-billion-parameter BERT mannequin took 16 hours to converge on a cluster of 128 Gaudi accelerators, with 85% robust scaling. We encourage you to check out the structure demonstrated within the AWS workshop and take into account adopting it to coach customized PyTorch mannequin architectures utilizing DL1 situations.

Concerning the authors

Mahadevan Balasubramaniam is a Principal Options Architect for Autonomous Computing with almost 20 years of expertise within the space of physics-infused deep studying, constructing, and deploying digital twins for industrial programs at scale. Mahadevan obtained his PhD in Mechanical Engineering from the Massachusetts Institute of Know-how and has over 25 patents and publications to his credit score.

RJ is an engineer in Search M5 group main the efforts for constructing massive scale deep studying programs for coaching and inference. Exterior of labor he explores totally different cuisines of meals and performs racquet sports activities.

Sundar Ranganathan is the Head of Enterprise Growth, ML Frameworks on the Amazon EC2 group. He focuses on large-scale ML workloads throughout AWS companies like Amazon EKS, Amazon ECS, Elastic Material Adapter, AWS Batch, and Amazon SageMaker. His expertise contains management roles in product administration and product growth at NetApp, Micron Know-how, Qualcomm, and Mentor Graphics.

Abhinandan Patni is a Senior Software program Engineer at Amazon Search. He focuses on constructing programs and tooling for scalable distributed deep studying coaching and actual time inference.

Pierre-Yves Aquilanti is Head of Frameworks ML Options at Amazon Internet Providers the place he helps develop the trade’s greatest cloud primarily based ML Frameworks options. His background is in Excessive Efficiency Computing and previous to becoming a member of AWS, Pierre-Yves was working within the Oil & Fuel trade. Pierre-Yves is initially from France and holds a Ph.D. in Laptop Science from the College of Lille.