To run these comparisons, we leveraged JAX, a highly efficient library that allows AI models to be compiled with XLA, a compiler designed specifically for AI models. Using XLA, we can build a compiled representation of Conformer-2 that can be conveniently ported to different hardware, making it easy to run on various accelerated instances on Google Cloud for straightforward comparison.
Experimental setup
The Conformer-2 model that we used for testing has 2 billion parameters, with over 1.5k hidden dimensions, 12 attention heads, and 24 encoder layers. We tested the model on three different accelerated instances on Google Cloud TPU v5e, G2, and A2. Given the cloud’s pay-per-chip-hour pricing model, we maximized the batch size for each type of accelerator under the constraint of the chip’s memory. This allowed for an accurate measurement of cost per hour of audio transcribed for a production system.
To evaluate each chip, we passed identical audio data through the model on each type of hardware, measuring the inference speed for each type of hardware. This approach allowed us to evaluate the cost per chip to run inference on 100k hours of audio data with no confounding factors.
Results: Cloud TPU v5e leads in large-scale inference price-performance
Our experimental results show that Cloud TPU v5e is the most cost-efficient accelerator on which to run large-scale inference for our model. It delivers 2.7x greater performance per dollar than G2 and 4.2x greater performance per dollar than A2 instances.