Introducing new PyTorch Dataflux Dataset abstraction

Under the hood

To achieve such significant performance gains for Dataflux, we addressed the data-loading performance bottlenecks in ML training workflows. In a training run, data is loaded in batches from storage, and after some processing, is sent from CPU to GPU for ML-Training computations. If reading and constructing a batch takes longer than GPU computation, then the GPU is effectively stalled and underutilized, leading to longer training times.

When data is in a cloud-based object storage system (like Google’s Cloud Storage), it takes longer to fetch the data than from a local disk, especially if the data is in small objects. This is due to time-to-first-byte latency. Once an object is ‘opened’ though, the cloud storage platform provides high throughput. In Dataflux, we employ a Cloud Storage feature called Compose Objects that can dynamically combine many smaller objects into a larger object. Then, instead of fetching (say) 1024 small objects (batch size), we only fetch 30 larger objects and download those to memory. The larger objects are then decomposed back to their individual smaller objects and served back as the dataset-samples. Any temporary composed objects created in the process are also cleaned up.

Another optimization that Dataflux Datasets employs is high-throughput parallel-listing, speeding up the initial metadata needed for the dataset. Dataflux uses a sophisticated algorithm called work-stealing to significantly speed up listings; with it, even the first AI training run, or “epoch,” is faster compared to Dataflux Datasets without parallel-listing, even on datasets that have tens of millions of objects.

Together, fast-listing and dynamic-composition help ensure that ML-training with Dataflux leads to minimal GPU stalls, leading to greatly reduced training time and increased accelerator utilization.

Fast-listing and dynamic-composition are part of the Dataflux Client Libraries and available on GitHub. Dataflux Dataset uses these client libraries under the hood.

Dataflux is available now

Give the Dataflux Dataset for PyTorch (or the Dataflux Python client library if writing your own ML training dataset code) a try and let us know how it boosts your workflows!

You can learn more about this and our other storage AI related capabilities from our Google Cloud Next ‘24 recorded session “How to define a storage infrastructure for AI and analytical workloads”