Mastering GPUs: A Newbie’s Information to GPU-Accelerated DataFrames in Python

Partnership Publish


In case you’re working in python with giant datasets, maybe a number of gigabytes in measurement, you possibly can possible relate to the frustration of ready hours on your queries to complete as your CPU-based pandas DataFrame struggles to carry out operations. This actual state of affairs is the place a pandas person ought to take into account leveraging the ability of GPUs for knowledge processing with RAPIDS cuDF.

RAPIDS cuDF, with its pandas-like API, permits knowledge scientists and engineers to rapidly faucet into the immense potential of parallel computing on GPUs–with just some code line modifications. 

In case you’re unfamiliar with GPU acceleration, this submit is a simple introduction to the RAPIDS ecosystem and showcases the commonest performance of cuDF, the GPU-based pandas DataFrame counterpart.

Desire a useful abstract of the following pointers? Comply with together with the downloadable cuDF cheat sheet.


Leveraging GPUs with cuDF DataFrame

cuDF is a knowledge science constructing block for the RAPIDS suite of GPU-accelerated libraries. It’s an EDA workhorse you should use to construct permitting knowledge pipelines to course of knowledge and derive new options. As a basic element inside the RAPIDS suite, cuDF underpins the opposite libraries, solidifying its function as a standard constructing block. Like all parts within the RAPIDS suite, cuDF employs the CUDA backend to energy GPU computations.

Nonetheless, with a simple and acquainted Python interface, cuDF customers need not work together straight with that layer.

How cuDF Can Make Your Knowledge Science Work Sooner

Are you bored with watching the clock whereas your script runs? Whether or not you are dealing with string knowledge or working with time sequence, there are numerous methods you should use cuDF to drive your knowledge work ahead. 

  • Time series analysis: Whether or not you are resampling knowledge, extracting options, or conducting advanced computations, cuDF affords a considerable speed-up, doubtlessly as much as 880x quicker than pandas for time-series evaluation.
  • Real-time exploratory data analysis (EDA): Shopping by giant datasets generally is a chore with conventional instruments, however cuDF’s GPU-accelerated processing energy makes real-time exploration of even the largest knowledge units doable
  • Machine learning (ML) data preparation:  Pace up knowledge transformation duties and put together your knowledge for generally used ML algorithms, akin to regression, classification and clustering, with cuDF’s acceleration capabilities. Environment friendly processing means faster mannequin improvement and permits you to get in direction of the deployment faster.
  • Large-scale data visualization: Whether or not you are creating warmth maps for geographic knowledge or visualizing advanced monetary traits, builders can deploy knowledge visualization libraries with high-performance and high-FPS knowledge visualization through the use of cuDF and cuxfilter. This integration permits for real-time interactivity to change into an important element of your analytics cycle.
  • Giant-scale knowledge filtering and transformation: For giant datasets exceeding a number of gigabytes, you possibly can carry out filtering and transformation duties utilizing cuDF in a fraction of the time it takes with pandas.
  • String data processing: Historically, string knowledge processing has been a difficult and sluggish activity as a result of advanced nature of textual knowledge. These operations are made easy with GPU-acceleration
  • GroupBy operations: GroupBy operations are a staple in knowledge evaluation however might be resource-intensive. cuDF hurries up these duties considerably, permitting you to achieve insights quicker when splitting and aggregating your knowledge

Acquainted interface for GPU processing

The core premise of RAPIDS is to offer a well-known person expertise to fashionable knowledge science instruments in order that the ability of NVIDIA GPUs is definitely accessible for all practitioners. Whether or not you’re performing ETL, constructing ML fashions, or processing graphs, if you understand pandas, NumPy, scikit-learn or NetworkX, you’ll really feel at house when utilizing RAPIDS.

Switching from CPU to GPU Knowledge Science stack has by no means been simpler: with as little change as importing cuDF as a substitute of pandas, you possibly can harness the big energy of NVIDIA GPUs, rushing up the workloads 10-100x (on the low finish), and having fun with extra productiveness — all whereas utilizing your favourite instruments. 

Verify the pattern code under that presents how acquainted cuDF API is to anybody utilizing pandas.

import pandas as pd
import cudf
df_cpu = pd.read_csv('/knowledge/pattern.csv')
df_gpu = cudf.read_csv('/knowledge/pattern.csv')


Loading knowledge out of your favourite knowledge sources

Studying and writing capabilities of cuDF have grown considerably because the first launch of RAPIDS in October 2018. The info might be native to a machine, saved in an on-prem cluster, or within the cloud. cuDF makes use of fsspec library to summary many of the file-system associated duties so you possibly can give attention to what issues essentially the most: creating options and constructing your mannequin.

Due to fsspec studying knowledge from both native or cloud file system requires solely offering credentials to the latter. The instance under reads the identical file from two completely different places,

import cudf
df_local = cudf.read_csv('/knowledge/pattern.csv')
df_remote = cudf.read_csv(
    , storage_options = {'anon': True})


cuDF helps a number of file codecs: text-based codecs like CSV/TSV or JSON, columnar-oriented codecs like Parquet or ORC, or row-oriented codecs like Avro. When it comes to file system help, cuDF can learn recordsdata from native file system, cloud suppliers like AWS S3, Google GS, or Azure Blob/Knowledge Lake, on- or off-prem Hadoop Recordsdata Methods, and in addition straight from HTTP or (S)FTP internet servers, Dropbox or Google Drive, or Jupyter File System.


Creating and saving DataFrames with ease

Studying recordsdata is just not the one technique to create cuDF DataFrames. In truth, there are at the very least 4 methods to take action:

From an inventory of values you possibly can create DataFrame with one column,

cudf.DataFrame([1,2,3,4], columns=['foo'])

Passing a dictionary if you wish to create a DataFrame with a number of columns,

      'foo': [1,2,3,4]
    , 'bar': ['a','b','c',None]


Creating an empty DataFrame and assigning to columns,

df_sample = cudf.DataFrame()
df_sample['foo'] = [1,2,3,4]
df_sample['bar'] = ['a','b','c',None]


Passing an inventory of tuples,

      (1, 'a')
    , (2, 'b')
    , (3, 'c')
    , (4, None)
], columns=['ints', 'strings'])


You too can convert to and from different reminiscence representations: 

  • From an inner GPU matrix represented as an DeviceNDArray, 
  • Via DLPack reminiscence objects used to share tensors between deep learning frameworks and Apache Arrow format that facilitates a way more handy manner of manipulating reminiscence objects from varied programming languages, 
  • To changing to and from pandas DataFrames and Collection.

As well as, cuDF helps saving the information saved in a DataFrame into a number of codecs and file techniques. In truth, cuDF can retailer knowledge in all of the codecs it may possibly learn.

All of those capabilities make it doable to rise up and working rapidly it doesn’t matter what your activity is or the place your knowledge lives.


Extracting, reworking, and summarizing knowledge

The basic knowledge science activity, and the one that every one knowledge scientists complain about, is cleansing, featurizing and getting conversant in the dataset. We spend 80% of our time doing that. Why does it take a lot time? 

One of many causes is as a result of the questions we ask the dataset take too lengthy to reply. Anybody who has tried to learn and course of a 2GB dataset on a CPU is aware of what we’re speaking about. 

Moreover, since we’re human and we make errors, rerunning a pipeline would possibly rapidly flip right into a full day train. This leads to misplaced productiveness and, possible, a espresso dependancy if we check out the chart under.


Diagram comparing a data scientist’s daily workload when using GPU acceleration versus CPU power
Determine 1. Typical workday for a developer utilizing a GPU- vs. CPU-powered workflow


RAPIDS with the GPU-powered workflow alleviates all these hurdles. The ETL stage is often anyplace between 8-20x quicker, so loading that 2GB dataset takes seconds in comparison with minutes on a CPU, cleansing and reworking the information can also be orders of magnitude quicker! All this with a well-known interface and minimal code modifications.


Working with strings and dates on GPUs

Not more than 5 years in the past working with strings and dates on GPUs was thought of nearly unattainable and past the attain of low-level programming languages like CUDA. In any case, GPUs have been designed to course of graphics, that’s, to govern giant arrays and matrices of ints and floats, not strings or dates.

RAPIDS permits you to not solely learn strings into the GPU reminiscence, but additionally extract options, course of, and manipulate them. If you’re conversant in Regex then extracting helpful data from a doc on a GPU is now a trivial activity due to cuDF. For instance, if you wish to discover and extract all of the phrases in your doc that match the [a-z]*movement sample (like, knowledgemovement, workmovement, or movement) all you should do is,


Extracting helpful options from dates or querying the information for a particular time frame has change into simpler and quicker due to RAPIDS as effectively. 

dt_to = dt.datetime.strptime("2020-10-03", "%Y-%m-%d")
df.question('dttm <= @dt_to')


Empowering Pandas Customers with GPU-acceleration

The transition from a CPU to a GPU knowledge science stack is easy with RAPIDS. Importing cuDF as a substitute of pandas is a small change that may ship immense advantages. Whether or not you are engaged on a neighborhood GPU field or scaling as much as full-fledged knowledge facilities, the GPU-accelerated energy of RAPIDS offers 10-100x velocity enhancements (on the low finish). This not solely results in elevated productiveness but additionally permits for environment friendly utilization of your favourite instruments, even in essentially the most demanding, large-scale situations. 

​​RAPIDS has really revolutionized the panorama of information processing, enabling knowledge scientists to finish duties in minutes that after took hours and even days, resulting in elevated productiveness and decrease total prices. 

To get began on making use of these strategies to your dataset, learn the accelerated data analytics series on NVIDIA Technical Blog.

Editor’s Notice: This post was up to date with permission and initially tailored from the NVIDIA Technical Weblog.

KDnuggets Information, July 26: Free Generative AI Coaching from Google • Knowledge Engineering Newbie’s Information • GPT-Engineer: Your New AI Coding Assistant

5 Issues You Must Know When Constructing LLM Purposes