The Good, The Unhealthy, and the Ugly of Pd.Get_Dummies | by Adam Ross Nelson | Jul, 2023

That is for the pd.get_dummies diehards

Howdy people 🤠

Okay, I get it. One of many best methods to transform a categorial to an array of dummies in Python is with the Pandas pd.get_dummies(). Why would you’re taking the time to import OneHotEncoderfrom sklearn, execute a .fit_transform() and many others, and many others, and many others? Speak about tedious!

This text will first introduce a easy knowledge set for demonstration functions that consists of a testing set that incorporates categoricals not discovered within the coaching set. Then, it should reveal how utilizing pd.get_dummies() can result in issues with the demonstration knowledge. And, lastly present learn how to keep away from that drawback with sklearn’s OneHotEncoder.

Three panda bears that look like country western cowboys. Two bears have hats. They’re on a green field.
Picture Credit score: Creator’s illustration utilizing textual content to picture in Canva. Prompted: “Three panda bears dressed as nation western cowboys.”

Right here we now have a easy dataset that features a categorical characteristic known as OS. The OS column lists pc working techniques. We are going to use this fictional knowledge for functions of demonstration. In train_df shall be fictional demonstration coaching knowledge. Whereas in test_df we now have fictional demonstration testing knowledge.

In our fictional demonstration case, the testing set incorporates categorical values not current within the coaching set. This mis-match will trigger issues.

import pandas as pd

train_df = pd.DataFrame({'OS': ['Windows', 'MacOS',
'Linux', 'Windows', 'MacOS']})
test_df = pd.DataFrame({'OS': ['Windows', 'MacOS',
'Android', 'Unix' 'iOS']})

In our coaching knowledge, we now have three working techniques: Home windows, MacOS, and Linux. However in our testing knowledge, we now have the extra classes together with Android, Unix, and iOS.

A mannequin match on train_df.get_dummies() is not going to work with testing knowledge from test_df.get_dummies(). The outcomes don’t match.

A woden dummie model used in art shown on a blue background.
Picture Credit score: Creator’s illustration created in Canva utilizing Canva inventory pictures. An artwork provide dummy.

When making use of the pd.get_dummies() operate to each our coaching and testing datasets here’s what you’ll get.

openCypher* towards any Relational Database | by Victor Morgante | Jul, 2023

3 Sensible Variations You Want To Know In Pandas