Howdy people 🤠
Okay, I get it. One of many best methods to transform a categorial to an array of dummies in Python is with the Pandas pd.get_dummies()
. Why would you’re taking the time to import OneHotEncoder
from sklearn, execute a .fit_transform()
and many others, and many others, and many others? Speak about tedious!
This text will first introduce a easy knowledge set for demonstration functions that consists of a testing set that incorporates categoricals not discovered within the coaching set. Then, it should reveal how utilizing pd.get_dummies()
can result in issues with the demonstration knowledge. And, lastly present learn how to keep away from that drawback with sklearn’s OneHotEncoder
.
Right here we now have a easy dataset that features a categorical characteristic known as OS. The OS column lists pc working techniques. We are going to use this fictional knowledge for functions of demonstration. In train_df
shall be fictional demonstration coaching knowledge. Whereas in test_df
we now have fictional demonstration testing knowledge.
In our fictional demonstration case, the testing set incorporates categorical values not current within the coaching set. This mis-match will trigger issues.
import pandas as pdtrain_df = pd.DataFrame({'OS': ['Windows', 'MacOS',
'Linux', 'Windows', 'MacOS']})
test_df = pd.DataFrame({'OS': ['Windows', 'MacOS',
'Android', 'Unix' 'iOS']})
In our coaching knowledge, we now have three working techniques: Home windows, MacOS, and Linux. However in our testing knowledge, we now have the extra classes together with Android, Unix, and iOS.
A mannequin match on train_df.get_dummies()
is not going to work with testing knowledge from test_df.get_dummies()
. The outcomes don’t match.
When making use of the pd.get_dummies()
operate to each our coaching and testing datasets here’s what you’ll get.