A information to dealing with categorical variables in Python | by Andrea D’Agostino | Jun, 2023

A information on the way to method categorical variables for machine studying and knowledge science functions

Picture by Thomas Haas / Unsplash

Dealing with categorical variables in an information science or machine studying mission isn’t any simple job. The sort of work requires deep information of the sphere of software and a broad understanding of the a number of methodologies obtainable.

For that reason, the current article will concentrate on explaining the next ideas

  • what are categorical variables and the way to divide them into the different sorts
  • the way to convert them to numeric worth primarily based on their kind
  • instruments and applied sciences for his or her administration primarily utilizing Sklearn

Correct dealing with of categorical variables can significantly enhance the results of our predictive mannequin or evaluation. Actually, many of the info related to studying and understanding knowledge may very well be contained within the obtainable categorical variables.

Simply consider tabular knowledge, break up by the variable gender or by a sure shade. These spits, primarily based on the variety of classes, can deliver out vital variations between teams and which may inform the analyst or the educational algorithm.

Let’s begin by defining what they’re and the way they will current themselves.

Categorical variables are a kind of variable utilized in statistics and knowledge science to signify qualitative or nominal knowledge. These variables could be outlined as a category or class of information that can’t be quantified constantly, however solely discretely.

For instance, an instance of a categorical variable is likely to be a particular person’s eye shade, which could be blue, inexperienced, or brown.

Most studying fashions don’t work with knowledge in a categorical format. We should first convert them into numeric format in order that the data is preserved.

Categorical variables could be labeled into two sorts:

Nominal variables are variables that aren’t constrained by a exact order. Gender, shade, or manufacturers are examples of nominal variables since they don’t seem to be sortable.

Ordinal variables are as an alternative categorical variables divided into logically orderable ranges. A column in a dataset that consists of ranges similar to First, Second, and Third could be thought of an ordinal categorical variable.

You may go deeper into the breakdown of categorical variables by contemplating binary and cyclic variables.

A binary variable is easy to know: it’s a categorical variable that may solely tackle two values.

A cyclic variable, then again, is characterised by a repetition of its values. For instance, the times of the week are cyclical, and so are the seasons.

Now that we’ve outlined what categorical variables are and what they appear like, let’s sort out the query of remodeling them utilizing a sensible instance — a Kaggle dataset known as cat-in-the-dat.

The dataset

That is an open supply dataset on the foundation of an introductory competitors to the administration and modeling of categorical variables, known as the Categorical Characteristic Encoding Problem II. You may obtain the information immediately from the hyperlink under.

The peculiarity of this dataset is that it comprises completely categorical knowledge. So it turns into the right use case for this information. It contains nominal, ordinal, cyclic, and binary variables.

We’ll see methods for reworking every variable right into a format usable by a studying mannequin.

The dataset seems like this

Picture by writer.

For the reason that goal variable can solely tackle two values, it is a binary classification job. We’ll use the AUC metric to judge our mannequin.

Now we’re going to apply methods for managing categorical variables utilizing the talked about dataset.

1. Label Encoding (mapping to an arbitrary quantity)

The only approach there may be for changing a class right into a usable format is to assign every class to an arbitrary quantity.

Take for instance the ord_2 column which comprises the classes

array(['Hot', 'Warm', 'Freezing', 'Lava Hot', 'Cold', 'Boiling Hot', nan],

The mapping may very well be performed like this utilizing Python and Pandas:

df_train = practice.copy()

mapping = {
"Chilly": 0,
"Scorching": 1,
"Lava Scorching": 2,
"Boiling Scorching": 3,
"Freezing": 4,
"Heat": 5


0 1.0
1 5.0
2 4.0
3 2.0
4 0.0
599995 4.0
599996 3.0
599997 4.0
599998 5.0
599999 3.0
Title: ord_2, Size: 600000, dtype: float64

Nevertheless, this technique has an issue: you must manually declare the mapping. For a small variety of classes this isn’t an issue, however for a big quantity it may very well be.

For this we are going to use Scikit-Be taught and the LabelEncoder object to realize the identical lead to a extra versatile means.

from sklearn import preprocessing

# we deal with lacking values
df_train["ord_2"].fillna("NONE", inplace=True)
# init the sklearn encoder
le = preprocessing.LabelEncoder()
# match + remodel
df_train["ord_2"] = le.fit_transform(df_train["ord_2"])

0 3
1 6
2 2
3 4
4 1
599995 2
599996 0
599997 2
599998 6
599999 0
Title: ord_2, Size: 600000, dtype: int64

Mapping is managed by Sklearn. We will visualize it like this:

mapping = {label: index for index, label in enumerate(le.classes_)}

{'Boiling Scorching': 0,
'Chilly': 1,
'Freezing': 2,
'Scorching': 3,
'Lava Scorching': 4,
'NONE': 5,
'Heat': 6}

Observe the .fillna(“NONE") within the code snippet above. Actually, Sklearn’s label encoder doesn’t deal with empty values and can give an error when making use of it if any are discovered.

Some of the essential issues to remember for the proper dealing with of categorical variables is to at all times deal with the empty values. Actually, many of the related methods don’t work if these aren’t taken care of.

The label encoder maps arbitrary numbers to every class within the column, with out an express declaration of the mapping. That is handy, however introduces an issue for some predictive fashions: it introduces the necessity to scale the information if the column just isn’t the goal one.

Actually, machine studying rookies usually ask what the distinction is between label encoder and one sizzling encoder, which we are going to see shortly. The label encoder, by design, must be utilized to the labels, ie the goal variable we need to predict and to not the opposite columns.

Having stated that, some fashions additionally very related within the area work properly even with an encoding of this kind. I’m speaking about tree fashions, amongst which XGBoost and LightGBM stand out.

So be at liberty to make use of label encoders for those who determine to make use of tree fashions, however in any other case, we have now to make use of one sizzling encoding.

2. One Scorching Encoding

As I already talked about in my article about vector representations in machine learning, one sizzling encoding is a quite common and well-known vectorization approach (i.e. changing a textual content right into a quantity).

It really works like this: for every class current, a sq. matrix is created whose solely attainable values are 0 and 1. This matrix informs the mannequin that amongst all attainable classes, this noticed row has the worth denoted by 1.

An instance:

             |   |   |   |   |   |   
Freezing | 0 | 0 | 0 | 0 | 0 | 1
Heat | 0 | 0 | 0 | 0 | 1 | 0
Chilly | 0 | 0 | 0 | 1 | 0 | 0
Boiling Scorching | 0 | 0 | 1 | 0 | 0 | 0
Scorching | 0 | 1 | 0 | 0 | 0 | 0
Lava Scorching | 1 | 0 | 0 | 0 | 0 | 0

The array is of dimension n_categories. That is very helpful info, as a result of one sizzling encoding usually requires a sparse illustration of the transformed knowledge.

What does it imply? It implies that for big numbers of classes, the matrix might change into equally massive. Being populated solely by values of 0 and 1 and since solely one of many positions could be populated by a 1, this makes the one sizzling illustration very redundant and cumbersome.

A sparse matrix solves this downside — solely the positions of the 1’s are saved, whereas values equal to 0 should not saved. This simplifies the talked about downside and permits us to save lots of an enormous array of data in alternate for little or no reminiscence utilization.

Let’s see what such an array seems like in Python, making use of the code from earlier than once more

from sklearn import preprocessing

# we deal with lacking values
df_train["ord_2"].fillna("NONE", inplace=True)
# init sklearn's encoder
ohe = preprocessing.OneHotEncoder()
# match + remodel
ohe.fit_transform(df_train["ord_2"].values.reshape(-1, 1))

<600000x7 sparse matrix of kind '<class 'numpy.float64'>'
with 600000 saved components in Compressed Sparse Row format>

Python returns an object by default, not an inventory of values. To get such an inventory, it’s worthwhile to use .toarray()

ohe.fit_transform(df_train["ord_2"].values.reshape(-1, 1)).toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 1.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 1.],
[1., 0., 0., ..., 0., 0., 0.]])

Don’t fear for those who don’t totally perceive the idea: we are going to quickly see the way to apply the label and one sizzling encoder to the dataset to coach a predictive mannequin.

Label encoding and one sizzling encoding are a very powerful methods for dealing with categorical variables. Understanding these two methods will can help you deal with most circumstances that contain categorical variables.

3. Transformations and aggregations

One other technique of changing from categorical to numeric format is to carry out a change or aggregation on the variable.

By grouping with .groupby() it’s attainable to make use of the depend of the values current within the column because the output of the transformation.


Boiling Scorching 84790
Chilly 97822
Freezing 142726
Scorching 67508
Lava Scorching 64840
Heat 124239
Title: id, dtype: int64

utilizing .remodel() we will substitute these numbers to the corresponding cell


0 67508.0
1 124239.0
2 142726.0
3 64840.0
4 97822.0
599995 142726.0
599996 84790.0
599997 142726.0
599998 124239.0
599999 84790.0
Title: id, Size: 600000, dtype: float64

It’s attainable to use this logic additionally with different mathematical operations — the tactic that almost all improves the efficiency of our mannequin must be examined.

4. Create new categorical options from categorical variables

We take a look at the ord_1 column along with ord_2

picture by writer.

We will create new categorical variables by merging current variables. For instance, we will merge ord_1 with ord_2 to create a brand new characteristic

df_train["new_1"] = df_train["ord_1"].astype(str) + "_" + df_train["ord_2"].astype(str)

0 Contributor_Hot
1 Grandmaster_Warm
2 nan_Freezing
3 Novice_Lava Scorching
4 Grandmaster_Cold
599995 Novice_Freezing
599996 Novice_Boiling Scorching
599997 Contributor_Freezing
599998 Master_Warm
599999 Contributor_Boiling Scorching
Title: new_1, Size: 600000, dtype: object

This method could be utilized in virtually any case. The concept that should information the analyst is to enhance the efficiency of the mannequin by including info that was initially obscure to the educational mannequin.

5. Use NaN as a categorical variable

Fairly often null values are eliminated. That is usually not a transfer I like to recommend, because the NaNs include doubtlessly helpful info to our mannequin.

One resolution is to deal with NaNs as a class in their very own proper.

Let’s take a look at the ord_2 column once more


Freezing 142726
Heat 124239
Chilly 97822
Boiling Scorching 84790
Scorching 67508
Lava Scorching 64840
Title: ord_2, dtype: int64

Now let’s strive making use of the .fillna(“NONE") to see what number of empty cells exist


Freezing 142726
Heat 124239
Chilly 97822
Boiling Scorching 84790
Scorching 67508
Lava Scorching 64840
NONE 18075

As a share, NONE represents about 3% of the complete column. It’s a reasonably noticeable quantity. Exploiting the NaN makes much more sense and could be performed with the One Scorching Encoder talked about earlier.

Let’s keep in mind what the OneHotEncoder does: it creates a sparse matrix whose variety of columns and rows is the same as the variety of distinctive classes within the referenced column. Which means we should additionally consider the classes that may very well be current within the check set and that may very well be absent within the practice set.

The scenario is analogous for the LabelEncoder — there could also be classes within the check set however which aren’t current within the coaching set and this might create issues throughout the transformation.

We clear up this downside by concatenating the datasets. This may enable us to use the encoders to all knowledge and never simply the coaching knowledge.

check["target"] = -1
knowledge = pd.concat([train, test]).reset_index(drop=True)
options = [f for f in train.columns if f not in ["id", "target"]]
for characteristic in options:
le = preprocessing.LabelEncoder()
temp_col = knowledge[feature].fillna("NONE").astype(str).values
knowledge.loc[:, feature] = le.fit_transform(temp_col)

practice = knowledge[data["target"] != -1].reset_index(drop=True)
check = knowledge[data["target"] == -1].reset_index(drop=True)

Picture by writer.

This technique helps us if we have now the check set. If we don’t have the check set, we are going to consider a price like NONE when a brand new class turns into a part of our coaching set.

Now let’s transfer on to the coaching of a easy mannequin. We’ll comply with the steps from the article on the way to design and implement a cross-validation on the following hyperlink 👇

We begin from scratch, importing our knowledge and creating our folds with Sklearn’s StratifiedKFold.

practice = pd.read_csv("/kaggle/enter/cat-in-the-dat-ii/practice.csv")
check = pd.read_csv("/kaggle/enter/cat-in-the-dat-ii/check.csv")

df = practice.copy()

df["kfold"] = -1
df = df.pattern(frac=1).reset_index(drop=True)
y = df.goal.values

kf = model_selection.StratifiedKFold(n_splits=5)

for f, (t_, v_) in enumerate(kf.break up(X=df, y=y)):
df.loc[v_, 'kfold'] = f

This little snippet of code will create a Pandas dataframe with 5 teams to check our mannequin in opposition to.

Picture by writer.

Now let’s outline a operate that may check a logistic regression mannequin on every group.

def run(fold: int) -> None:
options = [
f for f in df.columns if f not in ("id", "target", "kfold")

for characteristic in options:
df.loc[:, feature] = df[feature].astype(str).fillna("NONE")

df_train = df[df["kfold"] != fold].reset_index(drop=True)
df_valid = df[df["kfold"] == fold].reset_index(drop=True)

ohe = preprocessing.OneHotEncoder()

full_data = pd.concat([df_train[features], df_valid[features]], axis=0)
print("Becoming OHE on full knowledge...")

x_train = ohe.remodel(df_train[features])
x_valid = ohe.remodel(df_valid[features])
print("Coaching the classifier...")
mannequin = linear_model.LogisticRegression()
mannequin.match(x_train, df_train.goal.values)

valid_preds = mannequin.predict_proba(x_valid)[:, 1]

auc = metrics.roc_auc_score(df_valid.goal.values, valid_preds)

print(f"FOLD: {fold} | AUC = {auc:.3f}")


Becoming OHE on full knowledge...
Coaching the classifier...
FOLD: 0 | AUC = 0.785

I invite the reader to learn the article on cross-validation to know in additional element the functioning of the code proven.

Now let’s see how as an alternative to use a tree mannequin like XGBoost, which additionally works properly with a LabelEncoder.

def run(fold: int) -> None:
options = [
f for f in df.columns if f not in ("id", "target", "kfold")

for characteristic in options:
df.loc[:, feature] = df[feature].astype(str).fillna("NONE")

print("Becoming the LabelEncoder on the options...")
for characteristic in options:
le = preprocessing.LabelEncoder()
df.loc[:, feature] = le.remodel(df[feature])

df_train = df[df["kfold"] != fold].reset_index(drop=True)
df_valid = df[df["kfold"] == fold].reset_index(drop=True)

x_train = df_train[features].values
x_valid = df_valid[features].values

print("Coaching the classifier...")
mannequin = xgboost.XGBClassifier(n_jobs=-1, n_estimators=300)
mannequin.match(x_train, df_train.goal.values)

valid_preds = mannequin.predict_proba(x_valid)[:, 1]

auc = metrics.roc_auc_score(df_valid.goal.values, valid_preds)

print(f"FOLD: {fold} | AUC = {auc:.3f}")

# execute on 2 folds
for fold in vary(2):

Becoming the LabelEncoder on the options...
Coaching the classifier...
FOLD: 0 | AUC = 0.768
Becoming the LabelEncoder on the options...
Coaching the classifier...
FOLD: 1 | AUC = 0.765

In conclusion, there are additionally different methods value mentioning for dealing with categorical variables:

  • Goal-based encoding, the place the class is transformed into the typical worth assumed by the goal variable in correspondence with it
  • The embeddings of a neural network, which can be utilized to signify the textual entity

In abstract, listed here are the important steps for an accurate administration of categorical variables

  • at all times deal with null values
  • apply LabelEncoder or OneHotEncoder primarily based on the kind of variable and template we need to use
  • purpose by way of variable enrichment, contemplating NaN or NONE as categorical variables that may inform the mannequin
  • Mannequin the information!

Thanks to your time,

Onboard customers to Amazon SageMaker Studio with Energetic Listing group-specific IAM roles

PaLM: Effectively Coaching Huge Language Fashions | by Cameron R. Wolfe, Ph.D. | Jun, 2023