Most instructional and real-world datasets comprise categorical options. Right now we are going to cowl gradient boosted determination bushes from the CatBoost library, which supplies native help for categorical knowledge. We are going to use a dataset of mushrooms which are both edible or toxic. The mushrooms are described by categorical options corresponding to their coloration, odor, and form, and the query we need to reply is:
Is it suitable for eating this mushroom — primarily based on its categorical options?
As you’ll be able to see, the stakes are excessive. We need to ensure that we get the machine studying mannequin proper in order that our mushroom omelet doesn’t finish in a catastrophe. As a bonus, on the finish we are going to present a function significance rating that tells you which categorical function is the strongest predictor of mushroom security.
Introducing the mushroom dataset
The mushroom dataset is on the market right here: https://archive.ics.uci.edu/dataset/73/mushroom . For readability of presentation, we create a pandas DataFrame from the unique cryptic short-form variables and annotate it with correct column names and long-form variables. We use pandas’
change operate with long-form variables taken from the dataset description. The goal variable can solely take True and False values — the dataset creators performed it protected and categorised questionable mushrooms as inedible.