in

Flip and Face the Unusual. How you can leverage anomaly detection… | by Evie Fowler | Aug, 2023


Let’s think about that I’m planning a visit out of my residence metropolis of Pittsburgh, Pennsylvania. I’m not choosy about the place I am going, however I’d actually wish to keep away from journey hiccups like a canceled, diverted, and even severely delayed flight. A classification mannequin may assist me determine which flights are more likely to expertise issues, and Kaggle has some information that would assist me construct one.

I start by studying in my information and creating my very own definition of a foul flight — something canceled, diverted, or with an arrival delay longer than half-hour.

import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.ensemble import GradientBoostingClassifier, IsolationForest
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# learn in information
airlines2022 = pd.read_csv('myPath/Combined_Flights_2022.csv')
print(airlines2022.form)
# (4078318, 61)

# subset by my goal departure metropolis
airlines2022PIT = airlines2022[airlines2022.Origin == 'PIT']
print(airlines2022PIT.form)
# (24078, 61)

# mix cancellations, diversions, and 30+ minute delays into one Unhealthy Flight consequence
airlines2022PIT = airlines2022PIT.assign(arrDel30 = airlines2022PIT['ArrDelayMinutes'] >= 30)
airlines2022PIT = (airlines2022PIT
.assign(badFlight = 1 * (airlines2022PIT.Cancelled
+ airlines2022PIT.Diverted
+ airlines2022PIT.arrDel30))
)
print(airlines2022PIT.badFlight.imply())
# 0.15873411412908048

About 15% of flights fall into my “unhealthy flights” class. That’s not low sufficient to historically take into account this an anomaly detection downside, however it’s low sufficient that supervised strategies may not carry out in addition to I’d hope. Nonetheless, I’ll get began by constructing a easy gradient boosted tree mannequin to foretell whether or not a flight will expertise the kind of downside I’d wish to keep away from.

To start I must determine which options I’d like to make use of in my mannequin. For the sake of this instance I’ll choose only a few promising-looking options to mannequin with; in actuality, characteristic choice is a vital a part of any information science challenge. Many of the options out there listed below are categorical and should be encoded as a part of this information prep stage; the gap between cities must be scaled.

# categorize columns by characteristic sort
toFactor = ['Airline', 'Dest', 'Month', 'DayOfWeek'
, 'Marketing_Airline_Network', 'Operating_Airline']
toScale = ['Distance']

# drop fields that do not look useful for prediction
airlines2022PIT = airlines2022PIT[toFactor + toScale + ['badFlight']]
print(airlines2022PIT.form)
# (24078, 8)

# break up unique coaching information into coaching and validation units
practice, take a look at = train_test_split(airlines2022PIT
, test_size = 0.2
, random_state = 412)
print(practice.form)
# (19262, 8)
print(take a look at.form)
# (4816, 8)

# manually scale distance characteristic
mn = practice.Distance.min()
rng = practice.Distance.max() - practice.Distance.min()
practice = practice.assign(Distance_sc = (practice.Distance - mn) / rng)
take a look at = take a look at.assign(Distance_sc = (take a look at.Distance - mn) / rng)
practice.drop('Distance', axis = 1, inplace = True)
take a look at.drop('Distance', axis = 1, inplace = True)

# make an encoder
enc = make_column_transformer(
(OneHotEncoder(min_frequency = 0.025, handle_unknown = 'ignore'), toFactor)
, the rest = 'passthrough'
, sparse_threshold = 0)

# apply it to the coaching dataset
train_enc = enc.fit_transform(practice)

# convert it again to a Pandas dataframe for ease of use
train_enc_pd = pd.DataFrame(train_enc, columns = enc.get_feature_names_out())

# encode the take a look at set in the identical manner
test_enc = enc.rework(take a look at)
test_enc_pd = pd.DataFrame(test_enc, columns = enc.get_feature_names_out())

The event and tuning of a tree primarily based mannequin may simply be its personal put up, so I gained’t get into it right here. I’ve used the characteristic significance rankings of an preliminary mannequin to do some reverse characteristic choice, and tuned the mannequin from there. The ensuing mannequin performs decently at figuring out late, canceled, or diverted flights.

# characteristic choice - drop low significance phrases|
lowimp = ['onehotencoder__Airline_Delta Air Lines Inc.'
, 'onehotencoder__Dest_IAD'
, 'onehotencoder__Operating_Airline_AA'
, 'onehotencoder__Airline_American Airlines Inc.'
, 'onehotencoder__Airline_Comair Inc.'
, 'onehotencoder__Airline_Southwest Airlines Co.'
, 'onehotencoder__Airline_Spirit Air Lines'
, 'onehotencoder__Airline_United Air Lines Inc.'
, 'onehotencoder__Airline_infrequent_sklearn'
, 'onehotencoder__Dest_ATL'
, 'onehotencoder__Dest_BOS'
, 'onehotencoder__Dest_BWI'
, 'onehotencoder__Dest_CLT'
, 'onehotencoder__Dest_DCA'
, 'onehotencoder__Dest_DEN'
, 'onehotencoder__Dest_DFW'
, 'onehotencoder__Dest_DTW'
, 'onehotencoder__Dest_JFK'
, 'onehotencoder__Dest_MDW'
, 'onehotencoder__Dest_MSP'
, 'onehotencoder__Dest_ORD'
, 'onehotencoder__Dest_PHL'
, 'onehotencoder__Dest_infrequent_sklearn'
, 'onehotencoder__Marketing_Airline_Network_AA'
, 'onehotencoder__Marketing_Airline_Network_DL'
, 'onehotencoder__Marketing_Airline_Network_G4'
, 'onehotencoder__Marketing_Airline_Network_NK'
, 'onehotencoder__Marketing_Airline_Network_WN'
, 'onehotencoder__Marketing_Airline_Network_infrequent_sklearn'
, 'onehotencoder__Operating_Airline_9E'
, 'onehotencoder__Operating_Airline_DL'
, 'onehotencoder__Operating_Airline_NK'
, 'onehotencoder__Operating_Airline_OH'
, 'onehotencoder__Operating_Airline_OO'
, 'onehotencoder__Operating_Airline_UA'
, 'onehotencoder__Operating_Airline_WN'
, 'onehotencoder__Operating_Airline_infrequent_sklearn']
lowimp = [x for x in lowimp if x in train_enc_pd.columns]
train_enc_pd = train_enc_pd.drop(lowimp, axis = 1)
test_enc_pd = test_enc_pd.drop(lowimp, axis = 1)

# separate potential predictors from consequence
train_x = train_enc_pd.drop('remainder__badFlight', axis = 1); train_y = train_enc_pd['remainder__badFlight']
test_x = test_enc_pd.drop('remainder__badFlight', axis = 1); test_y = test_enc_pd['remainder__badFlight']
print(train_x.form)
print(test_x.form)

# (19262, 25)
# (4816, 25)

# construct mannequin
gbt = GradientBoostingClassifier(learning_rate = 0.1
, n_estimators = 100
, subsample = 0.7
, max_depth = 5
, random_state = 412)

# match it to the coaching information
gbt.match(train_x, train_y)

# calculate the chance scores for every take a look at statement
gbtPreds1Test = gbt.predict_proba(test_x)[:,1]

# use a customized threshold to transform these to binary scores
gbtThresh = np.percentile(gbtPreds1Test, 100 * (1 - obsRate))
gbtPredsCTest = 1 * (gbtPreds1Test > gbtThresh)

# test accuracy of mannequin
acc = accuracy_score(gbtPredsCTest, test_y)
print(acc)
# 0.7742940199335548

# test elevate
topDecile = test_y[gbtPreds1Test > np.percentile(gbtPreds1Test, 90)]
elevate = sum(topDecile) / len(topDecile) / test_y.imply()
print(elevate)
# 1.8591454794381614

# view confusion matrix
cm = (confusion_matrix(gbtPredsCTest, test_y) / len(test_y)).spherical(2)
print(cm)
# [[0.73 0.11]
# [0.12 0.04]]

However may or not it’s higher? Maybe there’s extra to be realized about flight patterns utilizing different strategies. An isolation forest is a tree-based anomaly detection methodology. It really works by iteratively choosing a random characteristic from the enter dataset, and a random break up level alongside the vary of that characteristic. It continues constructing a tree this fashion till every statement within the enter dataset has been break up into its personal leaf. The concept is that anomalies, or information outliers, are totally different from different observations, and so it’s simpler to isolate them with this decide and break up course of. Thus, observations which might be remoted with only a few rounds of decide and break up are thought-about anomalous, and people that may’t be separated from their neighbors rapidly are usually not.

The isolation forest is an unsupervised methodology, so it might’t be used to determine a selected sort of anomaly of the info scientist’s personal selecting (e.g. canceled, diverted, or very late flights). Nonetheless, it may be helpful for figuring out observations which might be totally different from others in some unspecified manner (e.g. flights on which one thing is totally different).

# construct an isolation forest
isf = IsolationForest(n_estimators = 800
, max_samples = 0.15
, max_features = 0.1
, random_state = 412)

# match it to the identical coaching information
isf.match(train_x)

# calculate the anomaly rating of every take a look at statement (decrease values are extra anomalous)
isfPreds1Test = isf.score_samples(test_x)

# use a customized threshold to transform these to binary scores
isfThresh = np.percentile(isfPreds1Test, 100 * (obsRate / 2))
isfPredsCTest = 1 * (isfPreds1Test < isfThresh)

Combining the anomaly scores with the supervised mannequin scores gives extra perception.

# mix predictions, anomaly scores, and survival information
comb = pd.concat([pd.Series(gbtPredsCTest), pd.Series(isfPredsCTest), pd.Series(test_y)]
, keys = ['Prediction', 'Outlier', 'badFlight']
, axis = 1)
comb = comb.assign(Appropriate = 1 * (comb.badFlight == comb.Prediction))

print(comb.imply())
#Prediction 0.159676
#Outlier 0.079942
#badFlight 0.153239
#Appropriate 0.774294
#dtype: float64

# higher accuracy in majority class
print(comb.groupby('badFlight').agg(accuracy = ('Appropriate', 'imply')))
# accuracy
#badFlight
#0.0 0.862923
#1.0 0.284553

# extra unhealthy flights amongst outliers
print(comb.groupby('Outlier').agg(badFlightRate = ('badFlight', 'imply')))

# badFlightRate
#Outlier
#0 0.148951
#1 0.202597

There are some things to notice right here. One is that the supervised mannequin is healthier at predicting “good” flights than “unhealthy” flights — it is a frequent dynamic in uncommon occasion prediction, and why it’s essential to take a look at metrics like precision and recall on prime of straightforward accuracy. Extra attention-grabbing is the truth that the “unhealthy flight” fee is sort of 1.5 instances increased amongst flights the isolation forest has categorised as anomalous. That’s despite the truth that the isolation forest is an unsupervised methodology and is figuring out atypical flights generally, reasonably than flights which might be atypical within the specific manner I’d wish to keep away from. This looks as if it should be priceless data for the supervised mannequin. The binary outlier flag is already in a great format to make use of as a predictor in my supervised mannequin, so I’ll feed it in and see if it improves mannequin efficiency.

# construct a second mannequin with outlier labels as enter options
isfPreds1Train = isf.score_samples(train_x)
isfPredsCTrain = 1 * (isfPreds1Train < isfThresh)

mn = isfPreds1Train.min(); rng = isfPreds1Train.max() - isfPreds1Train.min()
isfPreds1SCTrain = (isfPreds1Train - mn) / rng
isfPreds1SCTest = (isfPreds1Test - mn) / rng

train_2_x = (pd.concat([train_x, pd.Series(isfPredsCTrain)]
, axis = 1)
.rename(columns = {0:'isfPreds1'}))
test_2_x = (pd.concat([test_x, pd.Series(isfPredsCTest)]
, axis = 1)
.rename(columns = {0:'isfPreds1'}))

# construct mannequin
gbt2 = GradientBoostingClassifier(learning_rate = 0.1
, n_estimators = 100
, subsample = 0.7
, max_depth = 5
, random_state = 412)

# match it to the coaching information
gbt2.match(train_2_x, train_y)

# calculate the chance scores for every take a look at statement
gbt2Preds1Test = gbt2.predict_proba(test_2_x)[:,1]

# use a customized threshold to transform these to binary scores
gbtThresh = np.percentile(gbt2Preds1Test, 100 * (1 - obsRate))
gbt2PredsCTest = 1 * (gbt2Preds1Test > gbtThresh)

# test accuracy of mannequin
acc = accuracy_score(gbt2PredsCTest, test_y)
print(acc)
#0.7796926910299004

# test elevate
topDecile = test_y[gbt2Preds1Test > np.percentile(gbt2Preds1Test, 90)]
elevate = sum(topDecile) / len(topDecile) / test_y.imply()
print(elevate)
#1.9138477764819217

# view confusion matrix
cm = (confusion_matrix(gbt2PredsCTest, test_y) / len(test_y)).spherical(2)
print(cm)
#[[0.73 0.11]
# [0.11 0.05]]

Together with outlier standing as a predictor within the supervised mannequin does in truth enhance its prime decile elevate by a number of factors. It appears that evidently being “unusual” in an undefined manner is sufficiently correlated with my desired consequence as to supply predictive energy.


PyTorch Mannequin Efficiency Evaluation and Optimization — Half 3 | by Chaim Rand | Aug, 2023

Amazon Translate enhances its customized terminology to enhance translation accuracy and fluency