The GPBoost algorithm extends linear blended results and Gaussian course of fashions by changing the linear fastened results perform with a non-parametric non-linear perform modeled utilizing tree-boosting. This text exhibits how the GPBoost algorithm applied within the GPBoost
library can be utilized for modeling information with a spatial and grouped construction. We show the performance of the GPBoost
library utilizing European GDP information which is an instance of areal spatial econometric information.
Desk of contents
∘ Introduction
· Table of contents
· Basic workflow of GPBoost
· Data description
· Data loading and short visualization
∘ Training a GPBoost model
∘ Choosing tuning parameters
∘ Model interpretation
· Estimated random effects model
· Spatial effect map
· Understanding the fixed effects function
∘ Extensions
· Separate random effects for different time periods
· Interaction between space and fixed effects predictor variables
· Large data
· Other spatial random effects models
· (Generalized) linear mixed effects and Gaussian process models
∘ References
Primary workflow of GPBoost
Making use of a GPBoost mannequin (= mixed tree-boosting and random results / GP fashions) includes the next principal steps:
- Outline a
GPModel
through which you specify the next:
– A random results mannequin (e.g., spatial random results, grouped random results, mixed spatial and grouped, and so on.)
– The chance (= distribution of the response variable conditional on fastened and random results) - Create a
gpb.Dataset
with the response variable and stuck results predictor variables - Select tuning parameters, e.g., utilizing the perform
gpb.grid.search.tune.parameters
- Practice the mannequin with the
gpboost / gpb.prepare
perform
On this article, we show these steps utilizing an actual world information set. Apart from, we additionally present the best way to interpret fitted fashions, and we take a look at a number of extensions and extra options of the GPBoost
library. All outcomes are obtained utilizing GPBoost
model 1.2.1. This demo makes use of the R package deal, however the corresponding Python package deal supplies the identical performance.
Knowledge description
The info used on this demo is European gross home product (GDP) information. It could downloaded from https://raw.githubusercontent.com/fabsig/GPBoost/master/data/gdp_european_regions.csv. The info was collected by Massimo Giannini, College of Rome Tor Vergata, from Eurostat and kindly supplied to Fabio Sigrist for a chat on the College of Rome Tor Vergata on June 16, 2023.
Knowledge was collected for 242 European areas for the 2 years 2000 and 2021. I.e., the full variety of information factors is 484. The response variable is (log) GDP / capita. There are 4 predictor variables:
- L: (log) share of employment (empl/pop)
- Okay: (log) fastened capital/inhabitants
- Pop: log(inhabitants)
- Edu: share of tertiary schooling
Additional, there are centroid spatial coordinates of each area (longitude and latitude), a spatial area ID for the 242 totally different European areas, and an extra spatial cluster ID which identifies the cluster the area belongs (there are two clusters).
Knowledge loading and quick visualization
We first load the information and create a map illustrating the (log) GDP / capita over house. We create two maps: one with all information and one other one when excluding some distant islands. Within the commented code beneath, we additionally present the best way to create a spatial plot when no form file for spatial polygons is obtainable.
library(gpboost)
library(ggplot2)
library(gridExtra)
library(viridis)
library(sf)# Load information
information <- learn.csv("https://uncooked.githubusercontent.com/fabsig/GPBoost/grasp/information/gdp_european_regions.csv")
FID <- information$FID
information <- as.matrix(information[,names(data)!="FID"]) # convert to matrix because the boosting half at the moment doesn't assist information.frames
covars <- c("L", "Okay", "pop", "edu")
# Load form file for spatial plots
cur_tempfile <- tempfile()
obtain.file(url = "https://uncooked.githubusercontent.com/fabsig/GPBoost/grasp/information/shape_European_regions.zip", destfile = cur_tempfile)
out_directory <- tempfile()
unzip(cur_tempfile, exdir = out_directory)
form <- st_read(dsn = out_directory)
# Create spatial plot of GDP
data_plot <- merge(form, information.body(FID = FID, y = information[,"y"]), by="FID")
p1 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(identify="GDP / capita (log)", possibility = "B") +
ggtitle("GDP / capita (log)")
# Pattern plot excluding islands
p2 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(identify="GDP / capita (log)", possibility = "B") +
xlim(2700000,6400000) + ylim(1500000,5200000) +
ggtitle("GDP / capita (log) -- excluding islands")
grid.prepare(p1, p2, ncol=2)
# # Spatial plot with out a form file
# p1 <- ggplot(information = information.body(Lat=information[,"Lat"], Lengthy=information[,"Long"],
# GDP=information[,"y"]), aes(x=Lengthy,y=Lat,colour=GDP)) +
# geom_point(measurement=2, alpha=0.5) + scale_color_viridis(possibility = "B") +
# ggtitle("GDP / capita (log)")
# p2 <- ggplot(information = information.body(Lat=information[,"Lat"], Lengthy=information[,"Long"],
# GDP=information[,"y"]), aes(x=Lengthy,y=Lat,colour=GDP)) +
# geom_point(measurement=3, alpha=0.5) + scale_color_viridis(possibility = "B") +
# ggtitle("GDP / capita (log) -- Europe excluding islands") + xlim(-10,28) + ylim(35,67)
# grid.prepare(p1, p2, ncol=2)
Within the following, we use a Gaussian course of mannequin with an exponential covariance perform to mannequin spatial random results. Moreover, we embrace grouped random results for the cluster variable cl. Within the GPBoost
library, Gaussian course of random results are outlined by the gp_coords
argument and grouped random results by way of the group_data
argument of the GPModel
constructor. The above-mentioned predictor variables are used within the fastened results tree-ensemble perform. We match the mannequin utilizing the gpboost
, or equivalently the gpb.prepare
, perform. Observe that we use tuning parameters which can be chosen beneath utilizing cross-validation.
gp_model <- GPModel(group_data = information[, c("cl")],
gp_coords = information[, c("Long", "Lat")],
chance = "gaussian", cov_function = "exponential")
params <- listing(learning_rate = 0.01, max_depth = 2, num_leaves = 2^10,
min_data_in_leaf = 10, lambda_l2 = 0)
nrounds <- 37
# gp_model$set_optim_params(params = listing(hint=TRUE)) # To observe hzperparameter estimation
boost_data <- gpb.Dataset(information = information[, covars], label = information[, "y"])
gpboost_model <- gpboost(information = boost_data, gp_model = gp_model,
nrounds = nrounds, params = params,
verbose = 1) # similar as gpb.prepare# similar as gpb.prepare gpb.prepare
It is vital that tuning parameters are appropriately chosen for enhancing. There aren’t any common default values and each information set will possible want totally different tuning parameters. Observe that this will take a while. As a substitute of a hard and fast grid search as beneath, one may also do a random grid search to hurry issues up (see num_try_random
). Under we present how tuning parameters may be chosen utilizing the gpb.grid.search.tune.parameters
perform. We use the imply sq. error (mse
) as prediction accuracy measure on the validation information. Alternatively, one may also use, e.g., the take a look at detrimental log-likelihood (test_neg_log_likelihood
= default worth if nothing is specified) which additionally takes prediction uncertainty under consideration.
gp_model <- GPModel(group_data = information[, "cl"],
gp_coords = information[, c("Long", "Lat")],
chance = "gaussian", cov_function = "exponential")
boost_data <- gpb.Dataset(information = information[, covars], label = information[, "y"])
param_grid = listing("learning_rate" = c(1,0.1,0.01),
"min_data_in_leaf" = c(10,100,1000),
"max_depth" = c(1,2,3,5,10),
"lambda_l2" = c(0,1,10))
other_params <- listing(num_leaves = 2^10)
set.seed(1)
opt_params <- gpb.grid.search.tune.parameters(param_grid = param_grid, params = other_params,
num_try_random = NULL, nfold = 4,
information = boost_data, gp_model = gp_model,
nrounds = 1000, early_stopping_rounds = 10,
verbose_eval = 1, metric = "mse") # metric = "test_neg_log_likelihood"
opt_params
# ***** New greatest take a look at rating (l2 = 0.0255393919591794) discovered for the next parameter mixture: learning_rate: 0.01, min_data_in_leaf: 10, max_depth: 2, lambda_l2: 0, nrounds: 37
Estimated random results mannequin
Data on the estimated random results mannequin may be obtained by calling the abstract
perform on the gp_model
. For Gaussian processes, GP_var
is the marginal variance, i.e., the quantity of spatial correlation or construction spatial variation, and GP_range
is the vary, or scale, parameter that measures how briskly correlation decays over house. For an exponential covariance perform, thrice this worth (approx. 17 right here) is the gap the place the (residual) spatial correlation is actually zero (beneath 5%). Because the outcomes beneath present, the quantity of spatial correlation is comparatively small because the marginal variance of 0.0089 is small in comparison with the full variance of the response variable which is approx. 0.29. Moreover the error time period and the cl grouped random results even have small variances (0.013 and 0.022). We thus conclude that many of the variance within the response variable is defined by the fastened results predictor variables.
abstract(gp_model)
## =====================================================
## Covariance parameters (random results):
## Param.
## Error_term 0.0131
## Group_1 0.0221
## GP_var 0.0089
## GP_range 5.7502
## =====================================================
Spatial impact map
We are able to plot the estimated (“predicted”) spatial random results on the coaching information places by calling the predict
perform on the coaching information; see the code beneath. Such a plot exhibits the spatial impact when factoring out the impact of the fastened results predictor variables. Observe that these spatial results bear in mind each the spatial Gaussian course of and the grouped area cluster random results. If one desires to acquire solely spatial random results from the Gaussian course of half, one can use the predict_training_data_random_effects
perform (see the commented code beneath). Alternatively, one may also do spatial extrapolation (=“Krigging”), however this makes not plenty of sense for areal information.
pred <- predict(gpboost_model, group_data_pred = information[1:242, c("cl")],
gp_coords_pred = information[1:242, c("Long", "Lat")],
information = information[1:242, covars], predict_var = TRUE, pred_latent = TRUE)
data_plot <- merge(form, information.body(FID = FID, y = pred$random_effect_mean), by="FID")
plot_mu <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(identify="Spatial impact", possibility = "B") +
ggtitle("Spatial impact (imply)") + xlim(2700000,6400000) + ylim(1500000,5200000)
data_plot <- merge(form, information.body(FID = FID, y = pred$random_effect_cov), by="FID")
plot_sd <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(identify="Std. dev.", possibility = "B") +
ggtitle("Uncertainty (std. dev.)") + xlim(2700000,6400000) + ylim(1500000,5200000)
grid.prepare(plot_mu, plot_sd, ncol=2)# Solely spatial effecst from the Gaussian course of
# rand_effs <- predict_training_data_random_effects(gp_model, predict_var = TRUE)[1:242, c("GP", "GP_var")]
# data_plot <- merge(form, information.body(FID = FID, y = rand_effs[,1]), by="FID")
# plot_mu <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
# scale_fill_viridis(identify="Spatial impact (imply)", possibility = "B") +
# ggtitle("Spatial impact from Gausian course of (imply)") + xlim(2700000,6400000) + ylim(1500000,5200000)
# data_plot <- merge(form, information.body(FID = FID, y = rand_effs[,2]), by="FID")
# plot_sd <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
# scale_fill_viridis(identify="Uncertainty (std. dev.)", possibility = "B") +
# ggtitle("Uncertainty (std. dev.)") + xlim(2700000,6400000) + ylim(1500000,5200000)
# grid.prepare(plot_mu, plot_sd, ncol=2)
Understanding the fastened results perform
There are a number of instruments for understanding the type of the fastened results perform. Under we take into account variable significance measures, interplay measures, and dependence plots. Particularly, we take a look at
- SHAP values
- SHAP dependence plots
- Break up-based variable significance
- Friedman’s H-statistic
- Partial dependence plots (one and two-dimensional)
Because the outcomes beneath present, the knowledge obtained from SHAP values and SHAP dependence plots is similar as when conventional variable significance measures and partial dependence plots. A very powerful variables are ‘Okay’ and ‘edu’. From the dependence plots, we conclude that there are non-linearities. As an illustration, the impact of Okay is sort of flat for giant and small values of Okay and growing in-between. Additional, the impact of edu is steeper for small values of edu and flattens for bigger values of edu. Friedman’s H-statistic signifies that there are some interactions. For the 2 variables with the biggest quantity of interplay, L and pop, we create a two-dimensional partial dependence plot beneath.
# SHAP values
library(SHAPforxgboost)
shap.plot.abstract.wrap1(gpboost_model, X = information[,covars]) + ggtitle("SHAP values")
# SHAP dependence plots
shap_long <- shap.prep(gpboost_model, X_train = information[,covars])
shap.plot.dependence(data_long = shap_long, x = covars[2],
color_feature = covars[4], easy = FALSE, measurement = 2) +
ggtitle("SHAP dependence plot for Okay")
shap.plot.dependence(data_long = shap_long, x = covars[4],
color_feature = covars[2], easy = FALSE, measurement = 2) +
ggtitle("SHAP dependence plot for edu")
# Break up-based function importances
feature_importances <- gpb.significance(gpboost_model, share = TRUE)
gpb.plot.significance(feature_importances, top_n = 25, measure = "Acquire",
principal = "Break up-based variable importances")
# H-statistic for interactions. Observe: there aren't any interactions if max_depth = 1
library(flashlight)
fl <- flashlight(mannequin = gpboost_model, information = information.body(y = information[,"y"], information[,covars]),
y = "y", label = "gpb",
predict_fun = perform(m, X) predict(m, information.matrix(X[,covars]),
gp_coords_pred = matrix(-100, ncol = 2, nrow = dim(X)[1]),
group_data_pred = matrix(-1, ncol = 1, nrow = dim(X)[1]),
pred_latent = TRUE)$fixed_effect)
plot(imp <- light_interaction(fl, v = covars, pairwise = TRUE)) +
ggtitle("H interplay statistic") # takes just a few seconds
# Partial dependence plots
gpb.plot.partial.dependence(gpboost_model, information[,covars], variable = 2, xlab = covars[2],
ylab = "gdp", principal = "Partial dependence plot" )
gpb.plot.partial.dependence(gpboost_model, information[,covars], variable = 4, xlab = covars[4],
ylab = "gdp", principal = "Partial dependence plot" )
# Two-dimensional partial dependence plot (to visualise interactions)
i = 1; j = 3;# i vs j
gpb.plot.half.dep.work together(gpboost_model, information[,covars], variables = c(i,j), xlab = covars[i],
ylab = covars[j], principal = "Pairwise partial dependence plot")
Separate random results for various time durations
Within the above mannequin, now we have used the identical random results for each years 2000 and 2021. Alternatively, one may also use impartial spatial and grouped random results for various time durations (impartial underneath the prior, conditional on the information, there may be dependence …). Within the GPBoost
library, this may be completed by way of the cluster_ids
argument. cluster_ids
must be a vector of size the pattern measurement, and each entry signifies the “cluster” to which an remark belongs to. Completely different clusters then have impartial spatial and grouped random results, however the hyperparameters (e.g., marginal variance, variance parts, and so on.) and the fastened results perform are the identical throughout clusters.
Under we present how we will match such a mannequin and create two separate spatial maps. Because the outcomes beneath present, the spatial results are fairly totally different for the 2 years 2000 and 2021. Specifically, there may be much less (residual) spatial variation (or heterogeneity) for the yr 2021 in comparison with 2000. That is confirmed by normal deviations of the expected spatial random results which is sort of twice as massive for 2000 in comparison with the yr 2021 (see beneath). An additional conclusion is that in 2000, there have been extra areas with “low” GDP numbers (spatial results beneath 0), and that is not the case for 2021.
gp_model <- GPModel(group_data = information[, c("cl")], gp_coords = information[, c("Long", "Lat")],
chance = "gaussian", cov_function = "exponential",
cluster_ids = c(rep(1,242), rep(2,242)))
boost_data <- gpb.Dataset(information = information[, covars], label = information[, "y"])
params <- listing(learning_rate = 0.01, max_depth = 1, num_leaves = 2^10,
min_data_in_leaf = 10, lambda_l2 = 1)
# Observe: we use the identical tuning parameters as above. Ideally, the must be chosen once more
gpboost_model <- gpboost(information = boost_data, gp_model = gp_model, nrounds = nrounds,
params = params, verbose = 0)
# Separate spatial maps for the years 2000 and 2021
pred <- predict(gpboost_model, group_data_pred = information[, c("cl")],
gp_coords_pred = information[, c("Long", "Lat")],
information = information[, covars], predict_var = TRUE, pred_latent = TRUE,
cluster_ids_pred = c(rep(1,242), rep(2,242)))
data_plot <- merge(form, information.body(FID = FID, y = pred$random_effect_mean[1:242]), by="FID")
plot_mu_2000 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(identify="Spatial impact", possibility = "B") +
ggtitle("Spatial impact for 2000 (imply)") + xlim(2700000,6400000) + ylim(1500000,5200000)
data_plot <- merge(form, information.body(FID = FID, y = pred$random_effect_mean[243:484]), by="FID")
plot_mu_2021 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(identify="Spatial impact", possibility = "B") +
ggtitle("Spatial impact for 2021 (imply)") + xlim(2700000,6400000) + ylim(1500000,5200000)
grid.prepare(plot_mu_2000, plot_mu_2021, ncol=2)
sd(pred$random_effect_mean[1:242]) # 0.2321267
sd(pred$random_effect_mean[243:484]) # 0.1286398
Interplay between house and stuck results predictor variables
Within the above mannequin, there isn’t a interplay between the random results and the fastened results predictor variables. I.e., there isn’t a interplay between the spatial coordinates and the fastened results predictor variables. Such interplay may be modeled by moreover together with the random results enter variables (= the spatial coordinates or the specific grouping variable) within the fastened results perform. The code beneath exhibits how this may be completed. Because the variable significance plot beneath exhibits, the coordinates usually are not used within the tree-ensemble, and there may be thus no such interplay detectable for this information set.
gp_model <- GPModel(group_data = information[, c("cl")], gp_coords = information[, c("Long", "Lat")],
chance = "gaussian", cov_function = "exponential")
covars_interact <- c(c("Lengthy", "Lat"), covars) ## add spatial coordinates to fastened results predictor variables
boost_data <- gpb.Dataset(information = information[, covars_interact], label = information[, "y"])
params <- listing(learning_rate = 0.01, max_depth = 1, num_leaves = 2^10,
min_data_in_leaf = 10, lambda_l2 = 1)
# Observe: we use the identical tuning parameters as above. Ideally, the must be chosen once more
gpboost_model <- gpboost(information = boost_data, gp_model = gp_model, nrounds = nrounds,
params = params, verbose = 0)
feature_importances <- gpb.significance(gpboost_model, share = TRUE)
gpb.plot.significance(feature_importances, top_n = 25, measure = "Acquire",
principal = "Variable importances when together with coordinates within the fastened results")
Giant information
For giant information units, calculations with Gaussian processes are sluggish, and one has to make use of an approximation. The GPBoost
library implements a number of ones. As an illustration, setting gp_approx="vecchia"
within the GPModel
constructor will use a Vecchia approximation. The info set used on this article is comparatively small, and we will do all calculations precisely.
Different spatial random results fashions
Above, now we have used a Gaussian course of to mannequin spatial random results. For the reason that information is areal information, an alternative choice is to make use of fashions that depend on neighborhood data reminiscent of CAR and SAR fashions. These fashions are at the moment not but applied within the GPBoost
library (is perhaps added sooner or later -> contact the writer).
An alternative choice is to make use of a grouped random results mannequin outlined by the specific area ID variable for modeling spatial results. The code beneath exhibits how this may be completed in GPBoost
. Nonetheless, this mannequin primarily ignores the spatial construction.
gp_model <- GPModel(group_data = information[, c("group", "cl")], chance = "gaussian")
(Generalized) linear blended results and Gaussian course of fashions
(Generalized) linear blended results and Gaussian course of fashions can be run within the GPBoost
library. The code beneath exhibits how this may be completed with the fitGPModel
perform. Observe that one must manually add a column of 1’s to incorporate an intercept.
X_incl_1 = cbind(Intercept = rep(1,dim(information)[1]), information[, covars])
gp_model <- fitGPModel(group_data = information[, c("cl")], gp_coords = information[, c("Long", "Lat")],
chance = "gaussian", cov_function = "exponential",
y = information[, "y"], X = X_incl_1, params = listing(std_dev=TRUE))
abstract(gp_model)
## =====================================================
## Mannequin abstract:
## Log-lik AIC BIC
## 195.97 -373.94 -336.30
## Nb. observations: 484
## Nb. teams: 2 (Group_1)
## -----------------------------------------------------
## Covariance parameters (random results):
## Param. Std. dev.
## Error_term 0.0215 0.0017
## Group_1 0.0003 0.0008
## GP_var 0.0126 0.0047
## GP_range 7.2823 3.6585
## -----------------------------------------------------
## Linear regression coefficients (fastened results):
## Param. Std. dev. z worth P(>|z|)
## Intercept 16.2816 0.4559 35.7128 0.0000
## L 0.4243 0.0565 7.5042 0.0000
## Okay 0.6493 0.0212 30.6755 0.0000
## pop 0.0134 0.0097 1.3812 0.1672
## edu 0.0100 0.0009 10.9645 0.0000
## =====================================================