The GPBoost algorithm extends linear blended results and Gaussian course of fashions by changing the linear fastened results perform with a non-parametric non-linear perform modeled utilizing tree-boosting. This text exhibits how the GPBoost algorithm applied within the `GPBoost`

library can be utilized for modeling information with a spatial and grouped construction. We show the performance of the `GPBoost`

library utilizing European GDP information which is an instance of areal spatial econometric information.

## Desk of contents

**∘ ****Introduction**

· Table of contents

· Basic workflow of GPBoost

· Data description

· Data loading and short visualization**∘ ****Training a GPBoost model**

∘ **Choosing tuning parameters**

∘ **Model interpretation**

· Estimated random effects model

· Spatial effect map

· Understanding the fixed effects function**∘ ****Extensions**

· Separate random effects for different time periods

· Interaction between space and fixed effects predictor variables

· Large data

· Other spatial random effects models

· (Generalized) linear mixed effects and Gaussian process models

∘** ****References**

## Primary workflow of GPBoost

Making use of a GPBoost mannequin (= mixed tree-boosting and random results / GP fashions) includes the next principal steps:

- Outline a
`GPModel`

through which you specify the next:

– A random results mannequin (e.g., spatial random results, grouped random results, mixed spatial and grouped, and so on.)

– The chance (= distribution of the response variable conditional on fastened and random results) - Create a
`gpb.Dataset`

with the response variable and stuck results predictor variables - Select tuning parameters, e.g., utilizing the perform
`gpb.grid.search.tune.parameters`

- Practice the mannequin with the
`gpboost / gpb.prepare`

perform

On this article, we show these steps utilizing an actual world information set. Apart from, we additionally present the best way to interpret fitted fashions, and we take a look at a number of extensions and extra options of the `GPBoost`

library. All outcomes are obtained utilizing `GPBoost`

model 1.2.1. This demo makes use of the R package deal, however the corresponding Python package deal supplies the identical performance.

## Knowledge description

The info used on this demo is European gross home product (GDP) information. It could downloaded from https://raw.githubusercontent.com/fabsig/GPBoost/master/data/gdp_european_regions.csv. The info was collected by Massimo Giannini, College of Rome Tor Vergata, from Eurostat and kindly supplied to Fabio Sigrist for a chat on the College of Rome Tor Vergata on June 16, 2023.

Knowledge was collected for 242 European areas for the 2 years 2000 and 2021. I.e., the full variety of information factors is 484. The response variable is (log) GDP / capita. There are 4 predictor variables:

- L: (log) share of employment (empl/pop)
- Okay: (log) fastened capital/inhabitants
- Pop: log(inhabitants)
- Edu: share of tertiary schooling

Additional, there are centroid spatial coordinates of each area (longitude and latitude), a spatial area ID for the 242 totally different European areas, and an extra spatial cluster ID which identifies the cluster the area belongs (there are two clusters).

## Knowledge loading and quick visualization

We first load the information and create a map illustrating the (log) GDP / capita over house. We create two maps: one with all information and one other one when excluding some distant islands. Within the commented code beneath, we additionally present the best way to create a spatial plot when no form file for spatial polygons is obtainable.

`library(gpboost)`

library(ggplot2)

library(gridExtra)

library(viridis)

library(sf)# Load information

information <- learn.csv("https://uncooked.githubusercontent.com/fabsig/GPBoost/grasp/information/gdp_european_regions.csv")

FID <- information$FID

information <- as.matrix(information[,names(data)!="FID"]) # convert to matrix because the boosting half at the moment doesn't assist information.frames

covars <- c("L", "Okay", "pop", "edu")

# Load form file for spatial plots

cur_tempfile <- tempfile()

obtain.file(url = "https://uncooked.githubusercontent.com/fabsig/GPBoost/grasp/information/shape_European_regions.zip", destfile = cur_tempfile)

out_directory <- tempfile()

unzip(cur_tempfile, exdir = out_directory)

form <- st_read(dsn = out_directory)

# Create spatial plot of GDP

data_plot <- merge(form, information.body(FID = FID, y = information[,"y"]), by="FID")

p1 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +

scale_fill_viridis(identify="GDP / capita (log)", possibility = "B") +

ggtitle("GDP / capita (log)")

# Pattern plot excluding islands

p2 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +

scale_fill_viridis(identify="GDP / capita (log)", possibility = "B") +

xlim(2700000,6400000) + ylim(1500000,5200000) +

ggtitle("GDP / capita (log) -- excluding islands")

grid.prepare(p1, p2, ncol=2)

# # Spatial plot with out a form file

# p1 <- ggplot(information = information.body(Lat=information[,"Lat"], Lengthy=information[,"Long"],

# GDP=information[,"y"]), aes(x=Lengthy,y=Lat,colour=GDP)) +

# geom_point(measurement=2, alpha=0.5) + scale_color_viridis(possibility = "B") +

# ggtitle("GDP / capita (log)")

# p2 <- ggplot(information = information.body(Lat=information[,"Lat"], Lengthy=information[,"Long"],

# GDP=information[,"y"]), aes(x=Lengthy,y=Lat,colour=GDP)) +

# geom_point(measurement=3, alpha=0.5) + scale_color_viridis(possibility = "B") +

# ggtitle("GDP / capita (log) -- Europe excluding islands") + xlim(-10,28) + ylim(35,67)

# grid.prepare(p1, p2, ncol=2)

Within the following, we use a Gaussian course of mannequin with an exponential covariance perform to mannequin spatial random results. Moreover, we embrace grouped random results for the cluster variable cl. Within the `GPBoost`

library, Gaussian course of random results are outlined by the `gp_coords`

argument and grouped random results by way of the `group_data`

argument of the `GPModel`

constructor. The above-mentioned predictor variables are used within the fastened results tree-ensemble perform. We match the mannequin utilizing the `gpboost`

, or equivalently the `gpb.prepare`

, perform. Observe that we use tuning parameters which can be chosen beneath utilizing cross-validation.

`gp_model <- GPModel(group_data = information[, c("cl")], `

gp_coords = information[, c("Long", "Lat")],

chance = "gaussian", cov_function = "exponential")

params <- listing(learning_rate = 0.01, max_depth = 2, num_leaves = 2^10,

min_data_in_leaf = 10, lambda_l2 = 0)

nrounds <- 37

# gp_model$set_optim_params(params = listing(hint=TRUE)) # To observe hzperparameter estimation

boost_data <- gpb.Dataset(information = information[, covars], label = information[, "y"])

gpboost_model <- gpboost(information = boost_data, gp_model = gp_model,

nrounds = nrounds, params = params,

verbose = 1) # similar as gpb.prepare# similar as gpb.prepare gpb.prepare

It is vital that tuning parameters are appropriately chosen for enhancing. There aren’t any common default values and each information set will possible want totally different tuning parameters. Observe that this will take a while. As a substitute of a hard and fast grid search as beneath, one may also do a random grid search to hurry issues up (see `num_try_random`

). Under we present how tuning parameters may be chosen utilizing the `gpb.grid.search.tune.parameters`

perform. We use the imply sq. error (`mse`

) as prediction accuracy measure on the validation information. Alternatively, one may also use, e.g., the take a look at detrimental log-likelihood (`test_neg_log_likelihood`

= default worth if nothing is specified) which additionally takes prediction uncertainty under consideration.

`gp_model <- GPModel(group_data = information[, "cl"], `

gp_coords = information[, c("Long", "Lat")],

chance = "gaussian", cov_function = "exponential")

boost_data <- gpb.Dataset(information = information[, covars], label = information[, "y"])

param_grid = listing("learning_rate" = c(1,0.1,0.01),

"min_data_in_leaf" = c(10,100,1000),

"max_depth" = c(1,2,3,5,10),

"lambda_l2" = c(0,1,10))

other_params <- listing(num_leaves = 2^10)

set.seed(1)

opt_params <- gpb.grid.search.tune.parameters(param_grid = param_grid, params = other_params,

num_try_random = NULL, nfold = 4,

information = boost_data, gp_model = gp_model,

nrounds = 1000, early_stopping_rounds = 10,

verbose_eval = 1, metric = "mse") # metric = "test_neg_log_likelihood"

opt_params

# ***** New greatest take a look at rating (l2 = 0.0255393919591794) discovered for the next parameter mixture: learning_rate: 0.01, min_data_in_leaf: 10, max_depth: 2, lambda_l2: 0, nrounds: 37

## Estimated random results mannequin

Data on the estimated random results mannequin may be obtained by calling the `abstract`

perform on the `gp_model`

. For Gaussian processes, `GP_var`

is the marginal variance, i.e., the quantity of spatial correlation or construction spatial variation, and `GP_range`

is the vary, or scale, parameter that measures how briskly correlation decays over house. For an exponential covariance perform, thrice this worth (approx. 17 right here) is the gap the place the (residual) spatial correlation is actually zero (beneath 5%). Because the outcomes beneath present, the quantity of spatial correlation is comparatively small because the marginal variance of 0.0089 is small in comparison with the full variance of the response variable which is approx. 0.29. Moreover the error time period and the cl grouped random results even have small variances (0.013 and 0.022). We thus conclude that many of the variance within the response variable is defined by the fastened results predictor variables.

`abstract(gp_model)`

`## =====================================================`

## Covariance parameters (random results):

## Param.

## Error_term 0.0131

## Group_1 0.0221

## GP_var 0.0089

## GP_range 5.7502

## =====================================================

## Spatial impact map

We are able to plot the estimated (“predicted”) spatial random results on the coaching information places by calling the `predict`

perform on the coaching information; see the code beneath. Such a plot exhibits the spatial impact when factoring out the impact of the fastened results predictor variables. Observe that these spatial results bear in mind each the spatial Gaussian course of and the grouped area cluster random results. If one desires to acquire solely spatial random results from the Gaussian course of half, one can use the `predict_training_data_random_effects`

perform (see the commented code beneath). Alternatively, one may also do spatial extrapolation (=“Krigging”), however this makes not plenty of sense for areal information.

`pred <- predict(gpboost_model, group_data_pred = information[1:242, c("cl")], `

gp_coords_pred = information[1:242, c("Long", "Lat")],

information = information[1:242, covars], predict_var = TRUE, pred_latent = TRUE)

data_plot <- merge(form, information.body(FID = FID, y = pred$random_effect_mean), by="FID")

plot_mu <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +

scale_fill_viridis(identify="Spatial impact", possibility = "B") +

ggtitle("Spatial impact (imply)") + xlim(2700000,6400000) + ylim(1500000,5200000)

data_plot <- merge(form, information.body(FID = FID, y = pred$random_effect_cov), by="FID")

plot_sd <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +

scale_fill_viridis(identify="Std. dev.", possibility = "B") +

ggtitle("Uncertainty (std. dev.)") + xlim(2700000,6400000) + ylim(1500000,5200000)

grid.prepare(plot_mu, plot_sd, ncol=2)# Solely spatial effecst from the Gaussian course of

# rand_effs <- predict_training_data_random_effects(gp_model, predict_var = TRUE)[1:242, c("GP", "GP_var")]

# data_plot <- merge(form, information.body(FID = FID, y = rand_effs[,1]), by="FID")

# plot_mu <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +

# scale_fill_viridis(identify="Spatial impact (imply)", possibility = "B") +

# ggtitle("Spatial impact from Gausian course of (imply)") + xlim(2700000,6400000) + ylim(1500000,5200000)

# data_plot <- merge(form, information.body(FID = FID, y = rand_effs[,2]), by="FID")

# plot_sd <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +

# scale_fill_viridis(identify="Uncertainty (std. dev.)", possibility = "B") +

# ggtitle("Uncertainty (std. dev.)") + xlim(2700000,6400000) + ylim(1500000,5200000)

# grid.prepare(plot_mu, plot_sd, ncol=2)

## Understanding the fastened results perform

There are a number of instruments for understanding the type of the fastened results perform. Under we take into account variable significance measures, interplay measures, and dependence plots. Particularly, we take a look at

- SHAP values
- SHAP dependence plots
- Break up-based variable significance
- Friedman’s H-statistic
- Partial dependence plots (one and two-dimensional)

Because the outcomes beneath present, the knowledge obtained from SHAP values and SHAP dependence plots is similar as when conventional variable significance measures and partial dependence plots. A very powerful variables are ‘Okay’ and ‘edu’. From the dependence plots, we conclude that there are non-linearities. As an illustration, the impact of Okay is sort of flat for giant and small values of Okay and growing in-between. Additional, the impact of edu is steeper for small values of edu and flattens for bigger values of edu. Friedman’s H-statistic signifies that there are some interactions. For the 2 variables with the biggest quantity of interplay, L and pop, we create a two-dimensional partial dependence plot beneath.

`# SHAP values`

library(SHAPforxgboost)

shap.plot.abstract.wrap1(gpboost_model, X = information[,covars]) + ggtitle("SHAP values")

`# SHAP dependence plots`

shap_long <- shap.prep(gpboost_model, X_train = information[,covars])

shap.plot.dependence(data_long = shap_long, x = covars[2],

color_feature = covars[4], easy = FALSE, measurement = 2) +

ggtitle("SHAP dependence plot for Okay")

shap.plot.dependence(data_long = shap_long, x = covars[4],

color_feature = covars[2], easy = FALSE, measurement = 2) +

ggtitle("SHAP dependence plot for edu")

`# Break up-based function importances`

feature_importances <- gpb.significance(gpboost_model, share = TRUE)

gpb.plot.significance(feature_importances, top_n = 25, measure = "Acquire",

principal = "Break up-based variable importances")

`# H-statistic for interactions. Observe: there aren't any interactions if max_depth = 1`

library(flashlight)

fl <- flashlight(mannequin = gpboost_model, information = information.body(y = information[,"y"], information[,covars]),

y = "y", label = "gpb",

predict_fun = perform(m, X) predict(m, information.matrix(X[,covars]),

gp_coords_pred = matrix(-100, ncol = 2, nrow = dim(X)[1]),

group_data_pred = matrix(-1, ncol = 1, nrow = dim(X)[1]),

pred_latent = TRUE)$fixed_effect)

plot(imp <- light_interaction(fl, v = covars, pairwise = TRUE)) +

ggtitle("H interplay statistic") # takes just a few seconds

`# Partial dependence plots`

gpb.plot.partial.dependence(gpboost_model, information[,covars], variable = 2, xlab = covars[2],

ylab = "gdp", principal = "Partial dependence plot" )

gpb.plot.partial.dependence(gpboost_model, information[,covars], variable = 4, xlab = covars[4],

ylab = "gdp", principal = "Partial dependence plot" )

`# Two-dimensional partial dependence plot (to visualise interactions)`

i = 1; j = 3;# i vs j

gpb.plot.half.dep.work together(gpboost_model, information[,covars], variables = c(i,j), xlab = covars[i],

ylab = covars[j], principal = "Pairwise partial dependence plot")

## Separate random results for various time durations

Within the above mannequin, now we have used the identical random results for each years 2000 and 2021. Alternatively, one may also use impartial spatial and grouped random results for various time durations (*impartial underneath the prior, conditional on the information, there may be dependence …*). Within the `GPBoost`

library, this may be completed by way of the `cluster_ids`

argument. `cluster_ids`

must be a vector of size the pattern measurement, and each entry signifies the “cluster” to which an remark belongs to. Completely different clusters then have impartial spatial and grouped random results, however the hyperparameters (e.g., marginal variance, variance parts, and so on.) and the fastened results perform are the identical throughout clusters.

Under we present how we will match such a mannequin and create two separate spatial maps. Because the outcomes beneath present, the spatial results are fairly totally different for the 2 years 2000 and 2021. Specifically, there may be much less (residual) spatial variation (or heterogeneity) for the yr 2021 in comparison with 2000. That is confirmed by normal deviations of the expected spatial random results which is sort of twice as massive for 2000 in comparison with the yr 2021 (see beneath). An additional conclusion is that in 2000, there have been extra areas with “low” GDP numbers (spatial results beneath 0), and that is not the case for 2021.

`gp_model <- GPModel(group_data = information[, c("cl")], gp_coords = information[, c("Long", "Lat")],`

chance = "gaussian", cov_function = "exponential",

cluster_ids = c(rep(1,242), rep(2,242)))

boost_data <- gpb.Dataset(information = information[, covars], label = information[, "y"])

params <- listing(learning_rate = 0.01, max_depth = 1, num_leaves = 2^10,

min_data_in_leaf = 10, lambda_l2 = 1)

# Observe: we use the identical tuning parameters as above. Ideally, the must be chosen once more

gpboost_model <- gpboost(information = boost_data, gp_model = gp_model, nrounds = nrounds,

params = params, verbose = 0)

# Separate spatial maps for the years 2000 and 2021

pred <- predict(gpboost_model, group_data_pred = information[, c("cl")],

gp_coords_pred = information[, c("Long", "Lat")],

information = information[, covars], predict_var = TRUE, pred_latent = TRUE,

cluster_ids_pred = c(rep(1,242), rep(2,242)))

data_plot <- merge(form, information.body(FID = FID, y = pred$random_effect_mean[1:242]), by="FID")

plot_mu_2000 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +

scale_fill_viridis(identify="Spatial impact", possibility = "B") +

ggtitle("Spatial impact for 2000 (imply)") + xlim(2700000,6400000) + ylim(1500000,5200000)

data_plot <- merge(form, information.body(FID = FID, y = pred$random_effect_mean[243:484]), by="FID")

plot_mu_2021 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +

scale_fill_viridis(identify="Spatial impact", possibility = "B") +

ggtitle("Spatial impact for 2021 (imply)") + xlim(2700000,6400000) + ylim(1500000,5200000)

grid.prepare(plot_mu_2000, plot_mu_2021, ncol=2)

sd(pred$random_effect_mean[1:242]) # 0.2321267

sd(pred$random_effect_mean[243:484]) # 0.1286398

## Interplay between house and stuck results predictor variables

Within the above mannequin, there isn’t a interplay between the random results and the fastened results predictor variables. I.e., there isn’t a interplay between the spatial coordinates and the fastened results predictor variables. Such interplay may be modeled by moreover together with the random results enter variables (= the spatial coordinates or the specific grouping variable) within the fastened results perform. The code beneath exhibits how this may be completed. Because the variable significance plot beneath exhibits, the coordinates usually are not used within the tree-ensemble, and there may be thus no such interplay detectable for this information set.

`gp_model <- GPModel(group_data = information[, c("cl")], gp_coords = information[, c("Long", "Lat")],`

chance = "gaussian", cov_function = "exponential")

covars_interact <- c(c("Lengthy", "Lat"), covars) ## add spatial coordinates to fastened results predictor variables

boost_data <- gpb.Dataset(information = information[, covars_interact], label = information[, "y"])

params <- listing(learning_rate = 0.01, max_depth = 1, num_leaves = 2^10,

min_data_in_leaf = 10, lambda_l2 = 1)

# Observe: we use the identical tuning parameters as above. Ideally, the must be chosen once more

gpboost_model <- gpboost(information = boost_data, gp_model = gp_model, nrounds = nrounds,

params = params, verbose = 0)

feature_importances <- gpb.significance(gpboost_model, share = TRUE)

gpb.plot.significance(feature_importances, top_n = 25, measure = "Acquire",

principal = "Variable importances when together with coordinates within the fastened results")

## Giant information

For giant information units, calculations with Gaussian processes are sluggish, and one has to make use of an approximation. The `GPBoost`

library implements a number of ones. As an illustration, setting `gp_approx="vecchia"`

within the `GPModel`

constructor will use a Vecchia approximation. The info set used on this article is comparatively small, and we will do all calculations precisely.

## Different spatial random results fashions

Above, now we have used a Gaussian course of to mannequin spatial random results. For the reason that information is areal information, an alternative choice is to make use of fashions that depend on neighborhood data reminiscent of CAR and SAR fashions. These fashions are at the moment not but applied within the `GPBoost`

library (*is perhaps added sooner or later -> contact the writer*).

An alternative choice is to make use of a grouped random results mannequin outlined by the specific area ID variable for modeling spatial results. The code beneath exhibits how this may be completed in `GPBoost`

. Nonetheless, this mannequin primarily ignores the spatial construction.

`gp_model <- GPModel(group_data = information[, c("group", "cl")], chance = "gaussian")`

## (Generalized) linear blended results and Gaussian course of fashions

(Generalized) linear blended results and Gaussian course of fashions can be run within the `GPBoost`

library. The code beneath exhibits how this may be completed with the `fitGPModel`

perform. Observe that one must manually add a column of 1’s to incorporate an intercept.

`X_incl_1 = cbind(Intercept = rep(1,dim(information)[1]), information[, covars])`

gp_model <- fitGPModel(group_data = information[, c("cl")], gp_coords = information[, c("Long", "Lat")],

chance = "gaussian", cov_function = "exponential",

y = information[, "y"], X = X_incl_1, params = listing(std_dev=TRUE))

abstract(gp_model)

`## =====================================================`

## Mannequin abstract:

## Log-lik AIC BIC

## 195.97 -373.94 -336.30

## Nb. observations: 484

## Nb. teams: 2 (Group_1)

## -----------------------------------------------------

## Covariance parameters (random results):

## Param. Std. dev.

## Error_term 0.0215 0.0017

## Group_1 0.0003 0.0008

## GP_var 0.0126 0.0047

## GP_range 7.2823 3.6585

## -----------------------------------------------------

## Linear regression coefficients (fastened results):

## Param. Std. dev. z worth P(>|z|)

## Intercept 16.2816 0.4559 35.7128 0.0000

## L 0.4243 0.0565 7.5042 0.0000

## Okay 0.6493 0.0212 30.6755 0.0000

## pop 0.0134 0.0097 1.3812 0.1672

## edu 0.0100 0.0009 10.9645 0.0000

## =====================================================