in

Combined results machine studying for spatial econometric information


A demo utilizing European GDP information

European GDP information: spatial impact and SHAP dependence plot for predictor variable ‘edu’— Picture by Writer

The GPBoost algorithm extends linear blended results and Gaussian course of fashions by changing the linear fastened results perform with a non-parametric non-linear perform modeled utilizing tree-boosting. This text exhibits how the GPBoost algorithm applied within the GPBoost library can be utilized for modeling information with a spatial and grouped construction. We show the performance of the GPBoost library utilizing European GDP information which is an instance of areal spatial econometric information.

Desk of contents

Introduction
· Table of contents
· Basic workflow of GPBoost
· Data description
· Data loading and short visualization
Training a GPBoost model
Choosing tuning parameters
Model interpretation
· Estimated random effects model
· Spatial effect map
· Understanding the fixed effects function
Extensions
· Separate random effects for different time periods
· Interaction between space and fixed effects predictor variables
· Large data
· Other spatial random effects models
· (Generalized) linear mixed effects and Gaussian process models
References

Primary workflow of GPBoost

Making use of a GPBoost mannequin (= mixed tree-boosting and random results / GP fashions) includes the next principal steps:

  1. Outline a GPModel through which you specify the next:
    – A random results mannequin (e.g., spatial random results, grouped random results, mixed spatial and grouped, and so on.)
    – The chance (= distribution of the response variable conditional on fastened and random results)
  2. Create a gpb.Dataset with the response variable and stuck results predictor variables
  3. Select tuning parameters, e.g., utilizing the perform gpb.grid.search.tune.parameters
  4. Practice the mannequin with the gpboost / gpb.prepare perform

On this article, we show these steps utilizing an actual world information set. Apart from, we additionally present the best way to interpret fitted fashions, and we take a look at a number of extensions and extra options of the GPBoost library. All outcomes are obtained utilizing GPBoost model 1.2.1. This demo makes use of the R package deal, however the corresponding Python package deal supplies the identical performance.

Knowledge description

The info used on this demo is European gross home product (GDP) information. It could downloaded from https://raw.githubusercontent.com/fabsig/GPBoost/master/data/gdp_european_regions.csv. The info was collected by Massimo Giannini, College of Rome Tor Vergata, from Eurostat and kindly supplied to Fabio Sigrist for a chat on the College of Rome Tor Vergata on June 16, 2023.

Knowledge was collected for 242 European areas for the 2 years 2000 and 2021. I.e., the full variety of information factors is 484. The response variable is (log) GDP / capita. There are 4 predictor variables:

  • L: (log) share of employment (empl/pop)
  • Okay: (log) fastened capital/inhabitants
  • Pop: log(inhabitants)
  • Edu: share of tertiary schooling

Additional, there are centroid spatial coordinates of each area (longitude and latitude), a spatial area ID for the 242 totally different European areas, and an extra spatial cluster ID which identifies the cluster the area belongs (there are two clusters).

Knowledge loading and quick visualization

We first load the information and create a map illustrating the (log) GDP / capita over house. We create two maps: one with all information and one other one when excluding some distant islands. Within the commented code beneath, we additionally present the best way to create a spatial plot when no form file for spatial polygons is obtainable.

library(gpboost)
library(ggplot2)
library(gridExtra)
library(viridis)
library(sf)

# Load information
information <- learn.csv("https://uncooked.githubusercontent.com/fabsig/GPBoost/grasp/information/gdp_european_regions.csv")
FID <- information$FID
information <- as.matrix(information[,names(data)!="FID"]) # convert to matrix because the boosting half at the moment doesn't assist information.frames
covars <- c("L", "Okay", "pop", "edu")
# Load form file for spatial plots
cur_tempfile <- tempfile()
obtain.file(url = "https://uncooked.githubusercontent.com/fabsig/GPBoost/grasp/information/shape_European_regions.zip", destfile = cur_tempfile)
out_directory <- tempfile()
unzip(cur_tempfile, exdir = out_directory)
form <- st_read(dsn = out_directory)

# Create spatial plot of GDP
data_plot <- merge(form, information.body(FID = FID, y = information[,"y"]), by="FID")
p1 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(identify="GDP / capita (log)", possibility = "B") +
ggtitle("GDP / capita (log)")
# Pattern plot excluding islands
p2 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(identify="GDP / capita (log)", possibility = "B") +
xlim(2700000,6400000) + ylim(1500000,5200000) +
ggtitle("GDP / capita (log) -- excluding islands")
grid.prepare(p1, p2, ncol=2)

# # Spatial plot with out a form file
# p1 <- ggplot(information = information.body(Lat=information[,"Lat"], Lengthy=information[,"Long"],
# GDP=information[,"y"]), aes(x=Lengthy,y=Lat,colour=GDP)) +
# geom_point(measurement=2, alpha=0.5) + scale_color_viridis(possibility = "B") +
# ggtitle("GDP / capita (log)")
# p2 <- ggplot(information = information.body(Lat=information[,"Lat"], Lengthy=information[,"Long"],
# GDP=information[,"y"]), aes(x=Lengthy,y=Lat,colour=GDP)) +
# geom_point(measurement=3, alpha=0.5) + scale_color_viridis(possibility = "B") +
# ggtitle("GDP / capita (log) -- Europe excluding islands") + xlim(-10,28) + ylim(35,67)
# grid.prepare(p1, p2, ncol=2)

log GDP / capita for 242 European areas — Picture by Writer

Within the following, we use a Gaussian course of mannequin with an exponential covariance perform to mannequin spatial random results. Moreover, we embrace grouped random results for the cluster variable cl. Within the GPBoost library, Gaussian course of random results are outlined by the gp_coords argument and grouped random results by way of the group_data argument of the GPModel constructor. The above-mentioned predictor variables are used within the fastened results tree-ensemble perform. We match the mannequin utilizing the gpboost, or equivalently the gpb.prepare, perform. Observe that we use tuning parameters which can be chosen beneath utilizing cross-validation.

gp_model <- GPModel(group_data = information[, c("cl")], 
gp_coords = information[, c("Long", "Lat")],
chance = "gaussian", cov_function = "exponential")
params <- listing(learning_rate = 0.01, max_depth = 2, num_leaves = 2^10,
min_data_in_leaf = 10, lambda_l2 = 0)
nrounds <- 37
# gp_model$set_optim_params(params = listing(hint=TRUE)) # To observe hzperparameter estimation
boost_data <- gpb.Dataset(information = information[, covars], label = information[, "y"])
gpboost_model <- gpboost(information = boost_data, gp_model = gp_model,
nrounds = nrounds, params = params,
verbose = 1) # similar as gpb.prepare# similar as gpb.prepare gpb.prepare

It is vital that tuning parameters are appropriately chosen for enhancing. There aren’t any common default values and each information set will possible want totally different tuning parameters. Observe that this will take a while. As a substitute of a hard and fast grid search as beneath, one may also do a random grid search to hurry issues up (see num_try_random). Under we present how tuning parameters may be chosen utilizing the gpb.grid.search.tune.parameters perform. We use the imply sq. error (mse) as prediction accuracy measure on the validation information. Alternatively, one may also use, e.g., the take a look at detrimental log-likelihood (test_neg_log_likelihood = default worth if nothing is specified) which additionally takes prediction uncertainty under consideration.

gp_model <- GPModel(group_data = information[, "cl"], 
gp_coords = information[, c("Long", "Lat")],
chance = "gaussian", cov_function = "exponential")
boost_data <- gpb.Dataset(information = information[, covars], label = information[, "y"])
param_grid = listing("learning_rate" = c(1,0.1,0.01),
"min_data_in_leaf" = c(10,100,1000),
"max_depth" = c(1,2,3,5,10),
"lambda_l2" = c(0,1,10))
other_params <- listing(num_leaves = 2^10)
set.seed(1)
opt_params <- gpb.grid.search.tune.parameters(param_grid = param_grid, params = other_params,
num_try_random = NULL, nfold = 4,
information = boost_data, gp_model = gp_model,
nrounds = 1000, early_stopping_rounds = 10,
verbose_eval = 1, metric = "mse") # metric = "test_neg_log_likelihood"
opt_params
# ***** New greatest take a look at rating (l2 = 0.0255393919591794) discovered for the next parameter mixture: learning_rate: 0.01, min_data_in_leaf: 10, max_depth: 2, lambda_l2: 0, nrounds: 37

Estimated random results mannequin

Data on the estimated random results mannequin may be obtained by calling the abstract perform on the gp_model. For Gaussian processes, GP_var is the marginal variance, i.e., the quantity of spatial correlation or construction spatial variation, and GP_range is the vary, or scale, parameter that measures how briskly correlation decays over house. For an exponential covariance perform, thrice this worth (approx. 17 right here) is the gap the place the (residual) spatial correlation is actually zero (beneath 5%). Because the outcomes beneath present, the quantity of spatial correlation is comparatively small because the marginal variance of 0.0089 is small in comparison with the full variance of the response variable which is approx. 0.29. Moreover the error time period and the cl grouped random results even have small variances (0.013 and 0.022). We thus conclude that many of the variance within the response variable is defined by the fastened results predictor variables.

abstract(gp_model)
## =====================================================
## Covariance parameters (random results):
## Param.
## Error_term 0.0131
## Group_1 0.0221
## GP_var 0.0089
## GP_range 5.7502
## =====================================================

Spatial impact map

We are able to plot the estimated (“predicted”) spatial random results on the coaching information places by calling the predict perform on the coaching information; see the code beneath. Such a plot exhibits the spatial impact when factoring out the impact of the fastened results predictor variables. Observe that these spatial results bear in mind each the spatial Gaussian course of and the grouped area cluster random results. If one desires to acquire solely spatial random results from the Gaussian course of half, one can use the predict_training_data_random_effects perform (see the commented code beneath). Alternatively, one may also do spatial extrapolation (=“Krigging”), however this makes not plenty of sense for areal information.

pred <- predict(gpboost_model, group_data_pred = information[1:242, c("cl")], 
gp_coords_pred = information[1:242, c("Long", "Lat")],
information = information[1:242, covars], predict_var = TRUE, pred_latent = TRUE)
data_plot <- merge(form, information.body(FID = FID, y = pred$random_effect_mean), by="FID")
plot_mu <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(identify="Spatial impact", possibility = "B") +
ggtitle("Spatial impact (imply)") + xlim(2700000,6400000) + ylim(1500000,5200000)
data_plot <- merge(form, information.body(FID = FID, y = pred$random_effect_cov), by="FID")
plot_sd <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(identify="Std. dev.", possibility = "B") +
ggtitle("Uncertainty (std. dev.)") + xlim(2700000,6400000) + ylim(1500000,5200000)
grid.prepare(plot_mu, plot_sd, ncol=2)

# Solely spatial effecst from the Gaussian course of
# rand_effs <- predict_training_data_random_effects(gp_model, predict_var = TRUE)[1:242, c("GP", "GP_var")]
# data_plot <- merge(form, information.body(FID = FID, y = rand_effs[,1]), by="FID")
# plot_mu <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
# scale_fill_viridis(identify="Spatial impact (imply)", possibility = "B") +
# ggtitle("Spatial impact from Gausian course of (imply)") + xlim(2700000,6400000) + ylim(1500000,5200000)
# data_plot <- merge(form, information.body(FID = FID, y = rand_effs[,2]), by="FID")
# plot_sd <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
# scale_fill_viridis(identify="Uncertainty (std. dev.)", possibility = "B") +
# ggtitle("Uncertainty (std. dev.)") + xlim(2700000,6400000) + ylim(1500000,5200000)
# grid.prepare(plot_mu, plot_sd, ncol=2)

Spatial impact (after factoring out the predictor variables) — Picture by Writer

Understanding the fastened results perform

There are a number of instruments for understanding the type of the fastened results perform. Under we take into account variable significance measures, interplay measures, and dependence plots. Particularly, we take a look at

  • SHAP values
  • SHAP dependence plots
  • Break up-based variable significance
  • Friedman’s H-statistic
  • Partial dependence plots (one and two-dimensional)

Because the outcomes beneath present, the knowledge obtained from SHAP values and SHAP dependence plots is similar as when conventional variable significance measures and partial dependence plots. A very powerful variables are ‘Okay’ and ‘edu’. From the dependence plots, we conclude that there are non-linearities. As an illustration, the impact of Okay is sort of flat for giant and small values of Okay and growing in-between. Additional, the impact of edu is steeper for small values of edu and flattens for bigger values of edu. Friedman’s H-statistic signifies that there are some interactions. For the 2 variables with the biggest quantity of interplay, L and pop, we create a two-dimensional partial dependence plot beneath.

# SHAP values
library(SHAPforxgboost)
shap.plot.abstract.wrap1(gpboost_model, X = information[,covars]) + ggtitle("SHAP values")
SHAP values — Picture by Writer
# SHAP dependence plots
shap_long <- shap.prep(gpboost_model, X_train = information[,covars])
shap.plot.dependence(data_long = shap_long, x = covars[2],
color_feature = covars[4], easy = FALSE, measurement = 2) +
ggtitle("SHAP dependence plot for Okay")
shap.plot.dependence(data_long = shap_long, x = covars[4],
color_feature = covars[2], easy = FALSE, measurement = 2) +
ggtitle("SHAP dependence plot for edu")
SHAP dependence plot for Okay — Picture by Writer
SHAP dependence plot for edu — Picture by Writer
# Break up-based function importances
feature_importances <- gpb.significance(gpboost_model, share = TRUE)
gpb.plot.significance(feature_importances, top_n = 25, measure = "Acquire",
principal = "Break up-based variable importances")
Variable importances — Picture by Writer
# H-statistic for interactions. Observe: there aren't any interactions if max_depth = 1
library(flashlight)
fl <- flashlight(mannequin = gpboost_model, information = information.body(y = information[,"y"], information[,covars]),
y = "y", label = "gpb",
predict_fun = perform(m, X) predict(m, information.matrix(X[,covars]),
gp_coords_pred = matrix(-100, ncol = 2, nrow = dim(X)[1]),
group_data_pred = matrix(-1, ncol = 1, nrow = dim(X)[1]),
pred_latent = TRUE)$fixed_effect)
plot(imp <- light_interaction(fl, v = covars, pairwise = TRUE)) +
ggtitle("H interplay statistic") # takes just a few seconds
H interplay statistic — Picture by Writer
# Partial dependence plots
gpb.plot.partial.dependence(gpboost_model, information[,covars], variable = 2, xlab = covars[2],
ylab = "gdp", principal = "Partial dependence plot" )
gpb.plot.partial.dependence(gpboost_model, information[,covars], variable = 4, xlab = covars[4],
ylab = "gdp", principal = "Partial dependence plot" )
Partial dependence plot for Okay — Picture by Writer
Partial dependence plot for edu — Picture by Writer
# Two-dimensional partial dependence plot (to visualise interactions)
i = 1; j = 3;# i vs j
gpb.plot.half.dep.work together(gpboost_model, information[,covars], variables = c(i,j), xlab = covars[i],
ylab = covars[j], principal = "Pairwise partial dependence plot")
Two-dimensional partial dependence plot for pop and L — Picture by Writer

Separate random results for various time durations

Within the above mannequin, now we have used the identical random results for each years 2000 and 2021. Alternatively, one may also use impartial spatial and grouped random results for various time durations (impartial underneath the prior, conditional on the information, there may be dependence …). Within the GPBoost library, this may be completed by way of the cluster_ids argument. cluster_ids must be a vector of size the pattern measurement, and each entry signifies the “cluster” to which an remark belongs to. Completely different clusters then have impartial spatial and grouped random results, however the hyperparameters (e.g., marginal variance, variance parts, and so on.) and the fastened results perform are the identical throughout clusters.

Under we present how we will match such a mannequin and create two separate spatial maps. Because the outcomes beneath present, the spatial results are fairly totally different for the 2 years 2000 and 2021. Specifically, there may be much less (residual) spatial variation (or heterogeneity) for the yr 2021 in comparison with 2000. That is confirmed by normal deviations of the expected spatial random results which is sort of twice as massive for 2000 in comparison with the yr 2021 (see beneath). An additional conclusion is that in 2000, there have been extra areas with “low” GDP numbers (spatial results beneath 0), and that is not the case for 2021.

gp_model <- GPModel(group_data = information[, c("cl")], gp_coords = information[, c("Long", "Lat")],
chance = "gaussian", cov_function = "exponential",
cluster_ids = c(rep(1,242), rep(2,242)))
boost_data <- gpb.Dataset(information = information[, covars], label = information[, "y"])
params <- listing(learning_rate = 0.01, max_depth = 1, num_leaves = 2^10,
min_data_in_leaf = 10, lambda_l2 = 1)
# Observe: we use the identical tuning parameters as above. Ideally, the must be chosen once more
gpboost_model <- gpboost(information = boost_data, gp_model = gp_model, nrounds = nrounds,
params = params, verbose = 0)
# Separate spatial maps for the years 2000 and 2021
pred <- predict(gpboost_model, group_data_pred = information[, c("cl")],
gp_coords_pred = information[, c("Long", "Lat")],
information = information[, covars], predict_var = TRUE, pred_latent = TRUE,
cluster_ids_pred = c(rep(1,242), rep(2,242)))
data_plot <- merge(form, information.body(FID = FID, y = pred$random_effect_mean[1:242]), by="FID")
plot_mu_2000 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(identify="Spatial impact", possibility = "B") +
ggtitle("Spatial impact for 2000 (imply)") + xlim(2700000,6400000) + ylim(1500000,5200000)
data_plot <- merge(form, information.body(FID = FID, y = pred$random_effect_mean[243:484]), by="FID")
plot_mu_2021 <- ggplot(data_plot) + geom_sf(aes(group=FID, fill=y)) +
scale_fill_viridis(identify="Spatial impact", possibility = "B") +
ggtitle("Spatial impact for 2021 (imply)") + xlim(2700000,6400000) + ylim(1500000,5200000)
grid.prepare(plot_mu_2000, plot_mu_2021, ncol=2)
sd(pred$random_effect_mean[1:242]) # 0.2321267
sd(pred$random_effect_mean[243:484]) # 0.1286398
Separate spatial results for the years 2000 and 2021 — Picture by Writer

Interplay between house and stuck results predictor variables

Within the above mannequin, there isn’t a interplay between the random results and the fastened results predictor variables. I.e., there isn’t a interplay between the spatial coordinates and the fastened results predictor variables. Such interplay may be modeled by moreover together with the random results enter variables (= the spatial coordinates or the specific grouping variable) within the fastened results perform. The code beneath exhibits how this may be completed. Because the variable significance plot beneath exhibits, the coordinates usually are not used within the tree-ensemble, and there may be thus no such interplay detectable for this information set.

gp_model <- GPModel(group_data = information[, c("cl")], gp_coords = information[, c("Long", "Lat")],
chance = "gaussian", cov_function = "exponential")
covars_interact <- c(c("Lengthy", "Lat"), covars) ## add spatial coordinates to fastened results predictor variables
boost_data <- gpb.Dataset(information = information[, covars_interact], label = information[, "y"])
params <- listing(learning_rate = 0.01, max_depth = 1, num_leaves = 2^10,
min_data_in_leaf = 10, lambda_l2 = 1)
# Observe: we use the identical tuning parameters as above. Ideally, the must be chosen once more
gpboost_model <- gpboost(information = boost_data, gp_model = gp_model, nrounds = nrounds,
params = params, verbose = 0)
feature_importances <- gpb.significance(gpboost_model, share = TRUE)
gpb.plot.significance(feature_importances, top_n = 25, measure = "Acquire",
principal = "Variable importances when together with coordinates within the fastened results")
Variable importances when together with coordinates within the tree-ensemble — Picture by Writer

Giant information

For giant information units, calculations with Gaussian processes are sluggish, and one has to make use of an approximation. The GPBoost library implements a number of ones. As an illustration, setting gp_approx="vecchia" within the GPModel constructor will use a Vecchia approximation. The info set used on this article is comparatively small, and we will do all calculations precisely.

Different spatial random results fashions

Above, now we have used a Gaussian course of to mannequin spatial random results. For the reason that information is areal information, an alternative choice is to make use of fashions that depend on neighborhood data reminiscent of CAR and SAR fashions. These fashions are at the moment not but applied within the GPBoost library (is perhaps added sooner or later -> contact the writer).

An alternative choice is to make use of a grouped random results mannequin outlined by the specific area ID variable for modeling spatial results. The code beneath exhibits how this may be completed in GPBoost. Nonetheless, this mannequin primarily ignores the spatial construction.

gp_model <- GPModel(group_data = information[, c("group", "cl")], chance = "gaussian")

(Generalized) linear blended results and Gaussian course of fashions

(Generalized) linear blended results and Gaussian course of fashions can be run within the GPBoost library. The code beneath exhibits how this may be completed with the fitGPModel perform. Observe that one must manually add a column of 1’s to incorporate an intercept.

X_incl_1 = cbind(Intercept = rep(1,dim(information)[1]), information[, covars])
gp_model <- fitGPModel(group_data = information[, c("cl")], gp_coords = information[, c("Long", "Lat")],
chance = "gaussian", cov_function = "exponential",
y = information[, "y"], X = X_incl_1, params = listing(std_dev=TRUE))
abstract(gp_model)
## =====================================================
## Mannequin abstract:
## Log-lik AIC BIC
## 195.97 -373.94 -336.30
## Nb. observations: 484
## Nb. teams: 2 (Group_1)
## -----------------------------------------------------
## Covariance parameters (random results):
## Param. Std. dev.
## Error_term 0.0215 0.0017
## Group_1 0.0003 0.0008
## GP_var 0.0126 0.0047
## GP_range 7.2823 3.6585
## -----------------------------------------------------
## Linear regression coefficients (fastened results):
## Param. Std. dev. z worth P(>|z|)
## Intercept 16.2816 0.4559 35.7128 0.0000
## L 0.4243 0.0565 7.5042 0.0000
## Okay 0.6493 0.0212 30.6755 0.0000
## pop 0.0134 0.0097 1.3812 0.1672
## edu 0.0100 0.0009 10.9645 0.0000
## =====================================================


From Python to Julia: Fundamental Information Manipulation and EDA | by Wang Shenghao | Jun, 2023

Construct a Recommender System Utilizing Google Cloud Advice AI | by muffaddal qutbuddin | Jun, 2023