Gradient Boosting for Spatial Regression Models with Autoregressive Disturbances

Balzer, Michael

doi:10.1007/s11067-025-09717-8

Gradient Boosting for Spatial Regression Models with Autoregressive Disturbances

Research
Open access
Published: 03 December 2025

(2025)
Cite this article

You have full access to this open access article

Download PDF

Networks and Spatial Economics Aims and scope Submit manuscript

Michael Balzer¹

192 Accesses
Explore all metrics

Abstract

Researchers in urban and regional studies increasingly work with high-dimensional spatial data that captures spatial patterns and spatial dependencies between observations. To address the unique characteristics of spatial data, various spatial regression models have been developed. In this article, a novel model-based gradient boosting algorithm tailored for spatial regression models with autoregressive disturbances is proposed. Due to its modular nature, the approach offers an alternative estimation procedure with interpretable results that remains feasible even in high-dimensional settings where traditional quasi-maximum likelihood or generalized method of moments estimators may fail to yield unique solutions. The approach also enables data-driven variable and model selection in both low- and high-dimensional settings. Since the bias-variance trade-off is additionally controlled for within the algorithm, it imposes implicit regularization which enhances predictive accuracy on out-of-sample spatial data. Detailed simulation studies regarding the performance of estimation, prediction and variable selection in low- and high-dimensional settings support proper functionality of the proposed methodology. To illustrate the applicability of the model-based gradient boosting algorithm, a case study is presented where the life expectancy in German districts is modeled, incorporating a potential spatial dependence structure.

High-Order Data-Driven Spatial Simulation of Categorical Variables

Article Open access 01 July 2021

A semiparametric dynamic higher-order spatial autoregressive model

Article 20 September 2023

Higher-order spatial autoregressive varying coefficient model: estimation and specification test

Article 26 August 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the development of novel technologies such as global positioning systems, the availability of spatial data has increased across a wide range of real-world applications. A spatial dependence structure in the data can arise due to spatial autocorrelation which describes similarities between geographical locations in space. In principle, such spatial data can be modeled by spatial regression models which are extensions of the standard linear regression model (LeSage and Pace 2009). Under the assumption that the spatial dependence arises solely through the error terms, Cliff and Ord (1973) proposed based on Whittle (1954) to model the spatial autocorrelation in the disturbance process of a linear regression model. Then, for any given variable of interest, the error term for each location depends on a weighted average of the disturbances in connected locations (Anselin 1988).

However, the recent availability of large-scale novel information leads increasingly to high-dimensional data settings. Therefore, applied researchers are faced with unique challenges since in high-dimensional data sets the number of variables exceeds the number of observations making estimation via the quasi-maximum likelihood (QML) or generalized method of moments (GMM) principle infeasible. Furthermore, out of the available variables only a small fraction might actually be informative in explaining a dependent variable of interest. Moreover, the complexity of the linear predictor may rapidly increase depending on modeling choices of the practitioner such as the inclusion of higher order spatial lags of independent variables leading to a rich set of potential candidate models (LeSage and Pace 2009; Fahrmeir et al. 2013).

Thus, approaches to model choice, variable selection and estimation in high-dimensional settings have become increasingly more important. If model estimation is based on the QML, a quantifiable value for model comparison in low-dimensional settings can be obtained based on the Akaike information criterion. Instead of utilizing information criteria, regularization techniques are popular alternatives for model choice and variable selection (Fahrmeir et al. 2013). Such regularization techniques include, for example, the least absolute shrinkage and selection operator which has been recently extended to spatial regression model with autoregressive disturbances (Tibshirani 1996; Cai et al. 2019; Cai and Maiti 2020).

Another option in linear regression models is the model-based gradient boosting algorithm. Although originally proposed in the domain of machine learning for classification problems, gradient boosting has been extended for statistical regression models where it is known as component-wise, model-based or statistical gradient boosting. In principle, the algorithm is iterative in nature where the estimation problem reduces to fitting base-learners to the negative gradient of a prespecified loss function. The base-learners then describe the functional forms of the effect of the independent variables and the loss function is related to the statistical model of interest. In each iteration of the algorithm only the best performing base-learner is chosen from which a small fraction is added to the current linear predictor. Stopping the algorithm early allows for data-driven model and variable selection and yields interpretable results at each iteration (Mayr et al. 2014). Due to the modular and iterative nature of model-based gradient boosting, it remains a feasible approach even in high-dimensional settings (Bühlmann 2006; Bühlmann and Hothorn 2007).

Although a variant of boosting for semi-parametric additive spatial autoregressive models has been recently explored (Yue and Xi 2025), no prior work appears to have addressed a potential extension of model-based gradient boosting for parametric spatial regression models with autoregressive disturbances. To this end, the model-based gradient boosting algorithm in the mboost package (Bühlmann and Hothorn 2007; Hothorn et al. 2010; Hofner et al. 2014, 2015) for fitting generalized linear, additive and interaction models of potential high-dimensionality in the programming language R (R Core Team 2025) is extended to accommodate spatial regression models with autoregressive disturbances. To investigate proper functionality of the proposed model-based gradient boosting algorithm, in-depth simulation studies in low- and high-dimensional settings are conducted. The focus lies primarily on the evaluation of estimation, variable selection and prediction. To illustrate the potential real-world application of model-based gradient boosting, a case study concerned with modeling the life expectancy in German districts is presented which draws on the “Indicators and Maps on Spatial and Urban Development in Germany and Europe” (INKAR) data base that includes a rich variety of variables (Bundesinstitut für Bau-, Stadt- und Raumforschung (BBSR) 2024).

The structure in this article is as follows: In Section 2, the mathematical, theoretical framework of spatial regression models with autoregressive disturbances and model-based gradient boosting is introduced where both concepts are combined thereafter. Afterward, outcomes for the simulation studies are discussed in Section 3. A description of the context, data situation and variables for the case study as well as the results are presented in Section 4. The article finishes with a conclusion and a discussion in Section 5.

2 Methodology

2.1 Spatial Regression Models with Autoregressive Disturbances

Let $n \in \mathbb {N}$ denote the number of observations in a spatial data set. For each $i \in \{1,\dots ,n\}$, consider following spatial regression model with autoregressive disturbances

$$\begin{aligned} \begin{aligned} \varvec{y}&= \varvec{X}\varvec{\beta } + \varvec{W}\varvec{X}\varvec{\theta } + \varvec{u} \\ \varvec{u}&= \lambda \varvec{W}\varvec{u} + \varvec{\epsilon } \end{aligned} \end{aligned}$$

(1)

where $\varvec{y}$ is a $n \times 1$ vector of observations, $\varvec{X}$ is the $n \times p$ design matrix of $p \in \mathbb {N}$ exogenous variables, $\varvec{\beta }$ are corresponding $p \times 1$ coefficients and $\varvec{u}$ is the $n \times 1$ vector of disturbances. The spatial autocorrelation in the data is assumed to enter the model in two ways. First, the disturbances are modeled as an autoregressive process which depend on a spatial autoregressive parameter $\lambda \in (-1,1)$, a spatial weight matrix $\varvec{W}$ of size $n \times n$ that captures spatial connections between observations and $n \times 1$ idiosyncratic random innovations $\varvec{\epsilon }$. Second, spatial lags of exogenous variables $\varvec{W}\varvec{X}$ of size $n \times p$ and corresponding $p \times 1$ coefficients $\varvec{\theta }$ are also included in modeling the spatially dependent variable of interest. A list of important notation utilized throughout the article can be found in Appendix A. Thus, the model in Eq. 1 is a generalization of special cases of spatial regression models and is labeled the spatial Durbin error model (SDEM). In the SDEM, spatial autocorrelation is accounted for in the explanatory variables as well as the error term. However, removing the spatially lagged independent variables results in the simpler spatial error model (SEM). In contrast, retaining the spatially lagged independent variables but removing the autoregressive nature of the disturbances results in the simpler spatial cross-regressive model (SLX) (Anselin 1988; LeSage and Pace 2009; Halleck Vega and Elhorst 2015).

Assumption 1

The innovations $\varvec{\epsilon }$ are independently and identically distributed with expectation $\mathbb {E}(\varvec{\epsilon }) = 0$ and variance ${\text {Var}}(\varvec{\epsilon }) = \sigma ^2$. Additionally, for any $\xi> 0$, the moment $\mathbb {E}|\varvec{\epsilon }|^{4+\xi }$ exists.

Assumption 2

The spatial weight matrix $\varvec{W}$ has no self-loops, that is, the diagonal entries satisfy $w_{ii} = 0$. The off-diagonal entries satisfy $w_{ij} = O\left( \frac{1}{h}\right)$ where $\frac{h}{n} \rightarrow 0$.

Assumption 3

For any $|\lambda | < 1$, the matrix $\varvec{I} - \lambda \varvec{W}$ exists and is non-singular.

Assumption 4

The row and column sums of $\varvec{W}$ and $\left( \varvec{I} - \lambda \varvec{W}\right) ^{-1}$ are uniformly bounded in absolute value.

For a proper estimation procedure of the autoregressive parameter as well as the coefficients of the exogenous variables and the corresponding spatial lags, regularity conditions have to be imposed. Assumption 1 imposes homoskedasticity of the innovations, although normality is not formally required. Assumption 2 links the number of observations to the spatial weight matrix. The condition is satisfied if $\varvec{W}$ is row-normalized which is an assumption maintained throughout this article. The stability condition $|\lambda | < 1$ in Assumption 3 ensures the invertibility of $\varvec{I} - \lambda \varvec{W}$ and thus the uniqueness of the autoregressive disturbances in terms of the innovations. Similarly, Assumption 4 ensures that the degree of spatial autocorrelation remains within a manageable range (Lee 2004).

The model in Eq. 1 can be written more compactly by combining all exogenous variables and the corresponding spatial lags into one design matrix $\varvec{Z} =[\varvec{X}, \varvec{W} \varvec{X}]$ and by stacking the corresponding coefficients vertically $\varvec{\delta } = (\varvec{\varvec{\beta }^{\prime }, \varvec{\theta }}^{\prime })^{\prime }$ as

$$\begin{aligned} \begin{aligned} \varvec{y}&= \varvec{Z} \varvec{\delta } + \varvec{u}, \quad \mathbb {E}(\varvec{u}) = \varvec{0}, \\ \varvec{u}&= (\varvec{I} - \lambda \varvec{W})^{-1} \varvec{\epsilon },\quad \text {Var}(\varvec{u}) = \varvec{\Omega }(\lambda , \sigma ^2). \end{aligned} \end{aligned}$$

Although not the focus in this article, it is worth to note that the general model formulation also allows for the inclusion of additional spatial lags of exogenous variables such as $\varvec{W}\varvec{W}\varvec{X}$ or even $\varvec{W}\varvec{W}\varvec{W}\varvec{X}$. Let $\varvec{\eta } = \varvec{Z} \varvec{\delta }$ be the so-called linear predictor. To estimate the coefficients of the exogenous variables and the corresponding spatial lags of exogenous variables, it is convenient to transform the model into a single equation form as

$$\begin{aligned} (\varvec{y} - \varvec{\eta }) = (\varvec{I} - \lambda \varvec{W})^{-1} \varvec{\epsilon } \quad \mathbb {E}(\varvec{\epsilon }) = \varvec{0}, \quad \text {Var}(\varvec{\epsilon }) = \sigma ^2 \varvec{I}. \end{aligned}$$

(2)

The following Assumption 5 is additionally imposed on the design matrix $\varvec{Z}$.

Assumption 5

The matrix $\varvec{Z}$ has full column rank. Specifically, the limit $\lim _{n \rightarrow \infty } \varvec{Z}^{\prime } \varvec{Z}$ exists, is non-singular, and the elements of $\varvec{Z}$ are uniformly bounded in absolute value.

If the autoregressive parameter $\lambda$ is known, then the loss function corresponding to Eq. 2 is the squared Mahalanobis distance of the residual vector which arises from the generalized least squares objective function

$$\begin{aligned} \rho (\varvec{y}, \varvec{\eta }, \varvec{\Omega }(\lambda , \sigma ^2)) = (\varvec{y} - \varvec{\eta })^{\prime } \varvec{\Omega }(\lambda , \sigma ^2)^{-1} (\varvec{y} - \varvec{\eta }) \end{aligned}$$

(3)

where the variance-covariance matrix is induced by spatial autocorrelation and

$$\begin{aligned} \varvec{\Omega }(\lambda , \sigma ^2) = \sigma ^2 \left[ (\varvec{I} - \lambda \varvec{W})^{\prime } (\varvec{I} - \lambda \varvec{W})\right] ^{-1}. \end{aligned}$$

(4)

The negative gradient of the loss function with respect to the linear predictor $\varvec{\eta }$ is given by

$$\begin{aligned} -\frac{\partial }{\partial \varvec{\eta }} \rho (\varvec{y}, \varvec{\eta }, \varvec{\Omega }(\lambda , \sigma ^2)) = 2 \varvec{\Omega }(\lambda , \sigma ^2)^{-1} (\varvec{y} - \varvec{\eta }). \end{aligned}$$

(5)

which yields all necessary ingredients for the model-based gradient boosting algorithm (Kelejian and Prucha 1999; Cai et al. 2019; Cai and Maiti 2020).

2.2 Three-step Feasible Model-based Gradient Boosting

Given the relevant expressions for the ingredients, model-based gradient boosting for the SDEM can be implemented. In principle, the ingredients play an important role in each iteration of the algorithm. In general, the standard interpretation of boosting as steepest descent in function space implies that the algorithm reduces the empirical risk iteratively through the use of base-learners (Friedman 2001). The base-learners represent the functional forms associated with the exogenous input variables. The algorithm begins with an empty model and sequentially fits the specified base-learners to the negative gradient of the chosen loss function. Thus, proper functionality of the algorithm requires a pre-specified loss function which can be quite general. Subsequently, the residual sum of squares is computed for each base-learner separately and the linear predictor is updated by adding a small fraction of the best performing base-learner. The algorithm then reevaluates the negative gradient and updates the linear predictor in an iterative fashion until the specified number of boosting iterations is reached (Friedman 2001; Bühlmann and Hothorn 2007; Mayr et al. 2014).

Algorithm 1 adapts model-based gradient boosting for the SDEM and does not impose any limitations on the number of potential independent variables $q \in \mathbb {N}$. Indeed, the great advantage of model-based gradient boosting is the feasibility in high-dimensional settings where the number of variables is larger than the number of available observations (Hepp et al. 2016). As the main tuning parameter in the algorithm, the number of boosting iterations $m_{\text {stop}}$ controls the so-called bias-variance trade-off. Thus, the accuracy of prediction can be improved and overfitting behavior mitigated. Due to the modular nature, the algorithm yields an interpretable solution at each iteration such that sparser models can be obtained by stopping the algorithm early instead of convergence. Since the algorithm only updates the linear predictor by one component at each iteration, variable selection and shrinkage estimation is also accounted for Mayr et al. (2012).

The feasibility of Algorithm 1 strongly relies on the assumption that the spatial autoregressive parameter $\lambda$ and the variance of the innovations $\sigma ^2$ are both simultaneously apriori known. However, in real-world application settings, $\lambda$ and $\sigma ^2$ are unknown which implies that the variance-covariance matrix $\varvec{\Omega }(\lambda , \sigma ^2)$ occurring in the squared Mahalanobis distance and the negative gradient cannot be evaluated. Therefore, a three-step model-based gradient boosting procedure is proposed to enable the feasibility of Algorithm 1 which relies on replacing the unknown quantities $\lambda$, $\sigma ^2$ and $\varvec{\Omega }(\lambda , \sigma ^2)$ by the estimators $\hat{\lambda }$, $\hat{\sigma }^2$ and $\varvec{\Omega }(\hat{\lambda }, \hat{\sigma }^2)$. In the first step, the model in Eq. 1 is written as

$$\begin{aligned} \varvec{y} = \varvec{Z}\varvec{\delta }+ \varvec{u} \end{aligned}$$

(6)

temporarily ignoring the potential autoregressive structure of the disturbances. The model in Eq. 6 can then be estimated using a variety of methods as long as the resulting estimator $\varvec{\tilde{\delta }}$ is consistent. For low-dimensional linear settings, a natural choice for the estimator is ordinary least squares (OLS). In high-dimensional settings, OLS may not yield unique solutions, so model-based gradient boosting can be utilized instead. As noted in Zhang and Yu (2005) and Bühlmann (2006), model-based gradient boosting yields a consistent estimator in both low- and high-dimensional settings if the squared error is employed as the loss function. Additionally, Bühlmann and Hothorn (2007) formally show that model-based gradient boosting with the squared error loss function converges to the OLS solution if the number of boosting iterations $m_{\text {stop}}$ is chosen sufficiently large.

Based on Kelejian and Prucha (1999), let in the second step $\varvec{\tilde{u}} = \varvec{y} -\varvec{Z}\varvec{\tilde{\delta }}$ denote the predictors of $\varvec{u}$ based on a consistent estimator $\varvec{\tilde{\delta }}$. Define $\varvec{\bar{u}} = \varvec{W}\varvec{u}$, $\varvec{\bar{\bar{u}}} = \varvec{W}\varvec{W}\varvec{u}$ and the corresponding expressions based on the predictors as $\varvec{\tilde{\bar{u}}} = \varvec{W}\varvec{\tilde{u}}$ and $\varvec{\tilde{\bar{\bar{u}}}} = \varvec{W}\varvec{W}\varvec{\tilde{u}}$. Adapt an identical notation pattern to the innovations. Then, if Assumptions 1 to 3 hold, following three moments can be obtained

$$\begin{aligned} \mathbb {E}\left( \frac{1}{n} \varvec{\epsilon }^{\prime }\varvec{\epsilon } \right) = \sigma ^2 \quad \mathbb {E}\left( \frac{1}{n} \varvec{\bar{\epsilon }}^{\prime }\varvec{\bar{\epsilon }} \right) = \sigma ^2 \frac{1}{n} \text {tr}\left( \varvec{W}^{\prime }\varvec{W}\right) \quad \mathbb {E}\left( \frac{1}{n} \varvec{\bar{\epsilon }}^{\prime }\varvec{\epsilon } \right) = 0 \end{aligned}$$

(7)

where $\text {tr}(\cdot )$ denotes the trace of any matrix. Since the innovations can be written in terms of $\varvec{\bar{u}}$ and $\varvec{\bar{\bar{u}}}$ as $\varvec{\epsilon } = \varvec{u} - \lambda \varvec{\bar{u}}$ and $\varvec{\bar{\epsilon }} = \varvec{\bar{u}} - \lambda \varvec{\bar{\bar{u}}}$, a system of three equations can be obtained based on Eqs. 1 and 7

$$\begin{aligned} \varvec{\Gamma }[\lambda , \lambda ^2, \sigma ^2]^{\prime } - \varvec{\gamma } = 0. \end{aligned}$$

(8)

The expressions for $\varvec{\Gamma }$ and $\varvec{\gamma }$ are given as

$$\begin{aligned} \varvec{\Gamma } = \begin{bmatrix} \frac{2}{n} \varvec{u}^\prime \varvec{\bar{u}} & -\frac{1}{n} \varvec{\bar{u}}^\prime \varvec{\bar{u}} & 1 \\ \frac{2}{n} \varvec{\bar{\bar{u}}}^\prime \varvec{\bar{u}} & -\frac{1}{n} \varvec{\bar{\bar{u}}}^\prime \varvec{\bar{\bar{u}}} & \frac{1}{n} {\text {tr}}(\varvec{W}^\prime \varvec{W}) \\ \frac{1}{n} \left( \varvec{u}^\prime \varvec{\bar{\bar{u}}} + \varvec{\bar{u}}^\prime \varvec{\bar{u}} \right) & -\frac{1}{n} \varvec{\bar{u}}^\prime \varvec{\bar{\bar{u}}} & 0 \end{bmatrix} \quad \varvec{\gamma } = \begin{bmatrix} \frac{1}{n} \varvec{u}^\prime \varvec{u} \\ \frac{1}{n} \varvec{\bar{u}}^\prime \varvec{\bar{u}} \\ \frac{1}{n} \varvec{u}^\prime \varvec{\bar{u}} \end{bmatrix}. \end{aligned}$$

Replacing the moments in Eq. 8 by the corresponding sample moments yields

$$\begin{aligned} \varvec{G}[\lambda , \lambda ^2, \sigma ^2]^{\prime } - \varvec{g} = \varvec{\nu }(\lambda , \sigma ^2) \end{aligned}$$

(9)

where

$$\begin{aligned} \varvec{G} = \begin{bmatrix} \frac{2}{n} \varvec{\tilde{u}}^\prime \varvec{\tilde{\bar{u}}} & -\frac{1}{N} \varvec{\tilde{\bar{u}}}^\prime \varvec{\tilde{\bar{u}}} & 1 \\ \frac{2}{n} \varvec{\tilde{\bar{\bar{u}}}}^\prime \varvec{\tilde{\bar{u}}} & -\frac{1}{n} \varvec{\tilde{\bar{\bar{u}}}}^\prime \varvec{\tilde{\bar{\bar{u}}}} & \frac{1}{n} {\text {tr}}(\varvec{W}^\prime \varvec{W}) \\ \frac{1}{n} \left( \varvec{\tilde{u}}^\prime \varvec{\tilde{\bar{\bar{u}}}} + \varvec{\tilde{\bar{u}}}^\prime \varvec{\tilde{\bar{u}}} \right) & -\frac{1}{n} \varvec{\tilde{\bar{u}}}^\prime \varvec{\tilde{\bar{\bar{u}}}} & 0 \end{bmatrix} \quad \varvec{g} = \begin{bmatrix} \frac{1}{n} \varvec{\tilde{u}}^\prime \varvec{\tilde{u}} \\ \frac{1}{n} \varvec{\tilde{\bar{u}}}^\prime \varvec{\tilde{\bar{u}}} \\ \frac{1}{n} \varvec{\tilde{u}}^\prime \varvec{\tilde{\bar{u}}} \end{bmatrix} \end{aligned}$$

and $\varvec{\nu }(\lambda , \sigma ^2)$ is interpreted as a $3 \times 1$ residual vector. The estimators for $\lambda$ and $\sigma ^2$ are obtained using non-linear least squares and are denoted by $\hat{\lambda }$ and $\hat{\sigma }^2$. Based on Eq. 9, the non-linear least squares estimators are defined as

$$\begin{aligned} (\hat{\lambda }, \hat{\sigma }^2) = \underset{\lambda , \sigma ^2}{\text {arg min}} \left[ {\varvec{G}}[\lambda , \lambda ^2, \sigma ^2]^{\prime } - {\varvec{g}} \right] ^{\prime } \left[ {\varvec{G}}[\lambda , \lambda ^2, \sigma ^2]^{\prime } - {\varvec{g}} \right] . \end{aligned}$$

(10)

Let the Assumptions 1 to 7 hold. Then the non-linear least squares estimators $\hat{\lambda }$ and $\hat{\sigma }^2$ are consistent estimators of $\lambda$ and $\sigma ^2$ in the sense that $\hat{\lambda } \rightarrow _p \lambda$ and $\hat{\sigma }^2 \rightarrow _p \sigma ^2$ for $n \rightarrow \infty$ sufficiently large.

Assumption 6

Denote by $\tilde{u}_{i}$ denote the i-th element of the vector $\varvec{\tilde{u}}$. Assume that there exist finite-dimensional random vectors $\varvec{d}_{i,n}$ and $\varvec{\Delta }_n$ such that $\left| \tilde{u}_{i} - u_{i} \right| \le \left\| \varvec{d}_{i,n} \right\| \left\| \varvec{\Delta }_n \right\|$ with the following conditions holding $\frac{1}{n} \sum _{i=1}^n \left\| \varvec{d}_{i,n} \right\| ^{2 + \zeta } = O_p(1) \quad \text {for some } \zeta> 0, \quad \text {and} \quad \sqrt{n} \left\| \varvec{\Delta }_n \right\| = O_p(1)$.

Assumption 7

Assume that the matrix $\varvec{\Gamma }^{\prime } \varvec{\Gamma }$ is well-conditioned in the sense that its smallest eigenvalue is bounded away from zero. Specifically, $\phi _{\min } \left( \varvec{\Gamma }^{\prime } \varvec{\Gamma } \right) \ge \phi ^{*}> 0$ where the constant $\phi ^{*}$ may depend on $\lambda$ and $\sigma ^2$.

In the third step, $\lambda$ and $\sigma ^2$ in $\varvec{\Omega }(\lambda , \sigma ^2)$ are replaced by the estimators $\hat{\lambda }$ and $\hat{\sigma }^2$ yielding $\varvec{\Omega }(\hat{\lambda }, \hat{\sigma }^2)$. Thus, the squared Mahalanobis distance and negative gradient become

$$\begin{aligned} \rho (\varvec{y}, \varvec{\eta }, \varvec{\Omega }(\hat{\lambda }, \hat{\sigma }^2))&= (\varvec{y} - \varvec{\eta })^{\prime } \varvec{\Omega }(\hat{\lambda }, \hat{\sigma }^2)^{-1} (\varvec{y} - \varvec{\eta }) \end{aligned}$$

(11)

$$\begin{aligned} -\frac{\partial }{\partial \varvec{\eta }}\rho (\varvec{y}, \varvec{\eta }, \varvec{\Omega }(\hat{\lambda }, \hat{\sigma }^2))&= 2\varvec{\Omega }(\hat{\lambda }, \hat{\sigma }^2)^{-1} (\varvec{y} - \varvec{\eta }) \end{aligned}$$

(12)

$$\begin{aligned} \varvec{\Omega }(\hat{\lambda }, \hat{\sigma }^2)&= \hat{\sigma }^2 \left[ (\varvec{I} - \hat{\lambda }\varvec{W})^{\prime } (\varvec{I} - \hat{\lambda } \varvec{W})\right] ^{-1}. \end{aligned}$$

(13)

Finally replace the expressions in Algorithm 1 by the corresponding expressions based on the estimators in Eqs. 11 and 12 which yields a feasible model-based gradient boosting algorithm for the SDEM (Kelejian and Prucha 1999; Kapoor et al. 2007).

While convergence has been formally proven for model-based gradient boosting with the squared error loss function, the result does not directly translate to model-based gradient boosting with the squared Mahalanobis distance (Zhang and Yu 2005; Bühlmann 2006; Bühlmann and Hothorn 2007). Although there does not exist a proof for more general loss functions, empirical results confirm a similar convergence behavior with regards to the maximum likelihood estimator if the negative log likelihood of a generalized linear model is utilized as the loss function (Hepp et al. 2016). A similar behavior can be observed in model-based gradient boosting for the SDEM which is demonstrated in Appendix C.1.

2.3 Post-hoc Deselection

In model-based gradient boosting, the standard approach for model and variable selection is through early stopping via the stopping criterion $m_{\text {stop}}$. Generally, $m_{\text {stop}}$ is chosen by means of random cross-validation, subsampling or bootstrapping. However, random cross-validation techniques cannot be applied effectively to spatially dependent data because random partitions yield test data in which the observations are spatially close to the observations in the train data. This can lead to underestimation of the prediction error due to the negligence of the underlying spatial autocorrelation and thereby to unreliable variable as well as model selection (Schratz et al. 2019).

To address spatial autocorrelation, spatial cross-validation techniques such as those proposed by Brenning (2012) partition data into spatially disjoint subsets using clustering algorithms based on geodesic distances. Alternatively, spatial blocking techniques divide the geographical region into spatial blocks or buffer zones to create independent folds (Valavi et al. 2019). However, standard and spatial cross-validation techniques often select too many variables, resulting in less parsimonious models (Mayr et al. 2012). Furthermore, the quality of the non-linear least squares estimators $\hat{\lambda }$ and $\hat{\sigma }^2$ depends on the quality of the predictors of $\varvec{u}$ where the inclusion of non-informative variables severely impacts the final estimators. In fact, the additional regularization from early stopping can induce severe biases into non-linear least squares estimators which can result in misleading model as well as variable selection.

To mitigate the consequences and obtain sparser final models, the deselection algorithm proposed by Strömer et al. (2022) is applied to model-based gradient boosting for the SDEM. The basic idea of deselection is to run model-based gradient boosting once, determine the optimal number of boosting iterations $m_{\text {opt}}$ via spatial cross-validation techniques and then quantify the contribution of each variable to the overall risk reduction achieved up to $m_{\text {opt}}$. Variables that contribute little to reducing the empirical risk are subsequently removed and model-based gradient boosting is reapplied using only the remaining variables and the previously determined $m_{\text {opt}}$. Formally, let $\mathbbm {1}(\cdot )$ be the indicator function, $r^{[m]}$ the empirical risk of the squared Mahalanobis distance, $\left( r^{[m-1]} - r^{[m]}\right)$ the risk reduction and $j^{{*}^{[m]}}$ the best performing base-learner in iteration m of the model-based gradient boosting algorithm. The attributable risk reduction for base-learner j after $m_{\text {opt}}$ boosting iterations is defined as

$$\begin{aligned} R_j = \sum _{m = 1}^{m_{\text {opt}}} \mathbbm {1}\left\{ j = j^{{*}^{[m]}}\right\} \left( r^{[m-1]} - r^{[m]}\right) , \quad j = 1,\dots , q. \end{aligned}$$

(14)

In principle, the attributable risk reduction $R_j$ measures how much of the total reduction in empirical risk can be attributed to base-learner and hence variable j across the optimal number of boosting iterations $m_{\text {opt}}$. The total risk reduction is given by $\left( r^{[0]} - r^{[m_{\text {opt}}]}\right)$ and a base-learner j is deselected if its relative contribution is below a pre-specified threshold $\tau \in (0,1)$ given as

$$\begin{aligned} R_j < \tau \left( r^{[0]} - r^{[m_{\text {opt}}]}\right) . \end{aligned}$$

(15)

Since $R_j$ captures the risk reduction which can be attributed to each base-learner j, Eq. 15 removes base-learners for which $R_j$ is smaller than a fraction of the total risk reduction. Thus, variables remain only in the model if the relative risk contribution is equal to or larger than the threshold $\tau$. Therefore, the choice of the threshold $\tau$ is crucial and usually depends on the specific research context. Particularly, threshold $\tau$ controls the sparsity of the final model where smaller values of $\tau$ retain more variables while larger values yield more aggressive deselection. Simulation results reported in Appendix C.3 suggest that only relatively small thresholds $\tau \le 0.025$ are appropriate for the SDEM (Strömer et al. 2022). A summary of the deselection approach for model-based gradient boosting for the SDEM is given in Algorithm 2.

3 Simulation Study

3.1 Study Design

To evaluate the performance of the proposed three-step model-based gradient boosting algorithm, simulation studies are conducted with the ingredients derived in Section 2. Particularly, the performance of estimation, variable selection and prediction in low- as well as high-dimensional settings are evaluated. Additionally, an evaluation of the performance of the deselection algorithm is provided. Specifically, the number of observations is fixed at $n = 400$. In contrast, the number of independent variables is varied between $q = 20$ and $q = 800$, indicating a low- $(n> q)$ and high-dimensional $(n < q)$ setting. The true data generating process is given by

$$\begin{aligned} \begin{aligned} \varvec{y}&= 1 + 3.5\varvec{X}_1 -2.5 \varvec{X}_2 -4 \varvec{W} \varvec{X}_1 + 3 \varvec{W} \varvec{X}_2 + \varvec{u} \\ \varvec{u}&= \lambda \varvec{W}\varvec{u} + \varvec{\epsilon } \end{aligned} \end{aligned}$$

where the variables are independently and identically drawn from the uniform distribution $\varvec{X} \sim U(-2,2)$. The spatial autoregressive parameter is varied throughout the simulation study by $\lambda \in \{-0.8,-0.6,-0.4,-0.2,0.2,0.4,0.6,0.8\}$ and the innovations are normally distributed according to $\varvec{\epsilon } \sim N(0,\sigma ^2)$ with $\sigma ^2 = 1$. The spatial weight matrix $\varvec{W}$ is exactly identical to the application in Section 4 and has a 10-nearest neighbor structure where each location is connected to its ten geographically closest neighbors based on constant centroids of the spatial polygons available in the shape file. Afterward, $\varvec{W}$ is row-normalized such that each row sums up to one. Additionally, an artificial spatial weight matrix can be generated based on a circular world vis-à-vis Kelejian and Prucha (1999) in which each location is directly related to the five locations before and after, that is, $k = 5$. Simulation studies for varying spatial autoregressive parameters in a circular world are given in Appendix C.4 and for varying number of related locations $k \in \{1,2,3,5,10,20\}$ in a circular world are given in Appendix C.5.

In the model-based gradient boosting algorithm, the corresponding base-learners are specified as simple linear regression models due to the nature of the data generating process. The learning rate is set to $s = 0.1$ since that is the usual practice (see, for example, Schmid and Hothorn 2008; Mayr et al. 2012; Hofner et al. 2014). The optimal stopping criterion $m_{\text {opt}}$ is found by minimizing the empirical risk via 10-fold spatial cross-validation as proposed in Brenning (2012). In each simulation setting, a total of $n_{\text {sim}} = 100$ repetitions are conducted. Additionally, different approaches for the consistent estimator $\varvec{\tilde{\delta }}$ in the first step are considered, along with their impact on the final results. Specifically, the reported methods are first step OLS (LS-GB), gradient boosting (GB-GB), and gradient boosting with deselection (DS-GB) where applicable. Furthermore, model-based gradient boosting with first step gradient boosting with deselection and additional deselection (DS-DS) is considered. In this notation, the component before the hyphen indicates the method utilized in the first step and the component after the hyphen indicates the method applied for the SDEM. A practitioner’s note summarizing the presented model-based gradient boosting algorithm as well as deselection with different first step methods can be found in Appendix B.

Regarding the performance of variable selection and deselection, the criteria are chosen based on the confusion matrix. In particular, the reported variable selection criteria are the true positive rate (TPR), which is the proportion of correctly selected variables out of all true informative variables, the true negative rate (TNR), which is the proportion of correctly non-selected variables out of all true non-informative variables and the false discovery rate (FDR), which is the proportion of non-informative variables in the set of all selected variables (Stehman 1997).

The performance of estimation is evaluated by reporting the bias, the mean squared error (MSE) and the empirical standard error (ESE) for $\lambda$ defined as

$$\begin{aligned} \text {Bias}&= \frac{1}{n_{\text {sim}}} \sum _{i = 1}^{n_{\text {sim}}} \hat{\lambda }_i - \lambda \\ \text {MSE}&= \frac{1}{n_{\text {sim}}} \sum _{i = 1}^{n_{\text {sim}}} (\hat{\lambda }_i - \lambda )^2 \\ \text {ESE}&= \sqrt{\frac{1}{n_{\text {sim}} - 1} \sum _{i = 1}^{n_{\text {sim}}} (\hat{\lambda }_i - \bar{\lambda })^2}. \end{aligned}$$

For all proposed performance criteria, lower values are always preferred. Additionally, the effects shrinkage estimation for the independent variables is evaluated via visualization by boxplots to highlight the median, quartiles and outliers over 100 repetitions (Morris et al. 2019).

Furthermore, the prediction accuracy is evaluated based on an additional test data set. The test data $\varvec{y_{\text {test}}}$ is generated according to the same data generating process as the train data with an identical number of observations $n_{\text {test}} = 400$. The chosen criteria are the quasi negative log-likelihood (NLL), the root mean squared error of prediction (RMSEP) and mean absolute error of prediction (MAEP) defined as

$$\begin{aligned} \text {RMSEP}&= \sqrt{ \frac{1}{n_{\text {test}}} \sum _{i=1}^{n_{\text {test}}} (y_{\text {test},i} - \hat{y}_{\text {test},i})^2} \\ \text {MAEP}&= \frac{1}{n_{\text {test}}} \sum _{i=1}^{n_{\text {test}}} |y_{\text {test},i} - \hat{y}_{\text {test},i}| \\ \text {NLL}&= \frac{n_{\text {test}}}{2} \left( \log (2\pi \hat{\sigma }) + 1 \right) - \log \left| I - \hat{\lambda } W \right| \\&+ \frac{1}{2\hat{\sigma }} (\varvec{y_{\text {test}}} - \varvec{\hat{\eta }_{\text {test}}})^{\prime } (I - \hat{\lambda } W)^{\prime } (I - \hat{\lambda } W)(\varvec{y_{\text {test}}} - \varvec{\hat{\eta }_{\text {test}}}) \end{aligned}$$

For all proposed performance criteria of prediction accuracy, lower values are always preferred.

The simulation study is conducted in the programming language R (R Core Team 2025). The QML and GMM estimation of the SDEM in the low-dimensional setting is performed via the spatialreg package (Gay 1990; Bivand et al. 2021; Pebesma and Bivand 2023). The presented graphics are created with the tidyverse packages (Wickham et al. 2019). Model-based gradient boosting for generalized, additive and interaction models can be found in the mboost package (Bühlmann and Hothorn 2007; Hothorn et al. 2010; Hofner et al. 2014, 2015). Spatial cross-validation is performed via the sperrorest package (Brenning 2012). An implementation for model-based gradient boosting for the SDEM via the novel spatial error family incorporating the deselection algorithm and the R code for reproducibility of all simulation studies can be found in the GitHub repository https://github.com/micbalz/SpatRegBoost.

3.2 Results

3.2.1 Low-dimensional Linear Setting (n = 400, p = 20)

In Table 1, the average selection rates over 100 repetitions for the low-dimensional setting are presented. The results show a consistent TPR of $100\%$ across all spatial autoregressive parameters $\lambda$, indicating that all informative variables are selected on average. In contrast, the TNR varies substantially with $\lambda$. For strongly negative spatial dependence $\lambda = -0.8$, the TNR is relatively low at approximately $46\%$, meaning that only about eight out of 16 non-informative variables are correctly excluded on average. As $\lambda$ increases toward positive values, the TNR steadily improves reaching over $90\%$ at $\lambda = 0.8$, indicating that nearly all non-informative variables are correctly not selected. Despite this improvement in TNR, the FDR remains considerable for small or negative values of $\lambda$. At $\lambda = -0.8$, the FDR is around $66\%$, indicating that two-thirds of the selected variables are non-informative on average. As $\lambda$ increases, the FDR gradually decreases, reaching approximately $21\%$ for $\lambda = 0.8$. This suggests that even though the model correctly identifies all informative variables, it includes a non-negligible number of non-informative variables especially under weak or negative spatial dependence although this issue becomes substantially less pronounced for large positive $\lambda$.

Table 1 Average selection rates in the low-dimensional setting with 100 repetitions, the spatial error family, model-based gradient boosting with first step gradient boosting with deselection (DS-GB) across different spatial autoregressive parameters $\lambda$

Gradient Boosting for Spatial Regression Models with Autoregressive Disturbances

Abstract

Similar content being viewed by others

High-Order Data-Driven Spatial Simulation of Categorical Variables

A semiparametric dynamic higher-order spatial autoregressive model

Higher-order spatial autoregressive varying coefficient model: estimation and specification test

Explore related subjects

1 Introduction

2 Methodology

2.1 Spatial Regression Models with Autoregressive Disturbances

Assumption 1

Assumption 2

Assumption 3

Assumption 4

Assumption 5

2.2 Three-step Feasible Model-based Gradient Boosting

Assumption 6

Assumption 7

2.3 Post-hoc Deselection

3 Simulation Study

3.1 Study Design

3.2 Results

3.2.1 Low-dimensional Linear Setting (n = 400, p = 20)

3.2.2 High-dimensional Linear Setting (n = 400, p = 800)

3.2.3 Deselection

4 Case Study: Modeling Life Expectancy in German Districts

5 Conclusion

6 Supplementary Information

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Publisher's Note

Appendices

A Notation

B Practitioner’s Note

C Further Simulation Results

1.1 C.1 Convergence of Model-based Gradient Boosting to Generalized Least Squares Estimator

1.2 C.2 Small Sample Behavior of Non-linear Least Squares Estimator

1.3 C.3 Threshold Choice in Deselection

1.4 C.4 Varying Spatial Autoregressive Parameters in Circular World

1.4.1 C.4.1 Low-dimension

1.4.2 C.4.2 High-dimension

1.5 C.5 Varying Spatial Weight Matrices in Circular World

1.5.1 C.5.1 Low-dimension

1.5.2 C.5.2 High-dimension

D Further Case Studies

1.1 D.1 Boston Housing Prices

1.2 D.2 Columbus Crime Rate

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles