1 Introduction

With the development of novel technologies such as global positioning systems, the availability of spatial data has increased across a wide range of real-world applications. A spatial dependence structure in the data can arise due to spatial autocorrelation which describes similarities between geographical locations in space. In principle, such spatial data can be modeled by spatial regression models which are extensions of the standard linear regression model (LeSage and Pace 2009). Under the assumption that the spatial dependence arises solely through the error terms, Cliff and Ord (1973) proposed based on Whittle (1954) to model the spatial autocorrelation in the disturbance process of a linear regression model. Then, for any given variable of interest, the error term for each location depends on a weighted average of the disturbances in connected locations (Anselin 1988).

However, the recent availability of large-scale novel information leads increasingly to high-dimensional data settings. Therefore, applied researchers are faced with unique challenges since in high-dimensional data sets the number of variables exceeds the number of observations making estimation via the quasi-maximum likelihood (QML) or generalized method of moments (GMM) principle infeasible. Furthermore, out of the available variables only a small fraction might actually be informative in explaining a dependent variable of interest. Moreover, the complexity of the linear predictor may rapidly increase depending on modeling choices of the practitioner such as the inclusion of higher order spatial lags of independent variables leading to a rich set of potential candidate models (LeSage and Pace 2009; Fahrmeir et al. 2013).

Thus, approaches to model choice, variable selection and estimation in high-dimensional settings have become increasingly more important. If model estimation is based on the QML, a quantifiable value for model comparison in low-dimensional settings can be obtained based on the Akaike information criterion. Instead of utilizing information criteria, regularization techniques are popular alternatives for model choice and variable selection (Fahrmeir et al. 2013). Such regularization techniques include, for example, the least absolute shrinkage and selection operator which has been recently extended to spatial regression model with autoregressive disturbances (Tibshirani 1996; Cai et al. 2019; Cai and Maiti 2020).

Another option in linear regression models is the model-based gradient boosting algorithm. Although originally proposed in the domain of machine learning for classification problems, gradient boosting has been extended for statistical regression models where it is known as component-wise, model-based or statistical gradient boosting. In principle, the algorithm is iterative in nature where the estimation problem reduces to fitting base-learners to the negative gradient of a prespecified loss function. The base-learners then describe the functional forms of the effect of the independent variables and the loss function is related to the statistical model of interest. In each iteration of the algorithm only the best performing base-learner is chosen from which a small fraction is added to the current linear predictor. Stopping the algorithm early allows for data-driven model and variable selection and yields interpretable results at each iteration (Mayr et al. 2014). Due to the modular and iterative nature of model-based gradient boosting, it remains a feasible approach even in high-dimensional settings (Bühlmann 2006; Bühlmann and Hothorn 2007).

Although a variant of boosting for semi-parametric additive spatial autoregressive models has been recently explored (Yue and Xi 2025), no prior work appears to have addressed a potential extension of model-based gradient boosting for parametric spatial regression models with autoregressive disturbances. To this end, the model-based gradient boosting algorithm in the mboost package (Bühlmann and Hothorn 2007; Hothorn et al. 2010; Hofner et al. 2014, 2015) for fitting generalized linear, additive and interaction models of potential high-dimensionality in the programming language R (R Core Team 2025) is extended to accommodate spatial regression models with autoregressive disturbances. To investigate proper functionality of the proposed model-based gradient boosting algorithm, in-depth simulation studies in low- and high-dimensional settings are conducted. The focus lies primarily on the evaluation of estimation, variable selection and prediction. To illustrate the potential real-world application of model-based gradient boosting, a case study concerned with modeling the life expectancy in German districts is presented which draws on the “Indicators and Maps on Spatial and Urban Development in Germany and Europe” (INKAR) data base that includes a rich variety of variables (Bundesinstitut für Bau-, Stadt- und Raumforschung (BBSR) 2024).

The structure in this article is as follows: In Section 2, the mathematical, theoretical framework of spatial regression models with autoregressive disturbances and model-based gradient boosting is introduced where both concepts are combined thereafter. Afterward, outcomes for the simulation studies are discussed in Section 3. A description of the context, data situation and variables for the case study as well as the results are presented in Section 4. The article finishes with a conclusion and a discussion in Section 5.

2 Methodology

2.1 Spatial Regression Models with Autoregressive Disturbances

Let \(n \in \mathbb {N}\) denote the number of observations in a spatial data set. For each \(i \in \{1,\dots ,n\}\), consider following spatial regression model with autoregressive disturbances

$$\begin{aligned} \begin{aligned} \varvec{y}&= \varvec{X}\varvec{\beta } + \varvec{W}\varvec{X}\varvec{\theta } + \varvec{u} \\ \varvec{u}&= \lambda \varvec{W}\varvec{u} + \varvec{\epsilon } \end{aligned} \end{aligned}$$
(1)

where \(\varvec{y}\) is a \(n \times 1\) vector of observations, \(\varvec{X}\) is the \(n \times p\) design matrix of \(p \in \mathbb {N}\) exogenous variables, \(\varvec{\beta }\) are corresponding \(p \times 1\) coefficients and \(\varvec{u}\) is the \(n \times 1\) vector of disturbances. The spatial autocorrelation in the data is assumed to enter the model in two ways. First, the disturbances are modeled as an autoregressive process which depend on a spatial autoregressive parameter \(\lambda \in (-1,1)\), a spatial weight matrix \(\varvec{W}\) of size \(n \times n\) that captures spatial connections between observations and \(n \times 1\) idiosyncratic random innovations \(\varvec{\epsilon }\). Second, spatial lags of exogenous variables \(\varvec{W}\varvec{X}\) of size \(n \times p\) and corresponding \(p \times 1\) coefficients \(\varvec{\theta }\) are also included in modeling the spatially dependent variable of interest. A list of important notation utilized throughout the article can be found in Appendix A. Thus, the model in Eq. 1 is a generalization of special cases of spatial regression models and is labeled the spatial Durbin error model (SDEM). In the SDEM, spatial autocorrelation is accounted for in the explanatory variables as well as the error term. However, removing the spatially lagged independent variables results in the simpler spatial error model (SEM). In contrast, retaining the spatially lagged independent variables but removing the autoregressive nature of the disturbances results in the simpler spatial cross-regressive model (SLX) (Anselin 1988; LeSage and Pace 2009; Halleck Vega and Elhorst 2015).

Assumption 1

The innovations \(\varvec{\epsilon }\) are independently and identically distributed with expectation \(\mathbb {E}(\varvec{\epsilon }) = 0\) and variance \({\text {Var}}(\varvec{\epsilon }) = \sigma ^2\). Additionally, for any \(\xi> 0\), the moment \(\mathbb {E}|\varvec{\epsilon }|^{4+\xi }\) exists.

Assumption 2

The spatial weight matrix \(\varvec{W}\) has no self-loops, that is, the diagonal entries satisfy \(w_{ii} = 0\). The off-diagonal entries satisfy \(w_{ij} = O\left( \frac{1}{h}\right)\) where \(\frac{h}{n} \rightarrow 0\).

Assumption 3

For any \(|\lambda | < 1\), the matrix \(\varvec{I} - \lambda \varvec{W}\) exists and is non-singular.

Assumption 4

The row and column sums of \(\varvec{W}\) and \(\left( \varvec{I} - \lambda \varvec{W}\right) ^{-1}\) are uniformly bounded in absolute value.

For a proper estimation procedure of the autoregressive parameter as well as the coefficients of the exogenous variables and the corresponding spatial lags, regularity conditions have to be imposed. Assumption 1 imposes homoskedasticity of the innovations, although normality is not formally required. Assumption 2 links the number of observations to the spatial weight matrix. The condition is satisfied if \(\varvec{W}\) is row-normalized which is an assumption maintained throughout this article. The stability condition \(|\lambda | < 1\) in Assumption 3 ensures the invertibility of \(\varvec{I} - \lambda \varvec{W}\) and thus the uniqueness of the autoregressive disturbances in terms of the innovations. Similarly, Assumption 4 ensures that the degree of spatial autocorrelation remains within a manageable range (Lee 2004).

The model in Eq. 1 can be written more compactly by combining all exogenous variables and the corresponding spatial lags into one design matrix \(\varvec{Z} =[\varvec{X}, \varvec{W} \varvec{X}]\) and by stacking the corresponding coefficients vertically \(\varvec{\delta } = (\varvec{\varvec{\beta }^{\prime }, \varvec{\theta }}^{\prime })^{\prime }\) as

$$\begin{aligned} \begin{aligned} \varvec{y}&= \varvec{Z} \varvec{\delta } + \varvec{u}, \quad \mathbb {E}(\varvec{u}) = \varvec{0}, \\ \varvec{u}&= (\varvec{I} - \lambda \varvec{W})^{-1} \varvec{\epsilon },\quad \text {Var}(\varvec{u}) = \varvec{\Omega }(\lambda , \sigma ^2). \end{aligned} \end{aligned}$$

Although not the focus in this article, it is worth to note that the general model formulation also allows for the inclusion of additional spatial lags of exogenous variables such as \(\varvec{W}\varvec{W}\varvec{X}\) or even \(\varvec{W}\varvec{W}\varvec{W}\varvec{X}\). Let \(\varvec{\eta } = \varvec{Z} \varvec{\delta }\) be the so-called linear predictor. To estimate the coefficients of the exogenous variables and the corresponding spatial lags of exogenous variables, it is convenient to transform the model into a single equation form as

$$\begin{aligned} (\varvec{y} - \varvec{\eta }) = (\varvec{I} - \lambda \varvec{W})^{-1} \varvec{\epsilon } \quad \mathbb {E}(\varvec{\epsilon }) = \varvec{0}, \quad \text {Var}(\varvec{\epsilon }) = \sigma ^2 \varvec{I}. \end{aligned}$$
(2)

The following Assumption 5 is additionally imposed on the design matrix \(\varvec{Z}\).

Assumption 5

The matrix \(\varvec{Z}\) has full column rank. Specifically, the limit \(\lim _{n \rightarrow \infty } \varvec{Z}^{\prime } \varvec{Z}\) exists, is non-singular, and the elements of \(\varvec{Z}\) are uniformly bounded in absolute value.

If the autoregressive parameter \(\lambda\) is known, then the loss function corresponding to Eq. 2 is the squared Mahalanobis distance of the residual vector which arises from the generalized least squares objective function

$$\begin{aligned} \rho (\varvec{y}, \varvec{\eta }, \varvec{\Omega }(\lambda , \sigma ^2)) = (\varvec{y} - \varvec{\eta })^{\prime } \varvec{\Omega }(\lambda , \sigma ^2)^{-1} (\varvec{y} - \varvec{\eta }) \end{aligned}$$
(3)

where the variance-covariance matrix is induced by spatial autocorrelation and

$$\begin{aligned} \varvec{\Omega }(\lambda , \sigma ^2) = \sigma ^2 \left[ (\varvec{I} - \lambda \varvec{W})^{\prime } (\varvec{I} - \lambda \varvec{W})\right] ^{-1}. \end{aligned}$$
(4)

The negative gradient of the loss function with respect to the linear predictor \(\varvec{\eta }\) is given by

$$\begin{aligned} -\frac{\partial }{\partial \varvec{\eta }} \rho (\varvec{y}, \varvec{\eta }, \varvec{\Omega }(\lambda , \sigma ^2)) = 2 \varvec{\Omega }(\lambda , \sigma ^2)^{-1} (\varvec{y} - \varvec{\eta }). \end{aligned}$$
(5)

which yields all necessary ingredients for the model-based gradient boosting algorithm (Kelejian and Prucha 1999; Cai et al. 2019; Cai and Maiti 2020).

2.2 Three-step Feasible Model-based Gradient Boosting

Given the relevant expressions for the ingredients, model-based gradient boosting for the SDEM can be implemented. In principle, the ingredients play an important role in each iteration of the algorithm. In general, the standard interpretation of boosting as steepest descent in function space implies that the algorithm reduces the empirical risk iteratively through the use of base-learners (Friedman 2001). The base-learners represent the functional forms associated with the exogenous input variables. The algorithm begins with an empty model and sequentially fits the specified base-learners to the negative gradient of the chosen loss function. Thus, proper functionality of the algorithm requires a pre-specified loss function which can be quite general. Subsequently, the residual sum of squares is computed for each base-learner separately and the linear predictor is updated by adding a small fraction of the best performing base-learner. The algorithm then reevaluates the negative gradient and updates the linear predictor in an iterative fashion until the specified number of boosting iterations is reached (Friedman 2001; Bühlmann and Hothorn 2007; Mayr et al. 2014).

Algorithm 1 adapts model-based gradient boosting for the SDEM and does not impose any limitations on the number of potential independent variables \(q \in \mathbb {N}\). Indeed, the great advantage of model-based gradient boosting is the feasibility in high-dimensional settings where the number of variables is larger than the number of available observations (Hepp et al. 2016). As the main tuning parameter in the algorithm, the number of boosting iterations \(m_{\text {stop}}\) controls the so-called bias-variance trade-off. Thus, the accuracy of prediction can be improved and overfitting behavior mitigated. Due to the modular nature, the algorithm yields an interpretable solution at each iteration such that sparser models can be obtained by stopping the algorithm early instead of convergence. Since the algorithm only updates the linear predictor by one component at each iteration, variable selection and shrinkage estimation is also accounted for Mayr et al. (2012).

Algorithm 1
figure a

Model-based gradient boosting for the spatial Durbin error model (SDEM)

The feasibility of Algorithm 1 strongly relies on the assumption that the spatial autoregressive parameter \(\lambda\) and the variance of the innovations \(\sigma ^2\) are both simultaneously apriori known. However, in real-world application settings, \(\lambda\) and \(\sigma ^2\) are unknown which implies that the variance-covariance matrix \(\varvec{\Omega }(\lambda , \sigma ^2)\) occurring in the squared Mahalanobis distance and the negative gradient cannot be evaluated. Therefore, a three-step model-based gradient boosting procedure is proposed to enable the feasibility of Algorithm 1 which relies on replacing the unknown quantities \(\lambda\), \(\sigma ^2\) and \(\varvec{\Omega }(\lambda , \sigma ^2)\) by the estimators \(\hat{\lambda }\), \(\hat{\sigma }^2\) and \(\varvec{\Omega }(\hat{\lambda }, \hat{\sigma }^2)\). In the first step, the model in Eq. 1 is written as

$$\begin{aligned} \varvec{y} = \varvec{Z}\varvec{\delta }+ \varvec{u} \end{aligned}$$
(6)

temporarily ignoring the potential autoregressive structure of the disturbances. The model in Eq. 6 can then be estimated using a variety of methods as long as the resulting estimator \(\varvec{\tilde{\delta }}\) is consistent. For low-dimensional linear settings, a natural choice for the estimator is ordinary least squares (OLS). In high-dimensional settings, OLS may not yield unique solutions, so model-based gradient boosting can be utilized instead. As noted in Zhang and Yu (2005) and Bühlmann (2006), model-based gradient boosting yields a consistent estimator in both low- and high-dimensional settings if the squared error is employed as the loss function. Additionally, Bühlmann and Hothorn (2007) formally show that model-based gradient boosting with the squared error loss function converges to the OLS solution if the number of boosting iterations \(m_{\text {stop}}\) is chosen sufficiently large.

Based on Kelejian and Prucha (1999), let in the second step \(\varvec{\tilde{u}} = \varvec{y} -\varvec{Z}\varvec{\tilde{\delta }}\) denote the predictors of \(\varvec{u}\) based on a consistent estimator \(\varvec{\tilde{\delta }}\). Define \(\varvec{\bar{u}} = \varvec{W}\varvec{u}\), \(\varvec{\bar{\bar{u}}} = \varvec{W}\varvec{W}\varvec{u}\) and the corresponding expressions based on the predictors as \(\varvec{\tilde{\bar{u}}} = \varvec{W}\varvec{\tilde{u}}\) and \(\varvec{\tilde{\bar{\bar{u}}}} = \varvec{W}\varvec{W}\varvec{\tilde{u}}\). Adapt an identical notation pattern to the innovations. Then, if Assumptions 1 to 3 hold, following three moments can be obtained

$$\begin{aligned} \mathbb {E}\left( \frac{1}{n} \varvec{\epsilon }^{\prime }\varvec{\epsilon } \right) = \sigma ^2 \quad \mathbb {E}\left( \frac{1}{n} \varvec{\bar{\epsilon }}^{\prime }\varvec{\bar{\epsilon }} \right) = \sigma ^2 \frac{1}{n} \text {tr}\left( \varvec{W}^{\prime }\varvec{W}\right) \quad \mathbb {E}\left( \frac{1}{n} \varvec{\bar{\epsilon }}^{\prime }\varvec{\epsilon } \right) = 0 \end{aligned}$$
(7)

where \(\text {tr}(\cdot )\) denotes the trace of any matrix. Since the innovations can be written in terms of \(\varvec{\bar{u}}\) and \(\varvec{\bar{\bar{u}}}\) as \(\varvec{\epsilon } = \varvec{u} - \lambda \varvec{\bar{u}}\) and \(\varvec{\bar{\epsilon }} = \varvec{\bar{u}} - \lambda \varvec{\bar{\bar{u}}}\), a system of three equations can be obtained based on Eqs. 1 and 7

$$\begin{aligned} \varvec{\Gamma }[\lambda , \lambda ^2, \sigma ^2]^{\prime } - \varvec{\gamma } = 0. \end{aligned}$$
(8)

The expressions for \(\varvec{\Gamma }\) and \(\varvec{\gamma }\) are given as

$$\begin{aligned} \varvec{\Gamma } = \begin{bmatrix} \frac{2}{n} \varvec{u}^\prime \varvec{\bar{u}} & -\frac{1}{n} \varvec{\bar{u}}^\prime \varvec{\bar{u}} & 1 \\ \frac{2}{n} \varvec{\bar{\bar{u}}}^\prime \varvec{\bar{u}} & -\frac{1}{n} \varvec{\bar{\bar{u}}}^\prime \varvec{\bar{\bar{u}}} & \frac{1}{n} {\text {tr}}(\varvec{W}^\prime \varvec{W}) \\ \frac{1}{n} \left( \varvec{u}^\prime \varvec{\bar{\bar{u}}} + \varvec{\bar{u}}^\prime \varvec{\bar{u}} \right) & -\frac{1}{n} \varvec{\bar{u}}^\prime \varvec{\bar{\bar{u}}} & 0 \end{bmatrix} \quad \varvec{\gamma } = \begin{bmatrix} \frac{1}{n} \varvec{u}^\prime \varvec{u} \\ \frac{1}{n} \varvec{\bar{u}}^\prime \varvec{\bar{u}} \\ \frac{1}{n} \varvec{u}^\prime \varvec{\bar{u}} \end{bmatrix}. \end{aligned}$$

Replacing the moments in Eq. 8 by the corresponding sample moments yields

$$\begin{aligned} \varvec{G}[\lambda , \lambda ^2, \sigma ^2]^{\prime } - \varvec{g} = \varvec{\nu }(\lambda , \sigma ^2) \end{aligned}$$
(9)

where

$$\begin{aligned} \varvec{G} = \begin{bmatrix} \frac{2}{n} \varvec{\tilde{u}}^\prime \varvec{\tilde{\bar{u}}} & -\frac{1}{N} \varvec{\tilde{\bar{u}}}^\prime \varvec{\tilde{\bar{u}}} & 1 \\ \frac{2}{n} \varvec{\tilde{\bar{\bar{u}}}}^\prime \varvec{\tilde{\bar{u}}} & -\frac{1}{n} \varvec{\tilde{\bar{\bar{u}}}}^\prime \varvec{\tilde{\bar{\bar{u}}}} & \frac{1}{n} {\text {tr}}(\varvec{W}^\prime \varvec{W}) \\ \frac{1}{n} \left( \varvec{\tilde{u}}^\prime \varvec{\tilde{\bar{\bar{u}}}} + \varvec{\tilde{\bar{u}}}^\prime \varvec{\tilde{\bar{u}}} \right) & -\frac{1}{n} \varvec{\tilde{\bar{u}}}^\prime \varvec{\tilde{\bar{\bar{u}}}} & 0 \end{bmatrix} \quad \varvec{g} = \begin{bmatrix} \frac{1}{n} \varvec{\tilde{u}}^\prime \varvec{\tilde{u}} \\ \frac{1}{n} \varvec{\tilde{\bar{u}}}^\prime \varvec{\tilde{\bar{u}}} \\ \frac{1}{n} \varvec{\tilde{u}}^\prime \varvec{\tilde{\bar{u}}} \end{bmatrix} \end{aligned}$$

and \(\varvec{\nu }(\lambda , \sigma ^2)\) is interpreted as a \(3 \times 1\) residual vector. The estimators for \(\lambda\) and \(\sigma ^2\) are obtained using non-linear least squares and are denoted by \(\hat{\lambda }\) and \(\hat{\sigma }^2\). Based on Eq. 9, the non-linear least squares estimators are defined as

$$\begin{aligned} (\hat{\lambda }, \hat{\sigma }^2) = \underset{\lambda , \sigma ^2}{\text {arg min}} \left[ {\varvec{G}}[\lambda , \lambda ^2, \sigma ^2]^{\prime } - {\varvec{g}} \right] ^{\prime } \left[ {\varvec{G}}[\lambda , \lambda ^2, \sigma ^2]^{\prime } - {\varvec{g}} \right] . \end{aligned}$$
(10)

Let the Assumptions 1 to 7 hold. Then the non-linear least squares estimators \(\hat{\lambda }\) and \(\hat{\sigma }^2\) are consistent estimators of \(\lambda\) and \(\sigma ^2\) in the sense that \(\hat{\lambda } \rightarrow _p \lambda\) and \(\hat{\sigma }^2 \rightarrow _p \sigma ^2\) for \(n \rightarrow \infty\) sufficiently large.

Assumption 6

Denote by \(\tilde{u}_{i}\) denote the i-th element of the vector \(\varvec{\tilde{u}}\). Assume that there exist finite-dimensional random vectors \(\varvec{d}_{i,n}\) and \(\varvec{\Delta }_n\) such that \(\left| \tilde{u}_{i} - u_{i} \right| \le \left\| \varvec{d}_{i,n} \right\| \left\| \varvec{\Delta }_n \right\|\) with the following conditions holding \(\frac{1}{n} \sum _{i=1}^n \left\| \varvec{d}_{i,n} \right\| ^{2 + \zeta } = O_p(1) \quad \text {for some } \zeta> 0, \quad \text {and} \quad \sqrt{n} \left\| \varvec{\Delta }_n \right\| = O_p(1)\).

Assumption 7

Assume that the matrix \(\varvec{\Gamma }^{\prime } \varvec{\Gamma }\) is well-conditioned in the sense that its smallest eigenvalue is bounded away from zero. Specifically, \(\phi _{\min } \left( \varvec{\Gamma }^{\prime } \varvec{\Gamma } \right) \ge \phi ^{*}> 0\) where the constant \(\phi ^{*}\) may depend on \(\lambda\) and \(\sigma ^2\).

In the third step, \(\lambda\) and \(\sigma ^2\) in \(\varvec{\Omega }(\lambda , \sigma ^2)\) are replaced by the estimators \(\hat{\lambda }\) and \(\hat{\sigma }^2\) yielding \(\varvec{\Omega }(\hat{\lambda }, \hat{\sigma }^2)\). Thus, the squared Mahalanobis distance and negative gradient become

$$\begin{aligned} \rho (\varvec{y}, \varvec{\eta }, \varvec{\Omega }(\hat{\lambda }, \hat{\sigma }^2))&= (\varvec{y} - \varvec{\eta })^{\prime } \varvec{\Omega }(\hat{\lambda }, \hat{\sigma }^2)^{-1} (\varvec{y} - \varvec{\eta }) \end{aligned}$$
(11)
$$\begin{aligned} -\frac{\partial }{\partial \varvec{\eta }}\rho (\varvec{y}, \varvec{\eta }, \varvec{\Omega }(\hat{\lambda }, \hat{\sigma }^2))&= 2\varvec{\Omega }(\hat{\lambda }, \hat{\sigma }^2)^{-1} (\varvec{y} - \varvec{\eta }) \end{aligned}$$
(12)
$$\begin{aligned} \varvec{\Omega }(\hat{\lambda }, \hat{\sigma }^2)&= \hat{\sigma }^2 \left[ (\varvec{I} - \hat{\lambda }\varvec{W})^{\prime } (\varvec{I} - \hat{\lambda } \varvec{W})\right] ^{-1}. \end{aligned}$$
(13)

Finally replace the expressions in Algorithm 1 by the corresponding expressions based on the estimators in Eqs. 11 and 12 which yields a feasible model-based gradient boosting algorithm for the SDEM (Kelejian and Prucha 1999; Kapoor et al. 2007).

While convergence has been formally proven for model-based gradient boosting with the squared error loss function, the result does not directly translate to model-based gradient boosting with the squared Mahalanobis distance (Zhang and Yu 2005; Bühlmann 2006; Bühlmann and Hothorn 2007). Although there does not exist a proof for more general loss functions, empirical results confirm a similar convergence behavior with regards to the maximum likelihood estimator if the negative log likelihood of a generalized linear model is utilized as the loss function (Hepp et al. 2016). A similar behavior can be observed in model-based gradient boosting for the SDEM which is demonstrated in Appendix C.1.

2.3 Post-hoc Deselection

In model-based gradient boosting, the standard approach for model and variable selection is through early stopping via the stopping criterion \(m_{\text {stop}}\). Generally, \(m_{\text {stop}}\) is chosen by means of random cross-validation, subsampling or bootstrapping. However, random cross-validation techniques cannot be applied effectively to spatially dependent data because random partitions yield test data in which the observations are spatially close to the observations in the train data. This can lead to underestimation of the prediction error due to the negligence of the underlying spatial autocorrelation and thereby to unreliable variable as well as model selection (Schratz et al. 2019).

To address spatial autocorrelation, spatial cross-validation techniques such as those proposed by Brenning (2012) partition data into spatially disjoint subsets using clustering algorithms based on geodesic distances. Alternatively, spatial blocking techniques divide the geographical region into spatial blocks or buffer zones to create independent folds (Valavi et al. 2019). However, standard and spatial cross-validation techniques often select too many variables, resulting in less parsimonious models (Mayr et al. 2012). Furthermore, the quality of the non-linear least squares estimators \(\hat{\lambda }\) and \(\hat{\sigma }^2\) depends on the quality of the predictors of \(\varvec{u}\) where the inclusion of non-informative variables severely impacts the final estimators. In fact, the additional regularization from early stopping can induce severe biases into non-linear least squares estimators which can result in misleading model as well as variable selection.

To mitigate the consequences and obtain sparser final models, the deselection algorithm proposed by Strömer et al. (2022) is applied to model-based gradient boosting for the SDEM. The basic idea of deselection is to run model-based gradient boosting once, determine the optimal number of boosting iterations \(m_{\text {opt}}\) via spatial cross-validation techniques and then quantify the contribution of each variable to the overall risk reduction achieved up to \(m_{\text {opt}}\). Variables that contribute little to reducing the empirical risk are subsequently removed and model-based gradient boosting is reapplied using only the remaining variables and the previously determined \(m_{\text {opt}}\). Formally, let \(\mathbbm {1}(\cdot )\) be the indicator function, \(r^{[m]}\) the empirical risk of the squared Mahalanobis distance, \(\left( r^{[m-1]} - r^{[m]}\right)\) the risk reduction and \(j^{{*}^{[m]}}\) the best performing base-learner in iteration m of the model-based gradient boosting algorithm. The attributable risk reduction for base-learner j after \(m_{\text {opt}}\) boosting iterations is defined as

$$\begin{aligned} R_j = \sum _{m = 1}^{m_{\text {opt}}} \mathbbm {1}\left\{ j = j^{{*}^{[m]}}\right\} \left( r^{[m-1]} - r^{[m]}\right) , \quad j = 1,\dots , q. \end{aligned}$$
(14)

In principle, the attributable risk reduction \(R_j\) measures how much of the total reduction in empirical risk can be attributed to base-learner and hence variable j across the optimal number of boosting iterations \(m_{\text {opt}}\). The total risk reduction is given by \(\left( r^{[0]} - r^{[m_{\text {opt}}]}\right)\) and a base-learner j is deselected if its relative contribution is below a pre-specified threshold \(\tau \in (0,1)\) given as

$$\begin{aligned} R_j < \tau \left( r^{[0]} - r^{[m_{\text {opt}}]}\right) . \end{aligned}$$
(15)

Since \(R_j\) captures the risk reduction which can be attributed to each base-learner j, Eq. 15 removes base-learners for which \(R_j\) is smaller than a fraction of the total risk reduction. Thus, variables remain only in the model if the relative risk contribution is equal to or larger than the threshold \(\tau\). Therefore, the choice of the threshold \(\tau\) is crucial and usually depends on the specific research context. Particularly, threshold \(\tau\) controls the sparsity of the final model where smaller values of \(\tau\) retain more variables while larger values yield more aggressive deselection. Simulation results reported in Appendix C.3 suggest that only relatively small thresholds \(\tau \le 0.025\) are appropriate for the SDEM (Strömer et al. 2022). A summary of the deselection approach for model-based gradient boosting for the SDEM is given in Algorithm 2.

Algorithm 2
figure b

Model-based gradient boosting for the spatial Durbin error model (SDEM) with additional deselection of variables

3 Simulation Study

3.1 Study Design

To evaluate the performance of the proposed three-step model-based gradient boosting algorithm, simulation studies are conducted with the ingredients derived in Section 2. Particularly, the performance of estimation, variable selection and prediction in low- as well as high-dimensional settings are evaluated. Additionally, an evaluation of the performance of the deselection algorithm is provided. Specifically, the number of observations is fixed at \(n = 400\). In contrast, the number of independent variables is varied between \(q = 20\) and \(q = 800\), indicating a low- \((n> q)\) and high-dimensional \((n < q)\) setting. The true data generating process is given by

$$\begin{aligned} \begin{aligned} \varvec{y}&= 1 + 3.5\varvec{X}_1 -2.5 \varvec{X}_2 -4 \varvec{W} \varvec{X}_1 + 3 \varvec{W} \varvec{X}_2 + \varvec{u} \\ \varvec{u}&= \lambda \varvec{W}\varvec{u} + \varvec{\epsilon } \end{aligned} \end{aligned}$$

where the variables are independently and identically drawn from the uniform distribution \(\varvec{X} \sim U(-2,2)\). The spatial autoregressive parameter is varied throughout the simulation study by \(\lambda \in \{-0.8,-0.6,-0.4,-0.2,0.2,0.4,0.6,0.8\}\) and the innovations are normally distributed according to \(\varvec{\epsilon } \sim N(0,\sigma ^2)\) with \(\sigma ^2 = 1\). The spatial weight matrix \(\varvec{W}\) is exactly identical to the application in Section 4 and has a 10-nearest neighbor structure where each location is connected to its ten geographically closest neighbors based on constant centroids of the spatial polygons available in the shape file. Afterward, \(\varvec{W}\) is row-normalized such that each row sums up to one. Additionally, an artificial spatial weight matrix can be generated based on a circular world vis-à-vis Kelejian and Prucha (1999) in which each location is directly related to the five locations before and after, that is, \(k = 5\). Simulation studies for varying spatial autoregressive parameters in a circular world are given in Appendix C.4 and for varying number of related locations \(k \in \{1,2,3,5,10,20\}\) in a circular world are given in Appendix C.5.

In the model-based gradient boosting algorithm, the corresponding base-learners are specified as simple linear regression models due to the nature of the data generating process. The learning rate is set to \(s = 0.1\) since that is the usual practice (see, for example, Schmid and Hothorn 2008; Mayr et al. 2012; Hofner et al. 2014). The optimal stopping criterion \(m_{\text {opt}}\) is found by minimizing the empirical risk via 10-fold spatial cross-validation as proposed in Brenning (2012). In each simulation setting, a total of \(n_{\text {sim}} = 100\) repetitions are conducted. Additionally, different approaches for the consistent estimator \(\varvec{\tilde{\delta }}\) in the first step are considered, along with their impact on the final results. Specifically, the reported methods are first step OLS (LS-GB), gradient boosting (GB-GB), and gradient boosting with deselection (DS-GB) where applicable. Furthermore, model-based gradient boosting with first step gradient boosting with deselection and additional deselection (DS-DS) is considered. In this notation, the component before the hyphen indicates the method utilized in the first step and the component after the hyphen indicates the method applied for the SDEM. A practitioner’s note summarizing the presented model-based gradient boosting algorithm as well as deselection with different first step methods can be found in Appendix B.

Regarding the performance of variable selection and deselection, the criteria are chosen based on the confusion matrix. In particular, the reported variable selection criteria are the true positive rate (TPR), which is the proportion of correctly selected variables out of all true informative variables, the true negative rate (TNR), which is the proportion of correctly non-selected variables out of all true non-informative variables and the false discovery rate (FDR), which is the proportion of non-informative variables in the set of all selected variables (Stehman 1997).

The performance of estimation is evaluated by reporting the bias, the mean squared error (MSE) and the empirical standard error (ESE) for \(\lambda\) defined as

$$\begin{aligned} \text {Bias}&= \frac{1}{n_{\text {sim}}} \sum _{i = 1}^{n_{\text {sim}}} \hat{\lambda }_i - \lambda \\ \text {MSE}&= \frac{1}{n_{\text {sim}}} \sum _{i = 1}^{n_{\text {sim}}} (\hat{\lambda }_i - \lambda )^2 \\ \text {ESE}&= \sqrt{\frac{1}{n_{\text {sim}} - 1} \sum _{i = 1}^{n_{\text {sim}}} (\hat{\lambda }_i - \bar{\lambda })^2}. \end{aligned}$$

For all proposed performance criteria, lower values are always preferred. Additionally, the effects shrinkage estimation for the independent variables is evaluated via visualization by boxplots to highlight the median, quartiles and outliers over 100 repetitions (Morris et al. 2019).

Furthermore, the prediction accuracy is evaluated based on an additional test data set. The test data \(\varvec{y_{\text {test}}}\) is generated according to the same data generating process as the train data with an identical number of observations \(n_{\text {test}} = 400\). The chosen criteria are the quasi negative log-likelihood (NLL), the root mean squared error of prediction (RMSEP) and mean absolute error of prediction (MAEP) defined as

$$\begin{aligned} \text {RMSEP}&= \sqrt{ \frac{1}{n_{\text {test}}} \sum _{i=1}^{n_{\text {test}}} (y_{\text {test},i} - \hat{y}_{\text {test},i})^2} \\ \text {MAEP}&= \frac{1}{n_{\text {test}}} \sum _{i=1}^{n_{\text {test}}} |y_{\text {test},i} - \hat{y}_{\text {test},i}| \\ \text {NLL}&= \frac{n_{\text {test}}}{2} \left( \log (2\pi \hat{\sigma }) + 1 \right) - \log \left| I - \hat{\lambda } W \right| \\&+ \frac{1}{2\hat{\sigma }} (\varvec{y_{\text {test}}} - \varvec{\hat{\eta }_{\text {test}}})^{\prime } (I - \hat{\lambda } W)^{\prime } (I - \hat{\lambda } W)(\varvec{y_{\text {test}}} - \varvec{\hat{\eta }_{\text {test}}}) \end{aligned}$$

For all proposed performance criteria of prediction accuracy, lower values are always preferred.

The simulation study is conducted in the programming language R (R Core Team 2025). The QML and GMM estimation of the SDEM in the low-dimensional setting is performed via the spatialreg package (Gay 1990; Bivand et al. 2021; Pebesma and Bivand 2023). The presented graphics are created with the tidyverse packages (Wickham et al. 2019). Model-based gradient boosting for generalized, additive and interaction models can be found in the mboost package (Bühlmann and Hothorn 2007; Hothorn et al. 2010; Hofner et al. 2014, 2015). Spatial cross-validation is performed via the sperrorest package (Brenning 2012). An implementation for model-based gradient boosting for the SDEM via the novel spatial error family incorporating the deselection algorithm and the R code for reproducibility of all simulation studies can be found in the GitHub repository https://github.com/micbalz/SpatRegBoost.

3.2 Results

3.2.1 Low-dimensional Linear Setting (n = 400, p = 20)

In Table 1, the average selection rates over 100 repetitions for the low-dimensional setting are presented. The results show a consistent TPR of \(100\%\) across all spatial autoregressive parameters \(\lambda\), indicating that all informative variables are selected on average. In contrast, the TNR varies substantially with \(\lambda\). For strongly negative spatial dependence \(\lambda = -0.8\), the TNR is relatively low at approximately \(46\%\), meaning that only about eight out of 16 non-informative variables are correctly excluded on average. As \(\lambda\) increases toward positive values, the TNR steadily improves reaching over \(90\%\) at \(\lambda = 0.8\), indicating that nearly all non-informative variables are correctly not selected. Despite this improvement in TNR, the FDR remains considerable for small or negative values of \(\lambda\). At \(\lambda = -0.8\), the FDR is around \(66\%\), indicating that two-thirds of the selected variables are non-informative on average. As \(\lambda\) increases, the FDR gradually decreases, reaching approximately \(21\%\) for \(\lambda = 0.8\). This suggests that even though the model correctly identifies all informative variables, it includes a non-negligible number of non-informative variables especially under weak or negative spatial dependence although this issue becomes substantially less pronounced for large positive \(\lambda\).

Table 1 Average selection rates in the low-dimensional setting with 100 repetitions, the spatial error family, model-based gradient boosting with first step gradient boosting with deselection (DS-GB) across different spatial autoregressive parameters \(\lambda\)

The performance of estimation for the linear effects in the low-dimensional setting can be seen in Fig. 1. The estimated coefficients are presented by boxplots where the red horizontal line represents the true coefficient values according to the data generating process. Across all spatial autoregressive parameters \(\lambda\), DS-GB manages to recover all informative variables. However, the coefficient estimates are slightly biased across all values of \(\lambda\). In general, the magnitude of bias is stronger for spatial lags of the exogenous variables which is additionally increasing as \(\lambda\) increases. However, a bias in the estimates for DS-GB is unsurprising and actually expected since the presented estimates for the linear effects are based on regularization via early stopping. In contrast, non-informative variables where rarely chosen with coefficient estimates clustered around zero.

Fig. 1
figure 1

Estimated linear effects for the low-dimensional setting with 100 repetitions, the spatial error family, model-based gradient boosting with first step gradient boosting with deselection (DS-GB) across different spatial autoregressive parameters \(\lambda\). Horizontal red lines represent the true values

Additionally, the impact of the choice of the estimation method in the first step on the estimation accuracy of the spatial autoregressive parameter \(\lambda\) is evaluated. The results are presented in Table 2. Across almost all values of \(\lambda\), the established QML and GMM estimators struggle to estimate \(\lambda\) reliably, as indicated by substantial downward biases. However, the magnitude of this bias tends to decrease as \(\lambda\) becomes more positive where the best performance is observed at \(\lambda = 0.8\). Notably, QML generally outperforms GMM for positive values of \(\lambda\) in terms of lower bias, MSE and ESE. The results for LS-GB which utilizes OLS in the first step are identical to those for GMM as both rely on the same initial estimator of \(\varvec{\delta }\). Substantial improvements are observed when model-based gradient boosting is employed in the first step. Specifically, the GB-GB approach leads to notable reductions in bias, MSE, and ESE. These improvements are further enhanced when model-based gradient boosting is combined with deselection via DS-GB. Overall, GB-GB and DS-GB outperform or perform as well as the QML and GMM estimators across the entire range of \(\lambda\).

Table 2 Estimation performance for the spatial autoregressive parameter \(\lambda\) in the low-dimensional setting with 100 repetitions and the spatial error family

Finally, the prediction performance on an independent test data set for the low-dimensional setting is evaluated. The results are presented in Table 3. Across all values of the spatial autoregressive parameter \(\lambda\), the model-based gradient boosting algorithms outperform the classical approaches in terms of predictive accuracy. Specifically, the RMSEP, MAEP and the NLL are higher for QML and GMM compared to the model-based gradient boosting algorithms. DS-GB typically achieves the lowest NLL and slightly better RMSEP and MAEP although the differences between LS-GB, GB-GB, and DS-GB are minor and occur mostly at the decimal level. This indicates that the predictive performance of the model-based gradient boosting algorithms is stable and robust across different values of \(\lambda\), whereas QML and GMM show clear performance degradation especially as the spatial dependence increases.

Table 3 Prediction performance on independent test data for the low-dimensional linear setting with 100 repetitions and the spatial error family across different spatial autoregressive parameters \(\lambda\)

In general, the presented simulation results show proper functionality of the model-based gradient boosting algorithms from the perspective of specificity and sensitivity in the low-dimensional setting across all considered spatial autoregressive parameters \(\lambda\). This conclusion can be drawn based on the average selection rates where all true informative variables are selected across all 100 repetitions on average. Furthermore, the estimated linear effects show correct direction and algebraic sign. Nevertheless, biases are introduced due to the inherent regularization of model-based gradient boosting via early stopping. Although the average selection rates show high TNR and FDR, the actual impact in terms of estimated coefficients for non-informative variables remains very low, meaning the included non-informative variables are entering the final model only with low values. To provide intuition regarding the consistency and convergence of the proposed model-based gradient boosting algorithm, results are presented through a comparison with GMM coefficient estimates. These can be recovered by utilizing a sufficiently large stopping criterion, as demonstrated in Appendix C.1. Additionally, model-based gradient boosting either outperforms or performs as well as the established QML and GMM estimators in estimating \(\lambda\) when substantial noise is introduced through non-informative variables. As a result, the QML and GMM estimator exhibits a downward bias in the presence of such noise. This outcome is not unexpected as spatial autocorrelation typically decreases when informative variables are added to the model. However, the simulation study also indicates that introducing non-informative noise can falsely reduce spatial autocorrelation which has serious implications for model interpretation as it may lead to highly misleading conclusions. Utilizing model-based gradient boosting can effectively mitigate the consequences by not including non-informative variables, thereby improving quality of residuals in the first step and decreasing the bias across all values of \(\lambda\). While model-based gradient boosting outperforms QML and GMM estimation strategies in terms of predictive accuracy on unobserved test data, a notable drawback is its relatively high FDR indicating that the final model may still include many non-informative variables. However, its ability to exclude such variables improves drastically as \(\lambda\) increases. Despite this improvement, the persistently high FDR suggests that model complexity remains an issue even under stronger spatial dependence.

3.2.2 High-dimensional Linear Setting (n = 400, p = 800)

Similar to the low-dimensional setting, results are also evaluated for the high-dimensional setting. In this case, only the GB-GB and DS-GB approaches are reported as model-based gradient boosting remains the only viable option when the number of variables exceeds the number of observations among the considered methods. The average selection rates across 100 repetitions are shown in Table 4. Across all values of the spatial autoregressive parameter \(\lambda\), the TPR remains at \(100\%\), indicating that all informative variables are reliably selected on average. In contrast to the low-dimensional setting, the TNR is also high suggesting that most non-informative variables are successfully excluded. Notably, the TNR improves with increasing \(\lambda\) reaching approximately \(99.67\%\) at \(\lambda = 0.8\). This reflects a strong ability of the method to discriminate between informative and non-informative variables in high-dimensional settings, particularly under strong positive spatial dependence. Despite these improvements, the FDR remains high for lower values of \(\lambda\), though it steadily decreases as \(\lambda\) increases. At \(\lambda = -0.8\), nearly all selected variables are non-informative on average, whereas this proportion drops substantially to about \(28\%\) at \(\lambda = 0.8\). While these FDR values may seem high, they are not unexpected given the small number of informative relative to the large number of non-informative variables.

Table 4 Average selection rates in the high-dimensional setting with 100 repetitions, the spatial error family, model-based gradient boosting with first step gradient boosting with deselection (DS-GB) across different spatial autoregressive parameters \(\lambda\)

The estimation performance of the coefficients in the high-dimensional setting is illustrated in Fig. 2. Compared to the low-dimensional setting, the coefficient estimates exhibit a more pronounced shrinkage effect. Furthermore, the shrinkage effect additionally intensifies as the spatial autoregressive parameter \(\lambda\) increases and becomes positive. Nevertheless, the algebraic signs and general direction of the coefficients remain correct even in the presence of many non-informative variables. In contrast, non-informative variables where again rarely selected with coefficient estimates clustered around zero.

Fig. 2
figure 2

Estimated linear effects for the high-dimensional setting with 100 repetitions, the spatial error family, model-based gradient boosting with first step gradient boosting with deselection (DS-GB) across different spatial autoregressive parameters \(\lambda\). Horizontal red lines represent the true values

Furthermore, the estimation performance for the spatial autoregressive parameter \(\lambda\) in the high-dimensional setting is shown in Table 5. In this scenario, classical methods such as QML, GMM, and LS-GB are no longer applicable due to the high dimensionality of the predictor space. Consequently, only the model-based gradient boosting algorithms are considered as feasible estimation strategies. The results reveal a clear divergence in estimation behavior between the two algorithms. For GB-GB, the bias in the estimation of \(\hat{\lambda }\) increases steadily with the magnitude of \(\lambda\). This trend likely stems from the compounded effect of shrinkage on informative variables and the influence of a large number of non-informative variables. Since GB-GB does not actively remove non-informative variables, the residuals on which the estimation of \(\lambda\) heavily depends are more likely to be contaminated by noise leading to underestimation of \(\lambda\). In contrast, DS-GB demonstrates substantially lower bias across all values of \(\lambda\) with particularly notable improvements at higher spatial dependence. This improved performance is attributable to the deselection algorithm in the first step which effectively removes non-informative variables, thereby mitigating the adverse impact of noise induced by shrinkage. As a result, DS-GB achieves not only lower bias but also smaller MSE and ESE, enabling more accurate recovery of the true spatial autoregressive parameter in high-dimensional settings.

Table 5 Estimation performance for the spatial autoregressive parameter \(\lambda\) in the high-dimensional linear setting with 100 repetitions and the spatial error family

Finally, the predictive performance on independent test data in the high-dimensional setting is reported in Table 6. While the observed performance is slightly more variable than in the low-dimensional setting, the overall magnitude of the RMSEP, MAEP and NLL remains comparable. Notably, the predictive accuracy of DS-GB improves as the spatial autoregressive parameter \(\lambda\) increases which mirrors the pattern of reduced bias in the estimation of \(\hat{\lambda }\) discussed previously. This improvement is particularly evident in the NLL, where DS-GB achieves considerably better performance than GB-GB for higher values of \(\lambda\). However, a direct comparison to QML or GMM remains infeasible in this setting, reinforcing the advantage of model-based gradient boosting approaches.

Table 6 Prediction performance on independent test data for the high-dimensional linear setting with 100 repetitions and the spatial error family across different spatial autoregressive parameters \(\lambda\)

The results for the high-dimensional setting support the results of the low-dimensional setting and indicate proper functionality of the model-based gradient boosting algorithm from the perspective of specificity and sensitivity across all spatial autoregressive parameters \(\lambda\). The TPR remains at \(100\%\) across all 100 repetitions on average. Although the estimated coefficients are stronger affected from shrinkage, the algebraic sign and general direction remain correct. The GB-GB exhibits a strong bias as \(\lambda\) increases indicating that deselection in the first step is necessary to reliably estimate \(\lambda\) in high-dimensional settings. Nevertheless, the great advantage is the feasibility of model-based gradient boosting in scenarios where the number of variables exceed the number of observations. In such situations, the established estimation strategies like QML and GMM fail entirely as evidenced in the simulation study by missing values. For instance, the predictive performance as measured by the evaluation criteria remains comparable in absolute terms to that of the low-dimensional setting. However, the results also reveal a high FDR in the evaluation of the variable selection performance. Notably, no distinction is made between variables that are frequently selected and those that are selected only once, meaning all variables are equally weighted in the computation of the FDR. To address this limitation, the post-hoc deselection algorithm as proposed by Strömer et al. (2022) is utilized to improve the FDR of the model-based gradient boosting algorithm by removing variables that are likely selected merely by chance.

3.2.3 Deselection

As evidenced in the results for the low- as well as high-dimensional setting, model-based gradient boosting for the SDEM suffers from a “greedy” selection behavior which attributes to the inclusion of too many non-informative variables, thereby decreasing TNR and increasing FDR. The reason for the “greedy” nature lies in the fact that the solutions of model-based gradient boosting are optimal with respect to the \(L_1\)-arc-length. In fact, by taking any convex loss function, model-based gradient boosting is able to approximate the solution path of a strictly monotone \(L_1\)-regularized regression model. Therefore, model-based gradient boosting is unable to automatically deselect a variable once it has been added to the model (Hastie et al. 2007; Hepp et al. 2016). To mitigate the consequences, the deselection algorithm for generalized linear, additive and interaction models is adapted to model-based gradient boosting and evaluated in the following simulations (Strömer et al. 2022).

Table 7 Average selection rates for model-based gradient boosting with first step gradient boosting with deselection and additional deselection (DS-DS) in the low- and high-dimensional setting with 100 repetitions, spatial error family across different spatial autoregressive parameters \(\lambda\)

The average selection rates after utilizing model-based gradient boosting with first step gradient boosting with deselection and an additional deselection step (DS-DS) are presented in Table 7. Indeed, the results clearly show that the TNR remains at \(100\%\) while the FDR decreases to \(0\%\) on average. Thus, DS-DS successfully avoids selecting non-informative variables over all 100 repetitions, ensuring that only informative variables are included in the final model. This behavior is observed in the low- as well as high-dimensional setting. Therefore, the deselection algorithm is able to mitigate the consequences of the inclusion of additional non-informative variables in boosted SDEM. Particularly, the “greedy” nature of model-based gradient boosting can be effectively mitigated through the use of deselection. Thus, Algorithm 2 provides a systematic approach for post-hoc deselection of variables that would otherwise remain in the model due to the “greedy” selection of model-based gradient boosting. Moreover, deselection helps to address the limitations of regularization via early stopping, as it allows model-based gradient boosting to be reapplied after drastically reducing the number of variables. Thus, utilizing model-based gradient boosting in combination with deselection is strongly encouraged to ensure proper model and variable selection properties.

4 Case Study: Modeling Life Expectancy in German Districts

Life expectancy, defined as the average number of years a newborn in a given population is expected to live under current mortality conditions, is a statistical measure widely used to assess a variety of health, social, and economic outcomes (Marmot 2005; Cutler et al. 2006). Over the past several decades, the life expectancy has been steadily increasing across Europe indicating progress in healthcare, socioeconomic development, and public health interventions. Notably, countries in the European Union report an increase in longevity due to improved living standards, better medical care, and lower mortality from major diseases such as cardiovascular conditions and cancer. Nevertheless, large regional disparities remain between and within countries. For instance, Germany, one of the largest and economically wealthiest nation in the European Union, has seen signs of stagnation in life expectancy in comparison to especially Northern European countries. Thus, the stagnation gives rise to important concerns about regional inequalities, healthcare system efficiency, lifestyle risks, and demographic changes (OECD and European Commission 2024). In this case study, the goal is to model the life expectancy in German districts by utilizing spatial regression models with autoregressive disturbances to answer questions regarding underlying socio-economic and other determinants of life expectancy while incorporating district-level geographical disparities. In principle, the case study builds on recent previous research such as Lampert et al. (2019), Rau and Schmertmann (2020), Siegel et al. (2022), Jasilionis et al. (2023), Marinetti et al. (2023), Hoebel et al. (2025) which investigates current trends, socio-economic drivers of life expectancy, impact of the COVID-19 pandemic and patterns of mortality from both district-level and complete country perspectives.

To this end, a large-scale real-world data base, namely INKAR, is utilized. Managed by the Bundesinstitut für Bau-, Stadt- und Raumforschung, INKAR is an interactive (online) atlas about the living situation in Germany. It contains over 600 unique socio-demographic, socio-economic, and environmental indicators for distinct geographical locations allowing for evaluations of urban and rural disparities. In principle, INKAR is a panel data set relying on information starting from the year 1995. However, in this case study, the focus is on the cross-sectional life expectancy in German districts in the year 2019. Therefore, INKAR in the version of 2021 is utilized (Bundesinstitut für Bau-, Stadt- und Raumforschung (BBSR) 2024). The INKAR data set in the current as well as the 2021 version with the codebook for all indicators is freely available at https://www.inkar.de/.

To construct the data set suitable for this case study, the following preprocessing steps are necessary. First, the original number of districts available in INKAR is 401. However, the data is merged with the German map data from the year 2024 available at https://www.bkg.bund.de/. The map data in the most recent version includes only 400 districts because the district "Eisenach" has been merged with a neighboring district in 2022 (Bundesamt für Kartographie und Geodäsie (BKG) 2024). To ensure comparability between both data sets, “Eisenach” is thus removed from the INKAR 2021 data resulting in 400 districts of interest. Second, a spatial weight matrix has to be generated based on the specific spatial configuration of the German districts. Therefore, the centroids of the spatial polygons based on the shape file layer from the German map data are computed. To keep a close connection to the simulation study in Section 3, a 10-nearest neighbor structure is created indicating that each district is assigned its ten geographically closest neighbors. Afterward, the resulting spatial weight matrix is row-normalized. Third, since the focus is primary on the variable selection evaluation, the indicators are allowed to be quite general. Thus, any domain knowledge is not imposed in the selection of variables. However, dubious indicators like the indicator number of a district as well as indicators describing similar phenomena are removed before model estimation. Furthermore, transformations for continuous variables, that is, centering and scaling, are applied to ensure comparability across variables in terms of the scale.

The final data set is composed of 400 observations where each location corresponds to a district in Germany. These districts are divided again into 294 rural districts and 106 urban cities. A map of the life expectancy in German districts in the year 2019 can be seen in Fig. 3. The map reveals distinct disparities in life expectancy across German districts with clear spatial clustering among neighboring regions. Notably, the highest life expectancy is observed in southern Bavaria near the borders with Austria and Switzerland. Additional areas of high life expectancy can be found in Lower Saxony close to the Dutch border. In contrast, life expectancy in the new federal states (former East Germany) is generally lower than in the old federal states (former West Germany) indicating a clear east-west divide. Details on the data types and descriptions of the dependent and independent variables of interest are provided in Table 8.

Fig. 3
figure 3

Life expectancy in German districts in the year 2019

Table 8 Data type and description of dependent and independent variables for German district data

As evidenced by Fig. 3, ignoring spatial autocorrelation in the estimation process may lead to a misrepresentation of the underlying determinants influencing life expectancy. Therefore, the objective is to model life expectancy in German districts for the year 2019 using spatial regression models with autoregressive disturbances. In particular, the SDEM is employed as spatial autocorrelation may arise not only in the disturbances but also through spatial lags of the exogenous variables. For example, the number of medical doctors in neighboring districts may influence life expectancy given the relatively short distances between locations. Generally, the results are presented with a clear focus on model-based gradient boosting. Since the application is a low-dimensional setting, results are reported for the QML, GMM, LS-GB, GB-GB, DS-GB and DS-DS defined exactly as in the simulation study in Section 3. All exogenous variables presented in Table 8 as well as the corresponding first-order spatial lags and an intercept are included in the estimation yielding 65 coefficients to estimate in a non-parsimonious model. Regarding the setup for the model-based gradient boosting algorithm, the narrative from the simulation study is followed and the learning rate is thus set to \(s = 0.1\). Furthermore, for the optimization of the stopping criterion \(m_{\text {stop}}\), the search is conducted by minimizing the empirical risk via 10-fold spatial cross-validation as discussed in Brenning (2012). For the deselection algorithm, \(\tau\) is set to 0.01. The estimated coefficients for all estimation strategies can be seen in Table 9.

Table 9 Coefficient estimates in German district data for quasi-maximum likelihood (QML), generalized method of moments (GMM), model-based gradient boosting with first step OLS (LS-GB), gradient boosting (GB-GB), gradient boosting with deselection (DS-GB) and gradient boosting with deselection and additional deselection (DS-DS)

For brevity, the coefficients of the spatial lags of exogenous variables are not reported individually. Instead, the number of spatial lag variables included in each model (No. \(\varvec{WX}\)) is presented. Regarding variable selection, the LS-GB approach reduces the total number of coefficients from 65 to 36 out of which 23 can be attributed to the spatial lags of exogenous variables. A similar but improved outcome is observed for GB-GB where the number of included variables is reduced to 29 out of which 15 correspond to the coefficients of the spatial lags of exogenous variables. The best results in terms of variables selection are achieved when model-based gradient boosting is combined with deselection. For DS-GB, the number of selected variables is reduced to 11 with no coefficients of the spatial lags of exogenous variables included. DS-DS further improves performance of variable selection through post-hoc deselection, resulting in a final parsimonious model with only 5 variables. These results suggest that many of the variables listed in Table 8 are non-informative. Notably, model-based gradient boosting combined with deselection effectively reduces the initially non-parsimonious SDEM to a simpler and more parsimonious SEM in a data-driven manner without the need for formal likelihood-based model selection criteria or individual domain knowledge.

Regarding the spatial autoregressive parameter, all estimation strategies indicate positive spatial dependence between neighboring districts although the strength varies. In line with the downward bias observed in the simulation study, both QML and GMM estimators tend to underestimate the spatial autoregressive parameter \(\lambda\). This underestimation can be attributed to the inclusion of non-informative variables which are largely eliminated through model-based gradient boosting. Consequently, the estimator of \(\lambda\) obtained from model-based gradient boosting is generally higher depending on the first step estimation method. Moreover, the algebraic sign and overall direction of the coefficients remain identical across all estimation strategies. The primary difference lies in the magnitude of the coefficients, which varies due to the shrinkage effects introduced by early stopping in model-based gradient boosting. Based on the results for DS-DS, the most important variables explaining the life expectancy in German districts in 2019 are the the unemployment rate, the debt quota, the proportion of employees with an academic degree, the rent prices and the number of care-dependent individuals. Since the independent variables are transformed by simple scaling and centering, the coefficients have an intuitive and simple interpretation. For instance, an increase in rent prices by one euro ceteris paribus increases the average life expectancy by 0.2185 years on average. Conversely, holding all other variables constant, a 1 percentage point increase in the share of private debtors is associated with an average 0.4268 year decrease in average life expectancy.

To summarize the findings, the positive spatial autoregressive parameter \(\lambda\) indicates strong spatial dependence between districts, that is, the disturbance in one district is strongly influenced by the disturbances in neighboring districts. Thus, health outcomes cluster spatially as suggested in Fig. 3. Furthermore, the number of care-dependent individuals negatively influence average life expectancy meaning that underlying poor health status decreases the average life expectancy. Districts with higher unemployment and debt quotas have on average lower average life expectancies indicating that financial hardship, fiscal stress or economic precarity is negatively associated with longevity. In contrast, higher rent prices and higher share of employees with an academic degree lead to a higher average life expectancy which shows that inhabitants in wealthier districts and better socioeconomic status live longer on average.

5 Conclusion

The key findings and main contributions of this article are: (a) Model-based gradient boosting is extended for spatial regression model with autoregressive disturbances. (b) The algorithm is implemented in the mboost package (Bühlmann and Hothorn 2007; Hothorn et al. 2010; Hofner et al. 2014, 2015) by providing a novel spatial error family. (c) Model-based gradient boosting for the SDEM strongly relies on the knowledge about the spatial autoregressive parameter \(\lambda\). However, in real-world application settings \(\lambda\) is generally unknown making model-based gradient boosting infeasible. Therefore, a feasible model-based gradient boosting algorithm for the SDEM is proposed which relies on replacing the unknown quantities \(\lambda\) and \(\sigma ^2\) in the ingredients by the corresponding estimators \(\hat{\lambda }\) and \(\hat{\sigma }^2\) proposed in Kelejian and Prucha (1999). (d) In extensive simulation studies for a low- as well as high-dimensional setting, the results show proper functionality of the proposed feasible model-based gradient boosting algorithm. Particularly, the results are accompanied by a high TPR and coefficients are estimated with high accuracy although biases are introduced due to regularization via early stopping. Model-based gradient boosting also outperforms or performs as well as QML and GMM estimators in terms of bias, MSE and ESE when estimating the autoregressive parameter \(\lambda\) in the presence of non-informative variables. Additionally, the predictive performance on an independent test data set is always better for model-based gradient boosting in comparison to QML and GMM. (e) The great advantage of model-based gradient boosting is the feasibility in high-dimensional settings where the number of variables exceeds the number of observations. Due to the modular nature, model-based gradient boosting does not require refitting and allows for a direct interpretation of the effects of coefficients on the dependent variable of interest enabling a potential application for a wide range of real-world settings. (f) Additionally, the feasible model-based gradient boosting is applied in a real-world application setting where the life expectancy in German districts is modeled. Additional case studies revisiting classical spatial econometric data sets, namely Boston housing prices (Harrison Jr and Rubinfeld 1978) and Columbus crime rate (Anselin 1988) are provided in Appendix D.

Naturally, limitations, improvements and extensions beyond the scope of this article have to be acknowledged. Although simulation studies have been provided, the scope of the simulation settings is of course by no means exhaustive. Further simulation studies such as under varying spatial weight matrices aim at partially address these limitations but much more complex scenarios can be easily considered. For instance, additional correlated noise variables or higher spatial lags of exogenous variables may be included in the simulation settings. A major limitation of the proposed model-based gradient boosting algorithm is the restricted generalizability. Specifically, the algorithm can only be applied in settings where the dependent variable does not appear as a spatial lag in the spatial regression model. Thus, the implemented model-based gradient boosting algorithm cannot be readily extended to spatial autoregressive models which are promising candidates for a wide range of applications. Additionally, the algorithm assumes homoskedastic innovations which is a potentially unrealistic assumption in many real-world application settings. In fact, heteroskedasticity is a common feature of spatial datasets (LeSage and Pace 2009). Furthermore, as discussed in Section 2, a formal proof of the convergence of model-based gradient boosting using the squared Mahalanobis distance is still lacking. While the numerical results in Appendix C.1 indicate that the coefficients of model-based gradient boosting for the SDEM converge toward those of the GMM, these findings do not replace a formal theoretical justification of the method. Revisiting the average selection rates reveals high FDR values, indicating the inclusion of many non-informative variables of minor importance in the final model. To mitigate the consequences, the utilization of a deselection algorithm is proposed. Alternatively, the application of the so-called stability selection could be utilized which enables the control over the amount of false positive (Meinshausen and Bühlmann 2010; Shah and Samworth 2012; Hofner et al. 2015). Furthermore, probing can be adapted for the model-based gradient boosting for the SDEM which is currently available for univariate location models. The general idea is to obtain sparser models by stopping the algorithm as soon as the first randomly permuted version of a variable is added (Thomas et al. 2017). Finally, p-values for individual base-learners could be obtained by utilizing permutation techniques (Hepp et al. 2019).

Therefore, the next goal is to extend model-based gradient boosting for the SDEM by incorporating heteroskedastic innovations and adapting it to panel data models with random and fixed effects (Anselin 1988; Kapoor et al. 2007; Baltagi et al. 2013). Such an extension would enable a more nuanced analysis of life expectancy across German districts. For example, the INKAR dataset provides panel data making it possible to study the underlying determinants of life expectancy over time. Furthermore, the FDR, TNR and bias in the simulation study are influenced by the magnitude of \(\lambda\). This behavior is possibly related to the spectrum of the spatial weight matrix. Particularly, if the eigenvalue moduli are bounded by about one, positive values of \(\lambda\) approach the upper bound of the admissible range, leading to stronger regularization and more pronounced variable selection. However, a complete theoretical explanation remains elusive, representing an interesting direction for further research. Beyond the spatial econometrics context, the novel model-based gradient boosting could be extended and applied to network econometric models (Lee et al. 2010). Therefore, practitioners and applied statisticians working with spatial data are encouraged to utilize model-based gradient boosting for the SDEM as a valuable alternative for estimation, regularization, model and variable selection in future research.

6 Supplementary Information

The data as well as the accompanying codebook on the life expectancy in German districts is publicly available via https://www.inkar.de/ (DL-DE BY 2.0).

All R-code for the implemented model-based gradient boosting algorithm along with the simulation studies is publicly available in the following GitHub repository https://github.com/micbalz/SpatRegBoost.