Abstract
Researchers in urban and regional studies increasingly work with high-dimensional spatial data that captures spatial patterns and spatial dependencies between observations. To address the unique characteristics of spatial data, various spatial regression models have been developed. In this article, a novel model-based gradient boosting algorithm tailored for spatial regression models with autoregressive disturbances is proposed. Due to its modular nature, the approach offers an alternative estimation procedure with interpretable results that remains feasible even in high-dimensional settings where traditional quasi-maximum likelihood or generalized method of moments estimators may fail to yield unique solutions. The approach also enables data-driven variable and model selection in both low- and high-dimensional settings. Since the bias-variance trade-off is additionally controlled for within the algorithm, it imposes implicit regularization which enhances predictive accuracy on out-of-sample spatial data. Detailed simulation studies regarding the performance of estimation, prediction and variable selection in low- and high-dimensional settings support proper functionality of the proposed methodology. To illustrate the applicability of the model-based gradient boosting algorithm, a case study is presented where the life expectancy in German districts is modeled, incorporating a potential spatial dependence structure.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
With the development of novel technologies such as global positioning systems, the availability of spatial data has increased across a wide range of real-world applications. A spatial dependence structure in the data can arise due to spatial autocorrelation which describes similarities between geographical locations in space. In principle, such spatial data can be modeled by spatial regression models which are extensions of the standard linear regression model (LeSage and Pace 2009). Under the assumption that the spatial dependence arises solely through the error terms, Cliff and Ord (1973) proposed based on Whittle (1954) to model the spatial autocorrelation in the disturbance process of a linear regression model. Then, for any given variable of interest, the error term for each location depends on a weighted average of the disturbances in connected locations (Anselin 1988).
However, the recent availability of large-scale novel information leads increasingly to high-dimensional data settings. Therefore, applied researchers are faced with unique challenges since in high-dimensional data sets the number of variables exceeds the number of observations making estimation via the quasi-maximum likelihood (QML) or generalized method of moments (GMM) principle infeasible. Furthermore, out of the available variables only a small fraction might actually be informative in explaining a dependent variable of interest. Moreover, the complexity of the linear predictor may rapidly increase depending on modeling choices of the practitioner such as the inclusion of higher order spatial lags of independent variables leading to a rich set of potential candidate models (LeSage and Pace 2009; Fahrmeir et al. 2013).
Thus, approaches to model choice, variable selection and estimation in high-dimensional settings have become increasingly more important. If model estimation is based on the QML, a quantifiable value for model comparison in low-dimensional settings can be obtained based on the Akaike information criterion. Instead of utilizing information criteria, regularization techniques are popular alternatives for model choice and variable selection (Fahrmeir et al. 2013). Such regularization techniques include, for example, the least absolute shrinkage and selection operator which has been recently extended to spatial regression model with autoregressive disturbances (Tibshirani 1996; Cai et al. 2019; Cai and Maiti 2020).
Another option in linear regression models is the model-based gradient boosting algorithm. Although originally proposed in the domain of machine learning for classification problems, gradient boosting has been extended for statistical regression models where it is known as component-wise, model-based or statistical gradient boosting. In principle, the algorithm is iterative in nature where the estimation problem reduces to fitting base-learners to the negative gradient of a prespecified loss function. The base-learners then describe the functional forms of the effect of the independent variables and the loss function is related to the statistical model of interest. In each iteration of the algorithm only the best performing base-learner is chosen from which a small fraction is added to the current linear predictor. Stopping the algorithm early allows for data-driven model and variable selection and yields interpretable results at each iteration (Mayr et al. 2014). Due to the modular and iterative nature of model-based gradient boosting, it remains a feasible approach even in high-dimensional settings (Bühlmann 2006; Bühlmann and Hothorn 2007).
Although a variant of boosting for semi-parametric additive spatial autoregressive models has been recently explored (Yue and Xi 2025), no prior work appears to have addressed a potential extension of model-based gradient boosting for parametric spatial regression models with autoregressive disturbances. To this end, the model-based gradient boosting algorithm in the mboost package (Bühlmann and Hothorn 2007; Hothorn et al. 2010; Hofner et al. 2014, 2015) for fitting generalized linear, additive and interaction models of potential high-dimensionality in the programming language R (R Core Team 2025) is extended to accommodate spatial regression models with autoregressive disturbances. To investigate proper functionality of the proposed model-based gradient boosting algorithm, in-depth simulation studies in low- and high-dimensional settings are conducted. The focus lies primarily on the evaluation of estimation, variable selection and prediction. To illustrate the potential real-world application of model-based gradient boosting, a case study concerned with modeling the life expectancy in German districts is presented which draws on the “Indicators and Maps on Spatial and Urban Development in Germany and Europe” (INKAR) data base that includes a rich variety of variables (Bundesinstitut für Bau-, Stadt- und Raumforschung (BBSR) 2024).
The structure in this article is as follows: In Section 2, the mathematical, theoretical framework of spatial regression models with autoregressive disturbances and model-based gradient boosting is introduced where both concepts are combined thereafter. Afterward, outcomes for the simulation studies are discussed in Section 3. A description of the context, data situation and variables for the case study as well as the results are presented in Section 4. The article finishes with a conclusion and a discussion in Section 5.
2 Methodology
2.1 Spatial Regression Models with Autoregressive Disturbances
Let \(n \in \mathbb {N}\) denote the number of observations in a spatial data set. For each \(i \in \{1,\dots ,n\}\), consider following spatial regression model with autoregressive disturbances
where \(\varvec{y}\) is a \(n \times 1\) vector of observations, \(\varvec{X}\) is the \(n \times p\) design matrix of \(p \in \mathbb {N}\) exogenous variables, \(\varvec{\beta }\) are corresponding \(p \times 1\) coefficients and \(\varvec{u}\) is the \(n \times 1\) vector of disturbances. The spatial autocorrelation in the data is assumed to enter the model in two ways. First, the disturbances are modeled as an autoregressive process which depend on a spatial autoregressive parameter \(\lambda \in (-1,1)\), a spatial weight matrix \(\varvec{W}\) of size \(n \times n\) that captures spatial connections between observations and \(n \times 1\) idiosyncratic random innovations \(\varvec{\epsilon }\). Second, spatial lags of exogenous variables \(\varvec{W}\varvec{X}\) of size \(n \times p\) and corresponding \(p \times 1\) coefficients \(\varvec{\theta }\) are also included in modeling the spatially dependent variable of interest. A list of important notation utilized throughout the article can be found in Appendix A. Thus, the model in Eq. 1 is a generalization of special cases of spatial regression models and is labeled the spatial Durbin error model (SDEM). In the SDEM, spatial autocorrelation is accounted for in the explanatory variables as well as the error term. However, removing the spatially lagged independent variables results in the simpler spatial error model (SEM). In contrast, retaining the spatially lagged independent variables but removing the autoregressive nature of the disturbances results in the simpler spatial cross-regressive model (SLX) (Anselin 1988; LeSage and Pace 2009; Halleck Vega and Elhorst 2015).
Assumption 1
The innovations \(\varvec{\epsilon }\) are independently and identically distributed with expectation \(\mathbb {E}(\varvec{\epsilon }) = 0\) and variance \({\text {Var}}(\varvec{\epsilon }) = \sigma ^2\). Additionally, for any \(\xi> 0\), the moment \(\mathbb {E}|\varvec{\epsilon }|^{4+\xi }\) exists.
Assumption 2
The spatial weight matrix \(\varvec{W}\) has no self-loops, that is, the diagonal entries satisfy \(w_{ii} = 0\). The off-diagonal entries satisfy \(w_{ij} = O\left( \frac{1}{h}\right)\) where \(\frac{h}{n} \rightarrow 0\).
Assumption 3
For any \(|\lambda | < 1\), the matrix \(\varvec{I} - \lambda \varvec{W}\) exists and is non-singular.
Assumption 4
The row and column sums of \(\varvec{W}\) and \(\left( \varvec{I} - \lambda \varvec{W}\right) ^{-1}\) are uniformly bounded in absolute value.
For a proper estimation procedure of the autoregressive parameter as well as the coefficients of the exogenous variables and the corresponding spatial lags, regularity conditions have to be imposed. Assumption 1 imposes homoskedasticity of the innovations, although normality is not formally required. Assumption 2 links the number of observations to the spatial weight matrix. The condition is satisfied if \(\varvec{W}\) is row-normalized which is an assumption maintained throughout this article. The stability condition \(|\lambda | < 1\) in Assumption 3 ensures the invertibility of \(\varvec{I} - \lambda \varvec{W}\) and thus the uniqueness of the autoregressive disturbances in terms of the innovations. Similarly, Assumption 4 ensures that the degree of spatial autocorrelation remains within a manageable range (Lee 2004).
The model in Eq. 1 can be written more compactly by combining all exogenous variables and the corresponding spatial lags into one design matrix \(\varvec{Z} =[\varvec{X}, \varvec{W} \varvec{X}]\) and by stacking the corresponding coefficients vertically \(\varvec{\delta } = (\varvec{\varvec{\beta }^{\prime }, \varvec{\theta }}^{\prime })^{\prime }\) as
Although not the focus in this article, it is worth to note that the general model formulation also allows for the inclusion of additional spatial lags of exogenous variables such as \(\varvec{W}\varvec{W}\varvec{X}\) or even \(\varvec{W}\varvec{W}\varvec{W}\varvec{X}\). Let \(\varvec{\eta } = \varvec{Z} \varvec{\delta }\) be the so-called linear predictor. To estimate the coefficients of the exogenous variables and the corresponding spatial lags of exogenous variables, it is convenient to transform the model into a single equation form as
The following Assumption 5 is additionally imposed on the design matrix \(\varvec{Z}\).
Assumption 5
The matrix \(\varvec{Z}\) has full column rank. Specifically, the limit \(\lim _{n \rightarrow \infty } \varvec{Z}^{\prime } \varvec{Z}\) exists, is non-singular, and the elements of \(\varvec{Z}\) are uniformly bounded in absolute value.
If the autoregressive parameter \(\lambda\) is known, then the loss function corresponding to Eq. 2 is the squared Mahalanobis distance of the residual vector which arises from the generalized least squares objective function
where the variance-covariance matrix is induced by spatial autocorrelation and
The negative gradient of the loss function with respect to the linear predictor \(\varvec{\eta }\) is given by
which yields all necessary ingredients for the model-based gradient boosting algorithm (Kelejian and Prucha 1999; Cai et al. 2019; Cai and Maiti 2020).
2.2 Three-step Feasible Model-based Gradient Boosting
Given the relevant expressions for the ingredients, model-based gradient boosting for the SDEM can be implemented. In principle, the ingredients play an important role in each iteration of the algorithm. In general, the standard interpretation of boosting as steepest descent in function space implies that the algorithm reduces the empirical risk iteratively through the use of base-learners (Friedman 2001). The base-learners represent the functional forms associated with the exogenous input variables. The algorithm begins with an empty model and sequentially fits the specified base-learners to the negative gradient of the chosen loss function. Thus, proper functionality of the algorithm requires a pre-specified loss function which can be quite general. Subsequently, the residual sum of squares is computed for each base-learner separately and the linear predictor is updated by adding a small fraction of the best performing base-learner. The algorithm then reevaluates the negative gradient and updates the linear predictor in an iterative fashion until the specified number of boosting iterations is reached (Friedman 2001; Bühlmann and Hothorn 2007; Mayr et al. 2014).
Algorithm 1 adapts model-based gradient boosting for the SDEM and does not impose any limitations on the number of potential independent variables \(q \in \mathbb {N}\). Indeed, the great advantage of model-based gradient boosting is the feasibility in high-dimensional settings where the number of variables is larger than the number of available observations (Hepp et al. 2016). As the main tuning parameter in the algorithm, the number of boosting iterations \(m_{\text {stop}}\) controls the so-called bias-variance trade-off. Thus, the accuracy of prediction can be improved and overfitting behavior mitigated. Due to the modular nature, the algorithm yields an interpretable solution at each iteration such that sparser models can be obtained by stopping the algorithm early instead of convergence. Since the algorithm only updates the linear predictor by one component at each iteration, variable selection and shrinkage estimation is also accounted for Mayr et al. (2012).
The feasibility of Algorithm 1 strongly relies on the assumption that the spatial autoregressive parameter \(\lambda\) and the variance of the innovations \(\sigma ^2\) are both simultaneously apriori known. However, in real-world application settings, \(\lambda\) and \(\sigma ^2\) are unknown which implies that the variance-covariance matrix \(\varvec{\Omega }(\lambda , \sigma ^2)\) occurring in the squared Mahalanobis distance and the negative gradient cannot be evaluated. Therefore, a three-step model-based gradient boosting procedure is proposed to enable the feasibility of Algorithm 1 which relies on replacing the unknown quantities \(\lambda\), \(\sigma ^2\) and \(\varvec{\Omega }(\lambda , \sigma ^2)\) by the estimators \(\hat{\lambda }\), \(\hat{\sigma }^2\) and \(\varvec{\Omega }(\hat{\lambda }, \hat{\sigma }^2)\). In the first step, the model in Eq. 1 is written as
temporarily ignoring the potential autoregressive structure of the disturbances. The model in Eq. 6 can then be estimated using a variety of methods as long as the resulting estimator \(\varvec{\tilde{\delta }}\) is consistent. For low-dimensional linear settings, a natural choice for the estimator is ordinary least squares (OLS). In high-dimensional settings, OLS may not yield unique solutions, so model-based gradient boosting can be utilized instead. As noted in Zhang and Yu (2005) and Bühlmann (2006), model-based gradient boosting yields a consistent estimator in both low- and high-dimensional settings if the squared error is employed as the loss function. Additionally, Bühlmann and Hothorn (2007) formally show that model-based gradient boosting with the squared error loss function converges to the OLS solution if the number of boosting iterations \(m_{\text {stop}}\) is chosen sufficiently large.
Based on Kelejian and Prucha (1999), let in the second step \(\varvec{\tilde{u}} = \varvec{y} -\varvec{Z}\varvec{\tilde{\delta }}\) denote the predictors of \(\varvec{u}\) based on a consistent estimator \(\varvec{\tilde{\delta }}\). Define \(\varvec{\bar{u}} = \varvec{W}\varvec{u}\), \(\varvec{\bar{\bar{u}}} = \varvec{W}\varvec{W}\varvec{u}\) and the corresponding expressions based on the predictors as \(\varvec{\tilde{\bar{u}}} = \varvec{W}\varvec{\tilde{u}}\) and \(\varvec{\tilde{\bar{\bar{u}}}} = \varvec{W}\varvec{W}\varvec{\tilde{u}}\). Adapt an identical notation pattern to the innovations. Then, if Assumptions 1 to 3 hold, following three moments can be obtained
where \(\text {tr}(\cdot )\) denotes the trace of any matrix. Since the innovations can be written in terms of \(\varvec{\bar{u}}\) and \(\varvec{\bar{\bar{u}}}\) as \(\varvec{\epsilon } = \varvec{u} - \lambda \varvec{\bar{u}}\) and \(\varvec{\bar{\epsilon }} = \varvec{\bar{u}} - \lambda \varvec{\bar{\bar{u}}}\), a system of three equations can be obtained based on Eqs. 1 and 7
The expressions for \(\varvec{\Gamma }\) and \(\varvec{\gamma }\) are given as
Replacing the moments in Eq. 8 by the corresponding sample moments yields
where
and \(\varvec{\nu }(\lambda , \sigma ^2)\) is interpreted as a \(3 \times 1\) residual vector. The estimators for \(\lambda\) and \(\sigma ^2\) are obtained using non-linear least squares and are denoted by \(\hat{\lambda }\) and \(\hat{\sigma }^2\). Based on Eq. 9, the non-linear least squares estimators are defined as
Let the Assumptions 1 to 7 hold. Then the non-linear least squares estimators \(\hat{\lambda }\) and \(\hat{\sigma }^2\) are consistent estimators of \(\lambda\) and \(\sigma ^2\) in the sense that \(\hat{\lambda } \rightarrow _p \lambda\) and \(\hat{\sigma }^2 \rightarrow _p \sigma ^2\) for \(n \rightarrow \infty\) sufficiently large.
Assumption 6
Denote by \(\tilde{u}_{i}\) denote the i-th element of the vector \(\varvec{\tilde{u}}\). Assume that there exist finite-dimensional random vectors \(\varvec{d}_{i,n}\) and \(\varvec{\Delta }_n\) such that \(\left| \tilde{u}_{i} - u_{i} \right| \le \left\| \varvec{d}_{i,n} \right\| \left\| \varvec{\Delta }_n \right\|\) with the following conditions holding \(\frac{1}{n} \sum _{i=1}^n \left\| \varvec{d}_{i,n} \right\| ^{2 + \zeta } = O_p(1) \quad \text {for some } \zeta> 0, \quad \text {and} \quad \sqrt{n} \left\| \varvec{\Delta }_n \right\| = O_p(1)\).
Assumption 7
Assume that the matrix \(\varvec{\Gamma }^{\prime } \varvec{\Gamma }\) is well-conditioned in the sense that its smallest eigenvalue is bounded away from zero. Specifically, \(\phi _{\min } \left( \varvec{\Gamma }^{\prime } \varvec{\Gamma } \right) \ge \phi ^{*}> 0\) where the constant \(\phi ^{*}\) may depend on \(\lambda\) and \(\sigma ^2\).
In the third step, \(\lambda\) and \(\sigma ^2\) in \(\varvec{\Omega }(\lambda , \sigma ^2)\) are replaced by the estimators \(\hat{\lambda }\) and \(\hat{\sigma }^2\) yielding \(\varvec{\Omega }(\hat{\lambda }, \hat{\sigma }^2)\). Thus, the squared Mahalanobis distance and negative gradient become
Finally replace the expressions in Algorithm 1 by the corresponding expressions based on the estimators in Eqs. 11 and 12 which yields a feasible model-based gradient boosting algorithm for the SDEM (Kelejian and Prucha 1999; Kapoor et al. 2007).
While convergence has been formally proven for model-based gradient boosting with the squared error loss function, the result does not directly translate to model-based gradient boosting with the squared Mahalanobis distance (Zhang and Yu 2005; Bühlmann 2006; Bühlmann and Hothorn 2007). Although there does not exist a proof for more general loss functions, empirical results confirm a similar convergence behavior with regards to the maximum likelihood estimator if the negative log likelihood of a generalized linear model is utilized as the loss function (Hepp et al. 2016). A similar behavior can be observed in model-based gradient boosting for the SDEM which is demonstrated in Appendix C.1.
2.3 Post-hoc Deselection
In model-based gradient boosting, the standard approach for model and variable selection is through early stopping via the stopping criterion \(m_{\text {stop}}\). Generally, \(m_{\text {stop}}\) is chosen by means of random cross-validation, subsampling or bootstrapping. However, random cross-validation techniques cannot be applied effectively to spatially dependent data because random partitions yield test data in which the observations are spatially close to the observations in the train data. This can lead to underestimation of the prediction error due to the negligence of the underlying spatial autocorrelation and thereby to unreliable variable as well as model selection (Schratz et al. 2019).
To address spatial autocorrelation, spatial cross-validation techniques such as those proposed by Brenning (2012) partition data into spatially disjoint subsets using clustering algorithms based on geodesic distances. Alternatively, spatial blocking techniques divide the geographical region into spatial blocks or buffer zones to create independent folds (Valavi et al. 2019). However, standard and spatial cross-validation techniques often select too many variables, resulting in less parsimonious models (Mayr et al. 2012). Furthermore, the quality of the non-linear least squares estimators \(\hat{\lambda }\) and \(\hat{\sigma }^2\) depends on the quality of the predictors of \(\varvec{u}\) where the inclusion of non-informative variables severely impacts the final estimators. In fact, the additional regularization from early stopping can induce severe biases into non-linear least squares estimators which can result in misleading model as well as variable selection.
To mitigate the consequences and obtain sparser final models, the deselection algorithm proposed by Strömer et al. (2022) is applied to model-based gradient boosting for the SDEM. The basic idea of deselection is to run model-based gradient boosting once, determine the optimal number of boosting iterations \(m_{\text {opt}}\) via spatial cross-validation techniques and then quantify the contribution of each variable to the overall risk reduction achieved up to \(m_{\text {opt}}\). Variables that contribute little to reducing the empirical risk are subsequently removed and model-based gradient boosting is reapplied using only the remaining variables and the previously determined \(m_{\text {opt}}\). Formally, let \(\mathbbm {1}(\cdot )\) be the indicator function, \(r^{[m]}\) the empirical risk of the squared Mahalanobis distance, \(\left( r^{[m-1]} - r^{[m]}\right)\) the risk reduction and \(j^{{*}^{[m]}}\) the best performing base-learner in iteration m of the model-based gradient boosting algorithm. The attributable risk reduction for base-learner j after \(m_{\text {opt}}\) boosting iterations is defined as
In principle, the attributable risk reduction \(R_j\) measures how much of the total reduction in empirical risk can be attributed to base-learner and hence variable j across the optimal number of boosting iterations \(m_{\text {opt}}\). The total risk reduction is given by \(\left( r^{[0]} - r^{[m_{\text {opt}}]}\right)\) and a base-learner j is deselected if its relative contribution is below a pre-specified threshold \(\tau \in (0,1)\) given as
Since \(R_j\) captures the risk reduction which can be attributed to each base-learner j, Eq. 15 removes base-learners for which \(R_j\) is smaller than a fraction of the total risk reduction. Thus, variables remain only in the model if the relative risk contribution is equal to or larger than the threshold \(\tau\). Therefore, the choice of the threshold \(\tau\) is crucial and usually depends on the specific research context. Particularly, threshold \(\tau\) controls the sparsity of the final model where smaller values of \(\tau\) retain more variables while larger values yield more aggressive deselection. Simulation results reported in Appendix C.3 suggest that only relatively small thresholds \(\tau \le 0.025\) are appropriate for the SDEM (Strömer et al. 2022). A summary of the deselection approach for model-based gradient boosting for the SDEM is given in Algorithm 2.
3 Simulation Study
3.1 Study Design
To evaluate the performance of the proposed three-step model-based gradient boosting algorithm, simulation studies are conducted with the ingredients derived in Section 2. Particularly, the performance of estimation, variable selection and prediction in low- as well as high-dimensional settings are evaluated. Additionally, an evaluation of the performance of the deselection algorithm is provided. Specifically, the number of observations is fixed at \(n = 400\). In contrast, the number of independent variables is varied between \(q = 20\) and \(q = 800\), indicating a low- \((n> q)\) and high-dimensional \((n < q)\) setting. The true data generating process is given by
where the variables are independently and identically drawn from the uniform distribution \(\varvec{X} \sim U(-2,2)\). The spatial autoregressive parameter is varied throughout the simulation study by \(\lambda \in \{-0.8,-0.6,-0.4,-0.2,0.2,0.4,0.6,0.8\}\) and the innovations are normally distributed according to \(\varvec{\epsilon } \sim N(0,\sigma ^2)\) with \(\sigma ^2 = 1\). The spatial weight matrix \(\varvec{W}\) is exactly identical to the application in Section 4 and has a 10-nearest neighbor structure where each location is connected to its ten geographically closest neighbors based on constant centroids of the spatial polygons available in the shape file. Afterward, \(\varvec{W}\) is row-normalized such that each row sums up to one. Additionally, an artificial spatial weight matrix can be generated based on a circular world vis-à-vis Kelejian and Prucha (1999) in which each location is directly related to the five locations before and after, that is, \(k = 5\). Simulation studies for varying spatial autoregressive parameters in a circular world are given in Appendix C.4 and for varying number of related locations \(k \in \{1,2,3,5,10,20\}\) in a circular world are given in Appendix C.5.
In the model-based gradient boosting algorithm, the corresponding base-learners are specified as simple linear regression models due to the nature of the data generating process. The learning rate is set to \(s = 0.1\) since that is the usual practice (see, for example, Schmid and Hothorn 2008; Mayr et al. 2012; Hofner et al. 2014). The optimal stopping criterion \(m_{\text {opt}}\) is found by minimizing the empirical risk via 10-fold spatial cross-validation as proposed in Brenning (2012). In each simulation setting, a total of \(n_{\text {sim}} = 100\) repetitions are conducted. Additionally, different approaches for the consistent estimator \(\varvec{\tilde{\delta }}\) in the first step are considered, along with their impact on the final results. Specifically, the reported methods are first step OLS (LS-GB), gradient boosting (GB-GB), and gradient boosting with deselection (DS-GB) where applicable. Furthermore, model-based gradient boosting with first step gradient boosting with deselection and additional deselection (DS-DS) is considered. In this notation, the component before the hyphen indicates the method utilized in the first step and the component after the hyphen indicates the method applied for the SDEM. A practitioner’s note summarizing the presented model-based gradient boosting algorithm as well as deselection with different first step methods can be found in Appendix B.
Regarding the performance of variable selection and deselection, the criteria are chosen based on the confusion matrix. In particular, the reported variable selection criteria are the true positive rate (TPR), which is the proportion of correctly selected variables out of all true informative variables, the true negative rate (TNR), which is the proportion of correctly non-selected variables out of all true non-informative variables and the false discovery rate (FDR), which is the proportion of non-informative variables in the set of all selected variables (Stehman 1997).
The performance of estimation is evaluated by reporting the bias, the mean squared error (MSE) and the empirical standard error (ESE) for \(\lambda\) defined as
For all proposed performance criteria, lower values are always preferred. Additionally, the effects shrinkage estimation for the independent variables is evaluated via visualization by boxplots to highlight the median, quartiles and outliers over 100 repetitions (Morris et al. 2019).
Furthermore, the prediction accuracy is evaluated based on an additional test data set. The test data \(\varvec{y_{\text {test}}}\) is generated according to the same data generating process as the train data with an identical number of observations \(n_{\text {test}} = 400\). The chosen criteria are the quasi negative log-likelihood (NLL), the root mean squared error of prediction (RMSEP) and mean absolute error of prediction (MAEP) defined as
For all proposed performance criteria of prediction accuracy, lower values are always preferred.
The simulation study is conducted in the programming language R (R Core Team 2025). The QML and GMM estimation of the SDEM in the low-dimensional setting is performed via the spatialreg package (Gay 1990; Bivand et al. 2021; Pebesma and Bivand 2023). The presented graphics are created with the tidyverse packages (Wickham et al. 2019). Model-based gradient boosting for generalized, additive and interaction models can be found in the mboost package (Bühlmann and Hothorn 2007; Hothorn et al. 2010; Hofner et al. 2014, 2015). Spatial cross-validation is performed via the sperrorest package (Brenning 2012). An implementation for model-based gradient boosting for the SDEM via the novel spatial error family incorporating the deselection algorithm and the R code for reproducibility of all simulation studies can be found in the GitHub repository https://github.com/micbalz/SpatRegBoost.
3.2 Results
3.2.1 Low-dimensional Linear Setting (n = 400, p = 20)
In Table 1, the average selection rates over 100 repetitions for the low-dimensional setting are presented. The results show a consistent TPR of \(100\%\) across all spatial autoregressive parameters \(\lambda\), indicating that all informative variables are selected on average. In contrast, the TNR varies substantially with \(\lambda\). For strongly negative spatial dependence \(\lambda = -0.8\), the TNR is relatively low at approximately \(46\%\), meaning that only about eight out of 16 non-informative variables are correctly excluded on average. As \(\lambda\) increases toward positive values, the TNR steadily improves reaching over \(90\%\) at \(\lambda = 0.8\), indicating that nearly all non-informative variables are correctly not selected. Despite this improvement in TNR, the FDR remains considerable for small or negative values of \(\lambda\). At \(\lambda = -0.8\), the FDR is around \(66\%\), indicating that two-thirds of the selected variables are non-informative on average. As \(\lambda\) increases, the FDR gradually decreases, reaching approximately \(21\%\) for \(\lambda = 0.8\). This suggests that even though the model correctly identifies all informative variables, it includes a non-negligible number of non-informative variables especially under weak or negative spatial dependence although this issue becomes substantially less pronounced for large positive \(\lambda\).
The performance of estimation for the linear effects in the low-dimensional setting can be seen in Fig. 1. The estimated coefficients are presented by boxplots where the red horizontal line represents the true coefficient values according to the data generating process. Across all spatial autoregressive parameters \(\lambda\), DS-GB manages to recover all informative variables. However, the coefficient estimates are slightly biased across all values of \(\lambda\). In general, the magnitude of bias is stronger for spatial lags of the exogenous variables which is additionally increasing as \(\lambda\) increases. However, a bias in the estimates for DS-GB is unsurprising and actually expected since the presented estimates for the linear effects are based on regularization via early stopping. In contrast, non-informative variables where rarely chosen with coefficient estimates clustered around zero.
Estimated linear effects for the low-dimensional setting with 100 repetitions, the spatial error family, model-based gradient boosting with first step gradient boosting with deselection (DS-GB) across different spatial autoregressive parameters \(\lambda\). Horizontal red lines represent the true values
Additionally, the impact of the choice of the estimation method in the first step on the estimation accuracy of the spatial autoregressive parameter \(\lambda\) is evaluated. The results are presented in Table 2. Across almost all values of \(\lambda\), the established QML and GMM estimators struggle to estimate \(\lambda\) reliably, as indicated by substantial downward biases. However, the magnitude of this bias tends to decrease as \(\lambda\) becomes more positive where the best performance is observed at \(\lambda = 0.8\). Notably, QML generally outperforms GMM for positive values of \(\lambda\) in terms of lower bias, MSE and ESE. The results for LS-GB which utilizes OLS in the first step are identical to those for GMM as both rely on the same initial estimator of \(\varvec{\delta }\). Substantial improvements are observed when model-based gradient boosting is employed in the first step. Specifically, the GB-GB approach leads to notable reductions in bias, MSE, and ESE. These improvements are further enhanced when model-based gradient boosting is combined with deselection via DS-GB. Overall, GB-GB and DS-GB outperform or perform as well as the QML and GMM estimators across the entire range of \(\lambda\).
Finally, the prediction performance on an independent test data set for the low-dimensional setting is evaluated. The results are presented in Table 3. Across all values of the spatial autoregressive parameter \(\lambda\), the model-based gradient boosting algorithms outperform the classical approaches in terms of predictive accuracy. Specifically, the RMSEP, MAEP and the NLL are higher for QML and GMM compared to the model-based gradient boosting algorithms. DS-GB typically achieves the lowest NLL and slightly better RMSEP and MAEP although the differences between LS-GB, GB-GB, and DS-GB are minor and occur mostly at the decimal level. This indicates that the predictive performance of the model-based gradient boosting algorithms is stable and robust across different values of \(\lambda\), whereas QML and GMM show clear performance degradation especially as the spatial dependence increases.
In general, the presented simulation results show proper functionality of the model-based gradient boosting algorithms from the perspective of specificity and sensitivity in the low-dimensional setting across all considered spatial autoregressive parameters \(\lambda\). This conclusion can be drawn based on the average selection rates where all true informative variables are selected across all 100 repetitions on average. Furthermore, the estimated linear effects show correct direction and algebraic sign. Nevertheless, biases are introduced due to the inherent regularization of model-based gradient boosting via early stopping. Although the average selection rates show high TNR and FDR, the actual impact in terms of estimated coefficients for non-informative variables remains very low, meaning the included non-informative variables are entering the final model only with low values. To provide intuition regarding the consistency and convergence of the proposed model-based gradient boosting algorithm, results are presented through a comparison with GMM coefficient estimates. These can be recovered by utilizing a sufficiently large stopping criterion, as demonstrated in Appendix C.1. Additionally, model-based gradient boosting either outperforms or performs as well as the established QML and GMM estimators in estimating \(\lambda\) when substantial noise is introduced through non-informative variables. As a result, the QML and GMM estimator exhibits a downward bias in the presence of such noise. This outcome is not unexpected as spatial autocorrelation typically decreases when informative variables are added to the model. However, the simulation study also indicates that introducing non-informative noise can falsely reduce spatial autocorrelation which has serious implications for model interpretation as it may lead to highly misleading conclusions. Utilizing model-based gradient boosting can effectively mitigate the consequences by not including non-informative variables, thereby improving quality of residuals in the first step and decreasing the bias across all values of \(\lambda\). While model-based gradient boosting outperforms QML and GMM estimation strategies in terms of predictive accuracy on unobserved test data, a notable drawback is its relatively high FDR indicating that the final model may still include many non-informative variables. However, its ability to exclude such variables improves drastically as \(\lambda\) increases. Despite this improvement, the persistently high FDR suggests that model complexity remains an issue even under stronger spatial dependence.
3.2.2 High-dimensional Linear Setting (n = 400, p = 800)
Similar to the low-dimensional setting, results are also evaluated for the high-dimensional setting. In this case, only the GB-GB and DS-GB approaches are reported as model-based gradient boosting remains the only viable option when the number of variables exceeds the number of observations among the considered methods. The average selection rates across 100 repetitions are shown in Table 4. Across all values of the spatial autoregressive parameter \(\lambda\), the TPR remains at \(100\%\), indicating that all informative variables are reliably selected on average. In contrast to the low-dimensional setting, the TNR is also high suggesting that most non-informative variables are successfully excluded. Notably, the TNR improves with increasing \(\lambda\) reaching approximately \(99.67\%\) at \(\lambda = 0.8\). This reflects a strong ability of the method to discriminate between informative and non-informative variables in high-dimensional settings, particularly under strong positive spatial dependence. Despite these improvements, the FDR remains high for lower values of \(\lambda\), though it steadily decreases as \(\lambda\) increases. At \(\lambda = -0.8\), nearly all selected variables are non-informative on average, whereas this proportion drops substantially to about \(28\%\) at \(\lambda = 0.8\). While these FDR values may seem high, they are not unexpected given the small number of informative relative to the large number of non-informative variables.
The estimation performance of the coefficients in the high-dimensional setting is illustrated in Fig. 2. Compared to the low-dimensional setting, the coefficient estimates exhibit a more pronounced shrinkage effect. Furthermore, the shrinkage effect additionally intensifies as the spatial autoregressive parameter \(\lambda\) increases and becomes positive. Nevertheless, the algebraic signs and general direction of the coefficients remain correct even in the presence of many non-informative variables. In contrast, non-informative variables where again rarely selected with coefficient estimates clustered around zero.
Estimated linear effects for the high-dimensional setting with 100 repetitions, the spatial error family, model-based gradient boosting with first step gradient boosting with deselection (DS-GB) across different spatial autoregressive parameters \(\lambda\). Horizontal red lines represent the true values
Furthermore, the estimation performance for the spatial autoregressive parameter \(\lambda\) in the high-dimensional setting is shown in Table 5. In this scenario, classical methods such as QML, GMM, and LS-GB are no longer applicable due to the high dimensionality of the predictor space. Consequently, only the model-based gradient boosting algorithms are considered as feasible estimation strategies. The results reveal a clear divergence in estimation behavior between the two algorithms. For GB-GB, the bias in the estimation of \(\hat{\lambda }\) increases steadily with the magnitude of \(\lambda\). This trend likely stems from the compounded effect of shrinkage on informative variables and the influence of a large number of non-informative variables. Since GB-GB does not actively remove non-informative variables, the residuals on which the estimation of \(\lambda\) heavily depends are more likely to be contaminated by noise leading to underestimation of \(\lambda\). In contrast, DS-GB demonstrates substantially lower bias across all values of \(\lambda\) with particularly notable improvements at higher spatial dependence. This improved performance is attributable to the deselection algorithm in the first step which effectively removes non-informative variables, thereby mitigating the adverse impact of noise induced by shrinkage. As a result, DS-GB achieves not only lower bias but also smaller MSE and ESE, enabling more accurate recovery of the true spatial autoregressive parameter in high-dimensional settings.
Finally, the predictive performance on independent test data in the high-dimensional setting is reported in Table 6. While the observed performance is slightly more variable than in the low-dimensional setting, the overall magnitude of the RMSEP, MAEP and NLL remains comparable. Notably, the predictive accuracy of DS-GB improves as the spatial autoregressive parameter \(\lambda\) increases which mirrors the pattern of reduced bias in the estimation of \(\hat{\lambda }\) discussed previously. This improvement is particularly evident in the NLL, where DS-GB achieves considerably better performance than GB-GB for higher values of \(\lambda\). However, a direct comparison to QML or GMM remains infeasible in this setting, reinforcing the advantage of model-based gradient boosting approaches.
The results for the high-dimensional setting support the results of the low-dimensional setting and indicate proper functionality of the model-based gradient boosting algorithm from the perspective of specificity and sensitivity across all spatial autoregressive parameters \(\lambda\). The TPR remains at \(100\%\) across all 100 repetitions on average. Although the estimated coefficients are stronger affected from shrinkage, the algebraic sign and general direction remain correct. The GB-GB exhibits a strong bias as \(\lambda\) increases indicating that deselection in the first step is necessary to reliably estimate \(\lambda\) in high-dimensional settings. Nevertheless, the great advantage is the feasibility of model-based gradient boosting in scenarios where the number of variables exceed the number of observations. In such situations, the established estimation strategies like QML and GMM fail entirely as evidenced in the simulation study by missing values. For instance, the predictive performance as measured by the evaluation criteria remains comparable in absolute terms to that of the low-dimensional setting. However, the results also reveal a high FDR in the evaluation of the variable selection performance. Notably, no distinction is made between variables that are frequently selected and those that are selected only once, meaning all variables are equally weighted in the computation of the FDR. To address this limitation, the post-hoc deselection algorithm as proposed by Strömer et al. (2022) is utilized to improve the FDR of the model-based gradient boosting algorithm by removing variables that are likely selected merely by chance.
3.2.3 Deselection
As evidenced in the results for the low- as well as high-dimensional setting, model-based gradient boosting for the SDEM suffers from a “greedy” selection behavior which attributes to the inclusion of too many non-informative variables, thereby decreasing TNR and increasing FDR. The reason for the “greedy” nature lies in the fact that the solutions of model-based gradient boosting are optimal with respect to the \(L_1\)-arc-length. In fact, by taking any convex loss function, model-based gradient boosting is able to approximate the solution path of a strictly monotone \(L_1\)-regularized regression model. Therefore, model-based gradient boosting is unable to automatically deselect a variable once it has been added to the model (Hastie et al. 2007; Hepp et al. 2016). To mitigate the consequences, the deselection algorithm for generalized linear, additive and interaction models is adapted to model-based gradient boosting and evaluated in the following simulations (Strömer et al. 2022).
The average selection rates after utilizing model-based gradient boosting with first step gradient boosting with deselection and an additional deselection step (DS-DS) are presented in Table 7. Indeed, the results clearly show that the TNR remains at \(100\%\) while the FDR decreases to \(0\%\) on average. Thus, DS-DS successfully avoids selecting non-informative variables over all 100 repetitions, ensuring that only informative variables are included in the final model. This behavior is observed in the low- as well as high-dimensional setting. Therefore, the deselection algorithm is able to mitigate the consequences of the inclusion of additional non-informative variables in boosted SDEM. Particularly, the “greedy” nature of model-based gradient boosting can be effectively mitigated through the use of deselection. Thus, Algorithm 2 provides a systematic approach for post-hoc deselection of variables that would otherwise remain in the model due to the “greedy” selection of model-based gradient boosting. Moreover, deselection helps to address the limitations of regularization via early stopping, as it allows model-based gradient boosting to be reapplied after drastically reducing the number of variables. Thus, utilizing model-based gradient boosting in combination with deselection is strongly encouraged to ensure proper model and variable selection properties.
4 Case Study: Modeling Life Expectancy in German Districts
Life expectancy, defined as the average number of years a newborn in a given population is expected to live under current mortality conditions, is a statistical measure widely used to assess a variety of health, social, and economic outcomes (Marmot 2005; Cutler et al. 2006). Over the past several decades, the life expectancy has been steadily increasing across Europe indicating progress in healthcare, socioeconomic development, and public health interventions. Notably, countries in the European Union report an increase in longevity due to improved living standards, better medical care, and lower mortality from major diseases such as cardiovascular conditions and cancer. Nevertheless, large regional disparities remain between and within countries. For instance, Germany, one of the largest and economically wealthiest nation in the European Union, has seen signs of stagnation in life expectancy in comparison to especially Northern European countries. Thus, the stagnation gives rise to important concerns about regional inequalities, healthcare system efficiency, lifestyle risks, and demographic changes (OECD and European Commission 2024). In this case study, the goal is to model the life expectancy in German districts by utilizing spatial regression models with autoregressive disturbances to answer questions regarding underlying socio-economic and other determinants of life expectancy while incorporating district-level geographical disparities. In principle, the case study builds on recent previous research such as Lampert et al. (2019), Rau and Schmertmann (2020), Siegel et al. (2022), Jasilionis et al. (2023), Marinetti et al. (2023), Hoebel et al. (2025) which investigates current trends, socio-economic drivers of life expectancy, impact of the COVID-19 pandemic and patterns of mortality from both district-level and complete country perspectives.
To this end, a large-scale real-world data base, namely INKAR, is utilized. Managed by the Bundesinstitut für Bau-, Stadt- und Raumforschung, INKAR is an interactive (online) atlas about the living situation in Germany. It contains over 600 unique socio-demographic, socio-economic, and environmental indicators for distinct geographical locations allowing for evaluations of urban and rural disparities. In principle, INKAR is a panel data set relying on information starting from the year 1995. However, in this case study, the focus is on the cross-sectional life expectancy in German districts in the year 2019. Therefore, INKAR in the version of 2021 is utilized (Bundesinstitut für Bau-, Stadt- und Raumforschung (BBSR) 2024). The INKAR data set in the current as well as the 2021 version with the codebook for all indicators is freely available at https://www.inkar.de/.
To construct the data set suitable for this case study, the following preprocessing steps are necessary. First, the original number of districts available in INKAR is 401. However, the data is merged with the German map data from the year 2024 available at https://www.bkg.bund.de/. The map data in the most recent version includes only 400 districts because the district "Eisenach" has been merged with a neighboring district in 2022 (Bundesamt für Kartographie und Geodäsie (BKG) 2024). To ensure comparability between both data sets, “Eisenach” is thus removed from the INKAR 2021 data resulting in 400 districts of interest. Second, a spatial weight matrix has to be generated based on the specific spatial configuration of the German districts. Therefore, the centroids of the spatial polygons based on the shape file layer from the German map data are computed. To keep a close connection to the simulation study in Section 3, a 10-nearest neighbor structure is created indicating that each district is assigned its ten geographically closest neighbors. Afterward, the resulting spatial weight matrix is row-normalized. Third, since the focus is primary on the variable selection evaluation, the indicators are allowed to be quite general. Thus, any domain knowledge is not imposed in the selection of variables. However, dubious indicators like the indicator number of a district as well as indicators describing similar phenomena are removed before model estimation. Furthermore, transformations for continuous variables, that is, centering and scaling, are applied to ensure comparability across variables in terms of the scale.
The final data set is composed of 400 observations where each location corresponds to a district in Germany. These districts are divided again into 294 rural districts and 106 urban cities. A map of the life expectancy in German districts in the year 2019 can be seen in Fig. 3. The map reveals distinct disparities in life expectancy across German districts with clear spatial clustering among neighboring regions. Notably, the highest life expectancy is observed in southern Bavaria near the borders with Austria and Switzerland. Additional areas of high life expectancy can be found in Lower Saxony close to the Dutch border. In contrast, life expectancy in the new federal states (former East Germany) is generally lower than in the old federal states (former West Germany) indicating a clear east-west divide. Details on the data types and descriptions of the dependent and independent variables of interest are provided in Table 8.
As evidenced by Fig. 3, ignoring spatial autocorrelation in the estimation process may lead to a misrepresentation of the underlying determinants influencing life expectancy. Therefore, the objective is to model life expectancy in German districts for the year 2019 using spatial regression models with autoregressive disturbances. In particular, the SDEM is employed as spatial autocorrelation may arise not only in the disturbances but also through spatial lags of the exogenous variables. For example, the number of medical doctors in neighboring districts may influence life expectancy given the relatively short distances between locations. Generally, the results are presented with a clear focus on model-based gradient boosting. Since the application is a low-dimensional setting, results are reported for the QML, GMM, LS-GB, GB-GB, DS-GB and DS-DS defined exactly as in the simulation study in Section 3. All exogenous variables presented in Table 8 as well as the corresponding first-order spatial lags and an intercept are included in the estimation yielding 65 coefficients to estimate in a non-parsimonious model. Regarding the setup for the model-based gradient boosting algorithm, the narrative from the simulation study is followed and the learning rate is thus set to \(s = 0.1\). Furthermore, for the optimization of the stopping criterion \(m_{\text {stop}}\), the search is conducted by minimizing the empirical risk via 10-fold spatial cross-validation as discussed in Brenning (2012). For the deselection algorithm, \(\tau\) is set to 0.01. The estimated coefficients for all estimation strategies can be seen in Table 9.
For brevity, the coefficients of the spatial lags of exogenous variables are not reported individually. Instead, the number of spatial lag variables included in each model (No. \(\varvec{WX}\)) is presented. Regarding variable selection, the LS-GB approach reduces the total number of coefficients from 65 to 36 out of which 23 can be attributed to the spatial lags of exogenous variables. A similar but improved outcome is observed for GB-GB where the number of included variables is reduced to 29 out of which 15 correspond to the coefficients of the spatial lags of exogenous variables. The best results in terms of variables selection are achieved when model-based gradient boosting is combined with deselection. For DS-GB, the number of selected variables is reduced to 11 with no coefficients of the spatial lags of exogenous variables included. DS-DS further improves performance of variable selection through post-hoc deselection, resulting in a final parsimonious model with only 5 variables. These results suggest that many of the variables listed in Table 8 are non-informative. Notably, model-based gradient boosting combined with deselection effectively reduces the initially non-parsimonious SDEM to a simpler and more parsimonious SEM in a data-driven manner without the need for formal likelihood-based model selection criteria or individual domain knowledge.
Regarding the spatial autoregressive parameter, all estimation strategies indicate positive spatial dependence between neighboring districts although the strength varies. In line with the downward bias observed in the simulation study, both QML and GMM estimators tend to underestimate the spatial autoregressive parameter \(\lambda\). This underestimation can be attributed to the inclusion of non-informative variables which are largely eliminated through model-based gradient boosting. Consequently, the estimator of \(\lambda\) obtained from model-based gradient boosting is generally higher depending on the first step estimation method. Moreover, the algebraic sign and overall direction of the coefficients remain identical across all estimation strategies. The primary difference lies in the magnitude of the coefficients, which varies due to the shrinkage effects introduced by early stopping in model-based gradient boosting. Based on the results for DS-DS, the most important variables explaining the life expectancy in German districts in 2019 are the the unemployment rate, the debt quota, the proportion of employees with an academic degree, the rent prices and the number of care-dependent individuals. Since the independent variables are transformed by simple scaling and centering, the coefficients have an intuitive and simple interpretation. For instance, an increase in rent prices by one euro ceteris paribus increases the average life expectancy by 0.2185 years on average. Conversely, holding all other variables constant, a 1 percentage point increase in the share of private debtors is associated with an average 0.4268 year decrease in average life expectancy.
To summarize the findings, the positive spatial autoregressive parameter \(\lambda\) indicates strong spatial dependence between districts, that is, the disturbance in one district is strongly influenced by the disturbances in neighboring districts. Thus, health outcomes cluster spatially as suggested in Fig. 3. Furthermore, the number of care-dependent individuals negatively influence average life expectancy meaning that underlying poor health status decreases the average life expectancy. Districts with higher unemployment and debt quotas have on average lower average life expectancies indicating that financial hardship, fiscal stress or economic precarity is negatively associated with longevity. In contrast, higher rent prices and higher share of employees with an academic degree lead to a higher average life expectancy which shows that inhabitants in wealthier districts and better socioeconomic status live longer on average.
5 Conclusion
The key findings and main contributions of this article are: (a) Model-based gradient boosting is extended for spatial regression model with autoregressive disturbances. (b) The algorithm is implemented in the mboost package (Bühlmann and Hothorn 2007; Hothorn et al. 2010; Hofner et al. 2014, 2015) by providing a novel spatial error family. (c) Model-based gradient boosting for the SDEM strongly relies on the knowledge about the spatial autoregressive parameter \(\lambda\). However, in real-world application settings \(\lambda\) is generally unknown making model-based gradient boosting infeasible. Therefore, a feasible model-based gradient boosting algorithm for the SDEM is proposed which relies on replacing the unknown quantities \(\lambda\) and \(\sigma ^2\) in the ingredients by the corresponding estimators \(\hat{\lambda }\) and \(\hat{\sigma }^2\) proposed in Kelejian and Prucha (1999). (d) In extensive simulation studies for a low- as well as high-dimensional setting, the results show proper functionality of the proposed feasible model-based gradient boosting algorithm. Particularly, the results are accompanied by a high TPR and coefficients are estimated with high accuracy although biases are introduced due to regularization via early stopping. Model-based gradient boosting also outperforms or performs as well as QML and GMM estimators in terms of bias, MSE and ESE when estimating the autoregressive parameter \(\lambda\) in the presence of non-informative variables. Additionally, the predictive performance on an independent test data set is always better for model-based gradient boosting in comparison to QML and GMM. (e) The great advantage of model-based gradient boosting is the feasibility in high-dimensional settings where the number of variables exceeds the number of observations. Due to the modular nature, model-based gradient boosting does not require refitting and allows for a direct interpretation of the effects of coefficients on the dependent variable of interest enabling a potential application for a wide range of real-world settings. (f) Additionally, the feasible model-based gradient boosting is applied in a real-world application setting where the life expectancy in German districts is modeled. Additional case studies revisiting classical spatial econometric data sets, namely Boston housing prices (Harrison Jr and Rubinfeld 1978) and Columbus crime rate (Anselin 1988) are provided in Appendix D.
Naturally, limitations, improvements and extensions beyond the scope of this article have to be acknowledged. Although simulation studies have been provided, the scope of the simulation settings is of course by no means exhaustive. Further simulation studies such as under varying spatial weight matrices aim at partially address these limitations but much more complex scenarios can be easily considered. For instance, additional correlated noise variables or higher spatial lags of exogenous variables may be included in the simulation settings. A major limitation of the proposed model-based gradient boosting algorithm is the restricted generalizability. Specifically, the algorithm can only be applied in settings where the dependent variable does not appear as a spatial lag in the spatial regression model. Thus, the implemented model-based gradient boosting algorithm cannot be readily extended to spatial autoregressive models which are promising candidates for a wide range of applications. Additionally, the algorithm assumes homoskedastic innovations which is a potentially unrealistic assumption in many real-world application settings. In fact, heteroskedasticity is a common feature of spatial datasets (LeSage and Pace 2009). Furthermore, as discussed in Section 2, a formal proof of the convergence of model-based gradient boosting using the squared Mahalanobis distance is still lacking. While the numerical results in Appendix C.1 indicate that the coefficients of model-based gradient boosting for the SDEM converge toward those of the GMM, these findings do not replace a formal theoretical justification of the method. Revisiting the average selection rates reveals high FDR values, indicating the inclusion of many non-informative variables of minor importance in the final model. To mitigate the consequences, the utilization of a deselection algorithm is proposed. Alternatively, the application of the so-called stability selection could be utilized which enables the control over the amount of false positive (Meinshausen and Bühlmann 2010; Shah and Samworth 2012; Hofner et al. 2015). Furthermore, probing can be adapted for the model-based gradient boosting for the SDEM which is currently available for univariate location models. The general idea is to obtain sparser models by stopping the algorithm as soon as the first randomly permuted version of a variable is added (Thomas et al. 2017). Finally, p-values for individual base-learners could be obtained by utilizing permutation techniques (Hepp et al. 2019).
Therefore, the next goal is to extend model-based gradient boosting for the SDEM by incorporating heteroskedastic innovations and adapting it to panel data models with random and fixed effects (Anselin 1988; Kapoor et al. 2007; Baltagi et al. 2013). Such an extension would enable a more nuanced analysis of life expectancy across German districts. For example, the INKAR dataset provides panel data making it possible to study the underlying determinants of life expectancy over time. Furthermore, the FDR, TNR and bias in the simulation study are influenced by the magnitude of \(\lambda\). This behavior is possibly related to the spectrum of the spatial weight matrix. Particularly, if the eigenvalue moduli are bounded by about one, positive values of \(\lambda\) approach the upper bound of the admissible range, leading to stronger regularization and more pronounced variable selection. However, a complete theoretical explanation remains elusive, representing an interesting direction for further research. Beyond the spatial econometrics context, the novel model-based gradient boosting could be extended and applied to network econometric models (Lee et al. 2010). Therefore, practitioners and applied statisticians working with spatial data are encouraged to utilize model-based gradient boosting for the SDEM as a valuable alternative for estimation, regularization, model and variable selection in future research.
6 Supplementary Information
The data as well as the accompanying codebook on the life expectancy in German districts is publicly available via https://www.inkar.de/ (DL-DE BY 2.0).
All R-code for the implemented model-based gradient boosting algorithm along with the simulation studies is publicly available in the following GitHub repository https://github.com/micbalz/SpatRegBoost.
Data Availability
Data is provided within the manuscript or supplementary information files.
References
Anselin L (1988) Spatial econometrics: methods and models. Springer Science & Business Media
Baltagi BH, Egger P, Pfaffermayr M (2013) A generalized spatial panel data model with random effects. Economet Rev 32(5–6):650–685. https://doi.org/10.1080/07474938.2012.742342
Bivand R, Millo G, Piras G (2021) A review of software for spatial econometrics in R. Mathematics 9(11). https://doi.org/10.3390/math9111276
Bivand R, Nowosad J, Lovelace R (2025) spData: datasets for spatial analysis. https://github.com/nowosad/spdata, R package version 2.3.4
Brenning A (2012) Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: the R package sperrorest. In: 2012 IEEE international geoscience and remote sensing symposium, IEEE, pp 5372–5375. https://doi.org/10.1109/IGARSS.2012.6352393
Bühlmann P (2006) Boosting for high-dimensional linear models. Ann Stat 34(2):559–583. https://doi.org/10.1214/009053606000000092
Bühlmann P, Hothorn T (2007) Boosting algorithms: regularization, prediction and model fitting. Stat Sci 22(4):477–505. https://doi.org/10.1214/07-STS242
Bundesamt für Kartographie und Geodäsie (BKG) (2024) Geodaten des BKG. https://www.bkg.bund.de
Bundesinstitut für Bau-, Stadt- und Raumforschung (BBSR) (2024) Laufende Raumbeobachtung des BBSR - INKAR, Ausgabe 03/2024. https://www.inkar.de/
Cai L, Maiti T (2020) Variable selection and estimation for high-dimensional spatial autoregressive models. Scand J Stat 47(2):587–607. https://doi.org/10.1111/sjos.12452
Cai L, Bhattacharjee A, Calantone R et al (2019) Variable selection with spatially autoregressive errors: a generalized moments lasso estimator. Sankhya B 81(Suppl 1):146–200. https://doi.org/10.1007/s13571-018-0176-z
Cliff A, Ord J (1973) Spatial autocorrelation. Pion, London
Cutler D, Deaton A, Lleras-Muney A (2006) The determinants of mortality. J Econ Perspect 20(3):97–120. https://doi.org/10.1257/jep.20.3.97
Fahrmeir L, Kneib T, Lang S et al (2013) Regression: models, methods and applications. Springer-Verlag, Berlin
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232. https://doi.org/10.1214/aos/1013203451
Gay DM (1990) Usage summary for selected optimization routines. Computing Science Technical Report 153, AT&T Bell Laboratories, Murray Hill, NJ
Halleck Vega S, Elhorst JP (2015) The SLX model. J Reg Sci 55(3):339–363. https://doi.org/10.1111/jors.12188
Harrison D Jr, Rubinfeld DL (1978) Hedonic housing prices and the demand for clean air. J Environ Econ Manag 5(1):81–102. https://doi.org/10.1016/0095-0696(78)90006-2
Hastie T, Taylor J, Tibshirani R et al (2007) Forward stagewise regression and the monotone lasso. Electron J Stat 1:1–29. https://doi.org/10.1214/07-EJS004
Hepp T, Schmid M, Gefeller O et al (2016) Approaches to regularized regression - a comparison between gradient boosting and the lasso. Methods Inf Med 55(5):422–430. https://doi.org/10.3414/ME16-01-0033
Hepp T, Schmid M, Mayr A (2019) Significance tests for boosted location and scale models with linear base-learners. Int J Biostat 15(1). https://doi.org/10.1515/ijb-2018-0110
Hoebel J, Michalski N, Baumert J et al (2025) The life expectancy gap: socioeconomic differences in life expectancy between areas in Germany. J Health Monit 1–8. https://doi.org/10.25646/13026
Hofner B, Mayr A, Robinzonov N et al (2014) Model-based boosting in R: a hands-on tutorial using the R package mboost. Comput Stat 29:3–35. https://doi.org/10.1007/s00180-012-0382-5
Hofner B, Boccuto L, Goeker M (2015) Controlling false discoveries in high-dimensional situations: boosting with stability selection. BMC Bioinform 16(144). https://doi.org/10.1186/s12859-015-0575-3
Hothorn T, Buehlmann P, Kneib T et al (2010) Model-based boosting 2.0. J Mach Learn Res 11:2109–2113. https://doi.org/10.5555/1756006.1859922
Jasilionis D, van Raalte AA, Klüsener S et al (2023) The underwhelming German life expectancy. Eur J Epidemiol 38(8):839–850. https://doi.org/10.1007/s10654-023-00995-5
Kapoor M, Kelejian HH, Prucha IR (2007) Panel data models with spatially correlated error components. J Econom 140(1):97–130. https://doi.org/10.1016/j.jeconom.2006.09.004
Kelejian HH, Prucha IR (1999) A generalized moments estimator for the autoregressive parameter in a spatial model. Int Econ Rev 40(2):509–533. https://doi.org/10.1111/1468-2354.00027
Lampert T, Hoebel J, Kroll LE (2019) Social differences in mortality and life expectancy in Germany current situation and trends. J Health Monit 4(1):3. https://doi.org/10.25646/5872
Lee LF (2004) Asymptotic distributions of quasi-maximum likelihood estimators for spatial autoregressive models. Econometrica 72(6):1899–1925. https://doi.org/10.1111/j.1468-0262.2004.00558.x
Lee LF, Liu X, Lin X (2010) Specification and estimation of social interaction models with network structures. Economet J 13(2):145–176. https://doi.org/10.1111/j.1368-423X.2010.00310.x
LeSage J, Pace RK (2009) Introduction to spatial econometrics. Chapman and Hall/CRC
Marinetti I, Jdanov D, Grigoriev P et al (2023) Effects of the COVID-19 pandemic on life expectancy and premature mortality in the German federal states in 2020 and 2021. PLoS ONE 18(12):e0295763. https://doi.org/10.1371/journal.pone.0295763
Marmot M (2005) Social determinants of health inequalities. Lancet 365(9464):1099–1104. https://doi.org/10.1016/S0140-6736(05)71146-6
Mayr A, Hofner B, Schmid M (2012) The importance of knowing when to stop. Methods Inf Med 51(02):178–186. https://doi.org/10.3414/ME11-02-0030
Mayr A, Binder H, Gefeller O et al (2014) The evolution of boosting algorithms from machine learning to statistical modelling. Methods Inform Med 53(6):419–427. https://doi.org/10.3414/ME13-01-0122
Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc Ser B Stat Methodol 72(4):417–473. https://doi.org/10.1111/j.1467-9868.2010.00740.x
Morris TP, White IR, Crowther MJ (2019) Using simulation studies to evaluate statistical methods. Stat Med 38(11):2074–2102. https://doi.org/10.1002/sim.8086
OECD, European Commission (2024) Health at a glance: Europe 2024: state of health in the EU cycle. OECD Publishing, Paris
Pebesma E, Bivand RS (2023) Spatial data science with applications in R. Chapman & Hall. https://r-spatial.org/book/
R Core Team (2025) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Rau R, Schmertmann CP (2020) District-level life expectancy in Germany. Dtsch Arztebl Int 117(29–30):493. https://doi.org/10.3238/arztebl.2020.0493
Schmid M, Hothorn T (2008) Boosting additive models using component-wise P-Splines. Comput Stat Data Anal 53(2):298–311. https://doi.org/10.1016/j.csda.2008.09.009
Schratz P, Muenchow J, Iturritxa E et al (2019) Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data. Ecol Model 406:109–120. https://doi.org/10.1016/j.ecolmodel.2019.06.002
Shah RD, Samworth RJ (2012) Variable selection with error control: another look at stability selection. J R Stat Soc Ser B Stat Methodol 75(1):55–80. https://doi.org/10.1111/j.1467-9868.2011.01034.x
Siegel A, Schug JF, Rieger MA (2022) Social determinants of remaining life expectancy at age 60: a district-level analysis in Germany. Int J Environ Res Public Health 19(3):1530. https://doi.org/10.3390/ijerph19031530
Stehman SV (1997) Selecting and interpreting measures of thematic classification accuracy. Remote Sens Environ 62(1):77–89. https://doi.org/10.1016/S0034-4257(97)00083-7
Strömer A, Staerk C, Klein N et al (2022) Deselection of base-learners for statistical boosting—with an application to distributional regression. Stat Methods Med Res 31(2):207–224. https://doi.org/10.1177/0962280221105108
Thomas J, Hepp T, Mayr A et al (2017) Probing for sparse and fast variable selection with model-based boosting. Comput Math Methods Med 2017. https://doi.org/10.1155/2017/1421409
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Stat Methodol 58(1):267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Valavi R, Elith J, Lahoz-Monfort JJ et al (2019) blockcv: an R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. Methods Ecol Evol 10(2):225–232. https://doi.org/10.1111/2041-210X.13107
Whittle P (1954) On stationary processes in the plane. Biometrika 434–449. https://doi.org/10.2307/2332724
Wickham H, Averick M, Bryan J et al (2019) Welcome to the tidyverse. J Open Source Softw 4(43):1686. https://doi.org/10.21105/joss.01686
Yue M, Xi J (2025) Sparse boosting for additive spatial autoregressive model with high dimensionality. Mathematics 13(5):757. https://doi.org/10.3390/math13050757
Zhang T, Yu B (2005) Boosting with early stopping: convergence and consistency. Ann Stat 33(4):1538–1579. https://doi.org/10.1214/009053605000000255
Funding
Open Access funding enabled and organized by Projekt DEAL. The work on this article was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) within project 492988838.
Author information
Authors and Affiliations
Contributions
All work was carried out solely by M.B.
Corresponding author
Ethics declarations
Competing Interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Notation
Table 10 summarizes important notation utilized throughout the article.
B Practitioner’s Note
Algorithm 3 provides a summary of the three-step feasible model-based gradient boosting algorithm for the SDEM and serves as a guide for practitioners. The initialization primarily involves setting up the required data and estimating the spatial autoregressive parameter as well as the variance of the idiosyncratic random innovations. The choice of the first step consistent estimator depends on the considered scenario. Based on the simulation studies, the general recommendation is to employ model-based gradient boosting with deselection in both low- and high-dimensional settings according to Algorithm 2 when many non-informative variables are present. Otherwise, QML and GMM are the preferred options in low-dimensional settings (see, for example, Table 12).
Subsequently, the following optimization problem must be solved to compute the non-linear least squares estimators \((\hat{\lambda }, \hat{\sigma }^2)\)
The optimization involves the sample moments matrix and vector \(\varvec{G}\) and \(\varvec{g}\) which are computed using the predictors of the disturbances and their corresponding spatial lags as proposed by Kelejian and Prucha (1999). The optimization routine adopted in this article closely follows the spatialreg package in R and is based on unconstrained optimization via PORT routines. Furthermore, the optimization function requires the specification of suitable starting values. For the spatial autoregressive parameter, the starting value is given by
which represents the scaled spatial correlation between the predictors of the disturbances and their spatial lags, normalized by the average number of neighbors. For the variance, the starting value is defined as
that is, the sample variance of the predicted disturbances. The convergence criterion for optimization in the nlminb function in R is based on changes in the objective function and parameter values. Specifically, convergence is declared when these changes become sufficiently small. In this article, the R default settings are utilized (Gay 1990; Bivand et al. 2021; Pebesma and Bivand 2023).
Algorithm 3 then proceeds with model-based gradient boosting according to Algorithm 1. If deselection was applied in the first step, the resulting method is denoted as DS-GB in this article. Although the stopping criterion can be freely chosen, this approach is rarely utilized as it can result in a suboptimal bias–variance trade-off. Therefore, spatial cross-validation according to Brenning (2012) should always be employed to avoid such issues. The learning rate is set to \(s = 0.1\), following common practice (see, for example, Schmid and Hothorn 2008; Mayr et al. 2012; Hofner et al. 2014).
In principle, post-hoc deselection according to Algorithm 2 can additionally be applied to reduce the high FDR arising from model-based gradient boosting. This step leads to more parsimonious models and less regularized coefficients. The corresponding method is referred to as DS-DS. Thus, the high FDR observed in model-based gradient boosting for the SDEM can be effectively mitigated by applying post-hoc deselection (see, for example, Section 3). Hence, for reliable model and variable selection, post-hoc deselection is highly recommended. In general, the threshold should be set to values below \(\tau \le 0.025\) as evidenced in Figs. 4 and 5.
C Further Simulation Results
In this section, additional simulation results are provided. An intuition about convergence of model-based gradient boosting for the SDEM is given in Table 11. An evaluation of the choice of the threshold in deselection can be seen in Figs. 4 and 5. Moreover, further simulations are based on the artificial spatial weight matrix in a circular world discussed in Kelejian and Prucha (1999). In this case, each location is connected to k locations before and after. Small sample behavior of the non-linear least squares estimators is seen in Table 12. The results for varying autoregressive parameters are given in Figs. 6, 7 and Tables 13, 14, 15, 16, 17, 18. The corresponding results for varying spatial weight matrices are given in Figs. 8, 9 and Tables 19, 20, 21, 22, 23, 24.
1.1 C.1 Convergence of Model-based Gradient Boosting to Generalized Least Squares Estimator
1.2 C.2 Small Sample Behavior of Non-linear Least Squares Estimator
1.3 C.3 Threshold Choice in Deselection
Boxplots of root mean squared error of prediction (RMSEP), true negative rate (TNR) and true positive rate (TPR) across different threshold values \(\tau = \{0.005, 0.0075, 0.01, 0.025, 0.05, 0.075, 0.1, 0.125\}\) for model-based gradient boosting with first step gradient boosting with deselection and additional deselection (DS-DS) in the low-dimensional linear setting with 100 repetitions and spatial error family
Boxplots of root mean squared error of prediction (RMSEP), true negative rate (TNR) and true positive rate (TPR) across different threshold values \(\tau = \{0.005, 0.0075, 0.01, 0.025, 0.05, 0.075, 0.1, 0.125\}\) for model-based gradient boosting with first step gradient boosting with deselection and additional deselection (DS-DS) in the high-dimensional linear setting with 100 repetitions and spatial error family
1.4 C.4 Varying Spatial Autoregressive Parameters in Circular World
1.4.1 C.4.1 Low-dimension
Estimated linear effects for the low-dimensional linear setting in circular world with 100 repetitions, the spatial error family, model-based gradient boosting with first step gradient boosting with deselection (DS-GB) across different spatial autoregressive parameters \(\lambda\). Horizontal red lines represent the true values
1.4.2 C.4.2 High-dimension
Estimated linear effects for the high-dimensional linear setting in circular world with 100 repetitions, the spatial error family, model-based gradient boosting with first step gradient boosting with deselection (DS-GB) across different spatial autoregressive parameters \(\lambda\). Horizontal red lines represent the true values
1.5 C.5 Varying Spatial Weight Matrices in Circular World
1.5.1 C.5.1 Low-dimension
Estimated linear effects for the low-dimensional linear setting in circular world with 100 repetitions, the spatial error family, model-based gradient boosting with first step gradient boosting with deselection (DS-GB) across different number of locations k. Horizontal red lines represent the true values
1.5.2 C.5.2 High-dimension
Estimated linear effects for the high-dimensional linear setting in circular world with 100 repetitions, the spatial error family, model-based gradient boosting with first step gradient boosting with deselection (DS-GB) across different number of locations k. Horizontal red lines represent the true values
D Further Case Studies
1.1 D.1 Boston Housing Prices
In this section, the Boston housing prices data set (Harrison Jr and Rubinfeld 1978) which is accessible through the spData package in R (Bivand et al. 2025) is revisited. The dataset contains 506 observations and 20 explanatory variables. Descriptions and summary statistics for the key variables are provided in Tables 25 and 26. The target variable is the corrected median value of owner-occupied homes in thousands of USD (CMEDV). To model CMEDV, a spatial regression model with autoregressive disturbances, namely SDEM is employed. Figure 10 illustrates notable spatial clustering in CMEDV justifying the inclusion of spatial autocorrelation. For comparability with the simulation study in Section 3, estimation results are reported for QML, GMM, LS-GB, GB-GB, DS-GB and DS-DS. The optimal stopping iteration \(m_{\text {opt}}\) for model-based gradient boosting algorithms is chosen via 10-fold spatial cross-validation according to Brenning (2012) while the learning rate is set to \(s = 0.1\). Table 27 reports the coefficient estimates for all considered approaches. The spatial autoregressive parameter \(\lambda\) is estimated lower by GMM in comparison to QML but tends to increase in model-based gradient boosting algorithms depending on the first step estimation strategy. All methods confirm positive spatial autocorrelation evidenced by \(\lambda\) values near 0.6. Interestingly, the intercept is positive under QML and GMM but turns negative in model-based gradient boosting algorithms. Regarding variable selection, LS-GB excludes the INDUS, NOX and RAD variables along with most spatial lags of exogenous variables. GB-GB and DS-GB yield nearly identical coefficient estimates while most spatial lags of exogenous variables besides \(\varvec{W}RM\) are not selected. DS-DS further excludes the ZN, INDUS and AGE variables resulting in a parsimonious model with only seven variables compared to the original full model with 24 variables. The algebraic signs and general direction of coefficients remain stable across methods while effect sizes vary reflecting different degrees of shrinkage in the model-based gradient boosting algorithms. These findings are in line with prior studies analyzing the Boston housing data set (Harrison Jr and Rubinfeld 1978).
1.2 D.2 Columbus Crime Rate
In this section, the Columbus crime rate data set is revisited (Anselin 1988) which is available via the spData package in R (Bivand et al. 2025). The data set consists of 49 neighborhoods and includes 22 explanatory variables. Summary statistics and variable descriptions are provided in Tables 28 and 29. The dependent variable of interest is the number of residential burglaries and vehicle thefts per thousand households (CRIME). Motivated by the clear spatial clustering of crime rates observed in Fig. 11, the SDEM is employed to account for spatial dependencies. To ensure comparability with the simulation study in Section 3, the estimation results are reported for QML, GMM, LS-GB, GB-GB, DS-GB and DS-DS. For model-based gradient boosting, the optimal stopping criterion \(m_{\text {opt}}\) is selected via 10-fold spatial cross-validation as suggested in Brenning (2012) while the learning rate is set \(s = 0.1\). Table 30 displays the coefficient estimates. QML and GMM report close to zero values for the spatial autoregressive parameter \(\lambda\), suggesting weak spatial dependence in the disturbances. In contrast, model-based gradient boosting approaches yield larger \(\hat{\lambda }\) values. This difference highlights again a key characteristic of model-based boosting which is also observed in the simulation study. If non-informative variable are not properly excluded, they can distort the estimation of the spatial autoregressive parameter. Model-based gradient boosting improves estimation by iteratively selecting only informative variables leading to more accurate modeling of the disturbances and higher values of \(\hat{\lambda }\). In terms of variable selection, all model-based gradient boosting algorithms yield a sparse model including only five variables which are household income (INC), housing value (HOVAL), and distance to the central business district (DISCBD), plumbing (PLUMB) and plumbing in the neighborhood (\(\varvec{W}\text {PLUMB}\)). The signs and directional effects of coefficients remain identical across all considered methods although their magnitudes vary due to shrinkage or exclusion. Notably, most model-based gradient boosting approaches exclude almost all spatial lags of exogenous variables. Overall, the results illustrate the strength of model-based gradient boosting in identifying a parsimonious model in a data-driven fashion without the reliance on manual variable selection. The findings align with prior analyses of the data set (Anselin 1988) and reinforce the practical usability of model-based gradient boosting for spatial regression models with autoregressive disturbances.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Balzer, M. Gradient Boosting for Spatial Regression Models with Autoregressive Disturbances. Netw Spat Econ (2025). https://doi.org/10.1007/s11067-025-09717-8
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s11067-025-09717-8
















