Publications

# Structural equation modelling for digital soil mapping

Angelini, Marcos E.

## Summary

Climate change and land degradation are of increasing societal and governmental concern. For this reason, several international programs have been initiated in the last decade, such as the 4 per 1000 initiative and the Sustainable Development Goals of United Nations. The soil science community is actively working under different national and international organizations to provide regional and global soil information to support these programmes. Digital Soil Mapping (DSM), a relatively new methodology to create soil maps based on (geo)statistical methods, has became operational during the last fifteen years and has now been adopted by several organizations. It is defined as computer-assisted production of digital maps of soil type and soil properties, by use of mathematical and statistical models that combine information from soil observations with information contained in correlated environmental variables.

Most studies in DSM spatially predict soil properties or classes from either new or legacy laboratory data and spatially exhaustive environmental covariates (GIS layers of biophysical land surface properties), typically using empirical statistical methods. These methods have shown to result in accurate maps at different scales, but do not provide knowledge about the interrelationships between the soil properties and the functioning of the soil and soil-landscape system. We not only need to properly describe or map soil spatial variation, but also to understand soil behaviour. This is needed to answer questions such as: which are the dominant soil processes in a certain region? How will the soil react under increased productivity pressure? How vulnerable is the soil to erosion or pollution? How much organic carbon can we store in the soil at a given location?

Mechanistic soil-landscape models do include process-knowledge but cannot be applied easily for soil mapping because of their high complexity and large uncertainty. A solution could be to use structural equation modelling (SEM), which is a hybrid approach that combines elements of empirical and mechanistic models. SEM can model continuous soil properties while taking soil property interrelationships into account. In SEM, we first create a conceptual model, similar to the mental model of soil surveyors, which is converted into a graphical model, that represents the system interrelationships. This is the mechanistic side of SEM. The empirical side takes place after we translated the graphical model into a mathematical model, which is calibrated with observational data to estimate the model coefficients. Next, the calibrated model can be used to predict target variables, such as soil properties. These characteristics of SEM indicate that it could be a very useful technique to bridge the gap between empirical and mechanistic approaches for DSM. Thus, the objective of this thesis is to extend DSM with soil process information through the development, calibration, application and validation of a structural equation model.

After a general introduction to this thesis in Chapter 1, Chapter 2 describes how SEM can be implemented for DSM. In this chapter I argue that current DSM methods have limitations. For instance, it is difficult to predict a large number of soil properties simultaneously, while preserving the relationships between them. Furthermore, current widely applied prediction models use pedological knowledge in a very crude way only. To address these issues in DSM, I investigated the use of SEM. I introduced SEM theory and presented a case study in a 23 000-km2 region in the Argentinian Pampas, where I applied SEM to map seven key soil properties for the A horizon. I started with identifying the main soil forming processes in the study area and determined for each process the main soil properties affected. Based on this analysis I defined a conceptual soil-landscape model, which was subsequently converted to a SEM graphical model. The graphical model was translated to a mathematical model in the statistical software R using the latent variable analysis (lavaan) package. The prediction accuracy was poor, which was caused by a large measurement error in combination with a homogeneous study area. Nevertheless, the outcomes demonstrated that SEM can be used to explicitly include pedological knowledge in prediction of soil properties and modelling of their interrelationships.

In Chapter 3 I explored the capabilities of SEM for three-dimensional soil mapping. Since many soil processes operate within the soil profile, SEM might be suitable for simultaneous prediction of soil properties for multiple soil layers. The objectives of this chapter therefore were to i) apply SEM to multi-layer and multivariate soil mapping, ii) test SEM functionality for improving model performance by using model suggestions and iii) assess whether SEM reproduced the soil property covariation better than a multiple linear regression (MLR) model. I applied SEM to model and predict the lateral and vertical distribution of the cation exchange capacity (CEC), organic carbon (OC) and clay content for the A, B and C horizons for the study area of Chapter 2. I found that SEM reproduces the interrelationships between soil properties more accurately than MLR and that the model suggestions helped to improve the fit of the model. I concluded that SEM can be used to predict several soil properties for multiple layers simultaneously while retaining soil property interrelationships.

Given that SEM is a hybrid between mechanistic and empirical models, I hypothesised in Chapter 4 that SEM should have better extrapolation properties than a purely empirical model. I therefore investigated the extrapolation of a SE model from one region to another region with similar conditions. Empirical models have been used in DSM for extrapolation with varying success. The objective of this chapter was to investigate the extrapolation capability of SEM by testing and comparing six different model settings for extrapolation. I applied the structural equation model from Chapter 3 to a similar soil-landscape in the Great Plains of the United States to predict clay, OC and CEC for the same three major horizons A, B, and C.

I evaluated the performance of the SE mathematical model extrapolation, as well as the graphical model extrapolation (without coefficients). I defined four SE models that differed in the degree to which these were tailored to the US case study. I started with extrapolating the Argentinian SE mathematical model and ended with an extrapolated graphical model that was fitted and adapted on the basis of model suggestions using the US data. I also evaluated two more models using MLR to assess if SEM was better than purely empirical models.

The Argentinian SE mathematical model gave the worst results in the US while the extrapolated graphical model that was fitted with US data and adapted based on model suggestions performed best. Interestingly, I found that a SE graphical model only based on the conceptual model performed better that a more precise Argentinian SE graphical model. For this reason, I concluded that the adaptation of the conceptual model to a specific study area can improve local prediction but harm the potential predictive power for extrapolation. The prediction performance of the SE mathematical model was not substantially better than MLR. However, system relationships that were well supported by pedological knowledge showed consistent and equal behaviour in both study areas. Contrary, differences in the sign and strength of the relationships between covariates and soil properties of both areas reduced the performance of the mathematical model extrapolation. Thus, I concluded that knowledge-based links between system variables are more effective than data-driven links for model extrapolation. In addition, a deeper understanding of indicators of soil-forming factors could strengthen conceptual models for DSM.

Spatial correlation is an important feature in spatial analysis, especially in DSM. So far, current implementations of SEM do not take spatial correlation in data into account. The objective of Chapter 5 therefore was to extend SEM by accounting for residual spatial correlation using a geostatistical approach. I presented how the SE model definition and parameter estimation can be generalised to the spatially correlated case. The spatial SE model was applied to map the same soil properties of Chapter 4 in the Great Plains study area. The SE model residuals showed substantial spatial correlation, which suggests that including spatial correlation yields more accurate predictions. I also compared spatial SEM with standard SEM in terms of SEM model coefficients. There was significant differences, although none of the coefficients changed sign. Presence of residual spatial correlation suggests that some of the causal factors that explain soil variation were not captured by the set of covariates. In such case it is worthwhile to search for and include additional covariates leaving only unstructured residual noise, but as long as this is not achieved, it pays off to include residual spatial correlation in soil mapping using SEM.

Chapter 6 presents a synthesis of the main findings of this thesis and reflects on the possible role of SEM in DSM. It discusses some limitations of SEM, such as the assumed linearity of relationships (while in reality relationships between soil properties and environmental covariates are often non-linear) and its static nature, as well as several challenges. One of the main challenges I encountered in the case studies was the lack of good-quality proxies (covariates) to adequately represent the soil-forming processes. Since we cannot include hundred of covariates to predict soil properties in SEM, as done in machine-learning techniques, we need to develop proper covariates to achieve a good SE model. Nevertheless, I concluded that SEM is an appropriate technique to include pedological knowledge in DSM, and could potentially fill an important niche. Developing a SE model requires thorough work to translate a conceptual model to a graphical model. By doing so, one is able to test pedological hypotheses and learn from the data, because the model suggestions bring new considerations and research. SEM brings added value to DSM, because the graphical model unites pedological knowledge and statistical modelling in a single framework. What is more, every model coefficient can be analysed in terms of its sign and magnitude with respect to the other coefficients, and in terms of its pedological meaning. For that reason, I think that SEM can be a method to do conscious DSM since it helps one to become more aware of the processes behind the system interrelationships.

This thesis only introduced SEM for DSM. There are several features and possibilities of SEM that I did not research. The use of latent variables to represent conceptual variables, such as soil fertility, soil quality, etc., should be the next step in its adaptation for DSM purposes. Implementing categorical variables and non-linear relations will also bring more flexibility to the model, and can provide solutions for areas with different environmental conditions. In the face of the large demand for answers on global issues that are often addressed at national or regional scale, I am convinced that SEM is a suitable framework to meet these demands. Also, I am confident that SEM fills a vacant niche in DSM, since it does not compete with machine learning techniques and mechanistic modelling approaches, and provides a framework for conscious DSM.