Publications

# Sampling design optimization for geostatistical modelling and prediction

Wadoux, Alexandre M.J.C.

## Summary

Space-time monitoring and prediction of environmental variables requires measurements of the environment. But environmental variables cannot be measured everywhere and all the time. Scientists can only collect a fragment, a sample of the property of interest in space and time, with the objective of using this sample to infer the property at unvisited locations and times. Sampling might be a costly and time consuming affair. Consequently, we need efficient strategies to select an optimal design for mapping.

Most studies on sampling design optimization consider the case of predictive mapping using geostatistics. In recent years geostatistical models and associated mapping techniques have advanced, which calls for adaptation of associated sampling designs. The main objective of this thesis is to address the optimal design of four recent advances in mapping.

Chapter 3 explores sampling design optimization for the non-stationary variance geostatistical model defined in Chapter 2. Accounting for non-stationarity in the variance of environmental properties in complex landscapes leads to better quantification of the mapping uncertainty. This is applied in a case study mapping daily rainfall in the north of England, and optimizing the rain-gauges for mapping. It is shown that rainfall prediction benefits from a model that includes non-stationarity in the mean and variance, as shown by the likelihood and Akaike Information Criterion statistics. The optimization of the rain gauge network is achieved by spatial simulated annealing. The optimized rain gauge network improves slightly the rainfall mapping accuracy. The accuracy gain is limited because I used a static design for all time steps, while the areas with larger prediction uncertainty vary day-by-day. The optimized design also shows a specific spatial pattern, with a fairly uniform spatial distribution but an increased density in areas where the residual variance is large. I further test an optimized design using a reduction of 10% of the total number of rain-gauges. The optimized design shows a significant improvement over the original design using all rain-gauges. I conclude that 10% of the rain-gauges may be removed (e.g. to save costs) without loss of mapping accuracy, provided that the rain-gauges are placed optimally.

Chapter 4 investigates the use of simple sampling strategies to account for a criterion that encompasses both prediction error variance and variogram parameter uncertainty in geostatistical mapping of soil properties. I test two sampling designs: spatial coverage and spatial coverage supplemented by a subset of close-pairs units, and compare these to a design optimized for this criterion. I show that a spatial coverage design performs poorly for mapping using ordinary kriging because of the lack of information at short distances to estimate the variogram parameters. This is valid for series of estimated variogram parameters of a MatÃ¨rn function. An optimized design performs always slightly better, but has several disadvantages. For example, it requires the variogram parameters to be known. It also involves defining an objective function characterizing the total error, and minimizing this error using optimization algorithms. In contrast, a spatial coverage design supplemented by a subset of close-pair units offers accurate results for most variograms tested. I therefore recommend to use the latter design for designing a geostatistical survey, unless prior knowledge of the variogram is available (e.g. an average variogram). If an average variogram is available for the property of interest, it can be used to optimize the design. I further test the minimum number of units required to estimate the variogram of a geostatistical survey, and show that it strongly depends on the degree of spatial correlation of the target variable. For large values of the variogram effective range and small nugget to sill ratios, it is shown that only 15 units are enough to make geostatistical analysis worthwhile, i.e. more accurate than a design-based estimate.

Mapping is not always performed using geostatistical methods. There is growing interest towards mapping using data-driven, non-linear machine learning techniques. The objective of Chapter 5 is to extend our knowledge on sampling optimization for mapping using random forest, and to compare it to conventional sampling designs. I tested the methodology in a potential application scenarios, mapping topsoil organic carbon at European scale using measurements of the LUCAS dataset as population of interest. I demonstrate that an optimized design is always more accurate than other common designs, but possible to obtain only when subsampling an existing dataset with known values of the soil property at all locations. By comparing the mean square error (MSE) of the maps obtained by an optimized design with the those obtained by common designs, it is shown that optimizing a design in terms of MSE is not always worthwhile. When the sample size increases, the maps produced by the different designs converge to similar accuracy values. In a case study on large scale soil organic carbon mapping, a sampling density greater than 1 sampling unit per 4000 km2 decreases markedly the difference in term of average MSE between designs. A design optimized for the mean squared shortest standardized distance in the feature space has the closest match with the optimized design in terms of MSE. By analysing the distribution of the sampling locations in both geographic and feature space, I further show that the optimized design is not spread in the geographic space, but seems to be spread somewhat uniformly in the feature space, and especially in the most important covariates of the machine learning model. It is however difficult to draw further conclusions because of the complex spread of the units in feature space. Further research is needed in this direction.

Sampling design optimization becomes more complex when the ultimate goal is to provide a map used as input for a model whose output is the main interest. This is done by integrating geostatistics for mapping rainfall and Bayesian calibration of a hydrological model for predicting discharges in Chapter 6. The Bayesian calibration enables to capture model input, initial state, parameter and structural uncertainty, while also taking uncertainties in the output measurements into account. In a case study predicting river discharge using a rainfall-runoff model and maps of rainfall as input, a single rain gauge is sufficient to obtain accurate model parameter calibration and discharge predictions. Adding up to five rain gauges improves the model prediction. Adding even more only produces a marginal improvement of the prediction accuracy. Calibrating the rainfall time series as additional parameters leads to more accurate model performance compared to the case where rainfall uncertainty is not updated using discharge measurements. Furthermore, it is demonstrated for the case study that model parameter uncertainty is the main contributor to the posterior discharge uncertainty and that input uncertainty has a relatively small contribution. However, the study also shows that Bayesian calibration of rainfall has serious computational disadvantages. In particular, calibrating a large number of rainfall input parameters remains a serious challenge.

The thesis synthesis is given in Chapter 7. It discusses the findings of this thesis, compares these with existing literature, gives directions for future research and provides a personal reflection on sampling design optimization practices. On the basis of this thesis, I conclude that *there is no single best optimal design*. It is very much case dependent. It depends, among others, on: (i) the assumed model of spatial variation, (ii) the assumption whether we need or need not estimate the parameters from the data, and (iii) the criterion that is used to optimize the sampling configuration. This thesis shows that the choice of the criterion has a serious impact on the optimized design. In practice we may not know the three elements listed above. This is typically the case at the start of a project when no previous data or expertise are available but where we need to design a survey. In this case, it is sensible to use some rules of thumb to design a survey for mapping. Chapter 7 provides some basis for this.

This thesis makes a step towards derivation of optimal designs for novel mapping techniques, with case studies on mapping soil and hydrological variables. But it also shows that we are just at the beginning of this specific field of science. In recent years, there has been a large increase in complexity of techniques and models used for mapping. We make more use of spatially explicit covariate information, such as remote sensing imagery, and measurements are increasingly inferred rather than measured. Mapping techniques have become more data-driven and non-linear, increasing *de facto* the complexity of the sampling designs that should accompany such developments. Because sampling is the basis of mapping and has a large impact on cost and accuracy, this research field will remain as important as ever in geostatistics and spatial modelling.