By Inge van Manen (the Netherlands)
Conventional soil maps have already been used for a long time. They are usually created with a survey done by a soil surveyor, who has to rely on his own knowledge to create a map from the observations. There has been an increase in computer power and computer usage over the past decades. This increase made a lot of new research and calculation methods possible, such as machine learning. In parallel, there has been a useful development in the increased availability of environmental covariate data. From these developments, digital soil mapping has emerged. Nowadays, machine learning is regularly used to create digital soil maps. It has proven its value in the previous years, but little has been done on finding the optimal sampling design for it. A lot of work has been done on sampling design optimization for kriging and other traditional methods of soil property prediction, but no research has been done yet on finding the best sampling design for machine learning methods. This research is necessary because optimization of the sampling design can reduce the costs of taking and analysing (too many) samples. Also, it will reduce the computation time of models, because the optimal number of points for calibrating the random forest model can be selected. The objective of this thesis research was to assess and optimise sampling designs for mapping soil properties with machine learning methods. The work specifically addresses random forest. Three different sampling designs were chosen and compared: a simple random sample, a conditional Latin hypercube sample (cLHS) and a spatial coverage sample. Also, it was tested if spatial simulated annealing improves the prediction compared to those sampling designs. For each of the sampling designs, three samples were taken for each of the four different sampling sizes. To be able to compare the different sampling designs, criteria are needed on which to judge them. In this research, the designs were compared using quantile regression forest to create a prediction interval and by external validation with RMSE, ME and CCC. Also, all samples were checked with a spatial plot and a plot of Ripley’s K and the cLHS sample was checked with the distribution of covariates. It could be concluded that for a small sampling size (200 points), a spatial coverage sample is the best option. If the sampling size increases, spatial coverage sampling still perform equally well as for the small sampling size, but the other designs, random sampling does perform best, so this might be the best choice, especially if you take computation time into account. Spatially simulated annealing did improve the predictions by a lot, but it is very time consuming, computationally intensive and only possible if there is already a dataset available.