By Kees Baake (the Netherlands)
Random Forests are a type of machine learning algorithms that are known for making predictions with low errors. Random Forest (RF) has successfully been applied in a soil modelling context. Due to their black-box nature Random Forest models are difficult to interpret and the inherent modeling and input uncertainties are difficult to quantify. Within the last ten years statisticians discovered desirable properties of Random Forest that make the models more transparent, especially with regards to the quantification of prediction uncertainties. A literature review was done on the mathematical foundations of four uncertainty quantification techniques for Random forest predictions after which they underwent a qualitative assessment on the main criteria: scalability, usability and statistical rigor. Two techniques, Quantile Regression Forest (QRF) and Regression Kriging (RK) were chosen as most viable candidates mainly because they quantified the complete uncertainty, meaning they can be used for creating prediction intervals (PI). The other reason because both are widely available and easily implementable.
QRF and RK were both evaluated as (1) an overall assessment in the form of accuracy plots with derived summary statistics; (2) as a local assessment on spatial dispersal of outliers that fall consistently fall outside the PI; and (3) in terms of computation time scalability. This was done by averaging over 100 runs of 10-fold cross validation. A case study in eastern Australia (Edgeroi), characterized by a sampling design mix of systematic and clustered sampling, was selected for evaluation of the Random Forest prediction interval estimation by QRF and RK. After preprocessing steps pH and soil organic carbon content (SOC) were modeled with both a 14 covariate model (RF14) and 4 covariate model (RF4) with covariates falling within the soil forming factor categories of location, relief, vegetation, climate and parent material. For the overall uncertainty assessment multiple PIs were validated and for the local assessment 0.9 probability level was investigated.
In the overall uncertainty assessment both RK and QRF performed well on both with 4- and 14 covariate models with low absolute deviations (<5%) from the accuracy plot 1 : 1 (observed vs expected proportion in PI). QRF was often too optimistic: most of its proportion was above the 1 : 1 line (>0.90). RK was too pessimsitic and was mostly below the 1 : 1 line (>0.90). No major differences in uncertainty quantification performance were observed between the modeling of pH and SOC although the predictive R2 of the underlying Random Forest model varied largely between the two soil response variables (e.g. 0.41 vs 0.08 for RF4). However, the local uncertainty assessment did note substantial differences between pH and SOC for QRF and RK: pH seemed to be more clustered into regions of spatial outliers (RK) instead of being dispersed (QRF). SOC did not find any major differences in spatial outlier dispersal between RK and QRF. In terms of scalability QRF doubled in computation time when the number of points to predict increased 10 fold. In general, the width maps of the 0.9-PI showed more detail and clear boundaries for QRF. Indicating that conditioned geographical data has a large effect on the magnitude of uncertainty. Other literature on QRF in soil science context also showed promising results under a more sparse sampling design. Thus, there are strong clues that QRF can be used as a new, flexible tool in the field of uncertainty modeling in spatial context.