
Colloquium
Garbage in garbage out: Comparison of training data characteristics for different landscapes and machine learning algorithms in land cover mapping
By Joran de Lange
Abstract
Landcover maps are primarily made through the use of machine learning algorithms, these algorithms are trained by using training data. Training data can have different characteristics such as its size, or the design used to obtain the sampling locations. This thesis compared the impact of various training data characteristics on land cover classification. The characteristics that were studied are the sample size, sampling design, mislabelling of the training data, and homogeneity of the underlying area. The study used two machine learning models for the classification these were random forest and a single layer perceptron. The regions that were classified were located in western Africa and the south of Madagascar. They were chosen for their differing levels of landscape homogeneity and had accurate land cover maps that were used as reference and validation. The research showed that the four training data characteristics; sample size, sampling design, mislabeling of the training data and homogeneity of the training area impacted the prediction accuracy. This impact was observed for both areas and classification models. The largest impact on the classification accuracy came from the selected sampling design while increasing the sampling size increased the overall prediction accuracy. The prediction accuracy of individual classes depended largely on the sample size and continuity of the class. This thesis emphasizes the importance of taking training data characteristics into account when classifying land cover maps using supervised classification.