Analyzing the effectiveness of semi-supervised learning approaches for opinion spam classification

Ligthart, Alexander; Catal, Cagatay; Tekinerdogan, Bedir


Opinion spam detection is concerned with identifying fake reviews that are deliberately placed to either promote or discredit a product. Opinionated social media like product reviews are increasingly important resources for people as well as businesses in the decision-making process and can be easily manipulated by opportunistic individuals. To reduce this increasing impact of opinion spams, opinion spam detection approaches have been proposed, which adopt mostly supervised classification methods. However, in practice, the provided data is largely not labeled and therefore semi-supervised learning approaches are required instead. To this end, this study aims to analyze the effectiveness of several semi-supervised learning approaches for opinion spam classification. Four different semi-supervised methods are evaluated on a dataset of both genuine and deceptive hotel reviews. The results are compared with several traditional classification methods using the same amount of labeled data. According to this study, the self-training algorithm with Naive Bayes as the base classifier yields 93% accuracy. Results show that self-training is the only approach, out of the four tested semi-supervised models, that outperforms traditional supervised classification models when limited data is available. This study further shows that self-training can mitigate labeling efforts while retaining high model performance, which is useful for scenarios where limited data is available or retrieving labeled data is more costly.