Published online by Cambridge University Press: 04 March 2020
Anomaly detection can be seen as an unsupervised learning task in which a predictive model created on historical data is used to detect outlying instances in new data. This work addresses possibly promising but relatively uncommon application of anomaly detection to text data. Two English-language and one Polish-language Internet discussion forums devoted to psychoactive substances received from home-grown plants, such as hashish or marijuana, serve as text sources that are both realistic and possibly interesting on their own, due to potential associations with drug-related crime. The utility of two different vector text representations is examined: the simple bag of words representation and a more refined Global Vectors (GloVe) representation, which is an example of the increasingly popular word embedding approach. They are both combined with two unsupervised anomaly detection methods, based on one-class support vector machines (SVM) and based on dissimilarity to k-medoids clusters. The GloVe representation is found definitely more useful for anomaly detection, permitting better detection quality and ameliorating the curse of dimensionality issues with text clustering. The cluster dissimilarity approach combined with this representation outperforms one-class SVM with respect to detection quality and appears a more promising approach to anomaly detection in text data.