Applying Machine Learning to Detect Outliers in Alternative Data Sources. A universal methodology framework for scanner and web-scraped data sources (ESCoE TR-12)

By Xuxin Mao, Janine Boshoff, Garry Young, Hande Küçük

9 November 2021

This research explores new ways of applying machine learning to detect outliers in alternative price data resources such as web-scraped data and scanner data sources. Based on text vectorisation and clustering methods, we build a universal methodology framework which identifies outliers in both data sources. We provide a unique way of conducting goods classification and outlier detection. Using Density based spatial clustering of applications with noise (DBSCAN), we can provide two layers of outlier detection for both scanner data and web-scraped data. For web-scraped data we provide a method to classify text information and identify clusters of products. The framework allows us to efficiently detect outliers and explore abnormal price changes that may be omitted by the current practices in line with the 2019 Consumer Prices Indices Manual 2019. Our methodology also provides a good foundation for building better measurement of consumer prices with standard time series data transformed from alternative data sources.

Download as PDF