Outlier detection methodologies for alternative data sources

The importance of alternative data sources

Outlier detection methodologies for alternative data sources

Summary

Consumer Price Indices (CPI) have typically been compiled using manually collected data. Web-scraped and point of sale scanner data are becoming more accessible and provide an excellent source of price data from which inflation can be calculated. These sources are known as “alternative” data since the information was collected for purposes outside the remit of statistical agencies and the production of official statistics. Before incorporating these data, the nature and quality must be assessed to avoid biases in measures of inflation. One crucial issue is the detection of outliers. Outliers are data points that differ significantly from other observations and can cause price indices to over- or underestimate true inflation. This project develops methods for detecting and addressing outliers in these new data sources.

Overview

Point of sale (POS) scanner data records information for each good that can be uniquely identified by a barcode including the quantity, value and price of a particular product sold at retailer tills. Web-scraping software collects a product description and price directly from retailer websites. As these new data sources become readily available, national statistical institutions have increasingly used these data to supplement their manually and centrally collected prices for the construction of their consumer price statistics.

While outlier detection methodologies are well established for current data, the introduction of these alternative data sources represent a new challenge: it can be difficult to isolate the impact of changes in pricing, features or packaging on sales quantity in the case of scanner data; while the complicated and changeable nature of web page structures can affect the reliability of web scraped data. Furthermore, issues like seasonality, new item codes and missing item codes are more difficult to solve with these alternative data sources than in the case of standard price data.

In this project we take stock of outlier detection methodologies developed and implemented by national statistical institutions that have incorporated scanner or web-scraped data in their production of consumer price indices and develop methods for incorporating these data into the UK CPI.

Outputs

CODE Mao, X. (2021) ‘Python Algorithm for Alternative Data Outlier Detection’, November 2021.  Download Code.

Mao, X., Boshoff, J., Young, G. and Küçük, H. (2021) ‘Applying Machine Learning to Detect Outliers in Alternative Data Sources. A universal methodology framework for scanner and web-scraped data sourcesESCoE Technical Report, ESCoE TR-12

Mao, X ‘Improving Price and Inflation Measurement with Machine Learning, Outlier Detection and Alternative Data’, ESCoE Blog, 22nd November 2021

Boshoff, J.,  Young, G. and Mao, X. ‘Outlier Detection Methodologies for Alternative Data Sources: International Review of Current PracticeESCoE Conference on Economic Measurement 2021, Poster Exhibition, 11-13 May 2021. Poster Video Presentation

Mao, X. (2021) ‘Applying Machining Learning for Alternative Data Outlier Detection and DeletionESCoE Conference on Economic Measurement 2021, Poster Exhibition, 11-13 May 2021. Poster Video Presentation.

Boshoff, J., Mao, X. and Young, G. ‘Outlier detection methodologies for alternative data sources: International review of current practices’, NIESR Discussion Paper No 523, 2nd Dec 2020

Boshoff, J., Mao, X. and Young, G. (2020) ‘Outlier detection methodologies for alternative data sources: International review of current practices’ESCoE Technical Report, ESCoE TR-07

Boshoff, J. ‘The importance of alternative data sources’, NIESR Blog, 24th July 2020

Boshoff, J. ‘The importance of alternative data sources’ , ESCoE Blog, 24th July 2020

People

Janine Boshoff

Xuxin Mao

Partners

Related Publications