Point of sale (POS) scanner data records information for each good that can be uniquely identified by a barcode including the quantity, value and price of a particular product sold at retailer tills. Web-scraping software collects a product description and price directly from retailer websites. As these new data sources become readily available, national statistical institutions have increasingly used these data to supplement their manually and centrally collected prices for the construction of their consumer price statistics.
While outlier detection methodologies are well established for current data, the introduction of these alternative data sources represent a new challenge: it can be difficult to isolate the impact of changes in pricing, features or packaging on sales quantity in the case of scanner data; while the complicated and changeable nature of web page structures can affect the reliability of web scraped data. Furthermore, issues like seasonality, new item codes and missing item codes are more difficult to solve with these alternative data sources than in the case of standard price data.
In this project we take stock of outlier detection methodologies developed and implemented by national statistical institutions that have incorporated scanner or web-scraped data in their production of consumer price indices and develop methods for incorporating these data into the UK CPI.