By Janine Boshoff
Statistical agencies such as the Office for National Statistics (ONS) are responsible for collecting data on the prices of goods and services that we consumers purchase in order to calculate inflation. The ONS creates a representative ‘basket of goods’ that include the goods and services that households buy and consume most often. Historically, the prices for these goods and services would then be collected by dedicated price collectors each month, allowing the ONS to calculate the average price change between subsequent months. Naturally, manual price collection sometimes includes errors in the data, e.g. a pot of strawberry jam is listed as costing £25 instead of £2.5. This error is considered an ‘outlier’, a price observation that is significantly different from the average price for that particular good. It is important to remove these errors from the dataset, because their inclusion could result in calculated inflation appearing much higher than it really is. For this reason, the ONS developed methods to identify these errors and remove them from the dataset, ensuring that they correctly calculate inflation each month.
Point of sale (POS) scanner data records information for each good that can be uniquely identified by a barcode including the quantity, value and price of a particular product sold at retailer tills. Web-scraping software collects a product description and price directly from retailer websites. As scanner data and web-scraped data become more accessible, statistical agencies have started to explore the use of these alternative data sources in their production of inflation statistics. In fact, alternative data sources have already proved informative during the Covid-19 pandemic, when price collectors could not visit stores or retailers due to social distancing measures. Data collected from retailers and their websites have provided valuable information on the changing spending patterns of households during the lockdown.
While the use of alternative data sources has clear benefits for statistical agencies, the large volume and unique nature of alternative data require further research to understand how to detect and correct for outliers before the data can be used to calculate inflation. For example, it can be difficult to isolate the impact of changes in pricing, features or packaging on sales quantity in the case of scanner data, while the complicated and changeable nature of web page structures can affect the reliability of web scraped data. The Technical Report reviews published articles and working papers from statistical agencies and academia to provide an overview of the current methods for outlier detection.
The ONS hopes to incorporate alternative data sources into its calculation of consumer price inflation by Quarter 1 (January to March) 2023 but will first undertake research on the available data and methods to detect outliers. This literature review is the first step to understanding the structure of web-scraped data and the potential methods that can be used to correct for any outliers.
ESCoE blogs are published to further debate. Any views expressed are solely those of the author(s) and so cannot be taken to represent those of the ESCoE, its partner institutions or the Office for National Statistics.