From energy bills to online shopping carts, it is not hard to notice that many prices are going up. As the global economy slowly recovers from the COVID-19 pandemic, the debate now turns to inflation and its impact on macroeconomic policy, financial markets and the wellbeing of society.
Being able to measure inflation accurately and in a timely manner is crucial to good policy. Current practices in measuring monthly price changes need to be kept under review and where possible improved, particularly as shopping habits change and new methods become available. Data are compiled by price collectors, either remotely or in-store, and provide a backward-looking picture of inflation status with a three- or four-week lag. With hundreds of thousands of individual prices being collected it is important to detect erroneous information in the dataset because it could result in calculated inflation appearing much higher or lower than it really is.
In an effort to get a more-timely read on what’s happening to prices and improve measurement of key indices like the consumer price index (CPI), the Office for National Statistics and its international counterparts are adopting so-called alternative-data sources such as scanner data and web-scraped data (ONS, 2019, 2021, and Boshoff, Mao and Young, 2020). Scanner data are collected by retailers at the point of sale, providing statistical offices with enhanced product, geographic and temporal coverage as well as significantly more information on the number and type of products sold. Web-scraped data involve collecting prices and other related product information automatically from online websites.
While those alternative data sources provide a wealth of additional product information about online prices, such as product descriptions, in a timelier way, before incorporating these data, accurate outlier detection is extremely important to identify and invalidate price movements that differ significantly from the norm for a particular item.
Outliers are data points that differ significantly from other observations. A price observation that is significantly different from the average price for that particular good or service is considered an outlier. Currently, most statistical agencies use parametric outlier detection methods such as statistical profiling, i.e., creating upper and lower bound cut-off points by adding and subtracting a fixed number of standard deviations from a mean or median. However, due to their large volume and unique nature, we do not always know the distribution of data from alternative sources, requiring non-parametric approaches to detect outliers.
Machine learning can be used for anomaly detection when current practices like statistical profiling are not suited to deal with the text information and complex product identification that characterises alternative data sources. The machine learning tools and methods that we use for outlier detection cover natural language processing and clustering algorithms. But, for web-scraped data documents with descriptive information, applying machine learning to natural language faces one major hurdle: algorithms usually deal with numbers while natural language is text.
In our research, we propose to use a Doc2Vec method to transform that descriptive text into vector numbers, otherwise known as text vectorization, for further outlier detection. With derived vector information, for both scanner and web-scraped data, we can then adopt one clustering method to detect outliers. We use a well-known data clustering algorithm that is commonly used in data mining and machine learning: the density-based spatial clustering of applications with noise (DBSCAN). DBSCAN is a flexible non-parametric method that deals well with clusters of different densities like the web-scraped data and flexibly with data features, such as trend changes or breaks. It is easy and fast to calculate, and available in popular computing packages such as Python’s scikit learn. While applicable to alternative data sources, DBSCAN can be used for detecting outliers from standard data like cross section or time series data.
The practice proposed for applying machine learning for alternative data sources with a web-scraped data sample, can be demonstrated with the following architecture (Figure 1).
We first use Doc2Vec and text density-based clustering methods to vectorise the text information on goods characteristics and identify goods groups for further outlier detection analyses. After converting text information to document vectors, we use a DBSCAN algorithm to detect outliers and cluster the data into specific goods. We then construct monthly observations based on the date information and order the generated data based on ID, date, or price. The goods observations are listed and referenceable over time and across different goods groups. We then order the goods within the same group based on the date information. We can have monthly subsets of goods price data including all observations occurring in a specific calendar month. We then conduct two rounds of outlier detection from price observations within the same months and over the whole observation periods.
Among its many advantages, DBSCAN can detect significant abnormal price level changes omitted by current practices to improve price measurement and CPI construction. We take an example from the scanner price data from Dominick’s Finer Food (DFF), a large Midwestern US supermarket chain which operates around 130 stores in the Chicago Metro area. Its beer category contains 3,846,701 observations of 790 forms of beer from 14 September 1989 (week 1) to 05 January 1997 (week 399). Shown in Figure 2, for a specific beer that was available for sale over the weeks 285-323, DBSCAN can identify 6 outliers with a sharp weekly price change of more than half a dollar. By comparison, using the Tukey method currently employed for ONS outlier detection, we do not detect any outliers (the right-hand chart of Figure 2 for further examination, which may cause bias towards the CPI to be built.
In summary, our methodological framework can detect outliers with a combination of natural language processing and clustering methods. We can use the methodology to generate improved price indices by providing standard price time series with outliers addressed at the product level. The framework provides flexible outlier detection methods that could benefit from experts’ suggestions (the Delphi method) or a per-good threshold recommendation engine for DBSCAN setups. The current algorithm can not only detect outliers on price changes for more real-time price measurement but also be further developed to explain abnormal changes happened to the economy and financial markets, which may provide much needed information to guide us through the uncertain post-COVID period.
If you are interested in the technical details of our methodology, please refer to our full ESCoE Technical Report here.
Download the code here.
 Our proposed methodology framework also applies for scanner data.
Boshoff, J., Mao, X. and G. Young (2020) Outlier Detection Methodologies for Alternative Data Sources: International Review of Current Practices, ESCoE Technical Report 07. Link: escoe.ac.uk/publications/outlier-detection-methodologies-for-alternativedata-sources-international-review-of-current-practices/
Mao, X., Boshoff, J., Kucuk, H., and G. Young (2021) Applying Machine Learning to Detect Outliers in Alternative Data Sources. A universal methodology framework for scanner and web-scraped data sources, ESCoE Technical Report 12. Link: escoe.ac.uk/publications/applying-machine-learning-alternative-data-sources/
Office for National Statistics (ONS) (2019) Consumer Prices Indices Technical Manual, 2019. Link: https://www.ons.gov.uk/economy/inflationandpriceindices/methodologies/consumerpricesindicestechnicalmanual2019
ONS (2021) Transformation of Consumer Price Statistics: November 2021. Link: https://www.ons.gov.uk/economy/inflationandpriceindices/articles/introducingalternativedatasourcesintoconsumerpricestatistics/november2021
Xuxin Mao is a Principal Economist at the National Institute of Economic and Social Research
ESCoE blogs are published to further debate. Any views expressed are solely those of the author(s) and so cannot be taken to represent those of the ESCoE, its partner institutions or the Office for National Statistics.