We are exploring how new data sources – such as business website data – and machine learning methods could help create bottom-up industrial taxonomies providing a more up-to-date and detailed view of the economy. This can inform targeted policies to level up local economies, support innovation and accelerate the transition to the green economy.
Official Industrial taxonomies are an important tool for understanding the composition of the economy and the situation in different sectors. These taxonomies are developed through an internationally coordinated process and put in practice when businesses become incorporated or respond to a business survey. This top-down approach has the advantage of avoiding double counting and ensuring historical consistency in economic statistics. As a downside, it fails to capture new industries that appeared after a taxonomy was established, and contains noise because businesses have few incentives to choose the right industry code or update it when they shift their focus. As an example, the industrial taxonomy currently in use in the UK (the Standard Industrial Classification) was last updated in 2007, and a large number of businesses in it are registered in uninformative “not elsewhere classified” codes, suggesting pent-up demand for richer and more accurate descriptions of their economic activities.
New data sources and in particular text descriptions of what businesses do, together with machine learning algorithms to analyse and cluster that text data could help generate bottom-up industrial taxonomies that complement, expand and improve official taxonomies. These taxonomies could be applied automatically to businesses reducing the scope for misclassification due to human error.
In this programme of research we have analysed business website data provided by Glass, a business data startup using a host of machine learning methods in order to assess the limitation of the SIC-2007 taxonomy, and create a bottom-up alternative.
In order to do this, we have built a complex data pipeline that merges our business website corpus with Companies House (the business register) with the goal of creating a labelled dataset matching text descriptions against SIC codes. We then use a supervised machine learning strategy to assess the extent to which text is informative about a firm’s chosen code. After this evaluation step, we use topic modelling, a text mining method that identifies “themes” in a body of text – in this case business descriptions – and analyses them to cluster companies in similar groups. Having done this inside 4-digit SIC codes, we reassign companies based on their text similarity across the corpus of all companies following an iterative process that eventually yields robust “text sectors”.
We have analysed the extent to which these text sectors can be used to decompose uninformative “not elsewhere classified” sectors into their constituent parts, study policy-relevant industries in the green economy, and characterise local economies more accurately, and used it to segment UK local authorities into clusters with different economic and innovation outcomes.
Although this programme of work is still in its early stages, we have presented emerging findings to leading scholars in economics and network science and started discussing with policymakers how it could support key policy agendas in the UK (particularly around levelling up) and improve economic statistics by informing future revisions of the SIC taxonomy and helping identify misclassified companies
Bishop, A. Mateos-Garcia, J. and Richardson, G. (2022) ‘Using Text Data to Improve Industrial Statistics in the UK‘ ESCoE Discussion Paper Series, DP 2022-01
Bishop, A. Mateos-Garcia, J. and Richardson, G. ‘Discovering Industries in Networks of Words‘ ESCoE Blog, 17 January 2022
Mateos-Garcia, J. Bishop, A. and Richardson, G. ‘Discovering industries in networks of words’ Complex Networks in Economics and Innovation, Contributed II, 30 Jun 2021