An industrial taxonomy using web-data

An industrial taxonomy using web-data

Summary

We are exploring how new data sources – such as business website data – and machine learning methods could help create bottom-up industrial taxonomies providing a more up-to-date and detailed view of the economy. This can inform targeted policies to level up local economies, support innovation and accelerate the transition to the green economy.

Overview

Official Industrial taxonomies are an important tool for understanding the composition of the economy and the situation in different sectors. These taxonomies are developed through an internationally coordinated process and put in practice when businesses become incorporated or respond to a business survey. This top-down approach has the advantage of avoiding double counting and ensuring historical consistency in economic statistics. As a downside, it fails to capture new industries that appeared after a taxonomy was established, and contains noise because businesses have few incentives to choose the right industry code or update it when they shift their focus. As an example, the industrial taxonomy currently in use in the UK (the Standard Industrial Classification) was last updated in 2007, and a large number of businesses in it are registered in uninformative “not elsewhere classified” codes, suggesting pent-up demand for richer and more accurate descriptions of their economic activities.

New data sources and in particular text descriptions of what businesses do, together with machine learning algorithms to analyse and cluster that text data could help generate bottom-up industrial taxonomies that complement, expand and improve official taxonomies. These taxonomies could be applied automatically to businesses reducing the scope for misclassification due to human error.

Methods

In this programme of research we have analysed business website data provided by Glass, a business data startup using a host of machine learning methods in order to assess the limitation of the SIC-2007 taxonomy, and create a bottom-up alternative.

In order to do this, we have built a complex data pipeline that merges our business website corpus with Companies House (the business register) with the goal of creating a labelled dataset matching text descriptions against SIC codes. We then use a supervised machine learning strategy to assess the extent to which text is informative about a firm’s chosen code. After this evaluation step, we use topic modelling, a text mining method that identifies “themes” in a body of text – in this case business descriptions – and analyses them to cluster companies in similar groups. Having done this inside 4-digit SIC codes, we reassign companies based on their text similarity across the corpus of all companies following an iterative process that eventually yields robust “text sectors”.

We have analysed the extent to which these text sectors can be used to decompose uninformative “not elsewhere classified” sectors into their constituent parts, study policy-relevant industries in the green economy, and characterise local economies more accurately, and used it to segment UK local authorities into clusters with different economic and innovation outcomes.

Findings

  • State-of-the-art machine learning models struggle to classify firms into their SIC codes using text data. This is particularly difficult for companies in uninformative “Not elsewhere classified” codes suggesting that their composition is heterogeneous and overlapping with other codes
  • Measures of economic complexity based on our bottom-up taxonomy are more strongly linked to proxies for local productivity and knowledge intensity than those based on the SIC-2007 taxonomy, suggesting that the text-based approach is better at capturing industrial features of a location that makes it competitive and productive.
  • When we decompose the 4-digit SIC code 7490 (“Other professional, scientific and technical activities Not Elsewhere Classified”) into its constituent text sectors, we find a vast range of activities inside it ranging from meditation to renewable energy, suggesting substantial opportunities to create more finely grained classifications of the economy using text data.

Impact

Although this programme of work is still in its early stages, we have presented emerging findings to leading scholars in economics and network science and started discussing with policymakers how it could support key policy agendas in the UK (particularly around levelling up) and improve economic statistics by informing future revisions of the SIC taxonomy and helping identify misclassified companies

Outputs

Mateos-Garcia, J. and Richardson, G. “A Bottom Up Industrial Taxonomy for the UK. Refinements and an ApplicationESCoE Discussion Paper Series, DP 2022-29

Mateos-Garcia, J. “Modelling an Evolving Economy: Summary and Reflections” ESCoE Blog, 22 November 2022

Mateos-Garcia, J. “A Bottom-up Industrial Taxonomy Using Web-data” Modelling an Evolving Economy, Digital Catapult, 7 October 2022

Bishop, A. Mateos-Garcia, J. and Richardson, G. (2022) “Using Text Data to Improve Industrial Statistics in the UK” ESCoE Discussion Paper Series, DP 2022-01

Bishop, A. Mateos-Garcia, J. and Richardson, G. “Discovering Industries in Networks of Words” ESCoE Blog, 17 January 2022

Mateos-Garcia, J. Bishop, A. and Richardson, G. “Discovering industries in networks of words” Complex Networks in Economics and Innovation, Contributed II, 30 Jun 2021

People

Juan Mateos-Garcia

George Richardson

Partners

Related Events

Related publications