By Alex Bishop, Juan Mateos-Garcia and George Richardson
New data and new ways to analyse them with machine learning algorithms and Artificial Intelligence (AI) systems are transforming our economy and everyday lives.
This revolution is also impacting economic analysis, enhancing our ability to study phenomena as varied as the evolution of labour markets, the operation of gig economy platforms or the economic effects of Covid-19.
Our understanding of the industrial composition of the economy and how it evolves also stand to benefit from richer data about what businesses do, the technologies that they adopt and the consequences that this has for innovation and economic growth. This is becoming increasingly important as governments look for evidence to inform policies to support emerging industries, level up local economies and accelerate the transition to a green economy.
In our new ESCoE Discussion Paper, we explore one of these opportunities: to use business website data to develop a bottom-up industrial taxonomy that captures the composition of the UK economy in a more timely and accurate way than is possible with the Standard Industrial Classification (SIC) taxonomy currently in use.
The rationale behind this is clear: the SIC codes were last updated in 2007 and therefore exclude a plethora of new and important industries that have appeared since. They contain many uninformative “not elsewhere classified” codes that are difficult to interpret, and they force each company into a single code, making it difficult to study (in some cases innovative and high growth) companies that straddle sectors.
In contrast to the industrial codes, which businesses select when they become incorporated, the text in business websites capture what businesses say they do in order to attract customers, investors, partners and employees. If a business adopts a new technology that makes it more innovative and productive, it is likely to mention it in its website. If it operates in multiple sectors, this should also appear.
Business websites do also have some key limitations as a source of data for industrial analysis. Businesses with a website are for instance probably unrepresentative of the wider population, they might exaggerate their capabilities, and of course there is the challenge of collecting this noisy and structured data, cleaning it and structuring it to develop statistics that can be used to inform policy.
We analyse a corpus of 1.8 million business website descriptions collected by Glass, a big business data analysis startup, in 2020.
This involves matching this corpus with Companies House, the UK business registry, in order to identify the SIC code associated with each business website.
Having done this, we assess potential limitations of the SIC taxonomy by trying to predict the 4-digit SIC code that a business belongs to through using the text in its description, and analyse the topic composition of the business descriptions in different SIC codes. We then cluster businesses into “text sectors” based on their text descriptions, and explore the results.
Our predictive and topic analysis suggest important misalignments between what businesses say they do and the SIC codes where they are classified, particularly in “not elsewhere classified” codes.
If there was a strong association between business SIC codes and their descriptions, it should be easy to predict the former with the latter. We train a state-of-the-art transformer model on a subset of the data and evaluate its performance in a held-out set. We find that while the model performs quite well for relatively tight and well-specified SIC codes such as Dentistry, Funeral organisers, Roofing activities and Hairdressers, it struggles with more ambiguous and vaguely defined sectors such as “Other personal activities not elsewhere classified” or “Other professional, scientific or technical activities”.
The figure below, where we project company descriptions for some of those sectors in two-dimensional semantic space, explains why: the descriptions for these companies are widely scattered across the space, suggesting that they are heterogeneous and overlap with other industries.
We also use a hierarchical topic model to decompose company descriptions into their constituent topics. We would expect companies in the same industries to have similar themes, and sectors in the same parts of the SIC taxonomy to be thematically similar to each other. In Figure 2, we visualise these similarities and differences at the two-digit SIC level. While the diagonal shows some similarities in various parts of the taxonomy, particularly in manufacturing SIC codes, we also see horizontal and vertical lines cutting across the heatmap where sectors overlap with industries in far away parts of the taxonomy, suggesting ambiguity and potential misclassification.
Our prototype text-based industrial taxonomy could help address some of the limitations of the SIC taxonomy.
In order to build this prototype, we use a topic modelling algorithm that builds networks connecting companies based on their propensity to use the same words in their descriptions and decomposes these networks into communities of companies that we refer to as “text sectors.” We undertake this clustering inside each 4-digit SIC code, yielding 1,800 text sectors, and then iteratively reassign companies to those text-sectors they are most similar to across the whole taxonomy. We name sectors with salient terms in the descriptions of the companies that they contain.
We find that uninformative “not elsewhere classified” SICs tend to be broken up into more text sectors, suggesting that they contain many heterogeneous activities. For example, the 4-digit SIC code 7490 (“Other Professional, Technical and Scientific Activities N.E.C.”) is decomposed into more than a hundred text sectors, including industries as diverse as coaching, mediation, renewable energies, copywriting and clinical trials.
In addition to helping us decompose SIC codes into more granular and informative text sectors, we can slice these 1,800 sectors to identify new groups of interrelated, potentially policy relevant industries. Figure 3 shows various sectors that mention terms related to sustainability and the environment in their titles, as well, through their colour, their original position in the SIC taxonomy.
We also explore whether the prototype industrial taxonomy we have developed is more informative about the economic performance of UK local economies than the SIC codes. In order to do this, we build indices of economic complexity (ECI) based on both taxonomies. ECIs capture the sophistication of the economic capabilities in a location and correlate strongly with economic growth and productivity. Our regression analysis shows that the ECI based on our new taxonomy is more strongly associated with a location’s GDP per capita, Annual Gross Pay and share of workforce with degrees, suggesting that it captures more accurately a location’s industrial capabilities than the SIC taxonomy.
We conclude our analysis by exploiting similarities between text-sectors to reconstruct a hierarchical taxonomy that can be explored at different levels of analysis. Our key idea is to measure the similarity between text sectors based on their tendency to be present in the same companies, and then to cluster them hierarchically based on those similarities. We visualise some preliminary results in Figure 4. In this figure, the smaller points in the periphery represent highly detailed text sectors that are combined into less detailed sectors as we ascend levels of the taxonomy (get closer to the centre). At the top level, the taxonomy contains six sectors.
Notwithstanding its limitations, our analysis suggests that business website data has much to contribute to our understanding of industrial composition, its evolution, geography and impacts.
Our analysis and outputs are not without their weaknesses. We focus on a subset of the data that we have been able to match with Companies House and in 4-digit SIC codes with more companies, and our 1,800 text sectors include duplicates as well as noisy and nonsensical categories. We are already exploring various strategies to address these issues and improve the coverage, robustness and interpretability of this prototype taxonomy.
One important question going forward will be how to harness its potential: how can a bottom-up taxonomy allowing multi-sector tagging and with unequal sector coverage be combined with a top-down, mutually exclusive and completely exhaustive official taxonomy? What is the right schedule to refresh the data-driven taxonomy managing the trade-off between timeliness and temporal consistency, and what infrastructure is required to do this? How can the taxonomy be distributed in a way that maximises its usefulness and usability for national and local policymakers and other stakeholders?
We look forward to tackling some of these questions in future work. Get in touch with us if you are interested in discussing any of this.
Read the full ESCoE Discussion Paper here.
Alex Bishop is Principal Researcher, Data Analytics Practice, at Nesta.
Juan Mateos-Garcia is Director, Data Analytics Practice, at Nesta and a Topic Lead at ESCoE.
George Richardson is Head of Data Science, Data Analytics Practice, at Nesta.
ESCoE blogs are published to further debate. Any views expressed are solely those of the author(s) and so cannot be taken to represent those of the ESCoE, its partner institutions or the Office for National Statistics.