In this programme of research we have analysed business website data provided by Glass, a business data startup using a host of machine learning methods in order to assess the limitation of the SIC-2007 taxonomy, and create a bottom-up alternative.
In order to do this, we have built a complex data pipeline that merges our business website corpus with Companies House (the business register) with the goal of creating a labelled dataset matching text descriptions against SIC codes. We then use a supervised machine learning strategy to assess the extent to which text is informative about a firm’s chosen code. After this evaluation step, we use topic modelling, a text mining method that identifies “themes” in a body of text – in this case business descriptions – and analyses them to cluster companies in similar groups. Having done this inside 4-digit SIC codes, we reassign companies based on their text similarity across the corpus of all companies following an iterative process that eventually yields robust “text sectors”.
We have analysed the extent to which these text sectors can be used to decompose uninformative “not elsewhere classified” sectors into their constituent parts, study policy-relevant industries in the green economy, and characterise local economies more accurately, and used it to segment UK local authorities into clusters with different economic and innovation outcomes.