Thinking about data linkage?


Thinking about data linkage?

By Josh Martin

An increasing proportion of economic research is empirical, and an increasing fraction of that makes use of record-level data (or “microdata”). These data are at the level of the business or person, rather than aggregates (from the National Accounts or other sources). Using microdata allows researchers to consider distributions, heterogenous effects across observation units with different characteristics, and often allows for a far richer analysis in general.

This opportunity is enhanced even further when datasets can be combined or linked, often known as “data linkage”. For example, imagine that one household survey collects financial data such as earnings, and another collects data on opinions on various subjects. In isolation, each source provides information only on earnings or opinions, but not both. Assuming there are common observation units (i.e. households or people in this case) across both surveys, then linking them enables analysis of the relationship between opinions and earnings – creating a richer dataset which can provide greater insight for policymakers.

Anyone who has attempted data linkage knows that it can be hard and often messy. Overlapping respondents between surveys happen usually only by chance, and represent only a fraction of either survey’s total size. Data can relate to ostensibly the same topic or variable across surveys, and yet be recorded differently or inconsistently. And the action of actually linking the data is made difficult by the availability, and quality, of common information across surveys that allow for such linking.

Record-level data linkage usually requires some form of “linking key” – information common across both sources, and unique to each record in each source. To take a simple example, linking records for Josh Martin across the aforementioned earnings and opinions surveys requires that a record exists for me in both surveys, that identifiable information (such as my name, address, National Insurance number, or similar) exists in each, and that no other record in either source shares all of that information. For instance, if we matched only on name, it’s quite possible another Josh Martin exists in one source, making the linkage unreliable. Data linkage keys are therefore best when they are unique.

In terms of linking business survey data, businesses have an array of identifiers: tax numbers, company registration numbers, statistical numbers, legal names, and so forth. Linking microdata from ONS business surveys together is reasonably straightforward – in almost all cases, business ‘reporting units’ are identified by a ‘reporting unit reference’, which is unique to each business, common across surveys, and constant over time. These reference numbers are stored and maintained in the Inter-Departmental Business Register (IDBR) – a near exhaustive database of businesses in the UK, managed by ONS. The existence of such an identifier makes linking ONS business survey microdata to other ONS business survey microdata straightforward.

A challenge arises however when linking ONS business survey data with non-ONS sources. Business surveys conducted by other government departments or private enterprises, and administrative data, usually do not have the same business identifiers as ONS sources. This is because the respondent businesses are recruited or sampled not using the IDBR as a sampling frame, but via some other means – such as using commercial data, email addresses, phone numbers, or some other identifier. Without a linking key across sources, data linkage is far more difficult.

In our new ESCoE Technical Report, ‘Matching UK Business Microdata – A Study Using ONS and CBI Business Surveys‘, we provide a case study of such a linkage exercise, using a novel data source: the business surveys conducted by the Confederation of British Industry (CBI). These have run fairly consistently for many years, and collect a range of qualitative and quantitative data, including business forecast and outturn data –providing a wealth of information not available from ONS sources. However, to check its quality and veracity, we want to link it to ONS survey sources, and examine the congruence across the sources at the business-level. Doing so requires us to link the CBI survey data to ONS survey data, which in turn requires a linking key. The CBI surveys are not sampled using the IDBR, and hence do not contain the reporting unit references needed for linking to ONS survey sources. It is this challenge that we explore.

As we document in the paper, the ONS provides a service to link non-ONS data sources to the IDBR, which enables linkage to ONS business surveys. In our case study, we rely on business names, addresses and postcodes, which are imperfect but together provide a reasonable linking key. The ONS service we chose used propensity score matching, which can result in ‘multiple matches’ – that is, multiple records on the IDBR could be the appropriate match for the observation in the CBI data. To choose which one, we developed a rules-based algorithm, based on the presence of the IDBR units in the ONS business surveys of interest, and the employment size of the IDBR units. This was informed by our research objectives, and a good knowledge of the data. This approach allowed us to overcome a common challenge in microdata linkage.

We also linked the CBI data to a commercial dataset, the FAME dataset from Bureau van Dijk. This used a different linking approach called “trigram decomposition”, which we also explain in the paper. The paper then presents matching rates of the exercise, including breakdowns by some business characteristics. Matching was more successful amongst large businesses in the FAME data, since these are more likely to exist in the FAME dataset. Overall, however, match rates were far higher with the IDBR than with FAME.

We hope the case study of business data linkage presented by our paper is useful, and provides some insights into the processes and challenges involved. We summarise a range of linkage methods, offer practical advice to practitioners, and report matching rates that might be a useful benchmark for future researchers.

Read the full ESCoE Technical Report here.

Josh Martin is Head of Productivity at the Office for National Statistics and an ESCoE Topic lead

ESCoE blogs are published to further debate. Any views expressed are solely those of the author(s) and so cannot be taken to represent those of the ESCoE, its partner institutions or the Office for National Statistics.

About the authors

Research Projects

Related publications