Measuring job quality using ojd_daps models: Technical blog

cube-no-animation-2

Measuring job quality using ojd_daps models: Technical blog

By Jack Vines

Online job adverts are one of the most valuable data sources for tracking labour market information. However, extracting meaningful insights from these adverts isn’t straightforward. Job postings are often inconsistently formatted and can be difficult to standardise for comparative analysis, and often have valuable information within text descriptions. To address these challenges, our team has developed a suite of Python libraries that streamline and standardise the extraction of key information from job adverts. This is part of ESCoE work on job quality, in collaboration with Nesta.

This blog will walk you through some of the libraries we’ve created, each designed to solve a specific problem in job advert analysis:

A library that extracts skills mentioned in job adverts and maps them to standardised hierarchies, making it easier to identify and compare skills across industries.
A library that converts various salary formats in job adverts into standardised annual figures, allowing for consistent comparison across roles and regions.
A library that automatically extracts relevant sentences describing what a company does, helping to summarise the nature of businesses without manual reading.

Each of these libraries is accompanied by its own repository, allowing users to interact with the inference processes, as well as comprehensive documentation to guide their use.

Additionally, we have developed a unified repository that houses the training code for each approach, providing a deep dive into how our models are built. By having each training method contained in this centralised repository, it allows us to more easily retrain our models if we have access to more data.

In this blog, we’ll walk you through the functionality of each library using some example job adverts. By the end, you’ll see how these tools can be used to standardise and streamline job advert analysis, making it easier to uncover actionable insights. If you’d like to follow along each approach with working examples and code, you can check out our Google Colab notebook version of this post, which allows you to try each model out, to get a more in depth feel for the inputs/outputs of each approach.

Section 1: Extracting skills from job adverts with ojd_daps_skills

Identifying the skills required for a job is often a critical aspect of analysing job adverts, but extracting these skills from unstructured text poses significant challenges. Companies frequently describe the same skill in different ways, and with varying levels of detail. For instance, one company might simply list “Python” as a desired skill, while another may require “proficiency in Python and Java for back-end development.” This variability makes it difficult to consistently capture and categorise skills across job postings.

To address this challenge, we developed the ojd_daps_skills library. This tool applies natural language processing (NLP) techniques to the full job advert description, identifying skill-related terms embedded within the text and then mapping them to a predefined, standardised hierarchy of skills. By working with the complete text of the job description, the model ensures that both explicit and implied skills are captured.

This standardised mapping allows for consistent aggregation and comparison of skills across diverse job postings, regardless of how each company formulates its requirements. Such consistency is crucial for large-scale labour market analyses, where accurate skill trends are vital for insights. The ojd_daps_skills library thus enables more effective tracking of skill demand across industries, geographies, and time periods.

Let’s walk through a simple example to demonstrate how ojd_daps_skills works.

  1. "The job involves communication skills and maths skills"
  2. "The job involves Excel skills. You will also need good presentation skills"
  3. "You will need experience in the IT sector"
  1. Communications skills, maths skills
  2. Excel, presentation skills
  3. No skills found
  1. "Communication", "communication, collaboration and creativity", "maths skills"
  2. "Excel", "use spreadsheet software", "presentation skills"
  3. No skills mapped

By standardising these skills, the ojd_daps_skills library makes it easier to analyse and compare similar roles across different job postings, regardless of how the skills are described.

The model has been trained exclusively on online job advert descriptions, so expects an input of raw text, without any cleaning required from the source before its input into the model.

You can explore the ojd_daps_skills library further through our GitHub repository and detailed documentation here. The repo includes examples and instructions on how to integrate this library into your own job advert analysis workflows.

We have also developed a demo tool for this library – so you can try it out yourself without any coding knowledge required. See here for more details.

Section 2: Standardising job advert salaries with ojd_daps_salaries

Salary information is a critical component in job adverts, but it is often presented in highly inconsistent ways, complicating meaningful comparisons across postings. Some adverts list annual salaries, while others might offer daily or hourly rates. Additionally, job postings may present salary ranges, or in some cases, only single figures. This inconsistency in format poses a challenge for comparing salaries across different job adverts.

The ojd_daps_salaries library addresses this challenge by standardising the various salary formats into annualised figures. To handle this, the ojd_daps_salaries library operates based on the assumption that the raw salary value and its rate are provided in separate fields as part of the job advert data. The library then applies logic to convert these figures into an annualised salary. It accounts for variables such as the number of hours in a typical workweek, the number of weeks worked per year, and whether the salary is presented as a range. This ensures that salaries presented in different formats—whether hourly, daily, or weekly—are standardised for consistent comparison.

To better understand how ojd_daps_salaries works, let’s look at a fake job advert:

Software Engineer
InnovateTech
£12.50
We are looking for an experienced Software Engineer to join our team. The successful candidate must be experienced in Java and JavaScript.
  • Minimum Annualised Salary: 23475.0
  • Minimum Annualised Salary: 24375.0

In this example, the salary is listed as an hourly rate, which makes it difficult to directly compare it with other jobs that might list an annual salary. Using the ojd_daps_salaries library, we can convert this daily rate into an annual figure.

Online job advert salaries come in lots of different formats, but mostly we have found we have access to a set of standardised fields with either a singular salary, or a salary range with a minimum and maximum value, as well as the cadence related to the figures, which our approach relies on. A future improvement could involve looking for salaries that aren’t defined in standard fields in a job advert, such as ones contained within the job description text, as currently our approach is unable to account for these instances.

By automating this conversion process, ojd_daps_salaries makes it easy to aggregate salary data across job ads and generate meaningful insights from it.

If you’re interested in exploring the ojd_daps_salaries library in more detail, check out the GitHub repository and accompanying documentation here. The repo provides examples, installation instructions, and guidance on how to apply this tool to your own job advert data.

Section 3: Extracting company descriptions with ojd_daps_company_descriptions

Job adverts typically also include information about the company itself—what the company does, its mission, and its industry focus. However, this information is often scattered throughout the advert or embedded within lengthy paragraphs, making it difficult to extract succinctly. To solve this, we developed the ojd_daps_company_descriptions library, which helps to automatically extract key sentences that describe the nature of the company’s work from job adverts.

The ojd_daps_company_descriptions library utilises natural language processing (NLP) techniques to scan job adverts and identify the sentences most relevant to describing the company. This is particularly useful for large-scale data processing, where you want to capture a summary of what companies do across a wide range of job postings, and can be used downstream for things such as industry classification.

The tool is trained to differentiate between sentences that describe the company’s business focus and those that detail the job role itself. This helps create a cleaner dataset of company descriptions, which can be used for tasks such as industry analysis, company profiling, or even building datasets for machine learning models. The model works on a sentence by sentence basis, and so each sentence of an online job advert description can be passed into the model for extraction.

To illustrate how ojd_daps_company_descriptions works, let’s look at a sample job advert:

Marketing Manager
GreenTech Innovations
We are seeking a highly motivated Marketing Manager to join GreenTech Innovations, a leader in developing sustainable energy solutions. At GreenTech, we specialise in creating innovative solar power technology for residential and commercial markets. Our mission is to revolutionise the energy industry by making green energy accessible to everyone. In this role, you will develop and execute marketing strategies to promote our groundbreaking products.

In this job advert, the relevant sentences describing the company are:

  • “GreenTech Innovations, a leader in developing sustainable energy solutions.”
  • “At GreenTech, we specialise in creating innovative solar power technology for residential and commercial markets.”
  • “Our mission is to revolutionise the energy industry by making green energy accessible to everyone.”

Our library takes in a sentence, then returns each sentence with a probability between 0 and 1 that the sentence relates to a company description, allowing users to choose their own threshold at which they further interact with the sentences for downstream analysis.

This extraction process helps reduce the noise in the job advert and focuses solely on the sentences that provide insight into the company’s work, enabling more targeted company analysis, such as Industry Mapping.

To explore how the ojd_daps_company_descriptions library works and to start extracting company descriptions from your own job adverts, check out the GitHub repository and the detailed documentation here. The repo contains examples and instructions for using the tool on your datasets.

Section 4: ojd_daps_language_models, The Centralised Training Code Repository

In addition to the individual libraries for extracting specific types of information from job adverts, we also maintain a unified repository that houses the training code for all of our models. This repository serves as the backbone for the extraction processes, ensuring that the models driving each of our libraries remain accurate, up-to-date, and adaptable to various types of job advert data, centralising the training pipelines for the models that power our approaches. This repository not only provides a foundation for training new models but also supports fine-tuning and improving the existing models as new data becomes available. By centralising the code in a single repository, our team is able to ensure consistency across all extraction processes, maintain efficient workflows, and make it easier for others to get involved/try out our approaches.

The repository contains everything needed to retrain or refine the models that underpin our extraction libraries. It includes :

Scripts to clean and format job advert data in preparation for model training.
Code for training the models that extract skills, standardise salaries, and detect company descriptions, allowing you to adapt the models to new data sources or specific needs.
Metrics and evaluation scripts to assess the performance of the models and ensure they meet the desired accuracy levels.

Interested in exploring how these models are trained? The ojd_daps_language_models code repository is available on GitHub, and the documentation provides the code we used to train our models and outlines of the approaches we used.

Why does this matter?

These libraries can help facilitate large-scale job advert analysis, and also provide users with tools they can adapt to their specific needs. Whether you’re conducting labour market research, benchmarking salaries, or building datasets for further analysis, our tools offer a reliable, standardised solution for extracting key insights from the wealth of job advert data available, or for use as a starting point for researchers aiming to develop their own models.

Our work with online job adverts over the past few years has resulted in us developing a wide range of open source outputs, but a recurring challenge has been maintaining our approaches, iterating on them, and supporting other users wishing to take advantage of them, particularly with the code we have open sourced.

Our aim with this project has been to make this process significantly easier, as well as making more of our work accessible to others in a maintainable way. Our efforts have been guided by the experience of the initial release of our Skills Extraction library, which currently has over 100 stars on Github and users around the world.

Further reading


ESCoE blogs are published to further debate.  Any views expressed are solely those of the author(s) and so cannot be taken to represent those of the ESCoE, its partner institutions or the Office for National Statistics.

Research Projects