DBT: Find Exporters
Find Exporters data uses the Export Propensity Scores to aid users in establishing the export potential of companies on the Companies House register. 皇冠体育app algorithm predicts the probability that a company exports goods. This can then be used to identify companies to work with.
Tier 1 Information
Name
Find Exporters
Description
Find Exporters data uses the Export Propensity Scores, produced by the Export Propensity Algorithm, to aid users in establishing the export potential of companies on the Companies House register. 皇冠体育app algorithm predicts the probability that a company exports goods. This can then be used by staff within DBT to identify companies to work with.
Website URL
N/A
Contact email
Tier 2 - Owner and Responsibility
1.1 - Organisation or department
Department for Business and Trade (DBT)
1.2 - Team
Digital, Data and Technology Data Science Team
1.3 - Senior responsible owner
Chief Data Officer
1.4 - External supplier involvement
No
2.1 - Detailed description
This algorithm uses a Supervised classification, where the binary target variable (marking a 鈥渃ompany that exports goods鈥�) is associated with a probability which indicates the confidence level in the predicted variable.
By making this process a binary problem it simplifies the implementation and interpretation of the algorithm, as we want to compute an associated propensity score between 0 and 1. This was agreed with the Export and Investment portfolio when the model was first designed and deployed.
皇冠体育app Target variable is whether a company has exported in the ~6 months (180 days) successive to a prediction cut-off (defined below).
All of these variables were agreed by the Export and Investment portfolio when the algorithm was first designed and deployed. Other choices could be sensible; however this is linked to the current expected semantics of the model and should not be changed without proper communications with our users. We will continue to review this in the future.
2.2 - Scope
皇冠体育app scope of the tool is to review all UK companies so that the DBT team can make targeted decisions on which organisations to first engage. 皇冠体育app tool does not link the propensity of exports and the government support given to companies (if any) and cannot be used alone to measure the success of government support. 皇冠体育app tool does not account company export strategies (i.e. low propensity score may reflect a business strategy rather than obstacles in exporting and should not be used alone for that purpose).
2.3 - Benefit
皇冠体育app benefit of this tool is to help inform operational decisions and identify leads for DBT staff. This tool can help staff to decide which organisation to engage next but does not restrict the user from choosing themselves.
2.4 - Previous process
Prior to this tool, information about the propensity to export for companies were guessed using human expertise in manually inspecting and sifting through data. 皇冠体育app tool provides an accurate, well-defined, data-driven probability to replace the manual process and improve its outcome.
2.5 - Alternatives considered
Initial version of the model was implemented using Xgboost. However, we now use LightGBM.
LightGBM is a very efficient and flexible implementation of gradient boosting (faster than, for example xgboost). Gradient boosting models typically lead to accurate models because they use ensemble techniques. In particular, they use a combination of decision trees, which allow to capture non-linear relationships between variables and accommodates well both categorical and numerical variables. LightGBM makes no assumption on the distribution or processes determining the input data.
Tier 2 - Decision making Process
3.1 - Process integration
皇冠体育app tool provides users with data-driven insights on a company鈥檚 propensity to export, which provides the user with future prospects to engage. This information complements and integrates with many further insights from other sources available to the users to make the best-informed decisions on which companies to engage next.
3.2 - Provided information
皇冠体育app output of the tool (export predictions) is accessible both as raw data as well as through a dashboard. 皇冠体育app dashboard has quick filtering functionality that allows the user to quickly retrieve the desired output for any specific use case they are searching for. 皇冠体育app tool calculates an Export Propensity score, which tries to estimate the export potential of a company, as a real number between 0 and 1. This score is then used to assign an Export Propensity Label:
- Very high: the top 7% of companies with the highest propensity scores
- High: the next 8% of companies
- Medium: the next 15% of companies
- Low: the next 20% of companies
- Very low: the 50% of companies with the lowest propensity scores
For example, companies in the top 7% bracket in terms of their Export Propensity Score, will be assigned the label 鈥淰ery High鈥�. However, their score might be low, for instance 0.1, corresponding to a 10% estimated probability of exporting within the next 6 months. This is because, in general, not many companies are exporters, and there is a lot of inherent variability among companies even when they appear similar on paper, which makes it difficult to obtain scores close to 1.
皇冠体育app score is updated daily.
3.3 - Frequency and scale of usage
皇冠体育appre are on average 0.143 daily users of the find exporters tool. Every time a user engages with the tool, it is understood that they use this information to make informed decision as to which company they will reach out to next.
3.4 - Human decisions and review
皇冠体育app tool provides information that may help human to take decisions. 皇冠体育app tools does not provide decisions or hints for decisions, only additional data-driven information. Humans may or may not consider it in their decision process. 皇冠体育app users do not question, check or review the accuracy of the output of the tool (predictions), as they are in fact measured, with data.
3.5 - Required training
Users of the tools do not require any specific training; they are required to read the user documentation that is provided with the tool (webpage). For the developing, maintenance, and operations of the tool, Data Scientists would be able to operate the tool in any of its parts with a few days of hand-over sessions and documentation.
Tier 2 - Tool Specification
4.1.1 - System architecture
4.1.2 - Phase
Production
4.1.3 - Maintenance
Company data is being refreshed on a continual basis via daily upload feed. 皇冠体育app algorithm used is retrained every six months or more often if it becomes apparent that there is a need for the model to be retrained.
4.1.4 - Models
LightGBM classification model
Tier 2 - Model Specification
4.2.1 - Model name
LightGBM
4.2.2 - Model version
4.1.0
4.2.3 - Model task
Classification Model
4.2.4 - Model input
Company related features, accounting information, export information (target variable)
4.2.5 - Model output
Export propensity probability
4.2.6 - Model architecture
Gradient Boosting classification model.
Hyper-parameters of LightGBM are chosen using a randomised search and cross-validation (5 folds) based on the negative log loss metric. Chosen parameters are hard-coded in the training pipeline and also logged upon training.
4.2.7 - Model performance
皇冠体育app outputs of the models give a range of different outputs for different geographies and categories. 皇冠体育appre are extensive notebooks containing a range of breakdowns.
At a high level the model is evaluated on a test split, using a Brier score. Brier score can be slightly biased against the rare positive class. Other evaluation tools are used for quality assurance:
皇冠体育app calibration curve 皇冠体育app Brier skill, defined as 1 minus the ratio between the Brier score of the trained predictor and that of the no-skill predictor that outputs for all samples the overall positive class probability.
Justification for this method is that the Brier Score is a strictly proper score function that measures the accuracy of probabilistic predictions. As such, it is able to give us an indication of how good the probability score output by the model is, not just the ranking of predictions or other metrics that only consider the output label for some confidence threshold. 皇冠体育app calibration curve gives us greater insight into how good the scoring is for different strata. 皇冠体育app Brier skill allows to contextualise the Brier score in terms of a baseline.
4.2.8 - Datasets
- Companies House data.
- DBT data: Data Hub, Export Wins.
- Export Data from HMRC.
Each dataset has an entry on our departmental data platform which includes data dictionary and code snippets. 皇冠体育app data science team do not maintain or own these datasets.
4.2.9 - Dataset purposes
Dataset from Companies House is used for a canonical list of companies to use for training and prediction, and to extract company metadata. It is also used for extracting all accounting information. It is not loaded via dataflow, but using the Data Store service, which existed before dataflow was created.
Extract from a HMRC export data is used to obtain the date of last export for a company. 皇冠体育app last date for which we have export information for any company is assumed to be the smallest of the current date and the last day of the last month-year available in this table.
Tier 2 - Data Specification
4.3.1 - Source data name
-
Companies House data snapshot of information for live companies on the public register.
-
HMRC exports data, that comprises data on the export of goods from the UK, combining the previous separated non-EU exports and EU dispatches.
4.3.2 - Data modality
Tabular
4.3.3 - Data description
Administrative data which gives information in relation to companies.
4.3.4 - Data quantities
皇冠体育app final model output is for all companies in the UK (i.e. few million rows). 皇冠体育app dataset for model development has about 20 columns.
皇冠体育app test split corresponds to the data for 20% of the companies fetched at training time. 皇冠体育app training split is the remaining 80% of companies. No separate validation set is used, in favour of cross-validation.
4.3.5 - Sensitive attributes
皇冠体育app data all relates to companies, so should not contain personal data other than where this has been used, for example, as a company name. International trade advisor (ITA) the users of the tool have their own separate data set with contacts and means to reach companies.
4.3.6 - Data completeness and representativeness
This is administrative data which has been enriched with additional DBT data to improve the data quality of the data set such as Company ID matching. 皇冠体育appre are some issues with the underlying accuracy of some of the companies house data (e.g. some dormant companies, incorrect industrial classifications).
4.3.7 - Source data URL
4.3.8 - Data collection
We get the full raw administrative data as an API feed and is not changed from how it has been inputted onto HMRC and Companies House systems.
4.3.9 - Data cleaning
Not applicable
4.3.10 - Data sharing agreements
皇冠体育app HMRC and Companies House data sets are open data sets.
4.3.11 - Data access and storage
皇冠体育app outputs of find exporters are only available to DBT staff who have a login and credentials for this system that is managed and monitored. Data is loaded via the DBT Data Workspace platform, which handles security controls.
Tier 2 - Risks, Mitigations and Impact Assessments
5.1 - Impact assessment
皇冠体育appre has been no need for a Data Protection Impact Assessment (DPIA) to be conducted as it only contains non-personal company data. 皇冠体育appre has been no external impact assessments of the model. Internal model evaluation using standard data science metrics used for model building have been undertaken in term of fairness.
5.2 - Risks and mitigations
皇冠体育app key risk here is that use of the tool could reinforce existing DBT operational bias towards certain types of companies if data are not objectively interpreted. However operational staff are not required to use Find Exporters, it is simply a tool they can use in order to identify potential leads. 皇冠体育app fact that operational staff are responsible for identifying companies and Find exporters is not mandatory will mitigate any risk of biasing the direction of operational work. 皇冠体育app tool team will continue to explore risk mitigations, particularly as some teams are keen to automate aspects of lead generation and casework.