You do not plan to use real data, or you would like to explore alternatives to using real data, such as synthetic data.
There are datasets available, called synthetic datasets, that provide artificially generated information representing real-world occurrences.
Often, these datasets are designed to represent real scenarios, and thus provide valuable opportunities to develop, initially train and validate models, in a way that can speed up proof of concept work and aid in early model selection.
Does the use of synthetic datasets call for approval?
If the synthetic dataset has been ‘made up’ or ‘seeded’ from real data, it is highly likely to be safe… but it might not be and may require approval.
Data are considered ‘safe’ when they are rendered anonymous (otherwise known as ‘effectively anonymised’), i.e., when the risk of re-identifying individuals is low enough in the eyes of the law. There have been occurrences of synthetic datasets presenting patterns that made it possible to re-infer information about individual data used to generate it. If you are using an existing synthetic dataset, please check with your data provider that the dataset has been assessed and is considered ‘effectively anonymised’. You can also find more information about data anonymisation and risks of re-identification on the Information Commissioner’s Office (ICO) website.
If the dataset is considered ’effectively anonymised’, your usage of data does not require approval.
A note on using existing datasets
Data providers each have their own protocols for granting access to their data. If you plan on using an existing dataset, please factor in appropriate time and resources to go through these data access protocols.
What if you want to create your own synthetic dataset?
Again, if it is entirely made up (and not generated from real data), you are safe to proceed without further approval.
However, if you are generating it from real data:
- You will need to assess the risks and approval needs related to that use of real data.
- An assessment should be carried out regarding the likelihood of individuals being re-identified upon further analysis of the synthesised data. If necessary, additional safeguards should be put in place to ensure it is sufficiently remote.
- You must ensure that the data owner supplying the initial data is fully satisfied that the synthesising process is conducted with full compliance to approval regulations and that the resulting data derived is ’effectively anonymised’.
To assess the risks and approval needs related to your use of real data to generate a synthesised dataset click here to continue.
Further information on synthetic data
- Synthetic Data for Machine Learning in Medicine and Healthcare
- Synthesizing single-case studies: A Monte Carlo examination of a three-level meta-analytic model | SpringerLink
- Guide-to-synthesising-case-studies
- Data Anonymisation & Risk Assessment – Process Map and Automation Efforts
Example of existing synthetic datasets
Understanding the data requirements for your project is the first step in your research journey. This tool should have assisted you in thinking about the essential considerations on the use of data for health and social care research. Please do ensure that you think about the most appropriate data you need for your study and whether your data access needs would meet the statutory and legal governance requirements in the UK. It is imperative that data used for the development of AI and data-driven interventions is accessed with the highest privacy and ethical standards. Spending time to identify the data you need, understand what data is available and consult the relevant people and organisations, may ensure that your project can get started as soon as possible with minimal delays.