Navigating The Potential And Perils Of Synthetic Data In Healthcare

News Room

As the global healthcare industry continues to crumble from staffing shortages, artificial intelligence is being touted as the saving grace for both public and private sectors. The technology, with its ability to learn and handle tasks like detecting tumors from scans, has to potential to save healthcare workers from being overexerted as well as give them time to focus on delivering the highest quality of care.

But, the thing with AI is, it needs data to work perfectly. If the models aren’t trained on complete, unbiased and high-quality data, the outputs will not be up to the mark. For most healthcare organizations looking to leverage AI in some capacity, this particular aspect has been incredibly draining. The sensitivity of patient data involved alone makes it extremely difficult for them to collect and leverage information while complying with privacy and confidentiality requirements at the same time.

This is where a shiny new alternative called ‘synthetic data’ can come into play.

According to the U.S. Census Bureau, synthetic data is artificial microdata generated using statistical models or computer algorithms to mimic the statistical properties of real-world data. It can augment or replace real data in healthcare research, public health and health information technology, thereby saving organizations from the entire hassle of gathering and using actual patient information.

When created and deployed accurately, synthetic data can not only improve healthcare AI models but also inform new treatments, promote evidence-based policymaking, boost patient adherence and improve outbreak responses. It could be generated in any form according to the use case at hand, starting from electronic health records and medical claim datasets to patient condition reports.

Why Healthcare Needs Synthetic Data?

One of the biggest reasons why many prefer synthetic data over real-world information is the privacy advantage.

Synthetic data is generated in such a way that the analytic value contained in the dataset is preserved but all the personally identifiable information is replaced with non-identifiable values. This allows easy utilization and sharing of data for internal use while ensuring that identities cannot be linked back to specific records or used for re-identification purposes.

The replacement of PII with fake data also makes sure that the organization remains compliant with regulations such as GDPR and HIPAA throughout the process.

Beyond privacy, synthetic datasets can also help save the time and resources organizations usually spend on accessing and maintaining real-world data through traditional methods. They accurately represent the original data without requiring companies to engage in complex data-sharing agreements, privacy regulations or data access restrictions.

When the restrictions of real-world data are removed with synthetic data, organizations also get the flexibility to mobilize their datasets beyond primary applications. This could include educational purposes, where students can learn and practice on realistic cases with synthetic clinical data, as well as the public release of data for broader collaboration and knowledge sharing in the industry. The latter holds particular importance here as it allows researchers, data scientists and innovators to use the data for advancements in healthcare research and development.

Some commonly available synthetic datasets in healthcare right now are DE-SynPUF files published by CMS, SyntheticMass and the US Synthetic Household Population database.

But, there’s more.

Synthetic data can also help augment and generalize existing healthcare datasets. Organizations can combine their real and synthetic data resources, allowing researchers to expand the scope and diversity of the available data, leading to more robust analyses and insights. This process can easily overcome the challenges of data scarcity and heterogeneity, allowing for more comprehensive research and a better understanding of population health trends.

Notably, artificial datasets even help with the testing of data linkage methods and algorithms that are often used to gain a more comprehensive understanding of health outcomes, population characteristics, and disease patterns.

Caution Is A Must At All Stages

While synthetic data promises significant benefits over real-world data, it is not something to be taken casually at any stage.

For instance, if the statistical models and algorithms being used to generate the data are flawed or biased in any way, the output could be less reliable and accurate than expected — and affect the downstream applications. Similarly, if the information is only partially protected, it might be re-identified by a bad actor.

One such case could occur when synthetic data may include outliers or unique data points, like a rare condition that is present in only a few records. It could be easily linked back to the original dataset. Adversarial machine learning techniques can also be used to re-identify records in the synthetic data, especially if the attacker has access to both the synthetic data and the generative model.

The problems, however, can also be avoided by including techniques such as differential privacy and disclosure control in the generation process. The former adds noise to the data, while the latter involves alteration and perturbation of the information.

The entire process of generating synthetic data can be very complex and opaque, which can limit aspects like transparency and reproducibility. Teams should always look to document and share the methods used to generate synthetic data in order to ensure fellow teammates and researchers can also follow the same approach, without roping in the potential risks.

In addition, they should also evaluate the correctness and dependability of the created synthetic data with thorough validation and verification methods. The better the data, the finer will be the downstream applications.

Read the full article here

Share this Article
Leave a comment