top of page

Generative AI

Generative AI

Synthetic Data

With the recent advent of artificial intelligence (AI) and machine learning (ML) methods there is a tremendous need for data to train (develop and make accurate) those AI models. However, getting access to reliable and true data can turn out to be much harder than expected and that can stifle innovation. This is particularly urgent in healthcare, that innovations translate to saving and improving the quality of life and preventing the next pandemic spread.
Developments and innovations in healthcare rely on access to sensitive patient data protected by strict privacy laws [1]. The privacy restrictions make the sharing of vital data, that can influence technology development and policy decisions, ineffective and slow. Simple methods or protocols for releasing data such as masking some of the information are not compliant with privacy as they leave enough information for the patient to be tracked and linked to that data set. Other methods that remove too much information from the patient records do not leave enough data for proper analysis rendering the data sets useless. Furthermore, the implementation of such data release protocols has been slow (typically due to the necessary approvals needed), thus slowing healthcare innovation. This need has been evidenced by the ongoing Covid-19 pandemic, where information on patient’s prior health, pre-existing conditions medication, age etc. at the hands of researchers could have speed up the development of therapies and implementation of effective policies. There is a need to give researchers and policymakers access to “accurate” data sets that comply with privacy laws. Synthetic data might be the answer.  They preserve the statistical characteristics of the real data while maintaining privacy protection. The synthetic data cannot be traced back to the patient of the real data set. Since the data are synthetic they can be shared with researchers for rapid healthcare innovation [2,3]. Synthetic data technology was first introduced in the ’90s but it gained widespread use in 2010 and onwards due to the extensive use and adoption of AI techniques. Industries that can benefit from this technology include automotive and robotics, insurance and financial services, manufacturing, security, etc., which have already utilised synthetic data for data analytic applications (e.g., test data for new products, model validation and artificial intelligence model training). During the Covid-19 pandemic, the healthcare sector has also utilized synthetic data to help plan the treatment of Covid-19 patients and to speed up innovations without compromising privacy [4].
Synthetic data are non-reversible artificially manufactured data that replicate the statistical characteristics and correlations of real raw data. Synthetic data analysis yields similar conclusions as if done with the real data. In the healthcare sector, the synthetic data are created using different algorithms that mirror the statistical properties of the original patient data, without revealing information regarding the real patients. The data are synthesized according to specific requirements and data usage purposes.
References
[1] Jiri et al. “Multipurpose synthetic population for policy applications,” JRC, 2022.
[2] Emam, L. Mosquera, and R. Hoptroff, Practical synthetic data generation: balancing privacy and the broad availability of data. O’Reilly Media, 2020.
[3] A. Goncalves, et al., “Generation and evaluation of synthetic patient data,” BMC Med. Res. Methodol., vol. 20, no. 1, p. 108, Dec. 2020,
[4]R. J. Chen, et al., “Synthetic data in machine learning for medicine and healthcare,” Nat. Biomed. Eng., vol. 5, no. 6, pp. 493–497, 2021
We've  heard about generative AI but what exactly is that?  Lets start with an example.
The image above, was downloaded from a  popular website, aptly called  this person does not exist.  You select some variables such as age group, ethnicity, male or female, hit refresh and voila in seconds you get a new image. That person you see above does not exist.  Every time you hit refresh a new fake (or synthetic) person emerges. How is that person created? This was created using Artificial Intelligence (AI) and specifically in this case, using GAN (Generative Adverserial Networks).  To break this down, Generative means what you expect. That you use AI to generate data. In this case you generate the image of a  person. But you can generate computer code, you can generate sound, music etc. The term Adversarial means that the computer code that generates this image was created using two different networks that are pitted against each other and getting better after each iteration. So for example lets say we want to create a GAN that can generate fake (or synthetic) Middle Eastern females between the ages of 35-40. First you will need to supply a large data set (lets say 1000) of real persons with those characteristics.  The two networks, the generative and the discriminator network will independently analyze those images and measure different attributes of the faces.  Then the iterative process starts. The generator  takes an original image, modifies it and sends it to the discriminator that needs to determine if this is real or generated. If the discriminator is fooled then its parameters are adjusted. If it figures out that the image is generated then the generative model is adjusted. This iterative process continues until the generative network can produce images that are indistinguishable from real data.  And hence the image shown above. This person does not exist yet she looks very realistic. 
So two things should become apparent.  First , the more real data are available the more accurate the GAN models are. That is why these models are data hungry. Second, that the models are also computational hungry. In fact neural networks appeared in 1980s but have only became prevalent the last 10 years with the tremendous improvements in computational power and the advent of very powerful GPUs (from NVIDIA for example).

Generative AI

download.jpeg
image.png
bottom of page