Page Nav

HIDE

Breaking News:

latest

Ads Place

Will Synthetic Data Introduce Ethical Challenges for ML Engineers?

https://ift.tt/PrY1F4h Photo by Julia Koblitz on  Unsplash What Is Synthetic Data? Machine learning frameworks are becoming widespread...

https://ift.tt/PrY1F4h
Photo by Julia Koblitz on Unsplash

What Is Synthetic Data?

Machine learning frameworks are becoming widespread and easier to use, and there are ready-made machine learning models for most common tasks. As the model aspect of machine learning becomes commoditized, the focus of many machine learning initiatives shifts to the data.

Some in the industry estimate that over 70% of a data scientist’s time is spent collecting and handling data. Some algorithms require large amounts of data, and if a standard dataset is not available, researchers might have to collect and hand-label data. This is a slow, expensive, and error prone process, which complicates machine learning projects and delays time to market.

A synthetic dataset is generated by a computer using an algorithmic method. Data points are similar to, but do not actually represent, real-world events. At least on paper, synthetic datasets can provide unlimited, high-quality, low-cost data for machine learning models training. In reality, things are a bit more complex.

For synthetic data to be effective as an input to machine learning models, it needs to have three somewhat contradicting properties:

  • Synthetic data must have a statistical distribution similar to distribution of a real dataset
  • Synthetic data points should ideally be indistinguishable from real data points
  • Synthetic data points should be sufficiently different from each other

The random processes or algorithms used to generate the data do not always provide fine-grained control by the researcher. Many synthetic data sampling techniques are based on randomization, and some start from random noise and gradually form a meaningful artifact. This makes it difficult to tune the algorithm to provide exactly the data needed for the model.

What Is ML Engineering?

Machine learning engineers (ML engineers) are IT professionals who focus on researching, designing, and building artificial intelligence (AI) systems to run predictive models.

Machine learning engineers serve as the bridge between data scientists that focus on statistics and model building, and operational AI systems that can effectively train models and deploy them to production. Their main role is to evaluate and arrange large amounts of data, optimize it for ML algorithms and models, and build the systems that will be used to create and run those models.

ML engineers typically work as part of data science teams, in collaboration with data scientists, data engineers, data analysts, data architects, and business managers. Depending on the size of the organization, they may also have interaction with IT operations, software developers, and sales teams.

Synthetic Data Generation Methods

Generating Data Based on a Known Distribution

You can create complex synthetic datasets for or simple table data without starting with the actual data. The process begins with a solid understanding of the actual data set’s distribution and the desired data’s characteristics. The better you understand the structure of your data, the closer to the real word the synthetic data can be.

Fitting Real-Word Data to a Distribution

If you have a real-world data set available, you can generate synthetic data by identifying a best-fitting distribution. You can then create synthetic data points according to the distribution parameters. There are two common ways to estimate the best-fit distribution:

  • Monte Carlo method — uses an iterative approach of randomly sampling and statistically analyzing the results. This method can generate variants of the original data set that are random enough to be realistic. It involves a simple mathematical structure but requires high computational power. However, it generates less accurate data compared to other methods of generating synthetic data.
  • Non-neural machine learning techniques — for example, the distribution of a real data set can be estimated using a decision tree. However, machine learning models can overfit the model, resulting in prediction distributions that have a limited ability to generalize beyond the data points in the original data set.

Neural Network Techniques

A neural network is a sophisticated way to create synthetic data. It can handle richer data distributions than traditional algorithms like decision trees and can synthesize unstructured data like videos and images.

There are three common neural techniques for synthetic data generation:

  • Variational autoencoder (VAE) — an unsupervised algorithm that learns the distribution of the initial data set and produces synthetic data using an encoder and decoder. This model produces reconstruction errors that you can minimize with iterative training.
  • Generative adversarial network (GAN) — an algorithm that uses two neural networks to generate synthetic data points. The first neural network (the generator) generates fake samples while the second (the discriminator) learns to distinguish between real and fake samples. Although GAN models are complex and expensive to train, they can generate highly realistic and detailed data points.
  • Diffusion model — an algorithm that corrupts the training data by progressively introducing Gaussian noise until, eventually, the image is pure noise. The neural network is then trained to remove the noise gradually, reversing this process until it produces a new image. Diffusion models are very stable during training and can produce high-quality results from images and audio.

Ethical Challenges of Synthetic Data

Any dataset suffers from biases, because people make unconscious decisions when selecting data that appears most relevant for a dataset. For example, many image datasets were found to have a disproportionate amount of images showing white or male individuals.

Synthetic data can make this problem worse. The real world is extremely complex and nuanced. Synthetic data will not create a “fair sample” of the real data it represents — it is more likely to focus on specific patterns and biases in the real world and amplify them. Another aspect is that synthetic data, even if it perfectly reflects the real data distribution, does not take into account the dynamic nature of the real world. Real data will constantly shift and evolve, while a synthetic dataset will remain a “snapshot in time”, which will eventually grow stale.

The biggest concern is that eventually AI models, fed by synthetic data, will become a closed system. They will train based on biased, repetitive datasets, and generate a closed, limited set of predictions. Those predictions will diverge further and further from reality, and may do so in ways detrimental to their users.

ML engineers have the tools and skills to reduce this “reality gap”. But they have to be conscious of the problem. In many cases, synthetic datasets will not be able to accurately model reality, and organizations will need to undertake the cost and complexity of collecting real data. In many cases it will be an ML engineer who will make the call — do we need real data, or can we settle for synthetic data, for a given problem.

It is a complex challenge, and one that has technical, intellectual, and ethical dimensions. In the end of the data, the socially responsible ML engineer will need to weigh the importance of the problem, the impact of bias on the customer or end user, and the cost of the data, and determine an optimal solution for the organization and its customers. This is a new responsibility resting on the shoulders of ML engineering teams, which will impact the lives and well being of millions.

Conclusion

In this article, I explained the basics of synthetic data, and introduced several ethical challenges it can generate for ML engineers:

  • Synthetic data is much less expensive to generate, yet may be more biased and less accurate than real-world data
  • Synthetic data can amplify biases already present in real-world data
  • Synthetic data might introduce inaccuracies that will lead to wrong decisions
  • Synthetic data is not sensitive to temporal changes in real-world phenomena

Many machine learning engineers will be faced with a dilemma — go down the easier path of synthetic data while compromising on quality and bias, or invest in real datasets that provide higher fidelity.

Real datasets are not free of bias, and synthetic datasets can be constructed in a way that minimizes bias and inaccuracy. In some cases, synthetic datasets can even be more accurate than real-world data, due to errors in annotation and interpretation. In the end of the day, it will be up to the socially responsible machine learning engineer to make the call — synthetic or real data — which will affect many aspects of our future lives.


Will Synthetic Data Introduce Ethical Challenges for ML Engineers? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Towards Data Science - Medium https://ift.tt/3bhDsAi
via RiYo Analytics

ليست هناك تعليقات

Latest Articles