Introduction
Real-world Applications Using Synthetic Data
About the Hardware
Setting up Hardware
Synthetic Data Generation - Forecasting
Synthetic Data Generation - GAN
Synthetic Data Generation - LLM
Data Cleaning
Conclusion

Created July 27, 2024

Synthetic Data Generator & Cleaning Data

As the data scientists we always spend ~80% of our time to clean the data sets and thinking way to gather more data for the analysis.

Things used in this project

Hardware components

Minisforum Venus UM790 Pro with AMD Ryzen™ 9

Story

Introduction

AI, automated decision making and data driven insights are buzz words for these days. The only fuel powering all these is "Data". I hope you heard a lot of time people are saying if you input garbage you would get garbage. What is this garbage is "Data". If you remove all the physical parts of the this new realm the only remainder is "Data".

So, I do not need to explain how important data is AI/ML model building but the problem we have today is the lack of enough clean data. The data generation, collection, cleaning, pre-processing and formatting takes nearly 80% time of your AI/ML model building pipeline.

Real-world Applications Using Synthetic Data

Industries employ synthetic data

Then what would be the solution, let's dig deep...

About the Hardware

Venus UM790 Pro

The Minisforum Venus Series UM790 Pro is a relatively compact mini PC fitted with the newest APU from AMD. Its AMD Ryzen 9 7940HS is an 8-core processor belonging to the Phoenix series. The PC doesn't feature a dedicated graphics card, but this is no problem thanks to its iGPU, the AMD Radeon 780M, which is just as powerful.

Next to its brand-new APU, Minisforum has also fitted this test device with 16x2 GB DDR5 RAM and a total of 512 GB of storage. Minisforum has taken a slightly different approach to storage devices with this device - namely, it is fitted with two 256 GB SSDs in a RAID 0 array (Redundant Array of Independent Disks). This configuration, which always comes with a Windows 11 professional license.

For more information follow along here and here.

Setting up Hardware

Setting up hardware is very straight forward and simple. I suggest you to follow this Hackster.io project to have good grasp of the entire process because here I am giving you a quick overview. If you want to follow along with official installation guide please click here.

Let's follow the step by step fashion...

The AI PC is pre-loaded with licensed Windows 11, you have to complete the setting up Windows 11 at first boot up. Do not worry it very straight forward you have just provide computer name, keyboard layout, location, wifi password and etc.
Enable NPU (sometimes referred to as IPU), for this you have to boot you PC under "Recovery" mode then select Troubleshoot -> Advanced options -> UEFI Firmware Settings. This will reboot the PC and bring you to the BIOS menu there you need to select Advanced -> CPU Configuration -> Enable IPU Control. If you've got into any trouble I suggest you to have a look at "Appendix" section in here.
Download NPU driver and install it by running the .bat file inside the .zip file.
Install Visual Studio 2019
Download and install Python
Download and install CMake
Download and install Miniconda
Download the Ryzen AI Software installation package and extract it then run the install.bat.

That's it, you are ready to move forward...

Synthetic Data Generation - Forecasting

The simplest and non AI/ML way to generate more (synthetic) data for the given use case is where we can use the small set of existing data. For example if you have sales data for the specific product but you have few readings you would use forecasting method to create more data.

This is well famous example in AI/ML Shampoo sales dataset. The method behind this is ARIMA statistical method. Fire up your jupyter notebook on AMD Ryzen 9 7940HS PC. If you do not have jupyter notebook installed on your system run below command in the terminal

python -m pip install jupyter

Available dataset

from pandas import read_csv
import datetime
from matplotlib import pyplot

series = read_csv('shampoo.csv', header=0, index_col=0)
print(series.head())
series.plot()
pyplot.show()

The above code snippet load the dataset and plot it.

Using ARIMA to fit function to real data

Plot real data (blue colour) and predicted data (red colour)

from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
from math import sqrt
import warnings
warnings.filterwarnings('ignore')

# split into train and test sets
X = series.values
size = int(len(X) * 0.66)
train, test = X[0:size], X[size:len(X)]
history = [x for x in train]
predictions = list()

# walk-forward validation
for t in range(len(test)):
 model = ARIMA(history, order=(5,2,1))
 model_fit = model.fit()
 output = model_fit.forecast()
 yhat = output[0]
 predictions.append(yhat)
 obs = test[t]
 history.append(obs)
 print('predicted=%f, expected=%f' % (yhat, obs))

# evaluate forecasts
rmse = sqrt(mean_squared_error(test, predictions))
print('Test RMSE: %.3f' % rmse)

# plot forecasts against actual outcomes
pyplot.plot(test)
pyplot.plot(predictions, color='red')
pyplot.show()

This is not a comprehensive guide to show the working principle of ARIMA. By adjusting below line of the code you can get best matching function to reflect original data.

model = ARIMA(history, order=(5,2,1))

So, without much sweat you can generate synthetic data using ARIMA statistical method.

Pros:

Less compute intensive
Less time consume
Easy to implement

Cons:

We can get fairly accurate prediction model
Need to tune model manually (adjusting p, d ,q values)

For the complete code example please refer my GitHub repository.

Synthetic Data Generation - GAN

Generative Adversarial Networks (GANs) are a powerful machine learning method for producing synthetic data that closely resembles real data. GANs have been utilized to generate synthetic images, text, audio, and video, with applications spanning numerous fields such as healthcare, finance, and security.

GANs operate by having two neural networks compete against each other: a generator and a discriminator. The generator aims to produce synthetic data that appears as realistic as possible, while the discriminator's goal is to differentiate between real and synthetic data. Both networks are trained simultaneously, allowing the generator to progressively create more lifelike synthetic data.

GAN architecture

It's time to get your hands dirty. Power up your AI PC Minisforum Venus Series UM790 Pro powered by AMD Ryzen 9 7940HS and AMD Radeon 780M. Do not forget the working horse NPU core.

Fire up jupyter notebook and import necessary libraries. The tensorflow library is the heart of this.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras import layers

Define your limited real data. This example code try generate synthetic temperature data with few original data points.

# Original temperature data points
original_data = np.array([30, 32, 35, 33, 31, 30, 29, 28, 30, 31, 32, 34])
original_data = original_data.reshape(-1, 1)

Define two main components of the GAN, generator and discriminator

# Generator model
def build_generator():
    model = tf.keras.Sequential()
    model.add(layers.Dense(16, activation='relu', input_dim=100))
    model.add(layers.Dense(1, activation='linear'))
    return model

# Discriminator model
def build_discriminator():
    model = tf.keras.Sequential()
    model.add(layers.Dense(16, activation='relu', input_dim=1))
    model.add(layers.Dense(1, activation='sigmoid'))
    return model

Compile the model

def compile_gan(generator, discriminator):
    discriminator.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    
    discriminator.trainable = False
    gan_input = layers.Input(shape=(100,))
    generated_data = generator(gan_input)
    gan_output = discriminator(generated_data)
    
    gan = tf.keras.models.Model(gan_input, gan_output)
    gan.compile(optimizer='adam', loss='binary_crossentropy')
    
    return gan

Train the model

def train_gan(gan, generator, discriminator, original_data, epochs=1000, batch_size=8):
    for epoch in range(epochs):
        # Train discriminator
        noise = np.random.normal(0, 1, (batch_size, 100))
        generated_data = generator.predict(noise)
        real_data = original_data[np.random.randint(0, original_data.shape[0], batch_size)]
        
        combined_data = np.vstack((real_data, generated_data))
        labels = np.ones((2 * batch_size, 1))
        labels[batch_size:] = 0
        
        discriminator.trainable = True
        discriminator.train_on_batch(combined_data, labels)
        
        # Train generator
        noise = np.random.normal(0, 1, (batch_size, 100))
        labels = np.ones((batch_size, 1))
        discriminator.trainable = False
        gan.train_on_batch(noise, labels)
        
        if epoch % 100 == 0:
            print(f"Epoch: {epoch}")

Finally, do the inference (synthetic data generation)

def generate_synthetic_data(generator, num_samples):
    noise = np.random.normal(0, 1, (num_samples, 100))
    synthetic_data = generator.predict(noise)
    return synthetic_data

# Build and compile GAN
generator = build_generator()
discriminator = build_discriminator()
gan = compile_gan(generator, discriminator)

# Train GAN
train_gan(gan, generator, discriminator, original_data)

# Generate synthetic data
num_samples = 100
synthetic_data = generate_synthetic_data(generator, num_samples)

# Plot the original and synthetic data
plt.figure(figsize=(10, 5))
plt.plot(original_data, 'o', label='Original Data')
plt.plot(synthetic_data, 'x', label='Synthetic Data')
plt.legend()
plt.show()

Output of GAN model

This code sets up a simple GAN to generate synthetic temperature data. The generator learns to produce temperature data similar to the original data, while the discriminator learns to distinguish between real and fake data. After training, the generator can create new synthetic data samples.

This is far better than ARIMA model than can generate synthetic data very close to the original data.

Pros:

Generate data closely identical to original data

Cons:

Need considerable amount of computational power

For the complete code example please refer my GitHub repository.

Synthetic Data Generation - LLM

Large Language Models (LLMs) can be leveraged to generate synthetic data by training them on vast datasets and using their ability to understand and mimic human language patterns. These models, such as GPT-4, can be fine-tuned on specific types of data to produce realistic text-based data. For instance, in healthcare, an LLM can be trained on anonymized patient records to generate synthetic medical reports that can be used for training healthcare professionals or testing new software without risking patient privacy. The key advantage here is that the generated data retains the statistical properties of the original dataset, making it valuable for various analytical tasks.

In addition to text, LLMs can be used to create synthetic data in other formats, such as structured data for databases or code snippets for software testing. By training an LLM on a large corpus of database entries or code repositories, the model can generate new, plausible entries or code snippets that mimic the patterns and structures found in the training data. This can be particularly useful in fields like finance, where synthetic transaction data can be generated for developing and testing fraud detection algorithms without exposing real financial information.

Moreover, LLMs can enhance the diversity and quality of synthetic data by introducing controlled variations. By tweaking the input prompts or adjusting the model's parameters, data scientists can generate a wide range of synthetic data scenarios, helping to train robust machine learning models that perform well in diverse real-world conditions. This capability is crucial in fields like autonomous driving, where synthetic data representing rare but critical driving scenarios can be created to improve the safety and reliability of autonomous systems. Overall, LLMs provide a versatile and powerful tool for generating high-quality synthetic data across various domains.

Here come to most expensive and compute hungry to run model and tamed to run off the grid without the Internet thanks to AMD AI PC Minisforum Venus Series UM790 Pro powered by AMD Ryzen 9 7940HS and AMD Radeon 780M

Fire up your AI PC and load the jupyter notebook IDE.

We use the same use case here, to generate synthetic temperature data using the Mistral 7B model locally, we need to have the Mistral 7B model downloaded and set up. Since the Mistral 7B model is a large language model, we'll use it to predict sequences of temperature data based on given original data points. For this example, we'll use the transformers library from Hugging Face. Before begin you should install transformers library.

pip install transformers

Load the Mistral 7B model. Assuming you have the Mistral 7B model downloaded and available locally, you'll load it using the transformers library.

import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the Mistral 7B model and tokenizer
# Adjust the model_name to point to the correct location of your model files
model_name = "path_to_local_mistral_7b_model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Function to format temperature data into a text sequence
def format_temperature_data(data):
    return ' '.join(map(str, data))

# Function to generate synthetic data using the model
def generate_synthetic_data(model, tokenizer, original_data, num_predictions=10):
    input_text = format_temperature_data(original_data)
    input_ids = tokenizer.encode(input_text, return_tensors='pt')
    
    synthetic_data = original_data.copy()
    
    for _ in range(num_predictions):
        output = model.generate(input_ids, max_length=len(input_ids[0]) + 1, num_return_sequences=1)
        generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
        new_data_point = float(generated_text.split()[-1])
        synthetic_data.append(new_data_point)
        input_ids = tokenizer.encode(format_temperature_data(synthetic_data), return_tensors='pt')
    
    return synthetic_data

# Original temperature data points
original_data = [30, 32, 35, 33, 31, 30, 29, 28, 30, 31, 32, 34]

# Generate synthetic data
num_predictions = 10  # Number of synthetic data points to generate
synthetic_data = generate_synthetic_data(model, tokenizer, original_data, num_predictions)

# Print the synthetic data
print("Original Data:", original_data)
print("Synthetic Data:", synthetic_data)

This code demonstrates a basic approach to using a large language model for generating synthetic sequences based on input data. The example code is very straight forward but it needs a lot compute power and memory to run but thanks to AMD AI PC Minisforum Venus Series UM790 Pro powered by AMD Ryzen 9 7940HS and AMD Radeon 780M, we can make to process easy and simple.

This is far better than all other methods.

Pros:

Generate hi-quality data
Support multiple formats (text, image, video, audio, 3D)

Cons:

Need lot of computation power
Expensive

For the complete code example please refer my GitHub repository.

Data Cleaning

LLMs can significantly enhance data pre-processing and cleaning by automating tasks that traditionally require substantial human effort. One of the key advantages of LLMs is their ability to understand and generate human-like text, which can be leveraged to identify and correct errors in textual data. For instance, LLMs can detect and correct spelling and grammatical errors, standardize formats, and even infer missing information based on the context. This capability is particularly valuable when dealing with large datasets where manual correction would be time-consuming and prone to inconsistencies.

In addition to text correction, LLMs can also facilitate data normalization and standardization. Different data sources often use varied terminologies and formats, making it challenging to integrate them into a cohesive dataset. LLMs can be trained to recognize synonyms and standardize terms across datasets, ensuring consistency and improving the quality of the data. Furthermore, they can automate the process of data transformation by converting data into required formats, such as converting dates into a standardized format or normalizing units of measurement. This automated transformation helps in creating a uniform dataset that is ready for analysis.

LLMs also excel in detecting and handling outliers and anomalies in data. By analyzing the context and patterns within the dataset, LLMs can identify data points that deviate significantly from the norm. Once identified, these anomalies can be flagged for further investigation or automatically corrected if the LLM is confident in its prediction. This ability to pre-process and clean data efficiently not only saves time but also enhances the overall quality and reliability of the data, leading to more accurate and insightful analysis downstream.

Conclusion

This is not the end; it is just the beginning. With powerful AI-enabled PCs such as Minisforum Venus Series UM790 Pro powered by AMD Ryzen 9 7940HS and AMD Radeon 780M, we can achieve more while significantly reducing costs, conserving time, and saving energy. Today, AI models are being optimized to use less power and memory, driving efficiencies that were previously unimaginable. This optimization not only reduces operational expenses but also accelerates processing times, leading to faster outcomes and quicker decision-making.

The financial benefits of these advancements are substantial. Traditional computing systems require significant investment in hardware, electricity, and cooling infrastructure to support extensive data processing. Powerful AI PCs like Minisforum Venus Series UM790 Pro powered by AMD Ryzen 9 7940HS and AMD Radeon 780M, designed to operate more efficiently, lower these costs dramatically. By consuming less power, organizations can reduce their electricity bills and the associated costs of cooling data centers. Furthermore, these optimized models require less hardware, which translates to savings on equipment purchases and maintenance. This is particularly important for businesses that rely heavily on data processing, as the cumulative savings over time can be immense.

Time conservation is another critical advantage of these advanced AI systems. Optimized models that use less memory and power also tend to process data faster. This reduction in processing time means that businesses can achieve their goals more quickly, enhancing productivity and enabling them to stay competitive in a fast-paced market. Faster data processing also means that employees spend less time waiting for results and more time on value-added activities, thereby increasing overall efficiency.

Energy conservation is perhaps the most compelling benefit of these innovations. As AI models become more power-efficient, they significantly reduce the carbon footprint of computing operations. Data centers are notorious for their high energy consumption, contributing to substantial greenhouse gas emissions. By optimizing AI models to use less power, the demand for energy decreases, leading to fewer emissions and a more sustainable approach to computing. This not only benefits the environment but also aligns with the growing emphasis on corporate social responsibility and sustainability goals.

The shift towards edge computing represents the next stage in the evolution of AI and machine learning paradigms. Edge computing involves processing data closer to where it is generated, rather than relying on centralized data centers. This approach reduces latency, enhances data privacy, and further conserves energy by minimizing the need to transmit data over long distances. Edge computing, combined with optimized AI models, allows for real-time data analysis and decision-making at the source, which is crucial for applications such as autonomous vehicles, smart cities, and industrial automation.

Moreover, edge computing can significantly reduce operational costs. By processing data locally, businesses can decrease their reliance on expensive cloud services and reduce the amount of data that needs to be stored and managed in central servers. This local processing also means that devices can operate independently, without continuous internet connectivity, which can lead to additional cost savings and increased reliability.

In conclusion, the advancements in AI technology, particularly in the optimization of models to use less power and memory, are transforming the landscape of computing. These innovations offer substantial cost savings, time efficiencies, and energy conservation, contributing to a more sustainable and economically viable future. As we embrace edge computing and continue to refine AI capabilities, the potential for further improvements in efficiency and sustainability is vast. This is indeed just the beginning, with the promise of even greater advancements on the horizon.

Code

Credits

Chamal Ayesh Wickramanayaka

23 projects • 26 followers

Experienced software engineer with a passion for AI, IoT, and innovation, continuously seeking to learn and embrace new technologies.

Contact

Comments

Please log in or sign up to comment.

Embed the widget on your own site

Synthetic Data Generator & Cleaning Data

Synthetic Data Generator & Cleaning Data

Things used in this project

Hardware components

Story

Introduction

Real-world Applications Using Synthetic Data

About the Hardware

Setting up Hardware

Synthetic Data Generation - Forecasting

Synthetic Data Generation - GAN

Synthetic Data Generation - LLM

Data Cleaning

Conclusion

Schematics

GAN architecture

Code