1. Background Story
1.1 Note
1.2 The Problem We Want to Solve
1.3 NPU and AI
2. Our Journey
2.1 Identifying Topics
2.2 Winning Hardware
2.3 Collaborating with Geneseeq
2.4 Project Development
2.5 Conclusion
3. Lessons Learned
4. Acknowledgement

Team SolversMind:

•

•

•

•

Published August 1, 2024 © MIT

Silencing the Noise Precisely: NPU AI for Cancer Screening

A solution to enhance cancer screening accuracy and efficiency by leveraging the power of AMD Ryzen AI PCs with NPUs.

AdvancedFull instructions provided7 hours355

Silencing the Noise Precisely: NPU AI for Cancer Screening

Things used in this project

Hardware components

Minisforum Venus UM790 Pro with AMD Ryzen™ 9

NVIDIA GPU 4060 Ti

AMD Ryzen 9 7940HS

Software apps and online services

Microsoft Visual Studio Code

Anaconda

AMD Ryzen AI

Ubuntu 22.04

Microsoft WSL

Story

1. Background Story

1.1 Note:

This project is prepared for the AMD Pervasive AI Developer Contest, with ongoing work aimed at enhancing the genome sequencing based cancer signal detection process through partnerships with industry leaders.
Four of the participants come from FRC (FIRST Robotics Competition) team 10015, Bubbles. As high school students, we developed this project as one of our team training projects for AI data processing.
We are participating in Category 3: PC AI with AMD Ryzen™ AI.
To enable NPU/IPU, install drivers, learn about ONNX, and install Ryzen AI, please follow: https://ryzenai.docs.amd.com/en/latest/

1.2 The Problem We Want to Solve:

Many genetic screening companies often have great challenges in finding characteristic features of the target cancers in large databases because of different sources of noise. These noise sources range from disparities in the process of data collection, variations in the pathological conditions of patients, and technical artifacts generated during sequencing and data processing.

Biological Variability: The genetic mutations related to cancer have heterogeneity that can be observed among individuals, and such heterogeneity results in variations in the genomic profiles of the data.
Technical Artifacts: Errors introduced during sample collection, DNA extraction, sequencing, and data processing can also lead to noise and signal distortion, which can in turn affect the accuracy of analysis.
Environmental Factors: External factors, like environmental exposures, lifestyle choices, and comorbidities, may shift the expression patterns of genes and cause noise in the data.
Sampling Bias: A sampling bias in the representation of different genetic variants or disease subtypes may be introduced by the biases in sample selection or population demographics, thus resulting in an erroneous conclusion.
Data Integration Challenges: Integrating datasets from different sources, including genomic, clinical, and imaging data, can introduce extra noise and make the process of data analysis more complicated.

Many firms have used AI algorithms, including deep learning models, to solve such problems and identify the deterministic features of cancer. However, most of these conventional methods rely on computationally intensive, time-consuming, and expensive GPU architectures. While these methods can solve problems, they sometimes result in poor results due to data complexity and the limitations imposed by GPU architectures.

1.3 NPU and AI

1.3.1 What is an NPU?

NPU, or Neural Processing Unit, is a special hardware piece designed to deliver acceleration for artificial intelligence and machine learning tasks. Unlike regular CPUs (Central Processing Units) or GPUs (Graphical Processing Units), NPUs are optimized to deal with tough AI workloads, such as calculations and computations related to neural networks. Such tasks involve a lot of matrix math and other mathematical operations that are critical to AI algorithms.

1.3.2 Key Features of NPUs:

Parallel Processing: NPUs are designed to execute many operations simultaneously; this is highly important for many AI computations.
Efficiency: They are more efficient, not only in speed but also in power efficiency, which makes them ideal for mobile devices.
Speed: NPUs can accelerate the inference phase of the AI models and their application to data for making predictions.

1.3.3 How is this related to AI?

Inference in AI: NPUs are designed to accelerate the inference process of AI models. Inference refers to making predictions on new data using a trained model. This is extremely important for applications requiring fast response or real-time processing, such as autonomous vehicles, facial recognition, or voice assistants.
Energy Efficiency: Energy efficiency is also very critical for most AI applications and particularly for emerging edge devices such as smartphones, drones, and IoT devices. NPUs provide the required computational process with ideal energy efficiency, avoiding a high consumption of power.
Scalability: NPUs allow AI applications to scale by effectively managing large volumes of data and complex models. This scalability is favorable for deploying AI solutions in health, as well as in automobile and consumer electronics areas.

1.3.4 AMD's Ryzen AI and NPU:

AMD's Ryzen AI-equipped PCs feature NPUs embedded in the AMD XDNA architecture. These NPUs are designed to handle AI inference tasks efficiently, offloading workloads from the CPU and GPU. This architecture provides several benefits:

Optimized AI Performance: The integration of NPU with AI-specific software tools and runtimes, such as ONNX Runtime and Vitis AI EP, ensures optimal performance for AI models.
Support for AI Frameworks: Ryzen AI NPUs support popular AI frameworks like PyTorch and TensorFlow, making it easier for developers to train and deploy models.

2. Our Journey:

2.1 Identifying Topics:

For this contest, we want to identify a project that addresses a critical need in AI data processing with significant social and economic impact. Genetic screening serves as a perfect fit in our proposal, as it enhances public health by enabling early detection and management of hereditary conditions, therefore reducing healthcare costs. The complexity of genetic data makes it ideal for AI processing, which can analyze large datasets efficiently, leading to more accurate results. It will also open a door toward personalized medicine through treatment based on individual genetic profiling. We aim to leverage AMD NPU technology in the AI genetic screening process to improve health outcomes and demonstrate the transformative potential of AI in addressing real-world challenges.

2.2 Winning Hardware:

Realizing the potential of NPUs in genetic data screening, particularly in the data noise removal process, we submitted our idea to the Hackster competition. We were thrilled with our innovative approach, which earned us the UM790 Pro mini PC. This win provided us with the necessary hardware to advance our project, thanks to AMD and Hackster.

2.3 Collaborating with Geneseeq:

During our research, we discovered a cutting-edge genetic screening company called Geneseeq in Toronto, ON. Their focus on precision medicine and comprehensive genetic testing aligns perfectly with our project goals. Dr. Xue Wu, their CEO, has been highly supportive of our initiative and recognized the importance of our work. They provided us with invaluable real-world insights and access to the relevant datasets needed for our project.

ACCELERATE PRECISION CANCER CARE

2.4 Project Development:

We started the project development as soon as we received the UM790 Pro mini PC. Our performance analysis went through two phases:

Phase One: Focused on code development, testing, and verification using a small genetic dataset (2, 500 datasets made of 2 x 1 signals).
Phase Two: Utilized a much larger dataset commonly used in the industry, which is more representative of the genetic characteristics of cancers and diseases (161, 200 datasets made of 250 x 20 signals).

2.4.1 Phase I:

2.4.1.1 Data Processing:

In this initial phase, we aimed to ensure adequate code development for both the CPU and NPU platforms. We used a small dataset from a recognized public DNA database, FinaleDB(FragmentatIoN AnaLysis of cEll-free), for our benchmarking tests. The dataset was further simplified to only contain deterministic disease information. The focus was on code development and testing for data cleaning, specifically the elimination of technical artifacts and normalization of data to provide high-quality input for our AI models. These steps are crucial for accurate and efficient model training.

We split the data into different parts, which we then trained and validated using the Python package pandas, a Python Data Analysis Library, as well as the scikit-learn package, which is specialized in machine learning. This ensured that our models were built on clean, normalized data, setting a solid foundation for further development and testing.

import pandas as pd #import pandas
from sklearn.model_selection import train_test_split #import skikit learn

data = pd.read_csv('your_dataset.csv') #import the database

#split the database
train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)

This process allowed us to evaluate the performance and accuracy of our models, ensuring they were effectively learning from the data and making accurate predictions. The use of pandas and scikit-learn facilitated efficient data handling and model training, helping us refine our approach before moving on to larger datasets.

2.4.1.2 AI Model Training:

This codebase was implemented using the available preprocessed data over the PyTorch and Transformer Python packages. We opted for PyTorch due to its flexibility and efficiency in performing deep learning tasks, while the Transformer packages helped us develop advanced AI models. This combination allowed us to effectively manage the complexities of genetic data and build robust, high-performance models for our project.

The code was executed in an Ubuntu (24.04) environment via Windows Subsystem for Linux (WSL) on the UM790 Pro PC. This setup provided a robust and flexible development environment, leveraging the strengths of both Linux and Windows operating systems.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from transformers import BertModel, BertTokenizer

# Load and preprocess the data
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
train_texts = ["example sentence 1", "example sentence 2"]
train_labels = [0, 1]

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
train_inputs = torch.tensor(train_encodings['input_ids'])
train_labels = torch.tensor(train_labels)

# Create a simple dataset and dataloader
train_dataset = TensorDataset(train_inputs, train_labels)
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)

# Define a simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.classifier = nn.Linear(768, 2)
    
    def forward(self, input_ids):
        outputs = self.bert(input_ids)
        pooled_output = outputs[1]
        return self.classifier(pooled_output)

model = SimpleModel()

# Training loop
optimizer = optim.Adam(model.parameters(), lr=1e-5)
criterion = nn.CrossEntropyLoss()

model.train()
for epoch in range(3):  # number of epochs
    for batch in train_loader:
        inputs, labels = batch
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    print(f'Epoch {epoch+1}, Loss: {loss.item()}')

# Save the trained model
torch.save(model.state_dict(), 'simple_model.pth')

2.4.1.3 Leveraging Ryzen AI and ONNX:

Our AI models were trained with the help of Ryzen AI capabilities that come as part of the UM790 Pro mini PC. Utilizing ONNX Runtime and Vitis AI Execution Provider (EP), we could fine-tune our models for highly efficient NPU-based inference. This configuration maximizes hardware capabilities, ensuring that the training processes were carried out rapidly and with precision, thereby enhancing both performance and accuracy.

import torch
import onnx
import onnxruntime
import numpy as np

# Load the trained model
model = SimpleModel()
model.load_state_dict(torch.load('simple_model.pth'))
model.eval()

# Dummy input for ONNX export
dummy_input = torch.tensor([[101, 1037, 2742, 6251, 102]])

# Export the model to ONNX
torch.onnx.export(model, dummy_input, "simple_model.onnx", input_names=['input_ids'], output_names=['output'], opset_version=11)

# Load the ONNX model
onnx_model = onnx.load("simple_model.onnx")
onnx.checker.check_model(onnx_model)

# Run inference using ONNX Runtime
ort_session = onnxruntime.InferenceSession("simple_model.onnx")

def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

# Prepare input
ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(dummy_input)}

# Run inference
ort_outs = ort_session.run(None, ort_inputs)
print(f'ONNX Runtime output: {ort_outs}')

After completing the above steps, we began testing and optimizing the code.

2.4.1.4 Testing and Optimization:

To ensure the robustness of our solution, we conducted rigorous testing and optimization:

Quantization: We used the AMD Vitis AI Quantizer tool to convert our models into INT8 format to accommodate the Ryzen AI hardware and increase inference speed. The quantized model was run, and some promising results were obtained. The AI model successfully loaded the database, although it could only identify a few of the cancer cells. Since we were using just a small database, we could not achieve the best results. Nevertheless, we still carried out a performance evaluation in Phase I using the small dataset.

2.4.1.5 Performance Evaluation:

Comparing NPU and CPU:

As part of our development process, it was very clear to us from the beginning that the NPU would significantly outperform the CPU in terms of performance and power consumption. Nevertheless, we carried out this performance evaluation to demonstrate the difference. However, when it comes to the performance comparison between the NPU and GPU, the small dataset was not adequate to demonstrate the difference in performance. The high-speed processing capabilities of both the NPU and GPU limited our ability to get clear results with this dataset size. This led us to understand that we must have a larger dataset to effectively benchmark these hardware configurations.

Training time comparison between the IPU/NPU and the CPU in UM790 Pro mini PC

As can be seen from the result, the IPU/NPU was about 7 times faster than the CPU for the same task. Upon reviewing the literature, we found that NPUs can be up to 10-100 times faster than CPUs for specific AI workloads. Our result is slightly less than the published performance tests but still falls within a reasonable speedup range. This again indicates that a much larger dataset is required to demonstrate the full potential of NPUs.

Our findings align with common industry observations, where the extent of performance gains with NPUs can vary based on the specific characteristics of the workload and the dataset size. For a more detailed comparison, further testing with a larger and more complex dataset would likely provide a clearer picture of the full capabilities of the NPU.

2.4.1.6 Phase I Conclusion:

Using the small dataset, we developed and verified a working AI model for the genetic screening process. We successfully identified characteristic features of cancers out of the noise, demonstrating the potential of our approach in real-world applications. This highlights the effectiveness of our methodology and the capabilities of the NPU in handling complex AI tasks.

Our benchmarking results show that the NPU outperforms the CPU for the given specs of the UM790 Pro mini PC. A 7 times speedup performance was achieved with the optimized model using Ryzen AI packages. However, the small dataset was not adequate to show the difference in performance between the NPU and GPU. This is because high-speed processing by both the NPU and GPU limits getting clear results with this dataset size. This led us to understand that a larger dataset is necessary to benchmark these hardware configurations effectively.

2.4.2 Phase II:

Recognizing that a larger pool of samples was required for a more comprehensive analysis, we contacted Geneseeq to express our need for a larger DNA dataset. They generously provided a database containing 161, 200 lines of genetic data, with 250 by 20 feature signals per line. This allows us to train the AI to obtain higher accuracy and efficacy. This larger dataset enabled us to benchmark the hardware configurations more effectively and demonstrate the full potential of the NPUs in handling complex AI tasks.

2.4.2.1 Database Snippet

The large database contains a total of 161, 200 DNA datasets, with each dataset consisting of 250 sections and 20 characteristic feature signals. Here is an example of what the database format looks like:

text,output,patient
"### Instruction:
Annotate the following sequence.
### Input:
AAAAACCAGGCCAAATGGAC AAACATGTCTCAAGATATTG AAAGAAATCTTCCACCATCC AAATAAATTTCAGGTATTAA AAATTACAATATAATATTGC AACATGGCAGAGTGGATAGA AACGATTTTTGTATGTTTAC AACGGAATCAAACGGAATTA AACTACCTACCAGGTATGAG AAGCATGTATTATTTTTACA AATACAAAGGCAGGTGTGTG AATCGAATCCATTCAAAGAT AATTGTTTTGCCTATCCACC ACAACACCCTTGAAGCCCTC ACAGGAAGGAAGCCAGTCTA ACATTTCAGTTTTGTTTGAG ACCAACACCTACAGTTAGTC ACCTTTCTCTCTGGCTGCCC ACTGTCACTGGTCAATCGTC AGAAAAAGGAATAATCACAA AGAAAATATGTGCATGCATA AGCAATTCTCCTGCCTCAGC AGCAGATTTAGGGAACAGTT AGCTCTTAGCAGAGCACTGT AGGAACTCTCTAAACTCTGT AGGAGCTGGCTTAGAAATTA AGGCACCCACCAGCACACCC AGGGATGAAGCCCACTTGAT AGTCTGGGTAGACAGCAGGA AGTGCAAAGTAAAGTGTGCT AGTGGCACGACATAGCTCAC ATTGTTAGCTTATTTAACAT ATTTTAAAACGGATGAGATA ATTTTCATCTGGCAACCTGA CAAAAAACAAGCAACAATTA CAAATATAATCATCTGGGAG CAAATTAATGGAGACTGTTT CAACAAAGAAAGTATTTCAC CAACGACTGGTGGCCGTGAA CAAGCGATTTTCCTGCCTCA CAAGGAATAAGAGAGCTGAA CAAGTGAAAGCGAAAAGAGC CAATAAATCTTATGTCACAT CAATCAGCCCAGCAGAGCCT CAATCCCAGTACTTTGGGAA CAATGATTTGGGACTTTGGG CACAAAACAGTGCTCTACTC CACCATTGCATACCTCTTGT CAGACCTGGGGGCACCTCGC CAGAGCACTGTTATGAGAAA CAGATCCTTCGAGGGATGGC CAGCTCCTCCAGTGTGCGAG CAGGAAAAGAGCAAAGGAAG CAGGAGTTTGAGACCAGCCT CATAAATCTTGGTTTTGCTT CATCGATCATTTCAAAGAAC CATGCAGAAAAGTGTAAAAC CATGGACATCTACCCCCTTC CATGTCATAATGATACACTT CATGTTTGAGCTCACTCCCC CATTCCATTCCATTCTAGTT CCAAACTAAAAGCAGCCCCT CCAAATACAGTGCCAAGATT CCAAATTCAAAACATGTATA CCAAGTTCACTATGTTCCCC CCAATCAAAGATGTCTGGAG CCACAAATCACAATCGAAAG CCACAATCAAGTAGGCTTCA CCACCACGCCTGGCTAATTA CCACGCTGCCTGCTCCTAGG CCACTGAATTGTACACTGTA CCAGAGCCTCCTTGACGGCA CCAGGAAGCATGTGTTGGCA CCAGTGACAGCCCTTCCTAA CCATCATTGTGTCTTGTCTA CCATGCCCAGCTAATTTTTG CCCAAAACGCTGACTAATAA CCCACCTCAGCCTCATGAAT CCCACTTATAAGAAAAAGTT CCCAGCATCAGCAGGGCCCA CCCCAAGGACCTTTGTGAGA CCCCACCCACCTCCGAACAA CCCGGGAGTCAAAGTTTCTG CCCTATTCATCATTCATAAG CCCTCCCTGATACATTACCT CCCTGACTCCCGCTTCATTG CCCTGGCCTGCCATGCCCCC CCCTGGTTTCTGCTGGTCAG CCCTTAATGCTGGAGAGGTC CCCTTAATTGGGACAGTGGG CCCTTACTTTTGAAGTGTAG CCCTTCAGGAGGGTTTGTAT CCTAAAGAGTTGTTGCAGCA CCTACAAAAATGCGGTGATG CCTAGGTTCAGCCTACAGGA CCTATACTGGTGGTGTTATA CCTCCCCGCTGCTCTTCGTC CCTCTAGCTTTCCCTCTCTC CCTCTCTCCAGTGTTCTACT CCTCTTTTTTTTTTTGAGAC CCTGAGAGTGGTGGGATTAG CCTGCCACAGTGACCTTACT CCTGGCCCTTCCCTGGACAA CCTGTAATCTCAGCATTTTG CCTGTCTCTACAGTCAATCA CCTTCACCTGAATCTAAGCT CCTTCCTGATAAATGGCATT CCTTGGAAACTAAACAGTCT CCTTGGTTGGCTGTCTCTTC CCTTTCGATTGCATTCGATT CGAAGAAGTAGATGAAATCT CGACACTGGATGCTCAGATG CGGGCGGGGGTGCTCCTCAC CGGGGACTGGCCTGCTGAAG CGTGATCCACCTGCCTCGGC CTAAGGAGAGGAGCTACCCA CTACCAAATCTCTCTCAGAT CTAGAAAGTTTATTCGTGTT CTAGTCTCATGGCTTTAACT CTATACTTCCTGACTCCTCT CTATATCCTTCAAAGCACAG CTATCTTAAGGAGAGCATTT CTCAAATTTGTCTCCCTGGC CTGAGACAGGTACTATTATT CTGATGGCCAGTGATGATGA CTGGATTCTCAAGTTCCATG CTGGGATTACAGGCACCCAC CTGTGATAAGAATGTCGCTT CTTAAAATGTCTCCCATGAC CTTCAACAATATCTCATTAT CTTCCAGCCCCGTGAGTCCC GAAGAGTAAACTATGGACAG GAAGGACTCCTGTCCTTCTG GAATAGGTAGCTCAGCACAG GACATAAAATTGTCTTCGTG GACCGCATCTCTAGATCCAT GCAAAAACATACCAAATTAT GCAAGTACATTCAGATGCAT GCACGCCTGTGTCCTCACAT GCATCAATGTTCATCAAGGA GCATGAAAAAACTGTCGGTA GCCAAGATCACACCACTACA GCCAGAAATGCAATGGAAAA GCCCATCTCAGCCTCCCAAA GCCCCCATCCTGTCCTTAGC GCCTCTGACTTTCTGGGTTG GCCTGTCCCAGGAGAACTTG GCTAAAAGAGATAATGTCAG GCTGAGGAGGGAGGATCTCT GCTGATCTAGACCATGCTCA GCTGCATCCCTGTGTGTACA GCTGGGATTTCAGGCATGAG GCTTAGATACACAAACACTT GGAAAATAATCTTCAATGAT GGAATCCTTTCCCCATTGCT GGACTCAGACTGGGATCAGG GGACTGTGGCTTTATCCCTG GGAGCTGGAGTTAATGAGGT GGATCACATGAGGCCAGGAG GGCAGAAGGCATGCAGCTGG GGCCAGGCTGGTCTCGAACT GGCCAGGTGCCGGTGGCTCA GGCGGGCGCAGAGGGACAAG GGCTGCATAGTATTCCATGG GGTAGAATTCGGTTGTGAGT GGTCATAACATTGTAACTGT GGTCGAAGGCACTGGGCATT GGTCTCCCCAGTGGAAAGAG GGTGATACCCAGGCAAACAG GGTGGAGCAAAAAGATAGTG GGTTAGTAGGTCCCTTTATA GGTTCATTTCACATGTATAA GGTTGCACTGGCATATAGAA GTAGTCACTTTTGAAGATAC GTATCCTTTGCTGTGCAGAA GTCACATTTCCTGACTGTGT GTCTACTCCGGCATGCATCA GTTCAGTTTCCATGTAGTTG TAAAAATGAAAATAGTATTT TAAAGAAAAAAGGAAATTCT TAAATAATCAGTATAAGCAG TAAGGATCCTAGAAGAAAAC TAAGGGTGATGTGGGGAGAA TAATAAAAATGCTTCTCAAG TAATCTCAGCATTTTGGGAG TACAACGGAGGAGTGAGTGT TACAGACTATAGTATGCATG TACATACACAGGGGGCAGTC TACATCATTTCTGAATCTAG TACTAATCTTGTTACTGCTG TACTCTTGCTTTAAAAGTTT TAGAATTCTCAGAGCCCAGG TAGATACATAGATATGATAG TAGATACCAAAAGGATTATA TATAAAGCAGGGTAAGACTC TATACCTAAGGCTAAATGAC TATGTGCTGCTGTAGGGATG TATTTGGGGGAATAATCTAA TATTTTATATTTATATACAT TCAAGTGTTTTTCCACCTTC TCACACAAAGTGTCTAAAAC TCACAGCTTTCACCATGACA TCACCTTTTGGACTCTTGGA TCAGGAGGCTGAGGCAGGAG TCATCAAATCATTAAGGTCC TCCCAAGTAGCTGGGACTGC TCGGGTTGATTCCATTCCAT TCTATTAAATGAAAATCGTC TCTCTCTCCCAGGTTAGAAT TCTGTGGAGTGGCTGTGAAA TCTGTGTTTTAGTTTTGTTG TCTTCATATCTTACTCAGCT TCTTTCTGGGGTTCTCTGGA TGAGAAAATAAAGAACAACA TGAGAGAAATTAAAGATCTA TGAGGAGTCAGTGGAGAAGA TGAGGGGCAATTGTGCAGAT TGAGTTTAGCGTATGCCATT TGATGGGTAGACAGGTTGGA TGATTAATGTTTCCTTGGTC TGATTATGAAACTTTTTGGA TGCATTGACTGGTAGGGATG TGCTAAGTTGCCCAGGCTGC TGCTCTCTCCAGCAACGCTA TGCTTAGGATGCTGAAAGAA TGCTTCTTCACTCTTCATCT TGGCTTGAAGTCAGGAATTT TGTATTTTTTATGTCTTAGC TGTCACATGCTTTTTACAAC TGTCACCTAGGTAATCAGCA TGTCAGAGGAGGATTAGTTT TGTCCATCTGACTGTAGTGG TGTGCTAATTGCCATTTGTG TGTGTTGTGCATTTGAAACA TGTTCTGTTAAGGAAGTGAA TGTTGAATTCCCACTGCATA TTCAATAGGTTATTGGGGAA TTGAACTGGGACATTTCAGC TTGCAAGATTTTATGGCACC TTGCTATGGCTCCTGAAAAC TTGGGATGCTGACTACTACT TTGTGTTTTTAGTAGAGATG TTGTTTTACCACTTGTGCAT TTTGTTTTTTATTTGTTTTT TTTTATTACCGAATTGTTTA TTTTATTTTTCCCTCGCTCC TTTTGAGGTGAACCATCATA TTTTTATGGAACCACAAAAG TTTTTGTCATTTAAAAGGTA TTTTTTTGTATTTTTAGTAG
### Response:
Cancer.",Cancer,EE87786

2.4.2.2 Training Process:

We developed an improved model to address the larger dataset using the popular language model and transformer (4.39.3).

train.py:

from dataclasses import dataclass, field
from typing import Dict, Optional

import torch
import transformers
from datasets import load_from_disk
from transformers import DataCollatorForLanguageModeling, Trainer, TrainerCallback


class MemoryUsageCallback(TrainerCallback):
    """A callback to log memory usage at each training step and epoch."""

    def on_step_end(self, args, state, control, **kwargs):
        # Report memory after each batch has been processed
        if torch.cuda.is_available():
            print(f"Step {state.global_step}:")
            self.report_gpu_memory_usage()

    def on_epoch_end(self, args, state, control, **kwargs):
        # Report memory at the end of each epoch
        print(f"Epoch {state.epoch}: Memory usage stats")
        self.report_gpu_memory_usage()

    @staticmethod
    def report_gpu_memory_usage():
        """Reports current GPU memory usage."""
        if torch.cuda.is_available():
            torch.cuda.synchronize()
            allocated = torch.cuda.memory_allocated(0) / (1024**3)
            cached = torch.cuda.memory_reserved(0) / (1024**3)
            print(f"  CUDA Memory Allocated: {allocated:.2f} GB")
            print(f"  CUDA Memory Cached: {cached:.2f} GB")
        else:
            print("  CUDA is not available. Cannot report GPU memory usage.")


@dataclass
class ModelArguments:
    model_name_or_path: Optional[str] = field(default="facebook/opt-125m")
    tokenizer_name_or_path: Optional[str] = field(default="facebook/opt-125m")


@dataclass
class DataArguments:
    data_path: str = field(
        default=None, metadata={"help": "Path to the training data."}
    )


@dataclass
class TrainingArguments(transformers.TrainingArguments):
    cache_dir: Optional[str] = field(default=None)
    optim: str = field(default="adamw_torch")
    model_max_length: int = field(
        default=1024,
        metadata={
            "help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."
        },
    )


def make_data_module(tokenizer: transformers.PreTrainedTokenizer, data_args) -> Dict:
    """Make dataset and collator"""
    tokenized_datasets = load_from_disk(data_args.data_path)
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False, mlm_probability=0.0
    )
    return dict(
        train_dataset=tokenized_datasets["train"],
        eval_dataset=None,
        data_collator=data_collator,
    )


def train():
    parser = transformers.HfArgumentParser(
        (ModelArguments, DataArguments, TrainingArguments)
    )
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()

    print(model_args, data_args, training_args)
    model = transformers.AutoModelForCausalLM.from_pretrained(
        model_args.model_name_or_path,
        cache_dir=training_args.cache_dir,
    )

    tokenizer = transformers.AutoTokenizer.from_pretrained(
        model_args.tokenizer_name_or_path,
        cache_dir=training_args.cache_dir,
        model_max_length=training_args.model_max_length,
        padding_side="right",
    )

    data_module = make_data_module(tokenizer=tokenizer, data_args=data_args)
    trainer = Trainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        **data_module,
        callbacks=[
            MemoryUsageCallback()
        ],  # Add MemoryUsageCallback to the callbacks list
    )
    trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
    trainer.save_state()
    trainer.save_model(output_dir=training_args.output_dir)


if __name__ == "__main__":
    train()

The ‘train.py’ file trains the AI model with PyTorch and the Hugging Face Transformers library. It defines a custom memory-tracking class, ‘MemoryUsageCallback’, which logs GPU memory usage after every training step and epoch. The configuration of the model, data, and training is set via Python dataclasses, providing extreme flexibility in handling different settings. The ‘make_data_module’ function loads the dataset and sets it up for training.

Specifically, the ‘train’ function is responsible for:

Parsing the configurations
Loading the model and tokenizer
Initializing the training environment
Running the training process with memory usage monitoring
Saving the trained model and the training state

The ‘train_finaledb.sh’ script is a shell script that sets the environment for executing the training script (‘train.py’). It is optimized for running with GPU and includes memory usage reporting.

The following is the updated script of the training process and the shell script to run it, executed on a Windows system using WSL with Ubuntu:

train_finaledb.sh:

#!/bin/bash 

SHELL_SCRIPT=$(readlink -f "$0")
RUN_PATH=$(dirname "$SHELL_SCRIPT")
SRC_PATH=""

echo "RUN_PATH: ${RUN_PATH}"
echo "SRC_PATH: ${SRC_PATH}"

# lower per_device_train_batch_size to reduce memory usage

torchrun --nproc_per_node=1 ${SRC_PATH}/train.py \
  --model_name_or_path "facebook/opt-125m" \
  --tokenizer_name_or_path ${SRC_PATH}/opt-seq-pubmed-tokenizer \
  --data_path ${RUN_PATH}/data_Cristiano/tokenized_data \
  --bf16 True \
  --output_dir ${RUN_PATH}/train_output_Cristiano \
  --num_train_epochs 30 \
  --per_device_train_batch_size 6 \
  --per_device_eval_batch_size 6 \
  --gradient_accumulation_steps 6 \
  --evaluation_strategy "no" \
  --save_strategy "steps" \
  --save_steps 1000 \
  --save_total_limit 1 \
  --learning_rate 1e-4 \
  --weight_decay 0.01 \
  --warmup_ratio 0.03 \
  --lr_scheduler_type "cosine" \
  --logging_steps 32 \
  --full_determinism \
  --tf32 True \
  --model_max_length 8192 \
  --report_to tensorboard \
  --dataloader_num_workers 8 \
  --fsdp "full_shard auto_wrap" \
  --fsdp_config ${SRC_PATH}/fsdp_config_opt.json

The script sets up the paths to the training script and the source directory, echoing them out for confirmation. It uses the ‘torchrun’ command to initiate the training process with various settings such as the model path, tokenizer path, data path, batch sizes, number of epochs, learning rate, and other parameters. The script also enables features like full determinism and TensorFlow 32 (tf32) for better performance. Running this shell script will start the training process with configurations that specify GPU use for efficient execution and log memory usage to monitor resource consumption. Note that Tf32 is only supported on Nvidia GPUs; if you do not have an Nvidia GPU, set the ‘--tf32’ flag to ‘False’.

2.4.2.3 Prediction:

Before testing in Phase II, we predicted that the NPU would train the AI model faster than the GPU, due to its specialized design for handling neural network operations. We also anticipated that the NPU would consume less power during training, making it more efficient overall.

2.4.2.4 Testing and Benchmarking:

We started our testing on a GeForce RTX 4060Ti GPU to establish a benchmark result. The training process is estimated to take approximately 300+ hours to complete. The data was split into epochs with a total of 134, 310 steps. Each step took ~8 seconds with the GPU, as shown in the screenshot below.

GeForce RTX 4060Ti GPU

Estimated Training Time: ~300 hours
Total Steps: 134, 310
Time per Step: ~8 seconds

Time per step on RTX 4060Ti GPU

We then move on to the NPU training. To ensure the same baseline, we kept the same dataset configuration, i.e., the data was split into epochs with a total of 134, 310 steps. We first run the training on the NPU without Optimization, and the result is shown below:

Time per step on NPU without Optimization

The training time on the non-optimized NPU was expected to be approximately 1500 hours to complete, with each step taking about 53 seconds. After considerable optimization using the AMD Vitis AI Quantizer tool, the estimated total training time was reduced to about 620 hours, with each step taking approximately 17 seconds.

Time per step on NPU with Optimization

This optimization resulted in the NPU being about 3 times faster. This significant improvement demonstrates the effectiveness of the optimizations applied using the AMD Vitis AI Quantizer tool.

Non-Optimized NPU:

Estimated Total Time: ~1500 hours
Time per Step: ~53 seconds

Optimized NPU:

Estimated Total Time: ~620 hours
Time per Step: ~17 seconds

When comparing the optimized NPU result with the GPU, it shows that the RTX 4060Ti performs twice the speed of the NPU in our settings.

Lastly, we conducted the model training test on the UM790 Pro CPU and obtained the following results.

Time per step on CPU

For the same dataset configuration, the UM790 Pro CPU took approximately 153 seconds to complete one step, resulting in a total estimated runtime of around 5700 hours.

Benchmarking Data for GeForce RTX 4060Ti GPU, NPU, and CPU

This table provides a comprehensive comparison of the training performance for the GeForce RTX 4060Ti GPU, a non-optimized NPU, an optimized NPU, and the UM790 Pro CPU. These results highlight the significant performance gains achieved through hardware optimization, particularly with the NPU. The optimized NPU demonstrates substantial improvements in both time per step and total runtime compared to the non-optimized version. While the GPU remains the fastest option for this specific dataset configuration, the NPU's improvements bring it closer to GPU performance, making it a viable option for specialized AI tasks, especially when considering energy efficiency and other factors. The CPU, although the slowest, provides a baseline for understanding the performance enhancements offered by both NPUs and GPUs.

It is worth noting that by using the larger dataset, the optimized NPU outperforms the CPU by 9 times, compared to the earlier 7 times speedup with a much smaller dataset. The testing in Phase II proves our hypothesis and demonstrates the potential of NPU in large dataset analysis.

2.4.3Understanding NPU Performance:

Further literature review indicates that the NPU accelerates the operation of neural networks, mainly for real-time computations with low power consumption. NPUs are optimized for certain types of calculations, especially parallel computations, which enable more efficient handling. As it turned out in our tests.

Power consumption of GPU RTX 4060Ti

Power consumption of NPU and CPU

This image shows the overall power consumption of the GPU while running the training code: 74-165 W. In comparison, the NPU consumed only ~26 W while running the model training. This demonstrates that the NPU significantly outperforms the GPU in terms of power efficiency. One possible reason the NPU did not surpass the GPU in speed could be its emphasis on energy efficiency over raw processing power. This focus allows NPUs to perform neural network computations with lower energy consumption, although it may impact their speed compared to the high-performance, power-intensive GPUs.

2.5 Conclusion

Our results confirm that while GPUs like the GeForce RTX 4060Ti excel in high-performance scenarios, NPUs offer a compelling alternative with their power efficiency and specialization in AI tasks. The optimized NPU demonstrated considerable speed and efficiency improvements, making it a strong candidate for real-time AI applications where power consumption is a critical factor. The UM790 Pro CPU, though the slowest, provided a useful benchmark for evaluating the enhancements achieved by both NPUs and GPUs.

One of the highlights from the results indicates that using the NPU for this model training provides approximately one-third of the power consumption compared to the GPU, even though the computational time is doubled. These findings emphasize the importance of selecting the appropriate hardware based on the specific needs of the application, whether it be raw performance, power efficiency, or specialized AI capabilities.

In the context of cancer screening and genetic data analysis, NPUs present a valuable tool for enhancing accuracy, efficiency, and scalability. Their ability to process large datasets efficiently, combined with lower power consumption, makes them suitable for continuous monitoring and real-time analysis. This ensures that critical healthcare processes can be carried out more effectively and sustainably.

These results underline the need to consider specific application requirements when choosing hardware for AI tasks. The right choice of hardware can significantly impact the success of the project, whether the priority is maximizing performance, minimizing power consumption, or leveraging specialized capabilities.

3. Lessons Learned:

As a team composed mainly of high school students, we are proud to have come this far. Through this project, we learned a great deal about NPU structure and enhanced our AI model development and testing skills. The dedication of our team, with hundreds of hours of work, made this achievement possible. We hope to document our lessons learned here to help others explore AI modeling in the medical industry or big data applications in general.

When it came to making the system work on the NPU without any backup, we faced numerous challenges. It was particularly difficult to allocate tasks between the CPU and NPU in a controlled format. Another significant hurdle was integrating our code with Ryzen AI for seamless operation. This type of code is not supported out of the box by Ryzen AI. We had to read extensive technical documentation on our own to put everything in order. This process involved numerous tests and adjustments to ensure our AI models could utilize all the features offered by the NPU.

Our approach involved a lot of planning and testing. Data preprocessing was one of the most important parts and needed a lot of attention. We had to handle and normalize huge datasets to make them usable for training. We set up strong data cleaning practices to eliminate any noise or artifacts that could affect model accuracy. This was time-consuming and needed a sharp eye to make sure the data going into our models was top quality.

This added a whole new set of challenges to training the AIs. Getting everything ready for the best performance meant fine-tuning hyperparameters and optimizing the training process. We also added custom callbacks for memory management during training. These steps really helped us make the most out of our hardware to meet our needs.

Another critical aspect we focused on was performance evaluation. We needed to design comprehensive tests for the NPU, GPU, and CPU configurations to compare their performance. We analyzed the results to see improvements in training speed and significant reductions in power consumption. This required a deep understanding of the hardware capabilities and the ability to fine-tune the models for optimal performance.

Collaborating with Geneseeq was crucial to our success. The most important part was getting access to larger datasets and validating our models in real-world scenarios. With feedback from Geneseeq, we refined our approach to improve the accuracy and reliability of our AI models. This collaboration was instrumental in pushing the boundaries of what we could achieve.

Finally, here's a video that captures the spirit of our hard work and the milestones we've achieved on this journey:

Fun fact: This logo was generated by running LLama 2 with Ryzen AI on our UM790 Pro mini PC.

4. Acknowledgement:

We want to express our sincere gratitude to the experts from Geneseeq, namely Dr. Xue Wu, Dr. Hua Bao, Mr. Min Wu, and Mr. Xiaoxi Chen, for their invaluable consultation.

from dataclasses import dataclass, field
from typing import Optional, Dict

import torch
import transformers
from datasets import load_from_disk
from transformers import DataCollatorForLanguageModeling, Trainer, TrainingArguments, TrainerCallback
import os

class MemoryUsageCallback(TrainerCallback):
    """A callback to log memory usage at each training step and epoch."""

    def on_step_end(self, args, state, control, **kwargs):
        if torch.cuda.is_available():
            torch.cuda.synchronize()
            allocated = torch.cuda.memory_allocated(0) / (1024 ** 3)
            cached = torch.cuda.memory_reserved(0) / (1024 ** 3)
            print(f"Step {state.global_step}: CUDA Memory Allocated: {allocated:.2f} GB, Cached: {cached:.2f} GB")

    def on_epoch_end(self, args, state, control, **kwargs):
        if torch.cuda.is_available():
            torch.cuda.synchronize()
            allocated = torch.cuda.memory_allocated(0) / (1024 ** 3)
            cached = torch.cuda.memory_reserved(0) / (1024 ** 3)
            print(f"Epoch {state.epoch}: CUDA Memory Allocated: {allocated:.2f} GB, Cached: {cached:.2f} GB")

@dataclass
class ModelArguments:
    model_name_or_path: Optional[str] = field(default="facebook/opt-125m")
    tokenizer_name_or_path: Optional[str] = field(default="facebook/opt-125m")

@dataclass
class DataArguments:
    data_path: str = field(
        default=None, metadata={"help": "Path to the training data."}
    )

def make_data_module(tokenizer: transformers.PreTrainedTokenizer, data_args) -> Dict:
    tokenized_datasets = load_from_disk(data_args.data_path)
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False, mlm_probability=0.0
    )
    return dict(
        train_dataset=tokenized_datasets["train"],
        eval_dataset=None,
        data_collator=data_collator,
    )

def main():
    parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()

    print(model_args, data_args, training_args)
    model = transformers.AutoModelForCausalLM.from_pretrained(model_args.model_name_or_path)
    tokenizer = transformers.AutoTokenizer.from_pretrained(model_args.tokenizer_name_or_path)

    data_module = make_data_module(tokenizer=tokenizer, data_args=data_args)
    trainer = Trainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        **data_module,
        callbacks=[MemoryUsageCallback()]
    )
    trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
    trainer.save_state()
    trainer.save_model(output_dir=training_args.output_dir)

if __name__ == "__main__":
    main()

#!/bin/bash

SHELL_SCRIPT=$(readlink -f "$0")
RUN_PATH=$(dirname "$SHELL_SCRIPT")
SRC_PATH=""

echo "RUN_PATH: ${RUN_PATH}"
echo "SRC_PATH: ${SRC_PATH}"

torchrun --nproc_per_node=1 ${SRC_PATH}/train.py \
  --model_name_or_path "facebook/opt-125m" \
  --tokenizer_name_or_path ${SRC_PATH}/opt-seq-pubmed-tokenizer \
  --data_path ${RUN_PATH}/data_Cristiano/tokenized_data \
  --output_dir ${RUN_PATH}/train_output_Cristiano \
  --num_train_epochs 30 \
  --per_device_train_batch_size 7 \
  --per_device_eval_batch_size 7 \
  --gradient_accumulation_steps 7 \
  --evaluation_strategy "no" \
  --save_strategy "steps" \
  --save_steps 1000 \
  --save_total_limit 1 \
  --learning_rate 1e-4 \
  --weight_decay 0.01 \
  --warmup_ratio 0.03 \
  --lr_scheduler_type "cosine" \
  --logging_steps 32 \
  --full_determinism \
  --fp16 False \
  --dataloader_num_workers 8 \
  --fsdp "full_shard auto_wrap" \
  --fsdp_config ${SRC_PATH}/fsdp_config_opt.json