Baby Cry System Deployment Pipeline
Overview of Arduino Nicla Voice development board
Text prompts generation using ChatGPT
Install AudioLDM:Text-to-Audio for dataset generation
Model Training using Edge Impulse platform
References

Published April 22, 2023 © GPL3+

TinyML: Baby Cry Detection using ChatGPT and Synthetic data

The combination of ChatGPT, TinyML and Text-to-Audio technologies was utilized to create artificial data for detecting baby crying.

AdvancedFull instructions provided5 hours4,603

TinyML: Baby Cry Detection using ChatGPT and Synthetic data

Things used in this project

Hardware components

Arduino Nicla Voice

Software apps and online services

Edge Impulse Studio

Arduino IDE

Story

TinyML is a field of machine learning that focuses on bringing the power of artificial intelligence to low-power devices. This technology is particularly useful for applications that require real-time processing. In the field of machine learning, there is currently a challenge in locating and gathering datasets. However, the use of synthetic data enables the training of ML models in a manner that is both cost-effective and adaptable, eliminating the necessity for significant quantities of real-world data.

In this project, I will show you how you can create a Baby cry detection system by training model using Edge Impulse platform and deploy it to your edge device like Arduino Nicla Voice. By training a machine learning model using synthetic data, we can differentiate between the occurrence of a baby's cry or the presence of background noise.

Here is a sneak peek of what's about to happen:

Collect dataset to Edge Impulse to train the model using AduioLDM:Text-to-Audio and ChatGPT technologies.
Training of the model using Edge Impulse.
Export the model to be analyzed by netron.app
Deploy your model to Arduino Nicla Voice.
Real-time data evaluation and testing using Arduino IDE.

Baby Cry System Deployment Pipeline

The diagram consists of several components and steps involved in deploying a machine learning model for detecting two cases: baby crying and background noise, using ChatGPT for generating text prompts.

Here is a step-by-step breakdown of the components and their interactions in the pipeline diagram:

ChatGPT: ChatGPT is the starting point of the pipeline. It generates text prompts for the two cases: baby crying and background noise.
Text to Audio Conversion: After generating text prompts, we sends them to a module that converts text to audio. This module creates audio files that correspond to the prompts for both cases.
Model training: The generated audio files are uploaded to the Edge Impulse SaaS platform. This is a cloud-based platform that provides tools for developing, training, and deploying machine learning models for edge devices like microcontrollers.
Model Deployment: After training, the machine learning model is deployed to the Arduino Nicla Voice development board. These boards are designed for building intelligent voice-enabled devices that can process audio and perform machine learning tasks.
Inference: Once deployed, the machine learning model can process real-time audio input from a microphone. The model can detect whether the input audio represents a baby crying or background noise.

Potentially, the output of the machine learning model can be used to trigger an action, such as turning on a light or sending a notification to a smartphone.

Overview of Arduino Nicla Voice development board

The Arduino Nicla Voice is a development board that has been created in collaboration with Syntiant. With the use of Syntiant's deep learning processors, which operate on ultra-low-power, this board is capable of providing Always-On Speech, gesture and motion recognittion on the edge.

1 / 2 • The Arduino Nicla Voice development board

With its compact size, Nicla Voice can be incorporated into wearable devices, allowing for AI integration while requiring minimal energy consumption. By utilizing the Nicla Voice, you can develop customized voice recognition models and employ them with the board, enabling the Nicla Voice to recognize specific words or phrases by analyzing your voice.

Let’s get started!

Text prompts generation using ChatGPT

Using ChatGPT to generate different prompts can streamline the process of writing prompts for my machine learning model, which consists of two classes: baby cry and background noise. By generating different prompts using ChatGPT, I can save time and effort that would otherwise be spent on brainstorming and writing prompts. This approach can also result in a wider range of diverse prompts, which can improve the accuracy and effectiveness of the machine learning model.

Here are the my text prompts for Baby crying scenario, which was generated using ChatGPT.

prompts = [
"Baby Crying",
"Baby crying in bedroom",
"Baby crying loudly",
"Infant crying",
"Newborn crying",
"Crying baby",
"Upset baby",
"Distressed baby",
"Fussy baby",
"Weeping infant",
"Sobbing baby",
"Whimpering baby",
"Wailing baby",
"Bawling baby",
"Crying newborn",
"Tearful baby",
"Bawling infant",
"Mourning baby",
"Bellowing baby",
"Screaming baby",
"Howling baby",
"Squalling baby",
"Yowling baby",
"Crying baby in nursery",
"Wailing infant in bedroom",
"Whimpering baby in crib",
"Sobbing baby in bassinet",
"Crying baby in the dark",
"Upset baby in bed",
"Distressed baby in room",
"Fussy baby in cradle",
"Weeping infant in playpen",
"Sobbing baby in the corner",
"Whimpering baby in the closet",
"Wailing baby in the crib",
"Bawling baby in the nursery",
"Crying newborn in the bedroom",
"Tearful baby in the playroom",
"Bawling infant in the den",
"Mourning baby in the living room",
"Bellowing baby in the kitchen",
"Screaming baby in the bathroom",
"Howling baby in the hallway",
"Squalling baby in the dining room",
"Yowling baby in the family room",
"Crying baby in the middle of the night",
"Wailing infant in the early morning",
"Whimpering baby during naptime",
"Sobbing baby during mealtime",
"Crying baby during bathtime",
"Upset baby during diaper change",
"Distressed baby during playtime",
"Fussy baby during bedtime",
"Weeping infant during storytime",
"Sobbing baby during teething",
"Whimpering baby during vaccination",
"Wailing baby during check-up",
"Bawling baby during colic",
"Crying newborn during feeding",
"Tearful baby during immunization",
"Bawling infant during growth spurt",
"Mourning baby during illness",
"Bellowing baby during teething",
"Screaming baby during reflux",
"Howling baby during ear infection",
"Squalling baby during constipation",
"Yowling baby during sleep regression",
"Crying baby during travel",
"Wailing infant during car ride",
"Whimpering baby during flight",
"Sobbing baby during road trip",
"Crying baby during vacation",
"Upset baby during change of environment",
"Distressed baby during new experiences",
"Fussy baby during unfamiliar situations",
"Weeping infant during loud noises",
"Sobbing baby during separation anxiety",
"Whimpering baby during stranger danger",
"Wailing baby during socialization",
"Bawling baby during weaning",
"Crying newborn during swaddling",
"Tearful baby during bath",
"Bawling infant during burping",
"Mourning baby during pacifier weaning",
"Bellowing baby during crawling",
"Screaming baby during walking",
]

Additionally, using a language model like ChatGPT can help me come up with creative and innovative prompts that I might not have thought of otherwise.

These are prompts for background noise.

prompts = [
"A hammer is hitting a wooden surface",
"A noise of nature",
"The sound of waves crashing on the shore",
"A thunderstorm in the distance",
"Traffic noise on a busy street",
"The hum of an air conditioning unit",
"Birds chirping in the morning",
"The sound of a train passing by",
"A group of people talking in a crowded room",
"The sound of raindrops hitting a tin roof",
"The buzz of a fluorescent light",
"The sound of footsteps on a wooden floor",
"The crackling of a campfire",
"The whirring of a ceiling fan",
"The sound of a basketball bouncing on concrete",
"A dog barking in the distance",
"The rustling of leaves in the wind",
"The buzzing of a bee or other insect",
"The sound of a church bell ringing",
"The roar of a waterfall",
"The tapping of a keyboard",
"The hiss of a steam engine",
"The clanging of pots and pans in a kitchen",
"The sound of a roaring fire in a fireplace",
"The hum of an electric generator",
"The sound of a lawnmower in the distance",
"The whistling of wind through a window crack",
"The clatter of dishes in a busy restaurant",
"The sound of a helicopter flying overhead",
"The tapping of rain on a metal roof",
"The gentle rustling of a book's pages turning",
"The creaking of a wooden chair",
"The sound of a pencil scratching on paper",
"The chirping of crickets at night",
"The crackling of a vinyl record playing",
"The hissing of an old radio",
"The sound of a pencil sharpener grinding",
"The gurgling of a coffee maker",
"The sound of a ticking clock",
"The roar of an airplane engine",
"The bubbling of a fish tank filter",
"The clanking of dishes being washed in a sink",
"The sound of a typewriter clacking",
"The roar of a lion in the wild",
"The whirring of a drone flying overhead",
"The beeping of a car horn in traffic",
"The sound of a door creaking open",
"The buzzing of a mosquito in the room",
"The sound of a blender mixing ingredients",
"The rumbling of a thunderstorm overhead",
"The tapping of a woodpecker on a tree trunk",
"The rustling of paper being shuffled",
"The sound of a busy office with people talking on the phone and typing on their keyboards",
"The sound of a construction site with heavy machinery and drilling",
"The sound of a dishwasher running in the kitchen",
"The chirping of birds in a forest",
"The sound of a police siren in the distance",
"The whistling of wind through tall grass",
"The sound of a cash register in a busy store",
"The buzzing of a fly or bee flying around",
"The sound of a bicycle bell ringing",
"The crackling of a fire in a fireplace"
]

That’s all for dataset generation!

Install AudioLDM:Text-to-Audio for dataset generation

To produce audio files from text, the next step involves using a text-to-audio generation tool called AudioLDM, which was developed by researchers from the University of Surrey and Imperial College London, UK. This tool utilizes latent diffusion model to generate high-quality audio from text. To use AudioLDM, you will require a standalone computer with a powerful CPU. While having a dedicated GPU is recommended, it is not mandatory. To test the functionality of AudioLDM, you can try it out online via Hugging Face.

We will configure our Python environment. For managing virtual environments we'll be using virtualenv, which can be installed like below:

sudo pip3 install virtualenv virtualenvwrapper

To get virtualenv to work we need to add the following lines to the ~/.bashrc file:

nano ~/.bashrc

And add the below lines

# virtualenv and virtualenvwrapper
export WORKON_HOME=$HOME/.virtualenvs
export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3
source /usr/local/bin/virtualenvwrapper.sh

To activate the changes the following command must be executed:

source ~/.bashrc

Now we can create a virtual environment using the mkvirtualenv command.

mkvirtualenv audioldm -p python

Install PyTorch using pip.

pip3 install torch==2.0.0

Then install audioldm package.

pip3 install audioldm

then run below command to generate audio files using text prompts, which was generated using ChatGPT, and can be found below in github code section.

python3 generate.py

You should get the following output:

genereated: A hammer is hitting a wooden surface
genereated: A noise of nature
genereated: The sound of waves crashing on the shore
genereated: A thunderstorm in the distance
genereated: Traffic noise on a busy street
genereated: The hum of an air conditioning unit
genereated: Birds chirping in the morning
genereated: The sound of a train passing by

Once the wav audio samples have been collected, they can be fed into the neural network to initiate the training process for automatic detection of whether a baby is crying or if there is background noise present.

Model Training using Edge Impulse platform

Edge Impulse is a web-based tool that helps us quickly and easily create AI models that can then be used in all kinds of projects. We can create Machine Learning models in a few simple steps and users can build custom image classifiers with nothing more than a web browser.

Go to Arduino Cloudplatform, enter your credentials at Login (or create an account), and start a new project.

Download the Google Speech Commands Dataset to obtain "background noise class" data from it. The dataset can be downloaded as follows.

wget http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz

Upload syntethic wav audio files and "background noise class" from the the Google Speech Commands Dataset. In my case, I uploaded approximattely 500 wav files. You can also add more files if you want by labeling it and uploading in Data Acquisition and retraining the model.

Once you have set up all of your classes and are happy with your datasets, it is time to train the model. Navigate to Create Impulse on the left navigation menu.

Select Add a processing block and add Audio (Syntiant), since it is very suitable for Syntiant NDP120 based deveopment boards. It will try to convert audio into some kind of features based on time and frequency characteristics that will help us to do classification. Then select Add learning block and add Classification with two output classes.

Then navigate to Syntiant. Under the Syntiant we will keep the default parameters. Click Save parameters.

Finally, click on the Generate features button. You should get a response that looks like the one below.

Train the model by pressing the Start training button. This process might take around 5-10 minutes, depends on your data set sizes. If everything goes correctly, you should see the following in the Edge Impulse

We get a validation accuracy of 90.7%. You should not get the 100% accuracy from your training dataset, since it can be considered as overfitted model. Anything greater than 70% is a great model performance. Increasing the number of training epochs can potentially increase this accuracy score.

The .tflite file is our model. The final Quantized model file(int8) is around 5KB in size and it achieved an accuracy almost 90%.

It is always interesting to take a look at a model architecture as well as its input and output formats and shapes. You can use a program like Netron to view the neural network.

Click serving_default_x:0: we observe that the input is of type int8 and of size [1, 1600]. Now let's look at the outputs: we have 2 classes, so we see that the output shape is [1, 2]. The quantization process can reduce the performance of the model, as going from a 32-bit floating point to a 8-bit integer representation means a loss in precision.

Once you are done building your model head over to the Deployment section and deployed it on one of the supported edge devices. ML model deployment is the process of putting a trained and tested ML model into a production environment like edge device, where it can be used for its intended purposes.

Go to the Deployment tab of the Edge Impulse. Click your edge devices type of firmware. Here, it is Arduino Nicla Voice.

You might see the following log messages:

Total Parameter Memory: 1.375 KB out of 640.0 KB on the NDP120_B0 device.                            | | Estimated Model Energy/Inference at 0.9V: 5.55404 (uJ)

This information is important because it indicates the memory efficiency of the model and whether it can be deployed on resource-limited devices like Arduino Nicla Voice.

A zip file will be automatically downloaded to your computer after you click Build button. Unzip it.

Use the script below that corresponds to your operating system to upload it to the Arduino Nica Voice:

Use flash_windows.bat if you are using a PC
Use flash_mac.command if you are using a MAC
Use flash_linux.sh if you are using a Linux machine

Open the Arduino IDE Serial Monitor. Set the baud rate to 115200. If everything goes correctly, you should see the following:

Hello from Edge Impulse on Arduino Nicla Voice
Compiled on Apr 16 2023 14:14:04
mcu_fw_120_v91.synpkg exist
dsp_firmware_v91.synpkg exist
ei_model.synpkg exist
dsp firmware version: 19.2.0
package version: Model generated by Edge Impulse
pbi version: 3.2.3-0
num of labels: 2
labels: NN0:Baby cry detected, NN0:Background noise
total deployed neural networks: 1
IMU not enabled
Inferencing settings:
Interval: 0.062500 ms.
Frame size: 15488
Sample length: 968 ms.
No. of classes: 2
Classes:
Baby cry detected
Background noise
Starting inferencing, press 'b' to break
Type AT+HELP to see a list of commands.

In the Serial Monitor similar output must be shown like below.

Match: NN0:Background noise
Match: NN0:Background noise
Match: NN0:Background noise
Match: NN0:Background noise
Match: NN0:Background noise
Match: NN0:Background noise
Match: NN0:Background noise
Match: NN0:Background noise
Match: NN0:Background noise
Match: NN0:Background noise
Match: NN0:Background noise
Match: NN0:Baby Cry
Match: NN0:Baby Cry
Match: NN0:Baby Cry
Match: NN0:Baby Cry

Here is a demonstration video of what the final result looks like.

A demonstration video with Arduino Nicla Voice

I’ve taken our training data and trained a model in the cloud using Edge Impulse platform, and we’re now running that model locally on our Arduino Nicla Voice. So, it can be stated that it was successfully deployed to a edge device. Potentially, the project can be improved by adding trigger action, such as turning on a light or sending a notification to a smartphone.

In conclusion, by leveraging the capabilities of TinyML and utilizing synthetic data generated through Text-to-Audio and ChatGPT, it is possible to enhance the efficiency and accuracy of detecting and responding to a baby's cries. The effectiveness of artificial data generation was demonstrated, which eliminates the need for manual dataset search.

Feel free to leave a comment below. Thank you for reading!