Published August 23, 2023 © GPL3+

PVD - Pathological Voice Detector

Detect pathology within dysphonia and laryngitis in few seconds of the 'a' sound.

AdvancedShowcase (no instructions)1 hour202

Things used in this project

Hardware components

USB-A to B Cable

Arduino Nano RP2040 Connect

Software apps and online services

OpenCV – Open Source Computer Vision Library OpenCV

Story

PVD - Pathological Voice Detector

Note: This project aim to be an instrument to help doctors diagnose voice pathology and for future developments. It won't in any way substitute a medical advice. If you are in doubt with your health status please refer to a doctor, just use this application as test purpose only.

Introduction

This project aims to identify different types of pathology based on a few seconds of vowel sound. In particular, unhealthy voice sounds can be dysphonia and laryngitis pathologies.

The aim of this project is to build an application capable of collecting 5 seconds of "a" vowel sounds with an embedded device, preprocessing it to highlight the most important features which will help the AI algorithm to detect the response. As I will feed the AI algorithm to train it, I needed a completed and well-structured dataset to let the algorithm learn the specific behaviour of the problem.

For this purpose, I used the VOICED dataset (VOice ICar fEDerico II) (Cesari U. et al., 2021), containing 208 voice samples (150 pathological, 58 healthy), and the signals consist of 5 seconds of vowel 'a' vocalization without interruption.

The signals have been preprocessed, and a Support Vector Machine has been applied for the binary classification:

A Butterworth filter has been applied to select only voice frequency,
A Fast Fourier Transformation has been applied to clean the sound from background noise, as reported on the original data preprocessing,
Each signal of the dataset has been converted into an image with the Gramian Angular Field technique, producing grey images,
Finally, a Histogram of Oriented Gradients has been applied as a feature extractor.

Useful Libraries

import warnings
warnings.filterwarnings('ignore')
import os
import glob
from tqdm import tqdm
from PIL import Image
import pandas as pd
import numpy as np
#from scipy import signal
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from pyts.image import GramianAngularField
import cv2                                  #<- OPENCV HERE!
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV, StratifiedKFold
from sklearn.svm import SVC
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.model_selection import cross_val_predict
from joblib import dump

Exploratory Data Analysis

The following code has been completely developed in Python with Visual Studio Code notebook, which you can find on my GitHub repository linked below this page.

df1 = pd.read_csv("path",  delimiter='\t')
plt.plot(df1)
plt.show()

Dysphonia Voice Signal

The plot above represents the five-second signal of "a" vowel of subject 1: the subject was a 32-year-old researcher with hyperkinetic dysphonia (VHI: 15, RSI:5).

As we can see and as reported in the original dataset description, the signals have been filtered and cleaned from background noise.

In order to use one of the most powerful tools from the OpenCV library to extract features from images, I need to transform the sequence into images maintaining the requested details.

Then, the Gramian Angular Field technique was used to transform the signal sequence into an image, representing the correlation of each time step with past and feature points. For this purpose, the summation GAF image has been applied, resulting in a single channel image of shape 128 x 128.

signal2 = np.transpose(signal1)
gasf = GramianAngularField(method='summation', image_size=128)
img = gasf.transform(signal2)

#Normalizing the image in range [0, 1]
min_val = np.min(img)
max_val = np.max(img)
img = 255 * ((img - min_val) / (max_val - min_val))
img = img.astype(np.uint8)
img = img.transpose(1,2,0)
plt.imshow(img, cmap='gray')
plt.show()

Gray GAF Image

In order to detect if the 5-second signal-to-image is healthy or pathological, I will apply a Histogram of Oriented Gradients as a features extractor to detect the meaningful part of the image with the following hyperparameters:

A window size of 128 x 128 to match the entire image,
A cell block of size 2 x 2 has 4 pixels in each cell,
A block of size 8 x 8 has 64 cells in each block,
A stride of size 1 to move across all the cells in order to capture as many details as possible,
18 bins for the histogram

This part of code was developed with different HOG hyperparameters to simplify the visualization purpose only. Credits: Dahi Nemutlu

hog_descriptor_reshaped = features.reshape(15, 15, 2, 2, 9).transpose((1, 0, 2, 3, 4))

# Create an array that will hold the average gradients for each cell
ave_grad = np.zeros((16, 16, 9))
# Create an array that will count the number of histograms per cell
hist_counter = np.zeros((16, 16, 1))

# Add up all the histograms for each cell and count the number of histograms per cell
for i in range(2):
for j in range(2):
ave_grad[i:15 + i,
j:15 + j] += hog_descriptor_reshaped[:, :, i, j, :]
hist_counter[i:15 + i,
j:15 + j] += 1

# Calculate the average gradient for each cell
ave_grad /= hist_counter

# Calculate the total number of vectors we have in all the cells.
len_vecs = ave_grad.shape[0] * ave_grad.shape[1] * ave_grad.shape[2]

# Create an array that has num_bins equally spaced between 0 and 180 degress in radians.
deg = np.linspace(0, np.pi, num_bins, endpoint=False)

# Each cell will have a histogram with num_bins. For each cell, plot each bin as a vector (with its magnitude
# equal to the height of the bin in the histogram, and its angle corresponding to the bin in the histogram).
# To do this, create rank 1 arrays that will hold the (x,y)-coordinate of all the vectors in all the cells in the
# image. Also, create the rank 1 arrays that will hold all the (U,V)-components of all the vectors in all the
# cells in the image. Create the arrays that will hold all the vector positons and components.
U = np.zeros((len_vecs))
V = np.zeros((len_vecs))
X = np.zeros((len_vecs))
Y = np.zeros((len_vecs))

# Set the counter to zero
counter = 0

# Use the cosine and sine functions to calculate the vector components (U,V) from their maginitudes. Remember the
# cosine and sine functions take angles in radians. Calculate the vector positions and magnitudes from the
# average gradient array
for i in range(ave_grad.shape[0]):
for j in range(ave_grad.shape[1]):
for k in range(ave_grad.shape[2]):
U[counter] = ave_grad[i, j, k] * np.cos(deg[k])
V[counter] = ave_grad[i, j, k] * np.sin(deg[k])
X[counter] = (cell_size[0] / 2) + (cell_size[0] * i)
Y[counter] = (cell_size[1] / 2) + (cell_size[1] * j)
counter = counter + 1
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))

# Display the image
ax1.set(title='Grayscale Image')
ax1.imshow(img, cmap='gray')

# Plot the feature vector (HOG Descriptor)
ax2.set(title='HOG Descriptor')
ax2.quiver(Y, X, U, V, color='white', headwidth=0, headlength=0, scale_units='inches', scale=3)
ax2.invert_yaxis()
ax2.set_aspect(aspect=1)
ax2.set_facecolor('black')

HOG Descriptor on the GAF Image

The plot above represents the magnitude and direction of the gradients for each block, highlighting the most important features.

In particular, the HOG extractor divides the image into blocks, in which histograms of oriented gradients are computed. It highlights two important components throw arrows which represent gradients (weights of each pixel on the image compared to the whole image):

The orientation of the local gradients on the images in terms of luminosity and colours helps to identify the structure of the image
The magnitude of the gradients in that direction is represented throw the length of the arrow

Those extracted features are read as numbers by the AI algorithm, which will learn from those features to identify the differences between healthy and pathological voices.

Preprocessing

images = []
paths = []
fig,ax = plt.subplots(1,2,figsize = (10, 5))
ax = ax.ravel()
for idx, i in enumerate(classes):
img_base_path = "path" + str(i) #link of the subfolders
timages = os.listdir(img_base_path) #list of images inside each subfolders
images_path_to_display = os.path.join(img_base_path, str(timages[0]))
paths.append(images_path_to_display)
img = Image.open(images_path_to_display)
images.append(img)
image = images[idx]
ax[idx].axis('off')
ax[idx].imshow(image,cmap = 'gray')
ax[idx].set_title(str(i))
plt.suptitle('Image Example')
plt.show()

Example of healthy and pathological GAF Image

Here is the example of the final image dataset: on the left, there is a randomly picked healthy image and on the right, there is a randomly picked pathological image.

As the original dataset was quite imbalanced, with 55 healthy signals and 150 pathological signals, I needed to balance the images to allow the model to avoid learning higher features from a class only.

Code by Ricardo Zuccolo https://medium.com/@ricardo.zuccolo
aug_images = []
for i,path in enumerate(neg_imgs):

# read image
img = mpimg.imread(path)

# append original image
aug_images.append(img)

# apply random transformations for 5 times
for j in range(1):

# transform
trans_img = affine_transform(img, 30, 2, 5)

# append to list
aug_images.append(trans_img)
if i%10 == 0:
plt.figure(figsize=(8,4))
plt.subplot(1,6,1)
plt.imshow(img)
plt.axis('off')
plt.title('Original')
for k in range(2,3):
plt.subplot(1,6,k)
plt.imshow(aug_images[-1*(k-1)])
plt.axis('off')
plt.show()
plt.close()

Image Augmentation

One of the most famous ways to augment images is to apply various techniques, like random cropping, horizontal and vertical flipping and rotation, in order to create various types of images without losing any type of information.

In particular, a set of geometric transformations are applied to change the position, size and orientation of the image:

rotation of 30°,
shearing the image on its axis by 2,
translate the image on its axis by 5.

Training

At this point, the dataset is ready to feed the Support Vector Machine algorithm, which will analyze the extracted features and predict the response:

# Initialize variables
img_folder_1 = 'path1'
img_folder_2 = 'path2'
imgs = []
labels = []

# Process images in folder 1
for img_name in os.listdir(img_folder_1):
img_path = os.path.join(img_folder_1, img_name)
img = cv2.imread(img_path)
img = cv2.resize(img, (128, 128))
if img is not None:
img = hog.compute(img)
imgs.append(img)
labels.append(0)

# Process images in folder 2
for img_name in os.listdir(img_folder_2):
img_path = os.path.join(img_folder_2, img_name)
img = cv2.imread(img_path)
img = cv2.resize(img, (128, 128))
if img is not None:
img = hog.compute(img)
imgs.append(img)
labels.append(1)

# Convert the lists to NumPy arrays
imgs = np.array(imgs)
labels = np.array(labels)
# Print the shape of the arrays to verify
print("Images shape:", imgs.shape)
print("Labels shape:", labels.shape)

scaler = MinMaxScaler()
x = scaler.fit_transform(imgs)

clf = SVC(random_state=46, kernel='linear', gamma = 0.0)
cv = StratifiedKFold(shuffle=True, random_state=42)
param_grid = [{'C': [0.0001, 0.001, 0.01, 1, 10, 100]}]
search = HalvingGridSearchCV(clf, param_grid, cv = cv, random_state=42, scoring = 'f1', verbose=3, refit=True).fit(x, labels)

ConfusionMatrixDisplay.from_predictions(labels, y_pred, cmap='coolwarm', normalize = 'true')
plt.show()

Confusion Matrix

The plot above represents the confusion matrix, which compares the true and predicted labels. As we can see, 86% of corrected healthy images were predicted against 88% of corrected pathological images.

The model now is ready to be deployed!