What problem are you going to solve
Idea
Research
Overall Architecture
Running on AMD's NPU
Model preparations
Prompt Engineering
Conclusions & Lessons Learned
Next Steps
Concluding Words
Disclaimers

Published October 12, 2024

Health support assistant for workspaces with long pc usage

Tired of tech neck and stress? AI assistant uses a webcam to track posture and mood to stay healthy and focused while working from home.

Things used in this project

Hardware components

Minisforum UM790 Pro 7940HS Mini PC

Web camera

Software apps and online services

Ryzen AI Software

and related drivers etc. according to the installation instructions on the documentation. Many transient dependencies.

Python 3.x

and packages, environments according to the installation instructions on the documentation.

MQTT

OpenCV – Open Source Computer Vision Library OpenCV

Llama2

Yolov8 Pose

Facial_Expression_EDA_CNN

retinaface

Promptfoo

Story

What problem are you going to solve?

Sitting in front of a computer is potentially unhealthy due to extended periods of sitting especially if the workspace is not adjusted, which is even more likely the case in home offices where people are responsible for themselves to maintain a healthy environment.

Even with the best intentions, I observed that with an increased stress level and negative emotions, the pose is getting bad. What follows is stiff shoulders and neck, which if fostered over the years subtly decrease concentration.

Idea

The idea is to create an assistant leveraging AI technology to

verify the user's pose using a webcam (which is hardware that nearly anyone has available) and estimate the quality (regression task)
verify the user's mood using the capture of the face
maintain a context of pose, mood, and activity (and some more context like computer usage statistics to use it to generate feedback using generative AI and advice or action.
feedback could range from simple pose reminders to music recommendations to calm down, doing a small exercise or maybe a joke to cheer up. (It depends on what the AI generates)

Use existing solutions where feasible and combine them to achieve the above to deliver value to the user.

Research

Pose recognition projects

I conducted an online search for websites with Google, for GitHub projects with GitHub search, for academic papers on arxiv.org and on paperswithcode.com and checked for already available apps.

There are different approaches based on the sensor, e.g. install a force sensor inside the chair, use a depth camera, use an IMU on the body or a simple web camera.

Projects using a simple web camera either used existing models to pose key landmarks and calculate angles between these landmarks or created and trained their dedicated deep-learning models.

Existing apps didn't tell much about how they worked. Some required additional hardware. Overall, no inspiration from here.

So my conclusion was to go with the "sed existing models for pose key landmarks and calculate angles between these landmarks" approach.

Model for pose recognition

I had a look at several solutions for pose evaluation. They were Yolo Pose, MediaPipe and OpenPose.

My favourite was Media Pipe as it seemed to be a high-performance optimized (lightweight) approach with good results. It uses a two-step approach where it detects the presence of a body and then estimates the key points. It even provides an estimation of the depth. However, Ryzen AI Software requires a quantized ONNX model running on a Vitis AI provider. The model was not convertible to ONNX and the creators did not recommend it (answer on GitHub issues). Furthermore, the details of the model output are not obvious. They were explained on a GitHub issue. Ultimately, the conversion and inference on the pose model meant that I had to put the model out of the other processing like tracking the body across video frames. The benefit of a super easy-to-use pose detection was nullified and performance decreased.
Yolo Pose was easy to convert to ONNX and with some analysis of the open source code of the usage samples, I could make it run within Ryzen AI software.
Since Yolo Pose worked out, I skipped the evaluation of OpenPose

Model for face recognition

An online search led me to the Deepface project. However, due to the ONNX model requirements, I decided to use a model I found on AMD's hugging face models since they are ready-made for Ryzen AI.

Model for emotion detection (usingfacial expression)

Since I wanted a state-of-the-art approach that I could reuse, I checked out paperswithcode.com.

However, I recognized that recent data sets are not freely available. They (e.g. AffectNet RAF-DB) can be requested with an academic email address only.

Therefore I had to choose what I could get easily. It was the FER2013 (Facial Expression Recognition 2013 Dataset). I checked kaggle.com for contests and highly voted solutions. I selected https://www.kaggle.com/code/drcapa/facial-expression-eda-cnn.

LLM Model

At the time of the research, two new models were released. They are Llama3 and Phi3. I tried to get Llama3 running with the Ryzen AI Software. However, I ran into several issues most likely because the Huggingface Optimum library was not fully prepared on the version used Ryzen AI workspace. With the latest change sets and some more trial and error debugging, I think I got it running.

However, due to the lack of NPU's utilization rate monitoring, I could not be sure if it really worked. Experience showed that even if there is no error, there could be a fallback on the CPU. The CPU utilization was at 60% when running Llama3. Participants on Discord who said they got it running, too, on NPU confirmed the CPU utilization.

The Hardware Monitor software showed at least a 0.3% on the NPU. However, it never goes above 1% for anything incl. the Ryzen AI samples. AMD did not provide any concrete solution to measure utilization. It seems that one laptop manufacturer includes software to measure it, but not the hardware that was distributed for this contest. A generic advice to check the AMD Devzone did not lead to results.

I don't remember for which model I took this screenshot, Llama2 or Llama3..., but it should look like above that you see MATMULTINTEGER on the NPU.

Also, Llama3 took minutes to start up and loaded lots of memory (>10GB, don't remember the value)

I decided that it was not worth the effort to continue with Llama3 since at some point of time, AMD would release an update supporting Llama3.

Update end of July: There is now a model on hugging face since mid-July. A new release of the Ryzen AI Software is expected for the end of July, after the contest deadline.

Conclusion: I went on with the Llama2 PyTorch flow example as on the Ryzen AI Software Documentation.

Overall Architecture

The project is programmed in Python as this is the language of choice for AI-related tasks. Most libraries and projects utilize Python. Ryzen AI Software provides a python framework.

OpenCV is used to capture images.

Overall Architecture

Trigger sends requests to capture images periodically or on debugging keyboard input
Camera Input captures an image from the web camera
Pose Score Regression detects pose landmarks and calculates a score from certain angles (score calculation on the playground, to be integrated)
Face Detection detects faces on the image
Emotion Classification detects emotions in the faces
Context Manager collects and keeps a history of the pose and emotions over time (prompt on the playground, to be integrated)
AI Assistant asks an LLM for user feedback with a prompt template
Feedback provides the user feedback to the user (to be implemented)

Communication

The modules need to run in different Python virtual environments since the environment setup differs between Ryzen AI for LLMs and for regular deep learning models.

The IPC (Interprocess communication) method is JSON-based MQTT because it is easy to observe the messages for debugging, easy to inject messages for testing and easy to set up. It certainly has some overhead and performance is not optimal. Therefore it could be replaced in a later stage of development.
Each module subscribes to topics as an input and publishes its output.

Modules

Common

A common module abstracts the connection and handling of Mqtt and for the inference modules, the import and handling of the ONNX inference calls.

Trigger

The trigger listens for keyboard input to inject Mqtt messages for debugging and testing. For the running system, it would create a periodic trigger to create an image.

CameraInput

The camera input uses OpenCV to access the web camera of the PC and capture an image on the topic getImage. The image gets encoded into bytes and is sent to the topic image. On the topic getSample, a sample image is read from a file and sent.

FaceClassification

The face classification takes an image and detects faces on the image. It uses the retina face model available from AMD at huggingface at https://huggingface.co/amd/retinaface. The image preprocessing, i.e. resizing and reshaping, is taken from the utils.py code as is (also the vis function from widerface_onnx_inference.py). The model expects an image of the size 608x640 pixels. In the post-processing, the model's results are transformed back to fit the original image. The bounding box of the most probable face is taken and cut. The face is converted to jpg and sent to the topic face.

EmotionClassification

The emotion classification takes the faces of the face classification, preprocesses the image by scaling and reshaping to fit the model and runs inference on the emotion classification model built based on https://www.kaggle.com/code/drcapa/facial-expression-eda-cnn. Output is an emotion label, one of angry, disgust, fear, happy, sad, surprise or neutral, sent to the topic emotion.

PoseRecognition

The pose recognition takes an image and detects body landmarks using the Yolov8m pose model downloaded from https://docs.ultralytics.com/tasks/pose/. I converted the model into ONNX format and quantized the model so that it can run on the AMD NPU. Also here scaling and reshaping the image is required. Then the most probable landmarks are extracted from the results and angles between (knee, hip, shoulder), (ear, hip, shoulder) and (hip, shoulder, ear) are calculated. The more these angles deviate from a good position (roughly no angle or 90-degree angle depending on which one). A pose score is sent to the topic pose. (From angle calculation on, only implemented in the playground.pyfile)

Context

The context module collects data on the topics emotion and pose. It maintains a history of the last 60 values received. An average score is calculated where older values are decreased in weight. Additionally, the context module also collects statistics on the time since the last login and on the idle time to estimate the user's active time and break time. Based on a prompt template, the context module fills in the statistics of the emotion, pose, active and break time and sends it to the topic aiRquest.

Assistant

The assistant module uses the aiRquest topic and forwards it to an LLM. The LLM is based on the llama2 model according to the instructions of the Ryzen AI documentation. The model uses the PyTorch-based approach with 4bit quantization with flash attention and lm head to true. The result is published to the topic aiResponse.

Feedback

The feedback module shall use the aiResponse and present it to the user. To be implemented.

Running on AMD's NPU

What is necessary to run on the NPU?

It is required to have an NPU unit, that it is enabled in the BIOS (which was not the default for the Minisforum UM970Pro) and that the drivers are installed.
It is required to use Ryzen AI Software 1.1 (describes dependencies like drivers and the development environment). Works only on Windows 11.
Models must be converted to ONNX format.
Models must be quantized to int8 format.
Inference must run on ONNX runtime with Vitis AI EP

Setup of the environment (CNN-based models)

Installation according to https://ryzenai.docs.amd.com/en/latest/inst.html. For details and updates check out the link. Here are the main steps.

Added to PATH

C:\Program Files\CMake\bin
C:\Windows\System32\AMD

Installed Powershell Prompt in Anaconda (Miniconda failed somewhere later on)

In Anaconda Navigator launch Power Shell

cd C:\tools\ryzen-ai-sw-1.1
.\install.bat -env ai-assistant

-> ai-assistant environment appears on Anaconda Navigator for selection

Do the quick test

cd C:\tools\ryzen-ai-sw-1.1\quicktest\
python .\quicktest.py

Setup of the environment (Transformer based models == LLMs)

Note that the transformer-based flow is a completely different environment with different requirements, scripts and configurations. Therefore a different Python virtual environment is necessary. I tried to unify both environments into one, but I did not succeed.

Setup according to https://github.com/amd/RyzenAI-SW/blob/main/example/transformers/README.md. For details and updates check out the link. Here are the main steps.

Create the Python virtual environment

cd c:\tools\RyzenAI-SW\
cd example\transformers
conda env create --file=env.yaml
conda activate ryzenai-transformers

Obtain precomputed scales

git lfs install
cd ext
git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache

Activate the environment. Pitfall: This only works on the Windows Command Line, not on the Powershell!

cd ..
setup.bat

Build dependencies. Pitfall: For this step, you should have installed the C++ environment when installing Visual Studio!

pip install ops\cpp --force-reinstall

Install ONNX EP for running ONNX-based flows

pip install onnxruntime
cd C:\tools\ryzen-ai-sw-1.1\voe-4.0-win_amd64
pip install voe-0.1.0-cp39-cp39-win_amd64.whl
pip install onnxruntime_vitisai-1.15.1-cp39-cp39-win_amd64.whl
python installer.py

Continue with the instructions of one of the models. In my case: Run Llama2 Model in Pytorch

Obtain weights of llama-2 from https://llama.meta.com/llama-downloads/ and run

python C:\tools\anaconda3\envs\ryzenai-transformers\Lib\site-packages\transformers\models\llama\convert_llama_weights_to_hf.py --input_dir c:\tools\llama --model_size 7B --output_dir c:\tools\RyzenAI-SW\example\transformers\models\llama2\llama-2-wts-hf\7B

Activate the environment (if not already active by the step above)

conda activate ryzenai-transformers
cd c:\tools\RyzenAI-SW\example\transformers
setup.bat

Quantize and run the sample

cd models\llama2
python run_awq.py --w_bit 4 --task quantize --lm_head --flash_attention
python run_awq.py --task decode --target aie --w_bit 4 --lm_head --flash_attention

My output looked like this

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:None, device:aie, w_bit:4 group_size:128  )
          (k_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:None, device:aie, w_bit:4 group_size:128  )
          (v_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:None, device:aie, w_bit:4 group_size:128  )
          (o_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:None, device:aie, w_bit:4 group_size:128  )
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:11008, bias:None, device:aie, w_bit:4 group_size:128  )
          (up_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:11008, bias:None, device:aie, w_bit:4 group_size:128  )
          (down_proj): ryzenAI.QLinearPerGrp(in_features:11008, out_features:4096, bias:None, device:aie, w_bit:4 group_size:128  )
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)

**** Model size: 532.512MB

....

Example#:1      Prompt-len:7    New-tokens-generated:11 Total-time:3.888s       Prefill-phase:600.422ms Time/token:328ms        Tokens/sec:3.1
Example#:2      Prompt-len:5    New-tokens-generated:11 Total-time:3.722s       Prefill-phase:525.616ms Time/token:319ms        Tokens/sec:3.1
Example#:3      Prompt-len:8    New-tokens-generated:11 Total-time:3.867s       Prefill-phase:563.654ms Time/token:329ms        Tokens/sec:3.0
Example#:4      Prompt-len:8    New-tokens-generated:11 Total-time:3.721s       Prefill-phase:556.358ms Time/token:315ms        Tokens/sec:3.2
Example#:5      Prompt-len:6    New-tokens-generated:11 Total-time:3.881s       Prefill-phase:525.240ms Time/token:335ms        Tokens/sec:3.0
Example#:6      Prompt-len:5    New-tokens-generated:11 Total-time:3.685s       Prefill-phase:510.104ms Time/token:316ms        Tokens/sec:3.2
Example#:7      Prompt-len:9    New-tokens-generated:11 Total-time:4.656s       Prefill-phase:972.228ms Time/token:367ms        Tokens/sec:2.7
Example#:8      Prompt-len:8    New-tokens-generated:11 Total-time:3.724s       Prefill-phase:575.125ms Time/token:314ms        Tokens/sec:3.2
Example#:9      Prompt-len:9    New-tokens-generated:11 Total-time:4.097s       Prefill-phase:971.000ms Time/token:311ms        Tokens/sec:3.2
Example#:10     Prompt-len:7    New-tokens-generated:11 Total-time:3.681s       Prefill-phase:538.271ms Time/token:313ms        Tokens/sec:3.2

Model preparations

Face detection

AMD provides a ready-to-use retina face model optimized for the NPU at the hugging face link. The repository includes code to run inferences for verification. I reused this model and extracted the minimal steps to run inference to integrate it into my software.

Emotion Classification

The emotion classification model is created (steps 1 and 2) according to https://www.kaggle.com/code/drcapa/facial-expression-eda-cnn. If this is useful for you, give them an upvote.

1. Unzip the training data

import os
import zipfile
import tarfile

# Unzip the training data for the emotion recognition model
zip_path = 'challenges-in-representation-learning-facial-expression-recognition-challenge.zip'
data_path = 'tmp/faces/'
tar_path = os.path.join(data_path, 'fer2013.tar.gz')
icml_face_data_path = os.path.join(data_path, 'icml_face_data.csv')

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
  zip_ref.extractall(data_path)

with tarfile.open(tar_path, 'r:gz') as tar_ref:
  tar_ref.extractall(data_path)

2. Model definition and training

import os
import pandas as pd
import numpy as np

os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'

# pip install pandas keras tensorflow
from keras import models
from keras.optimizers import Adam
from keras.layers import Dense, Flatten, Conv2D, MaxPool2D
from keras.utils import to_categorical

data_path = 'tmp/faces/'
icml_face_data_path = os.path.join(data_path, 'icml_face_data.csv')
model_path = os.path.join(data_path, 'ferPlus-drcapa.h5')

# Load the image data with labels.
data = pd.read_csv(icml_face_data_path)

# We define some helper functions for preparing and ploting the data.
def prepare_data(data):
    """ Prepare data for modeling
        input: data frame with labels und pixel data
        output: image and label array """

    image_array = np.zeros(shape=(len(data), 48, 48))
    image_label = np.array(list(map(int, data['emotion'])))

    for i, row in enumerate(data.index):
        image = np.fromstring(data.loc[row, ' pixels'], dtype=int, sep=' ')
        image = np.reshape(image, (48, 48))
        image_array[i] = image

    return image_array, image_label

# Prepare data
emotions = {0: 'Angry', 1: 'Disgust', 2: 'Fear', 3: 'Happy', 4: 'Sad', 5: 'Surprise', 6: 'Neutral'}

# Define training, validation and test data:
train_image_array, train_image_label = prepare_data(data[data[' Usage']=='Training'])
val_image_array, val_image_label = prepare_data(data[data[' Usage']=='PrivateTest'])
test_image_array, test_image_label = prepare_data(data[data[' Usage']=='PublicTest'])

# Reshape and scale the images:
train_images = train_image_array.reshape((train_image_array.shape[0], 48, 48, 1))
train_images = train_images.astype('float32')/255
val_images = val_image_array.reshape((val_image_array.shape[0], 48, 48, 1))
val_images = val_images.astype('float32')/255
test_images = test_image_array.reshape((test_image_array.shape[0], 48, 48, 1))
test_images = test_images.astype('float32')/255

# Encoding of the target value:
train_labels = to_categorical(train_image_label)
val_labels = to_categorical(val_image_label)
test_labels = to_categorical(test_image_label)


# We define a simple CNN model:
model = models.Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(48, 48, 1)))
model.add(MaxPool2D((2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPool2D((2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(7, activation='softmax'))

model.compile(optimizer=Adam(learning_rate=1e-3), loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()


# Do the training
class_weight = dict(zip(range(0, 7), (((data[data[' Usage']=='Training']['emotion'].value_counts()).sort_index())/len(data[data[' Usage']=='Training']['emotion'])).tolist()))

history = model.fit(train_images, train_labels,
                    validation_data=(val_images, val_labels),
                    class_weight = class_weight,
                    epochs=12,
                    batch_size=64)


# save the model
model.save(model_path)

3. Model Quantization

Follows the instructions here https://ryzenai.docs.amd.com/en/latest/vai_quant/vai_q_onnx.html. The key point of the code is to create an image reader for the calibration step.

import os
import pandas as pd
import numpy as np

os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'

# pip install tf2onnx
import tensorflow as tf
import tf2onnx

import vai_q_onnx

from keras.models import load_model
from keras.utils import to_categorical
from onnxruntime.quantization.calibrate import CalibrationDataReader
import onnxruntime


data_path = 'tmp/faces/'
icml_face_data_path = os.path.join(data_path, 'icml_face_data.csv')
model_tf_path = os.path.join(data_path, 'ferPlus-drcapa.h5')
model_onnx_path = "resources/model/ferPlus-drcapa.onnx"
model_onnx_quant_path = "resources/model/ferPlus-drcapa-quantized.onnx"

model = load_model(model_tf_path)

# Convert Keras model to TensorFlow model
model = tf.keras.models.Model(model.inputs, model.outputs)

spec = (tf.TensorSpec((None, 48, 48, 1), tf.float32, name="input"),)
model_proto, _ = tf2onnx.convert.from_keras(model, input_signature=spec, opset=18, output_path=model_onnx_path)


# We define some helper functions for preparing and ploting the data.
def prepare_data(data):
    """ Prepare data for modeling
        input: data frame with labels und pixel data
        output: image and label array """

    image_array = np.zeros(shape=(len(data), 48, 48))
    image_label = np.array(list(map(int, data['emotion'])))

    for i, row in enumerate(data.index):
        image = np.fromstring(data.loc[row, ' pixels'], dtype=int, sep=' ')
        image = np.reshape(image, (48, 48))
        image_array[i] = image

    return image_array, image_label

# Prepare data
emotions = {0: 'Angry', 1: 'Disgust', 2: 'Fear', 3: 'Happy', 4: 'Sad', 5: 'Surprise', 6: 'Neutral'}

class ImageDataReader(CalibrationDataReader):

    def __init__(self):
        # Load the image data with labels.
        self.data = pd.read_csv(icml_face_data_path)
        
        # Define training, validation and test data:
        # Using training data for calibration results in
        #  numpy.core._exceptions._ArrayMemoryError: Unable to allocate 7.24 GiB for an array with shape (1943943808,) and data type float32
        self.train_image_array, _ = prepare_data(self.data[self.data[' Usage']=='PrivateTest'])

        # Reshape and scale the images:
        self.train_images = self.train_image_array.reshape((self.train_image_array.shape[0], 48, 48, 1))
        self.train_images = self.train_images.astype('float32')/255

        self.enum_data = None

        # Use inference session to get input shape.
        session = onnxruntime.InferenceSession(
            model_onnx_path, providers=['CPUExecutionProvider'])
        (_, _, height, width) = session.get_inputs()[0].shape

        # Convert image to input data
        self.train_images = np.expand_dims(self.train_images, axis=0)
        self.nhwc_data_list = self.train_images
        self.input_name = session.get_inputs()[0].name
        self.datasize = len(self.nhwc_data_list)

    def get_next(self):
        if self.enum_data is None:
            self.enum_data = iter([{
                self.input_name: nhwc_data
            } for nhwc_data in self.nhwc_data_list])
        return next(self.enum_data, None)

    def rewind(self):
        self.enum_data = None

    def reset(self):
        self.enum_data = None

dr = ImageDataReader()

vai_q_onnx.quantize_static(
    model_onnx_path,
    model_onnx_quant_path,
    dr,
    quant_format=vai_q_onnx.QuantFormat.QDQ,
    calibrate_method=vai_q_onnx.PowerOfTwoMethod.MinMSE,
    activation_type=vai_q_onnx.QuantType.QUInt8,
    weight_type=vai_q_onnx.QuantType.QInt8,
    enable_ipu_cnn=True,
    extra_options={'ActivationSymmetric':True}
)

4. Model Inference

This is the test that the result runs on the NPU before integrating it into the software.

import os

import onnxruntime

import cv2
import numpy as np

model_onnx_path = "resources/model/ferPlus-drcapa-quantized.onnx"
test_image = "resources/sampleImages/0.jpg"

emotions = {0: 'Angry', 1: 'Disgust', 2: 'Fear', 3: 'Happy', 4: 'Sad', 5: 'Surprise', 6: 'Neutral'}


providers = ["VitisAIExecutionProvider"]
provider_options = [{"config_file": "resources/ryzen-ai/vaip_config.json"}]

session = onnxruntime.InferenceSession(model_onnx_path, providers=providers, provider_options=provider_options)

session.get_inputs()[0]

print(session.get_outputs()[0].name)
print(session.get_outputs()[0].shape)
print(session.get_inputs()[0].name)
print(session.get_inputs()[0].shape)

image = cv2.imread(test_image)

gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
resized_image = cv2.resize(gray_image, (48, 48))
image_normalized = resized_image / 255.0

image_batch = np.expand_dims(image_normalized, axis=0)
image_batch = np.expand_dims(image_batch, axis=3)

onnx_pred = session.run(None, {"input": image_batch.astype(np.float32)})

print(onnx_pred[0])
max_index = np.argmax(onnx_pred[0])
predicted_label = emotions[max_index]
print(predicted_label)

Pose recognition

The model is downloaded and exported to ONNX according to the instructions on https://docs.ultralytics.com/tasks/pose/. Quantization was done by vai_q_onnx.quantize_static(). As I did not have any training data, the ImageDataReader for calibration was set to None (which likely decreased the quality of the model. It was still OK.)

import os

import vai_q_onnx

# Due to older versions of Torch in the Python envrionment of Ryzen AI, opset 18 is not supported
# Export e.g. in Colab or another environment with a newer version of Torch

pip install ultralytics
from ultralytics import YOLO

MODEL = 'yolov8m-pose.pt'
model = YOLO(MODEL)
model.export(format='onnx', imgsz=640, half=False, dynamic=False, simplify=True, opset=18)

model_onnx_path = "resources/model/yolov8m-pose.onnx"
model_onnx_quant_path = "resources/model/yolov8m-pose-quantized.onnx"

vai_q_onnx.quantize_static(
  model_onnx_path,
  model_onnx_quant_path,
  None,
  quant_format=vai_q_onnx.QuantFormat.QDQ,
  calibrate_method=vai_q_onnx.PowerOfTwoMethod.MinMSE,
  activation_type=vai_q_onnx.QuantType.QUInt8,
  weight_type=vai_q_onnx.QuantType.QInt8,
  enable_ipu_cnn=True,
  extra_options={'ActivationSymmetric':True}
)

LLM

I decided to use the llama2 model with the setup as described by AMD documentation. See previous chapter Setup of the environment (Transformer based models == LLMs).

Prompt Engineering

A prompt instructs the llm.

For Prompt Engineering, I used a framework called Promptfoo to build and test the prompts for various models. Promptfoo lets you define prompts that are sent to multiple LLMs (called providers) and test cases with common assertions and LLM instructions to use LLMs to grade the responses. The issue with prompt engineering is that 1) prompt performance depends on the model, 2) there are various guidances for creating "good" prompts, but a small change in the prompt can completely change the results, and 3) output is somewhat random (read hard to predict or hard to explain why a certain output is observed). So without dedicated testing for your use case and the model in use, you cannot judge if a change in your prompt improves the desired output or not.

I was using the providers gpt4, gpt3.5 and llama3 via Ollama. (Note that gpt4o and llama3.1 were not out yet when I did this)

The following is the configuration

.env

export OLLAMA_BASE_URL="http://127.0.0.1:11434"
# only for ollama, but with the openai grader, it will fail
#export REQUEST_TIMEOUT_MS="200"
export OPENAI_API_KEY="..."

promptfooconfig.yaml

# Learn more: https://promptfoo.dev/docs/configuration/guide
description: 'Testing Assistant Prompts'

# Set an LLM
providers:
  - id: openai:gpt-4
  - id: openai:gpt-3.5-turbo
  - id: ollama:chat:llama3:8b
    # https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter
    config:
      # Enable Mirostat sampling for controlling perplexity. (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
      mirostat: 0
      # Influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive. (Default: 0.1)
      mirostat_eta: 0.1
      # Controls the balance between coherence and diversity of the output. A lower value will result in more focused and coherent text. (Default: 5.0)
      mirostat_tau: 5.0
      # Sets the size of the context window used to generate the next token. (Default: 2048)
      num_ctx: 2048
      # Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx)
      repeat_last_n: 64
      # Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1)
      repeat_penalty: 1.1
      # The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8)	
      temperature: 0.8
      # Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. (Default: 0)
      seed: 42
#      # Sets the stop sequences to use. When this pattern is encountered the LLM will stop generating text and return. Multiple stop patterns may be set by specifying multiple separate stop parameters in a modelfile.
#      stop: "AI assistant:"
      # Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting. (default: 1)
      tfs_z: 1
      # Maximum number of tokens to predict when generating text. (Default: 128, -1 = infinite generation, -2 = fill context)
      num_predict: 128
      # Reduces the probability of generating nonsense. A higher value (e.g. 100) will give more diverse answers, while a lower value (e.g. 10) will be more conservative. (Default: 40)	int	top_k 40
      top_k: 40
      # Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)	float	top_p 0.9
      top_p: 0.9      
#  - exec:python langchain_example.py


# Load prompts from a file
prompts: 
  - prompts/assistant.json


tests:
  - tests/assistant.yaml


# These test properties are applied to every test
defaultTest:
  # This is used for the llm-rubric assertion
  options:
#    threshold: 0.5
    provider: openai:gpt-3.5-turbo
    # provider: ollama:llama3
  assert:
    - type: llm-rubric
      value: Do not mention that you are an AI or chat assistant
    - type: javascript
      # Shorter is better
      value: Math.max(0, Math.min(1, 1 - (output.length - 100) / 900));

prompts/assistant.json

[
    {
      "role": "system",
      "content": "{{ system_message }}"
    },
    {
      "role": "user",
      "content": "{{ user_message }}"
    }
  ]

prompts/assistant_systemMessage.txt

I describe in markdown structured text

the role of the llm
the task
explain input data format and meaning with examples
instructions
the expected output format
and an example for in and output.

# Role

You are a helpful AI assistant designed to provide personalized feedback on posture and well-being.
Your role is to analyze measurements of the user and offer suggestions to improve posture, reduce stress, and create a healthier work environment.

# Task

## Overview

This is a system designed to promote healthy work habits by analyzing user behavior and providing personalized recommendations.
The workflow is 
* A video stream is captured from a camera and preprocessed.
* The image will be fed to a vision model to detect the human body’s key points. Angles are calculated and used for a pose quality indicator.
* The image will be fed into another vision model to detect the face, its landmarks and the emotions out of it.
* The results are aggregated over time and saved together with other information to maintain a state.
* An assistant takes the aggregated data and creates a recommendation to improve posture, reduce stress, and create a healthier work environment.

## Task

You are in charge of giving feedback to the user to keep him healthy, focused and motivated.
The feedback should be positive, actionable, and tailored to the user's current needs.
Focus on suggesting concrete actions like stretching exercises, changing the pose, taking breaks, listening to relaxing music or telling a joke to cheer up.


# Context: Measurements and History

## Description

* Pose Score: A list of posture scores ordered by time where the latest is the newest. Scores are in the range of [0,5] where larger is better
* Emotions: A list of emotions classified based on the user's facial expression ordered by time. Categories are "Angry", "Disgust", "Fear", "Happy", "Sad", "Surprise" and "Neutral".
* Work: Statistics on how long the user worked today, how many breaks he did and how long these breaks were in total.
* Lighting Conditions: A Measurement if the camera could take a proper image for the recognition or if the recording was too dark. Categories are "Good" and "Dark".
* History: A list of suggestions that were already made today.

## Example Data

* Pose Score: [5, 4, 3, 3, 3, 3, 2, 1, 4, 4, 5, 4]
* Emotions: [Happy, Happy, Neutral, Neutral, Neutral, Disgust, Angry, Angry, Neutral, Neutral, Happy, Happy]
* Work: 2 hours working, 2 breaks with a total of 10 minutes
* Lighting Conditions: [Good, Good, Good, Good, Good, Good, Good, Good, Good, Good, Good, Good]
* History: []


# Recommendations

Offer suggestions to improve posture, reduce stress, and create a healthier work environment.

## Examples for actions

* Keep the posture in mind and correct it
* Take different poses like using a standing desk
* Take breaks, step back a moment for difficult problems to see new angles later on
* Listen to music to calm down or focus
* Do a small stretch exercise

Above are a few examples. Further advice is welcome.


# Format

Provide one sentence that only contains the concrete action for the user.

# Instructions

Generate feedback based on the user's context. Use positive language and encourage healthy habits. If the user's posture is poor, suggest simple exercises to improve it. If the user seems stressed or overworked, offer tips to relax or take a break.

tests/assistant.yaml

- vars: 
    user_message: |
      * Pose Score: [5, 4, 3, 3, 3, 3, 2, 1, 4, 4, 5, 4]
      * Emotions: [Happy, Happy, Neutral, Neutral, Neutral, Disgust, Angry, Angry, Neutral, Neutral, Happy, Happy]
      * Work: 2 hours working, 2 breaks with a total of 10 minutes
      * Lighting Conditions: [Good, Good, Good, Good, Good, Good, Good, Good, Good, Good, Good, Good]
      * History: []    
    system_message: file://prompts/assistant_systemMessage.txt

  assert:
    - type: llm-rubric
      value: ensure that the output is actionable


# https://github.com/promptfoo/promptfoo/blob/main/examples/custom-grading-prompt/promptfooconfig.yaml

Running promptfoo eval returns

Results for promptfoo

Conclusions & Lessons Learned

Accessing deep learning models became easy with only a few lines of code utilizing libraries. When it comes to leveraging acceleration hardware like the AMD NPU, models need to be converted, and quantized and need to run on a dedicated runtime. This was harder and more time-consuming than I expected. No easy-to-use framework was available. I wish that e.g. Ollama or llama.cpp support the AMD NPU in the future.
Deep learning models are highly specialized and need a lot of training data to work, which you usually don't have. Several data sets are only available to academic staff. On the other side, the upcoming LLM "foundation models" generalize very well and anyone can jump into crafting a prompt. Tuning is usually the second step if prompt engineering does not work out.
Advances in the AI domain are quite fast nowadays. Frameworks, tools, and libraries pop out quite fast. Not everything runs smoothly from the beginning and evaluation is required. I did not leverage Langchain here, but it might be a good candidate to manage context and provide additional data to the LLM.

Next Steps

Currently, a basic flow is running through the modules up to the context module. The assistant module is integrated, but the pose score calculation and the prompting are still on a playground and need to be integrated. Therefore integrate the complete workflow and give feedback to the user.
Models are not tuned. Likely each part needs some quality and performance optimization.
With the release of Ryzen AI Software 1.2 end of July, I would expect easier setup for LLMs and more models to run out of the box. Therefore exchange the current model with the latest Llama3.1 or Phi3.
Further evaluations of whether a pose score can be also calculated with a front-facing camera, as this is the setup of the home office worker. Setting up a camera to the side is rather difficult.

Concluding Words

Thank you for reading this long story to the end. I hope that you found a part useful and could learn something new. My main motivation was to share my learnings and give back something to the community.

Although not fully complete, I hope you can find something useful in the shared code.

Disclaimers

The this software is NOT a medical device and is not intended to be used as such or as an accessory to such nor diagnose or treat any conditions.

Code

Credits

Jens

3 projects • 5 followers

Studied Computational Linguistics with machine learning focus and Japan Studies. Working on AOSP (Android) in the automotive industry.

Contact

Comments

Please log in or sign up to comment.

Health support assistant for workspaces with long pc usage

Things used in this project

Hardware components

Software apps and online services

Story

What problem are you going to solve?

Idea

Research

Overall Architecture

Running on AMD's NPU

Model preparations

Prompt Engineering

Conclusions & Lessons Learned

Next Steps

Concluding Words

Disclaimers

Schematics

Overall Architecture

Code

assistant.zip

Credits

Jens

Comments

Embed the widget on your own site

Health support assistant for workspaces with long pc usage

Health support assistant for workspaces with long pc usage

Things used in this project

Hardware components

Software apps and online services

Story

What problem are you going to solve?

Idea

Research

Overall Architecture

Running on AMD's NPU

Model preparations

Prompt Engineering

Conclusions & Lessons Learned

Next Steps

Concluding Words

Disclaimers

Schematics

Overall Architecture

Code

assistant.zip

Credits

Jens

Comments

Related channels and tags