Team actionai:

•

Salma Mayorquin

Published January 9, 2020 © GPL3+

ActionAI: Custom Tracking & MultiPerson Activity Recognition

We introduce an IVA pipeline to enable the development and prototyping of AI social applications.

AdvancedFull instructions provided8 hours13,479

Things used in this project

Hardware components

NVIDIA Jetson Nano Developer Kit

Webcam, Logitech® HD Pro

Software apps and online services

TensorFlow

NVIDIA Deepstream SDK

Pytorch

Story

See what's new at Remyx.ai

ActionAI

Performs multi-person tracking and activity recognition
Simply customize for your task

Check out the repo!

What is IVA?

Intelligent Video Analytics (IVA) applications often require the ability to detect and track objects over time in video.

Running object detection models as the primary inference engine provides the developer access to more spatial context in application logic.

IVA platforms like the Deepstream SDK also make it easy to cascade inference using a secondary model performing object detection or identifying object instance-level attributes with a classifier.

Here, we refer to smart or ML-enabled applications designed to analyze or interact with humans as AI social applications.

Cheap hardware & software makes vision a natural sensing modality for building intelligent applications to analyze people.

For example, many AI social applications might require the ability to recognize activities performed by humans in a camera's field of view.

Early approaches in this area concentrated on extending image-based deep learning techniques to video through 3D CNN architectures, essentially regarding time and space on the same footing.

Last Summer, Google researchers described a number of innovations leading to a new state-of-the-art around performing Human Activity Recognition (HAR) using the Kinetics dataset.

Last year, we also investigated HAR through the development of YogAI on a yoga pose dataset we gathered.

We used body keypoint estimation as a feature extractor before applying a secondary classifier to predict pose. We then extended this approach to classify a motion by concatenating a sequence of similar features over time.

Since modern pose estimation models support keypoint identification for multiple human instances, we can recognize activities as human-instance level attributes.

This stands in contrast to the results of Google's EvaNet, which yields classification labels characterizing a video segment rather than the activity of each individual person.

Since we want to make it easier to make new AI social applications, we emphasize the simplicity of specializing our approach to your custom HAR tasks.

The inductive bias of ActionAI's structured approach makes learning more efficient in scenes where activity is recognizable from body configuration over time. With this approach, the challenge is reduced to training a sequential model like LSTMs.

Check out our video where we demo EvaNet and YogAI:

Suppose we want our own categories (not Kinetics) or to distinguish activities among multiple humans

Generalizing YogAI

After showing that we could pair pose estimation with simple & lightweight machine learning algorithms to reliably perform HAR, we generalized the approach to easily support new use cases like Shoot Your Shot.

We also found this approach powerful in building the resource utilization and activity monitoring demo 'Yellow Couch' for a client as we explored cascaded inference to extract additional attributes about people.

pose estimates help to characterize activity and localize to regions of interest for secondary inference

Pose estimates were used to produce regions of interest around the face for recognition, hands to run classifiers helping infer activities like drinking coffee or using cellphones and laptops.
Another classifier works like YogAI to infer activities like: standing, passing, sitting, lying down.
The raw images are 4K so we produced a cleaner segmentation mask by cropping to a region of interest before running inference rather than resizing the image first. This region of interest is defined by the bounding box containing the body keypoints for each person instance. This allowed us to run faster and smaller tflite models on devices like the Jetson TX2.
In this resource-utilization use-case, information regarding a person's proximity to objects like a couch provides additional context for activity recognition. This is similar to the motivation behind Action Genome but narrowing in on the easily annotated spatial relationships between human actors and surrounding objects.

Similar work

Introducing ActionAI

ActionAI generalizes the approach of YogAI and related projects framing an IVA pipeline by introducing trackers and multi-person pose estimation.

By baking pose estimation into the pipeline as the primary inference engine, the developer can focus on training simple image classification models based on low dimensional features or small, localized image crops.

Since popular IVA frameworks typically only support the most common computer vision tasks like object detection or image classification/segmentation, we needed to implement our own.

Many IVA frameworks use GStreamer to acquire and process video. For our video processing demo, OpenCV suffices. For pose estimation we use Openpose implemented with popular deep learning frameworks like Tensorflow and Pytorch.

Accurately recognizing some activities requires higher resolution in time with higher frame rates, so we use TensorRT converters for optimized inference on edgeAI prototyping devices like the Jetson Nano.

The main programming abstraction of ActionAI is a trackable person class, similar to this pyimagesearch trackable object. This object has a method to enqueue the configuration of N (14 or 18) new keypoints as a length 2N numpy array into a circular buffer. For computational efficiency, we prefer smaller buffers, but we balance this desire with one to provide enough information as input for secondary models. This object also encapsulates ids, bounding boxes, or the results of running additional inference.

secondary inference engine classifies squat vs spin for each person found with pose estimation

To track person instances, we used a scikit-learn implementation the Kuhn–Munkres algorithm based on the intersection over union of bounding boxes between consecutive time steps. This blog has nice exposition on applying this algorithm to perform matching.

Like other IVA frameworks, we incorporate visual overlays to support ML observability and introspection as well as visual storytelling.

Alternatively, we can integrate ActionAI with messaging brokers to stream inference results to the cloud for logging or additional processing, similar to the reference applications for the Deepstream SDK.

In another direction, by polling for button presses of a PS3 controller connected to the Jetson Nano by USB, we easily annotated activities for person instances at each time step interactively, like we did with the MuttMentor.

This makes an ideal prototyping and data gathering platform for Human Activity Recognition, Human Object Interaction, and Scene Understanding tasks with ActionAI, a Jetson Nano, a USB Camera and the PS3 controller's rich input interface.

In YogAI, we found sequences of pose estimates to be powerful features in recognizing motions from relatively few samples. In ActionAI, by running model update steps inline with image acquisition and PS3 controller annotation, we can implement a demo similar to the teachable machine.

Future Directions

We can use ActionAI to demo our own invisible keyboard learning to associate patterns in hand keypoints with text using a keystroke logger! MediaPipe and TensorFlow.js provide a model for face and hand pose tracking.

Or we can associate the movements of facial keypoints with spoken word! By demuxing video and using Automatic Speech Recognition to convert speech to text, we can apply ActionAI to learn to read lips.

Here, we experiment with facial key points, finding temporal inconsistencies in the eyebrows generated with deep fakes as part of Kaggle's DeepFake Challenge.

real eyebrows don't do that!

Besides additional key point models, we can introduce the context of image texture & color with a descriptor for each trackable person object. This can help perform more robust tracking, but also to improve activity recognition.

In Facebook's SlowFast, researchers structured two video processing pathways, modeled on the human vision system, to achieve a state-of-the-art in video recognition.

The dual pathways helped to efficiently process color & texture features less frequently, while analyzing motion-based information at a higher frequency.

We can introduce a similar slow-fast modeling paradigm by adding new methods to our trackers configured to run at different frequencies. Then ActionAI can incorporate the additional slowly-varying context of color and texture to perform action recognition.

ActionAI has worked well in running secondary LSTM-based classifiers to recognize activity from sequences of key point features in time. However, we can introduce methods to render spectrograms for Fourier smoothing of this sequence before running a CNN-based action recognition model.

Finally, by developing a gstreamer plugin, we can achieve higher performance and generalize this work further while integrating this approach more seamlessly within applications developed using the Deepstream SDK.