When rock climbing was first introduced as a discipline in the Olympics a few years back, I wondered if the current state of pose estimation algorithms could detect the incredible poses that humans take while performing this sport.
Applying AI to Climbing - The Long Road to the Tokyo Olympics
My first attempts were really bad, and it was clear that the data sets used to train these algorithms did not include these human poses.
Applying AI to Climbing - Deep Learning Meets the Odd Human Data Set
More that a year later, LearnOpenCV published an article that caught my attention. They described the new YOLOv7 model, which not only drastically increased the capability of detection rock climbing poses, but also provided keypoint detection.
Applying AI to Climbing - Pose Estimation with YOLOv7
This was a great breakthrough, but I did not find time to pursue the project further.
I did, however, manually create this cool video with a combination of video editing and python scripting:
This type of montage allows climbers to analyze their performance, and compare themselves to other climbers. It should be obvious in the video that a 5'8" climber (me, on the left) will not go about it the same way as a 5'2" climber (my wife, on the right).
For elite climbers, they can compare their performance with previous performances to track progress, and identify areas that need improvements.
The GoalCreating a side-by-side comparison view of two or more climbs is a tedious process.
A good strategy for this type of project is to start small and focused. My hope is to create a useful tool for climbing coaches that could use this type of montage as a training tool for their climbers (perhaps the next Olympic candidates).
Therefore, the first goal will be to align the videos of two climbs in order to perform a side-by-side comparison.
Automating this process with computer vision and machine learning is a great challenge to take on before the next Olympics being held in Paris in July 2024.
The best way to get a project done in time is to establish a deadline.
For this, I will enter this project in the new OpenCV AI Competition, being hosted on Hackster.io.
An international open source competition on computer vision & AI by OpenCV Foundation.
With up to $48, 000+ in prizes, it is a great opportunity to flex your computer vision skills 😃OpenCV AI Competition 2023
Whether or not my project is "accepted" in the competition, I will move forward with the project and use the competition deadlines and milestones as motivation 😃😃. Hopefully, it will also inspire others.
Standing on the Shoulders of GiantsI am a big fan of LearnOpenCV, from their blogs to their on-line courses.
The LearnOpenCV team provides tutorials and documentation for all the components and that you may need to create your own project.
They also have on-line courses for every budget, from the free bootcamp courses to the premium in-depth courses.
https://opencv.org/university/free-courses/
https://opencv.org/university/cvdl-master/
I will leverage (and acknowledge) their content to implement this project.
The Proposed SolutionThis project will be broken down into two main components:
- Video Annotation
- Side-by-Side Viewing
The Video Annotation will use pose estimation to annotate the video. Since I will want to validate and edit the annotations, an interactive tool will be ideal. It will provide the option to correct the annotations (if required).
The starting point for this component will be the open-source annotation tool from OpenCV : pyOpenAnnotate
Building An Automated Image Annotation Tool: PyOpenAnnotate
Roadmap To an Automated Image Annotation Tool Using OpenCV Python
Several features make this tool interesting for reuse and customization:
- ability to annotate each frame of video
- ability to manually correct annotations
The following modifications will be made for the purpose of this project:
- replace existing thresholding based annotations (bounding boxes) with pose estimation annotations (bounding box + keypoints)
- identify climber of interest and track climber in video
- filter out unwanted annotations (ie. not climber of interest)
The Side-by-Side Viewer will use the pre-calculated annotations to view multiple climbs together. For this purpose, the following features will also need to be created:
- creation of synchronization points of climber on route
- video stretching to align climbs together
- identify background image of route (used for viewing options)
The Viewer will also provide several viewing options:
- side-by-side or overlay
- climber or stick-man
It all starts with the data. For the purpose of this project, I gathered several videos of climbing footage of me and my wife Josie-Anne.
This combines videos of several scenarios:
- 6 indoor videos / 42 outdoor videos
- mostly up-climbs, a few down-climbs (ie. climbing backwards)
- one climb with two different view points
The videos were taken with an iPhone 14 Pro, in "accelerated mode".
The motivations (and advantages) of using this mode are:
- less battery consumption (ie. important for outdoor climbs)
- less storage requirements
- smaller video files
Some disadvantages of using this mode are:
- delay of 10 sec between frames, which is too long for climbing (ie. a lot of movements are lost)
- a lot of frames have blurry climber (ie. less than ideal for pose estimation)
Another technique of gathering data will have to be considered, but for now, I will make use of these videos, which have between 600 and 1200 frames per video.
For this first iteration, the pyOpenAnnotate utility was modified as follows:
- modify the write/read annotation to include 1 bounding box + 17 keypoints per person, including the confidence scores
- modify the annotation tool to reuse annotations if present
The annotation data contains the following information:
- bbox.id (integer) : unused for now (0)
- bbox.x (float) : normalized x coordinate for center of bounding box
- bbox.y (float) : normalized y coordinate for center of bounding box
- bbox.w (float) : normalized width for bounding box
- bbox.h (float) : normalized height for bounding box
- bbox.c (float) : confidence score for bounding box
- keypoint[0].x (float) - normalized x coordinate for keypoint[0]
- keypoint[0].y (float) - normalized y coordinate for keypoint[0]
- keypoint[0].c (float) - confidence : confidence score for keypoint[0]
- ...
- keypoint[17].x (float) - normalized x coordinate for keypoint[17]
- keypoint[17].y (float) - normalized y coordinate for keypoint[17]
- keypoint[17].c (float) - confidence : confidence score for keypoint[17]
Note that the pose landmarks, or keypoints, correspond to the COCO dataset:
On a PC (without GPU card), it took 20 hours of processing time to annotate the ~50 videos, taking approximately 30min per video.
The processing time per frame was approximately 2 sec. The second graph shows this as fairly stable 1.8sec per frame for the first videos, then an unexpected range of 1.5-3.0sec per frame near the end.
Viewing Data (2023/09/03)The following video illustrates the current status of the project:
The open-source annotation tool (pyOpenAnnotate) has been modified to use the YOLOv7 pose estimation model to annotate the video frames.
As seen previously, the annotation process is quite long, so executed in batch mode. The annotation tool has been modified to reuse the pre-calculated annotations for viewing and analysis.
Three new slides (track bars) have been added to the annotation tool:
- frameNum : quickly navigate in video to identify areas of interest
- threshBBOX : confidence threshold for bounding boxes
- threshKPTS : confidence threshold for keypoints
Although the ultimate goal is to identify the following frames automatically, the current annotation tool allows the user to manually specify:
- background image
- start of climb
- end of climb
- cruxes (difficult parts of climb, areas of interest)
Note that the video is not showing the final viewer, but rather two instances of the annotation tool, giving a glimpse of what we will want to achieve with our side-by-side viewer.
Identifying the Climber of Interest (2023/09/13)It is not always obvious which person is the climber of interest. The following series of four frames was taken from a video where the camera was placed on a trail which had a lot of activity, in addition to another climber. Can you guess which climber is of interest ?
Notice how the IDs of the detected people change at each frame. This is due to lack of tracking in the current annotation tool.
In order to identify the climber of interest, it may be necessary for the user to specify the climber at the start of the climb, and possibly use tracking to follow the climber throughout the video.
The following LearnOpenCV article, describing DeepSort tracking, will be used to "follow" a climber in the sequence of frames in the videos:
LearnOpenCV - Real Time Deep SORT with Torchvision Detectors
Tracking the Climber (2023/09/15)My first attempt to use the Deep SORT algorithm to track the climber of interest was not successful.
To be fair, this video is fairly complex, as there are a lot of people in the scene. When I start climbing, Deep SORT has allocated the following IDs:
- belayer (my wife, Josie-Anne) : ID 20
- climber (myself, Mario) : ID 21
What is very impressive is that despite the continual flow of people moving around the belayer, the ID 20 remains the same throughout the video. This is a solid success !
For the climber, however, the ID changes from 21, to 45, to 47, to 49, to 56, to 57, to 61, to 68, to 69, to 72, to 74, to 75, to 76, to 83, to 92, etc... This is an epic failure !
But why ? What is going on ?
For this test, I used the "mobilenet" re-identification model. I wanted to also test the "torchreid" and "clip" re-id models, but could not get those working ...
The tracking is lost almost every time that I fall ... which results in a sudden change in position. Since these videos were taken in "accelerated" mode, the changes is position are perhaps too great for the algorithm to work correctly.
To test this first theory, I will have to re-try with a normal video (ie. 30 frames/sec, instead of 0.1 frames/sec).
Another theory is that the re-id model has its attention on the rock features instead of the climber. One way to test this could be to do background subtraction, to remove the rock textures from the video being processed by the Deep SORT algorithm.
Revision2023/08/25 - Project Draft
2023/08/29 - Project Entered in OpenCV AI Competition 2023
2023/08/31 - Update on Gathering and Annotating Data
2023/09/03 - Update on Viewing Data
2023/09/13 - Update on Isolating the Climber of Interest
2023/09/15 - Update on Tracking the Climber
Comments
Please log in or sign up to comment.