With the application of scientific advances in medicine, improved quality of life and economic and social development, people are living longer and longer. This situation has led to an increase in the global aging of the population. On the other hand, the current pace of life, together with some of the extreme situations we have recently experienced, have led to increasing social distancing. These factors are causing more and more elderly people to find themselves alone in their daily lives, putting many of them at risk because they have not been able to make a timely diagnosis of behaviors compatible with degenerative diseases, among which dementia is one of the most common.
In this project we propose HOSTIA, a computer vision-based detector of falls and other important circumstances that can affect the quality of life of the elderly.
The proposed pipeline is divided into several stages. First, person detection is performed by a deep learning region detector. Second, each image of the detected people is fed to a pose estimation network to obtain the coordinates of their skeleton. With these coordinates, we can use different algorithms to detect different circumstances, such as falls, disorientation or anomalous action durations (which may require an action recognition network).
Once one of these events is detected, the system can generate an alarm to alert the person in charge.
The advantages of our system over traditional methods of fall detection is that the person does not need to wear any kind of device, which he or she may forget to use, to generate an automatic alarm. In addition, our approach is able to perform an analysis of the person's daily life, in order to infer some signs of cognitive impairment.
2. Methodology2.1 Person detectionFor pose recognition, it is first necessary to isolate the frame containing the person in the scene. As a result of the detection of the person, we obtain the bounding box with the position of the person in the image.
The person identification system employed is Yolo V8, providing a significant advantage in image processing speed compared to Yolo V7. This becomes particularly important when considering the deployment of our system in a real-time environment for alerting a third party in the event of a fall or abnormal detection [2].
Furthermore, if in the previous frames a skeleton with more than 5 keypoints was detected, and in the subsequent frame, the bounding box suddenly disappears, we will interpret this as a likely failure in person detection. In such cases, we will retain the same bounding box as in the previous frame for pose detection.
2.2 Pose estimationPose detection is a crucial step that tends to result in false negatives, implying the non-detection of a person's skeleton when it is present. This is particularly exacerbated when the person is in unusual positions, such as horizontal or diagonal postures commonly associated with falls [1].
To mitigate this issue, if no human figure is detected, the image is rotated in 90-degree increments until a human figure is found, allowing the system to obtain the associated bounding box for that detection.
The sensitivity of the 2D pose estimation system is a critical aspect to consider. This implies that, in instances of false positives, where the bounding box erroneously detects human presence, the pose system is prone to incorrectly assign keypoints to a non-existent human skeleton. This phenomenon can lead to inaccurate results and potentially impact the overall quality of the detection [1].
To cope with this sensitivity and to increase the reliability of the system, we have applied a filtering criterion. In this regard, all pose estimates with less than 5 keypoints are discarded. This filtering strategy helps to reduce the probability of false positives by focusing on more robust and complete pose estimates, thus contributing to the overall accuracy of the system.
Pose estimation is carried out using mmpose library, utilizing a specific distribution of keypoints, as shown in Figure 1. This distribution was carefully designed to tackle the complexities associated with the diversity of human postures. Figure 1 provides a visualization of this distribution of keypoints, emphasizing the importance of understanding this association for subsequent use in fall detection.
2.3 Fall detectionFall detection does not only involve detecting the relative position of the person with respect to the ground in a static image, as there may be certain circumstances in which lying down may be justified, such as a camera in the living room recording how a person sleeps.
To build a more robust system, a temporal analysis of the person's pose is required to detect sudden changes in the position of any of the joints related to the position of the head and trunk, which is very likely to correspond to a fall event.
The system works in real time and is able to analyze video sequences to detect such events, distinguishing between an accidental fall and the action of lying down.
For the study presented in this document, we have chosen to conduct research using two of these datasets. This selection of only these two datasets is based on time constraints, the need to obtain multiple perspectives of falls, and the inclusion of diverse subjects.
Both selected datasets, URFD and HQFSD, present notable challenges, such as adverse lighting conditions and occlusions. These conditions occasionally affect the human detection system and skeleton identification, generating suboptimal quality data. This issue has been addressed by transforming the skeletons in three different ways, thus allowing for more robust inputs adapted to adverse conditions:
- Slope formed by each of the points.
- Angles formed by each of the points with respect to the coordinate axis.
- Distance, normalized, formed by each of the points with the rest of the points.
We have tested the system with the three types of skeleton transformations mentioned earlier (slopes, angles, and distances), with the addition of a new one where we combined these values using cross-validation. Since the class without falls is clearly a majority class in all evaluated datasets, we decided to use undersampling for the preliminary analysis.
The results were obtained using various neural network architectures, as shown in Table 2. Different levels of accuracy were achieved, for time constraints, the following table has been compiled using the URFD dataset only.
Table 2 highlights the inputs with which the tested architectures perform best. BRNN, LSTM, and Distance and Angles are selected as inputs that yield the best results. After this selection, we made some modifications to the selected network using the following changes until achieving the accuracy of the BRNN and CNN1D networks.
Finally, to apply these results in a real-world setting and implement a system that calls an emergency system in the event of a fall, we decided to implement the model CNN1D in real-time. As a demonstration, we display the detected fall on the screen at an approximate frequency of 9 or 10 frames per second (FPS).
In this work, we have developed a fall detection system using only information from images.
To do so, we have had to use a person detection method, a pose estimation network and perform geometric calculations that have allowed us to detect falls in numerous situations.
The system works in real time and is able to analyze video sequences to detect such events, distinguishing between an accidental fall and the action of lying down.
References[1] Singh, A. K., Kumbhare, V. A., & Arthi, K. (2021, June). Real-time human pose detection and recognition using mediapipe. In International Conference on Soft Computing and Signal Processing (pp. 145-154). Singapore: Springer Nature Singapore.
[2] Terven, J., & Cordova-Esparza, D. (2023). A comprehensive review of YOLO: From YOLOv1 to YOLOv8 and beyond. arXiv preprint arXiv:2304.00501.
[3] 2D Body Keypoint Datasets — MMPOSE 1.2.0 Documentation. (s. f.). https://mmpose.readthedocs.io/en/latest/dataset_zoo/2d_body_keypoint.html
[4] I. Charfi, J. Miteran, J. Dubois, M. Atri, R. Tourki, Optimized spatio-temporal descriptors for real-time fall detection: comparison of support vector machine and adaboost-based classification, Journal of Electronic Imaging 22 (4) (2013) 041106.
[5] B. Kwolek, M. Kepski, Human fall detection on embedded platform using depth maps and wireless accelerometer, Computer methods and programs in biomedicine 117 (3) (2014) 489–501.
[6] X. Ma, H. Wang, B. Xue, M. Zhou, B. Ji, Y. Li, Depth-based human fall detection via shape features and improved extreme learning machine, IEEE journal of biomedical and health informatics 18 (6) (2014) 1915–1922.
[7] G. Baldewijns, G. Debard, G. Mertes, B. Vanrumste, T. Croonenborghs, Bridging the gap between real-life data and simulated data by providing a highly realistic fall dataset for evaluating camera-based fall detection algorithms, Healthcare technology letters 3 (1) (2016) 6–11.
[8] A. Sucerquia, J. D. López, J. F. Vargas-Bonilla, Sisfall: A fall and movement dataset, Sensors 17 (1) (2017) 198.
[9] L. Martínez-Villaseñor, H. Ponce, J. Brieva, E. Moya-Albor, J. Núñez-Martínez, C. Peñafort-Asturiano, Up-fall detection dataset: A multimodal approach, Sensors 19 (9) (2019) 1988.
[10] S. Maldonado-Bascon, C. Iglesias-Iglesias, P. Martín-Martín, S. Lafuente-Arroyo, Fallen people detection capabilities using assistive robot, Electronics 8 (9) (2019) 915.
[11] Alam, E., Sufian, A., Dutta, P., & Leo, M. (2022). Vision-based human fall detection systems using deep learning: A review. Computers in biology and medicine, 146, 105626.
[12] Kiranyaz, S., Avci, O., Abdeljaber, O., Ince, T., Gabbouj, M., & Inman, D. J. (2021). 1D convolutional neural networks and applications: A survey. Mechanical systems and signal processing, 151, 107398.
[13] Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11), 2673-2681.
[14] Graves, A., & Graves, A. (2012). Long short-term memory. Supervised sequence labelling with recurrent neural networks, 37-45.
[15] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
Technical Reporthttps://docs.google.com/document/d/1gvqV0v7gqSlJnR1j8oYsthYfFULZ0Tzeq3xxAU9Iua4/edit?usp=sharing
Video Demonstration
Comments