Cargando…
Human Interaction Classification in Sliding Video Windows Using Skeleton Data Tracking and Feature Extraction †
A “long short-term memory” (LSTM)-based human activity classifier is presented for skeleton data estimated in video frames. A strong feature engineering step precedes the deep neural network processing. The video was analyzed in short-time chunks created by a sliding window. A fixed number of video...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10384121/ https://www.ncbi.nlm.nih.gov/pubmed/37514573 http://dx.doi.org/10.3390/s23146279 |
Sumario: | A “long short-term memory” (LSTM)-based human activity classifier is presented for skeleton data estimated in video frames. A strong feature engineering step precedes the deep neural network processing. The video was analyzed in short-time chunks created by a sliding window. A fixed number of video frames was selected for every chunk and human skeletons were estimated using dedicated software, such as OpenPose or HRNet. The skeleton data for a given window were collected, analyzed, and eventually corrected. A knowledge-aware feature extraction from the corrected skeletons was performed. A deep network model was trained and applied for two-person interaction classification. Three network architectures were developed—single-, double- and triple-channel LSTM networks—and were experimentally evaluated on the interaction subset of the ”NTU RGB+D” data set. The most efficient model achieved an interaction classification accuracy of 96%. This performance was compared with the best reported solutions for this set, based on “adaptive graph convolutional networks” (AGCN) and “3D convolutional networks” (e.g., OpenConv3D). The sliding-window strategy was cross-validated on the ”UT-Interaction” data set, containing long video clips with many changing interactions. We concluded that a two-step approach to skeleton-based human activity classification (a skeleton feature engineering step followed by a deep neural network model) represents a practical tradeoff between accuracy and computational complexity, due to an early correction of imperfect skeleton data and a knowledge-aware extraction of relational features from the skeletons. |
---|