Fine-to-coarse self-attention graph convolutional network for skeleton-based action recognition

Kilic, Ugur; Oztimur Karadag, Ozge; TÜMÜKLÜ ÖZYER, Gülşah

doi:10.1016/j.asoc.2025.114268

Fine-to-coarse self-attention graph convolutional network for skeleton-based action recognition

Kilic U., Oztimur Karadag O., TÜMÜKLÜ ÖZYER G.

Applied Soft Computing, cilt.186, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 186
Basım Tarihi: 2026
Doi Numarası: 10.1016/j.asoc.2025.114268
Dergi Adı: Applied Soft Computing
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC
Anahtar Kelimeler: Fine-to-coarse approach, Graph convolutional networks, Multi-scale, Skeletal data, Skeleton-based action recognition, Temporal self-attention
Atatürk Üniversitesi Adresli: Evet

Özet

Skeleton data has become an important modality in action recognition due to its robustness to environmental changes, computational efficiency, compact structure, and privacy-oriented nature. With the rise of deep learning, many methods for action recognition using skeleton data have been developed. Among these methods, spatial-temporal graph convolutional networks (ST-GCNs) have seen growing popularity due to the suitability of skeleton data for graph-based modeling. However, ST-GCN models use fixed graph topologies and fixed-size spatial-temporal convolution kernels. This limits their ability to model coordinated movements of joints in different body regions and long-term spatial-temporal dependencies. To address these limitations, we propose a fine-to-coarse self-attention graph convolutional network (FCSA-GCN). Our approach employs a fine-to-coarse scaling strategy for multi-scale feature extraction. This strategy effectively models both local and global spatial-temporal relationships and better represents the interactions among joint groups in different body regions. By integrating a temporal self-attention mechanism (TSA) into the multi-scale feature extraction process, we enhance the model's ability to capture long-term temporal dependencies effectively. Additionally, during training, we employ the dynamic weight averaging (DWA) approach to ensure balanced optimization across the multi-scale feature extraction stages. Comprehensive experiments conducted on the NTU-60, NTU-120, and NW-UCLA datasets demonstrate that FCSA-GCN outperforms state-of-the-art methods. These results highlight that the proposed approach effectively addresses the current challenges in skeleton-based action recognition (SBAR).