Fine-to-coarse self-attention graph convolutional network for skeleton-based action recognition


Kilic U., Oztimur Karadag O., TÜMÜKLÜ ÖZYER G.

Applied Soft Computing, cilt.186, 2026 (SCI-Expanded, Scopus) identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 186
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1016/j.asoc.2025.114268
  • Dergi Adı: Applied Soft Computing
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC
  • Anahtar Kelimeler: Fine-to-coarse approach, Graph convolutional networks, Multi-scale, Skeletal data, Skeleton-based action recognition, Temporal self-attention
  • Atatürk Üniversitesi Adresli: Evet

Özet

Skeleton data has become an important modality in action recognition due to its robustness to environmental changes, computational efficiency, compact structure, and privacy-oriented nature. With the rise of deep learning, many methods for action recognition using skeleton data have been developed. Among these methods, spatial-temporal graph convolutional networks (ST-GCNs) have seen growing popularity due to the suitability of skeleton data for graph-based modeling. However, ST-GCN models use fixed graph topologies and fixed-size spatial-temporal convolution kernels. This limits their ability to model coordinated movements of joints in different body regions and long-term spatial-temporal dependencies. To address these limitations, we propose a fine-to-coarse self-attention graph convolutional network (FCSA-GCN). Our approach employs a fine-to-coarse scaling strategy for multi-scale feature extraction. This strategy effectively models both local and global spatial-temporal relationships and better represents the interactions among joint groups in different body regions. By integrating a temporal self-attention mechanism (TSA) into the multi-scale feature extraction process, we enhance the model's ability to capture long-term temporal dependencies effectively. Additionally, during training, we employ the dynamic weight averaging (DWA) approach to ensure balanced optimization across the multi-scale feature extraction stages. Comprehensive experiments conducted on the NTU-60, NTU-120, and NW-UCLA datasets demonstrate that FCSA-GCN outperforms state-of-the-art methods. These results highlight that the proposed approach effectively addresses the current challenges in skeleton-based action recognition (SBAR).