IEEE Internet of Things Journal, 2025 (SCI-Expanded, Scopus)
With 6G-enabled Intelligent Internet of Vehicles (IIoV) generating massive amounts of sensory data, traditional deep learning models struggle to capture long-range relationships across different sensor types while preserving privacy. This paper proposes DT-Trans, a privacy-preserving federated learning framework that combines Digital Twin technology with Vision Transformers. Our framework first trains a global perception model on synthetic digital twin data, then fine-tunes it efficiently for real-world vehicles. By grouping vehicles with similar driving patterns and allowing them to collaboratively train personalized model components, DT-Trans achieves significant accuracy improvements while maintaining data privacy. The Twin-Enhanced Vision Transformer (TE-ViT) is introduced as the global perception backbone; it is pre-trained on massive synthetic DT data and then fine-tuned via parameter-efficient LoRA adapters to bridge the domain gap between virtual and physical worlds. The Cluster-Enhanced Decoupled PFL (CD-PFL-Trans) algorithm splits each TE-ViT into (i) a shared Transformer encoder (base layer) and (ii) client-specific Transformer decoder heads (personalized layer). Hierarchical clustering on decoder parameters groups clients with similar traffic patterns, enabling group-wise aggregation without exchanging raw sensory data. DT-Trans outperforms CNN-based FedAvg/FedPer by 9.3%-16.2% mAP on V&PKITTI perception tasks and up to 42.8% accuracy improvement on CINIC-10 classification under severe heterogeneity, while reducing on-device FLOPs by 34 % via Transformer sparsity techniques. Our work advances Transformer architectures for scalable, privacy-preserving perception in IIoV.