Journal of Translational Medicine, cilt.24, sa.1, 2026 (SCI-Expanded, Scopus)
Background: Machine-learning models based on tissue transcriptomic data are powerful tools for disease classification. However, their clinical adoption is limited by the invasive nature of tissue sampling. Furthermore, transcriptomic datasets are often affected by batch effects and gene-level noise, which compromise model generalizability across platforms and clinical cohorts. Methods: We developed WBT-DC (Whole Blood Transcriptomics–based Disease Classification), a computational pipeline designed to overcome these challenges. WBT-DC integrates rank-based feature extraction to mitigate batch effects with an ensemble machine-learning framework that incorporates cross-validation and hyperparameter optimization. Its performance was systematically evaluated across five independent cohorts involving 2,164 participants and three disease contexts: Crohn’s disease (CD), ulcerative colitis (UC), and amyotrophic lateral sclerosis (ALS). We tested the model’s robustness across RNA-sequencing and microarray platforms. Additionally, an internal rheumatoid arthritis (RA) cohort (n = 165) was utilized for real-world prospective validation. Results: WBT-DC demonstrated high accuracy, achieving ROC–AUC values of 0.90–0.94 in independent datasets when training and testing were conducted on the same platform. In cross-platform evaluations, the pipeline maintained robust performance with ROC–AUC values ranging from 0.71 to 0.84, consistently outperforming conventional gene expression-based models. In the RA validation cohort, WBT-DC achieved an ROC–AUC of 0.81, supporting its applicability in a real-world clinical setting. Conclusions: WBT-DC provides a robust, non-invasive, and platform-agnostic framework for disease classification using whole-blood transcriptomics. By effectively addressing batch effects and platform variability, this pipeline offers a scalable solution for translating systems-level transcriptomic insights into applications.