Empirical Evaluation of Various Metaheuristics for Efficient Feature Selection in High-Dimensional Bioinformatics Data

ÇELİK E., Dertli S. E., DAL D.

9th International Symposium on Innovative Approaches in Smart Technologies, ISAS 2025, Gaziantep, Türkiye, 27 - 28 Haziran 2025, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1109/isas66241.2025.11101825
Basıldığı Şehir: Gaziantep
Basıldığı Ülke: Türkiye
Anahtar Kelimeler: ant colony optimization, bioinformatics, classification, clustering, cross-validation, decision tree classifier, feature selection, gene expression data, genetic algorithm, k-means, k-nearest neighbors, metaheuristic, naive bayes, simulated annealing, support vector classification, wrapper approach
Atatürk Üniversitesi Adresli: Evet

Özet

The analysis of gene expression data in the field of bioinformatics holds critical importance for enhancing the understanding of genetic and molecular biological processes, supporting the early diagnosis of diseases, and developing therapeutic strategies. However, due to the typically high-dimensional and complex structure of these data, the analysis process can be challenging. In this study, the effectiveness of feature selection methods was investigated with the aim of improving the classification performance of high-dimensional gene expression data. To this end, a wrapper approach was developed to enhance classification accuracy through feature selection. In the first stage of the study, an unlabeled dataset was divided into two classes using the K-Means clustering method, thereby obtaining the necessary label information for classification. Subsequently, feature selection was performed using three metaheuristics, namely Simulated Annealing (SA), Genetic Algorithm (GA), and Ant Colony Optimization (ACO). The accuracy of each algorithm was evaluated using four different machine learning classifiers-Support Vector Classification (SVC), Naive Bayes (NB), K-Nearest Neighbors (KNN), and Decision Tree Classifier (DTC)-through 1 0-fold cross-validation. This process was carried out on subsets of varying sizes consisting of 5,10,15,20,50, and 100 features, and execution times were also recorded. The findings demonstrated that the SA offered an ideal option with fast execution times and acceptable accuracy levels, while the GA provided a balance between accuracy and execution time, making it suitable for optimization purposes. On the other hand, although the ACO achieved high accuracy rates, it was characterized by longer execution times. In summary, this study emphasizes that the selection of the appropriate metaheuristic should be made according to application requirements and provides a detailed framework for examining the performance of three metaheuristics in bioinformatics analyses.