Better with fewer features: climate dynamics estimation for Van Lake basin using feature selection

ÇOBAN, Önder; Esit, Musa; Yalçın, Sercan; BOZKURT, Ferhat

doi:10.1007/s11356-025-36057-4

Better with fewer features: climate dynamics estimation for Van Lake basin using feature selection

ÇOBAN Ö., Esit M., Yalçın S., BOZKURT F.

Environmental Science and Pollution Research, cilt.32, sa.10, ss.5849-5873, 2025 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 32 Sayı: 10
Basım Tarihi: 2025
Doi Numarası: 10.1007/s11356-025-36057-4
Dergi Adı: Environmental Science and Pollution Research
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, IBZ Online, ABI/INFORM, Aerospace Database, Agricultural & Environmental Science Database, Aqualine, Aquatic Science & Fisheries Abstracts (ASFA), BIOSIS, CAB Abstracts, EMBASE, Environment Index, Geobase, MEDLINE, Pollution Abstracts, Veterinary Science Database, Civil Engineering Abstracts
Sayfa Sayıları: ss.5849-5873
Anahtar Kelimeler: Artificial intelligence, Climatological parameter estimation, Feature selection, Machine learning
Atatürk Üniversitesi Adresli: Evet

Özet

Even though there exist many research efforts trying to develop forecasting models based on machine learning (ML) or statistical techniques, feature selection is not employed in a large majority of the studies. To fill this gap, this study builds prediction models involving feature selection through one-step ahead estimation of climatological parameters (i.e., temperature and evapotranspiration), considering the aforementioned shortcomings. In addition, the best models are used to make estimations for a long horizon of 30 years. The experimental results performed on three stations located at the Van Lake Closed basin of Turkey showed that the Bayesian Ridge regressor (BRR) often outperforms other regressors. The respective best models involving BRR also enabled us to obtain R2 scores ranging from 0.961 to 0.988. On the other hand, feature selection helps us to reach or go beyond the respective baseline performance of any model by using a lower number of features. Finally, the overall evaluation is stated to have a limitation in that it needs non-sparse and complete time series data to produce satisfying results. It will also be a challenging task to employ our regression-based ML pipeline on any sparse time series dataset.