Performance Analysis of Embedding Methods for Deep Learning-Based Turkish Sentiment Analysis Models

Alawi, Abdulfattah; BOZKURT, Ferhat

doi:10.1007/s13369-024-09360-4

Performance Analysis of Embedding Methods for Deep Learning-Based Turkish Sentiment Analysis Models

Alawi A. B., BOZKURT F.

ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, cilt.50, sa.10, ss.7299-7321, 2025 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 50 Sayı: 10
Basım Tarihi: 2025
Doi Numarası: 10.1007/s13369-024-09360-4
Dergi Adı: ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Aerospace Database, Communication Abstracts, Metadex, Pollution Abstracts, zbMATH, Civil Engineering Abstracts
Sayfa Sayıları: ss.7299-7321
Anahtar Kelimeler: Character embedding, Deep learning, Embedding technique, Text classification, Textual data mining, Turkish short text, Word embedding
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Atatürk Üniversitesi Adresli: Evet

Özet

The complex syntactic structure of Turkish text makes sentiment analysis in natural language processing (NLP) a challenging task. Conventional sentiment analysis methods often fail to effectively identify attitudes in Turkish texts, creating an urgent need for more efficient approaches. To fill this need, our study investigates the effectiveness of embedding techniques including pre-trained Turkish models such as Word2Vec, GloVe, and FastText in addition to two character-level embedding methods, namely, character-integer embedding (CIE) and character one-hot encoding embedding (COE), in conjunction with deep learning models specifically long short-term memory (LSTM), convolution neural networks (CNNs), bidirectional LSTM (Bi-LSTM), and hybrid models, for Turkish short-texts sentiment analysis. DL-based models were investigated on two datasets (e.g., an original Twitter (X) dataset and an accessible hotel reviews dataset). In addition to providing an intensive performance analysis of different embedding strategies and assessing their efficacy in dealing with the linguistic intricacies of Turkish, this study proposed a previously unexplored method in Turkish text representation that relies on a character-level one-hot encoding technique. The obtained findings indicate positive progress using a novel approach utilizing a dual-pathway architecture for both character level and word level that constitutes a substantial contribution to the area of natural language processing (NLP), specifically in the context of complex morphological languages. By employing a hybrid strategy that combines character and word levels on Twitter (X) data, the LSTM model obtained an F1 score of 0.835 +/- 0.005\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.835 \pm 0.005$$\end{document} concerning cross-validation while CNN-BiLSTM attained the highest F1 Score (0.8392) using holdout validation. This strategy consistently produced modest improvements across the second public dataset (hotel reviews dataset) by emerging as the runner-up embedding technique in effectiveness, surpassed only by FastText. Findings provide practical recommendations for practitioners on how to effectively use sentiment analysis to make informed decisions by introducing an extensive performance analysis of the use of embedding techniques and deep learning models for sentiment analysis in Turkish texts, which is crucial in the current age of data analysis.