Domain Effect Investigation for Bert Models Fine-Tuned on Different Text Categorization Tasks


ÇOBAN Ö., YAĞANOĞLU M., BOZKURT F.

ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, cilt.49, sa.3, ss.3685-3702, 2023 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 49 Sayı: 3
  • Basım Tarihi: 2023
  • Doi Numarası: 10.1007/s13369-023-08142-8
  • Dergi Adı: ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Aerospace Database, Communication Abstracts, Metadex, Pollution Abstracts, zbMATH, Civil Engineering Abstracts
  • Sayfa Sayıları: ss.3685-3702
  • Anahtar Kelimeler: Bidirectional Encoder Representations from Transformer, Deep learning, Text categorization, User-generated content
  • Atatürk Üniversitesi Adresli: Evet

Özet

Text categorization (TC) is one of the most useful automatic tools in today's world to organize huge text data automatically. It is widely used by practitioners to classify texts automatically for different purposes, including sentiment analysis, authorship detection, spam detection, and so on. However, studying TC task for different fields can be challenging since it is required to train a separate model on a labeled and large data set specific to that field. This is very time-consuming, and creating a domain-specific large and labeled data is often very hard. In order to overcome this problem, language models are recently employed to transfer learned information from a large data to another downstream task. Bidirectional Encoder Representations from Transformer (BERT) is one of the most popular language models and has been shown to provide very good results for TC tasks. Hence, in this study, we use four pretrained BERT models trained on formal text data as well as our own BERT models trained on Facebook messages. We then fine-tuned BERT models on different downstream data sets collected from different domains such as Twitter, Instagram, and so on. We aim to investigate whether fine-tuned BERT models can provide satisfying results on different downstream tasks of different domains via transfer learning. The results of our extensive experiments show that BERT models provide very satisfying results and selecting both the BERT model and downstream tasks' data from the same or similar domain is akin to improve the performance in a further direction. This shows that a well-trained language model can remove the need for a separate training process for each different downstream TC task within the OSN domain.