IRText: An Item Response Theory-Based Approach for Text Categorization


Coban Ö.

ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, cilt.47, sa.8, ss.9423-9439, 2022 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 47 Sayı: 8
  • Basım Tarihi: 2022
  • Doi Numarası: 10.1007/s13369-021-06238-7
  • Dergi Adı: ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, zbMATH
  • Sayfa Sayıları: ss.9423-9439
  • Anahtar Kelimeler: Item response theory, Text categorization, Term weighting, Feature selection, FEATURE-SELECTION
  • Atatürk Üniversitesi Adresli: Hayır

Özet

Text categorization (TC) is a machine learning task that tries to assign a text to one of the predefined categories. In a nutshell, texts are converted into numerical feature vectors in which each feature is bounded with a weight value. Afterward, a classifier is trained on vectorized texts and is used to classify previously unseen documents. Feature selection (FS) is also optionally applied to achieve better classification accuracy by using a lower number of features. Item response theory (IRT), on the other hand, is a set of statistical models designed to understand persons based on their responses to questions by assuming that responses on a given item are a function of both person and item properties. Even though there exist many studies devoted to understand, explore, and improve methods, there is not any previous study that aims at combining powers of these fields. As such, in this study, an IRT-based approach is proposed that suggests using the IRT score of a feature in both term weighting and FS that are important inter-steps of TC. The efficiency of the proposed approach is measured on two well-known benchmark datasets by comparing it with its two traditional peers. Experimental results show that the IRT-based approach can be used for text FS and there is open room for possible improvements. To the best of our knowledge, this study is the first of its kind which tries to adapt IRT for classical TC.