Turkish named entity discovery based on termsets


ÇOBAN Ö., ÖZEL S. A., İNAN A.

4th International Conference on Computer Science and Engineering, Samsun, Türkiye, 11 - 15 Eylül 2019, ss.28-32 identifier identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.1109/ubmk.2019.8907039
  • Basıldığı Şehir: Samsun
  • Basıldığı Ülke: Türkiye
  • Sayfa Sayıları: ss.28-32
  • Anahtar Kelimeler: Named entity recognition, frequent itemset min-ing, termsets, text classification, RECOGNITION, TEXT
  • Atatürk Üniversitesi Adresli: Hayır

Özet

Named Entity Recognition (NER) is a subtask of the information extraction process and aims to discover named entities in unstructured texts. Previous studies on NER mostly use statistical machine learning models instead of using classifiers since solving this problem as a classification task requires to deal with quite high dimensional and sparse vector spaces. In this paper, we take NER as a classical text classification problem and extract nominal features from each token in the unstructured text sequence. We convert each token to a document transaction and then, we use frequent termset mining to extract termset features and apply termset weighting to classify named entities. Therefore we deal with lower dimensional feature spaces. Our experimental results obtained on a large Turkish dataset show that frequent termsets and their weighting scheme can be used in NER task.