q-gram hash comparison based multiple exact string matching algorithm for DNA sequences


Creative Commons License

Karcıoğlu A. A., Bulut H.

JOURNAL OF THE FACULTY OF ENGINEERING AND ARCHITECTURE OF GAZI UNIVERSITY, cilt.38, sa.2, ss.875-888, 2023 (SCI-Expanded) identifier identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 38 Sayı: 2
  • Basım Tarihi: 2023
  • Doi Numarası: 10.17341/gazimmfd.951157
  • Dergi Adı: JOURNAL OF THE FACULTY OF ENGINEERING AND ARCHITECTURE OF GAZI UNIVERSITY
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Art Source, Compendex, TR DİZİN (ULAKBİM)
  • Sayfa Sayıları: ss.875-888
  • Anahtar Kelimeler: Multiple exact string matching, pattern matching, sequence analysis, hash function, wu-manber algorithm, PATTERN, CLASSIFICATION, REGIONS
  • Atatürk Üniversitesi Adresli: Evet

Özet

The exact string matching algorithms are among the important study topics in computer science due to their various applications in many fields such as medicine, bioinformatics, and biology. New algorithms have been developed recently, and the string matching on the text has been accelerated. The string matching algorithms are divided into two parts, single and multiple. . The string matching algorithms are divided into two parts, single and multiple. The multiple exact string matching algorithms involve finding d number patterns (P) in a given text T. In this study, the Wu-Manber algorithm, one of the hash-based multiple exact string matching algorithms, is discussed. Although the Wu-Manber algorithm is effective, it has some limitations, such as hash collisions. In our study, a new approach has is proposed for these limitations. In the proposed approach, unlike the traditional Wu-Manber algorithm, the searching in the sequences is performed by q-gram hash comparison, using the hash function that removes hash collisions in DNA sequences. The proposed approach has been compared with the multiple exact string matching algorithms with the well-known algorithms in the literature on E. Coli and Human Chromosome1 datasets. As a result of the experimental studies, better results have been achieved in terms of performance metrics such as the average runtime, the average number of character and hash comparisons in the proposed approach compared to the Wu-Manber algorithm. Also, the proposed approach is shown to be more efficient than well-known algorithms, such as Aho Corasick (AC) and Commentz Walter (CW).