The impact of N-gram on the Malay text document clustering

Rosmayati Mohemad; Nazratul Naziah Mohd Muhait; Noor Maizura Mohamad Noor; Zulaiha Ali Othman

doi:10.53840/myjict6-2-83

Authors

Rosmayati Mohemad Faculty of Ocean Engineering Technology & Informatics, Universiti Malaysia Terengganu, 21030 Kuala Nerus, Terengganu, Malaysia.
Nazratul Naziah Mohd Muhait Faculty of Ocean Engineering Technology & Informatics, Universiti Malaysia Terengganu, 21030 Kuala Nerus, Terengganu, Malaysia.
Noor Maizura Mohamad Noor 1Faculty of Ocean Engineering Technology & Informatics, Universiti Malaysia Terengganu, 21030 Kuala Nerus, Terengganu, Malaysia.
Zulaiha Ali Othman Faculty of Information Science & Technology, Universiti Kebangsaan Malaysia, 43600 UKM Bangi, Selangor, MALAYSIA.

DOI:

https://doi.org/10.53840/myjict6-2-83

Keywords:

N-Gram, document clustering, Malay documents, k-means.

Abstract

Document preprocessing is one of the crucial elements in text mining framework to provide a high-quality model for machine learning applications. The process including tokenizing, transform cases, stop word removal, and stemming. However, these sub-processes are not enough to optimize the clustering performance. Thus text preprocessing has to be improved by using N-gram features. N-gram is a sequence of words generate from a text document. Therefore, this study aims to evaluate the impact of using different N-gram models in text preprocessing. There are 1000 of the Malay documents were tested using N-gram on the K-means clustering algorithm. In addition, the document without N-gram is compared with the document that applies 2-gram,3-gram, and 4-gram. The result of text document clustering using 4-gram shows the highest accuracy with 92.48% compared to the text document clustering without using N-gram, which is 87.32%. The accuracy of the result indicates that applying N-gram in the Malay document clustering using K-means clustering algorithm could increase the cluster performance.

Downloads

Download data is not yet available.

References

D. Birks, A. Coleman, and D. Jackson, "Unsupervised identification of crime problems from police free-text data," Crime Sci., vol. 9, no. 1, pp. 1–19, 2020, doi: 10.1186/s40163-020-00127-4.

N. Abd Rahman, Z. Abu Bakar, and N. S. S. Zulkefli, "Malay document clustering using complete linkage clustering technique with Cosine Coefficient," ICOS 2015 - 2015 IEEE Conf. Open Syst., no. January 2016, pp. 103–107, 2016, doi: 10.1109/ICOS.2015.7377286.

J. Agarwal, R. Nagpal, and R. Sehgal, "Crime Analysis using K-Means Clustering," Int. J. Comput. Appl., vol. 83, no. December, pp. 1–4, 2013, doi: 10.5120/14433-2579.

M. Alruily, A. Ayesh, and A. Al-marghilani, "Using Self Organizing Map to Cluster Arabic Crime Documents," pp. 357–363, 2010.

D. B. Bisandu, R. Prasad, and M. M. Liman, "Clustering news articles using efficient similarity measure and N-grams," Int. J. Knowl. Eng. Data Min., vol. 5, no. 4, p. 333, 2018, doi: 10.1504/ijkedm.2018.095525.

F. Khoirunnisa, N. Yusliani, M. T, D. Rodiah, and M. T, “Effect of N-Gram on Document Classification on the Naïve Bayes Classifier Algorithm,” vol. 01, no. 01, pp. 26–33, 2020.

A. Deniz and H. E. Kiziloz, "Effects of various preprocessing techniques to Turkish text categorization using n-gram features," 2nd Int. Conf. Comput. Sci. Eng. UBMK 2017, no. May, pp. 655–660, 2017, doi: 10.1109/UBMK.2017.8093491.

K. K. Purnamasari, "K-Means and K-Medoids for Indonesian Text Summarization," IOP Conf. Ser. Mater. Sci. Eng., vol. 662, no. 6, 2019, doi: 10.1088/1757-899X/662/6/062013.

B. Aubaidan, M. Mohd, M. Albared, and F. Author, "Comparative study of k-means and k-means++ clustering algorithms on crime domain," J. Comput. Sci., vol. 10, no. 7, pp. 1197–1206, 2014, doi: 10.3844/jcssp.2014.1197.1206.

R. T. Vulandari, W. L. Y. Saptomo, and D. W. Aditama, "Application of K-Means Clustering in Mapping of Central Java Crime Area," Indones. J. Appl. Stat., vol. 3, no. 1, p. 38, 2020, doi: 10.13057/ijas.v3i1.40984.