A New Original Approach of Authorship Attribution based on Normalized Word Bigram Transition Probability: Application on the Quran and Hadith
DOI:
https://doi.org/10.53840/myjict10-2-222Keywords:
NLP, Stylometry, Quran, Transition Probability, Authorship AttributionAbstract
After several studies on the intrinsic characteristics of the authors’ style, we noticed a thorough link between the author’s style and the use of specific word bigrams: some bigrams were found to have transition probabilities that are specific to only one author. Thus, we propose the use of a new set of features based on the normalized Forward and Backward Probabilities. We called it Word Bigram Transition Probability (i.e., WBTP). The approach described in this paper is proposed for the first time to the knowledge of the author. It can be used in any task of authorship attribution with the same or different topics. Three evaluation experiments of authorship attribution are conducted: Evaluation on SIMSTYL corpus, on HAT corpus, and on a subset of the Guardian corpus. Furthermore, a specific application of author discrimination between two ancient religious books (i.e., the Quran, and Hadith) has been carried out using this new approach. A comparison of this new set of features with some state-of-the-art features has been made. Results showed a high accuracy of authorship attribution. Furthermore, the results have shown that this new feature is less sensitive to the topic and can then be used with document belonging to different topics. Concerning the application of Author discrimination between the Quran and Hadith, the results show that the score of discrimination is 100%, confirming once again that the two authors are different, and that the holy Quran (believed to be a Divine revelation) could not have been invented by the Prophet.
Downloads
References
Alduais, A., Al-Khulaidi, M. A., Allegretta, S., & Abdulkhalek, M. M. (2023). Forensic linguistics: A scientometric review. Cogent Arts & Humanities, 10(1), 2214387.
Drozdowski, P., Rathgeb, C., Dantcheva, A., Damer, N., & Busch, C. (2020). Demographic bias in biometrics: A survey on an emerging challenge. IEEE Transactions on Technology and Society, 1(2), 89-103.
Jovanović, A., & Perić, Z. (2021). Two-dimensional GMM-based clustering in the presence of quantization noise. Facta Universitatis. Series: Automatic Control and Robotics, 20(2), 099-110.
Kang, M., Ahn, J., & Lee, K. (2018). Opinion mining using ensemble text hidden Markov models for text classification. Expert Systems with Applications, 94, 218-227.
Khmelev D. V. and F. J. Tweedie, "Using markov chains for identification of writers, Literary and Linguistic Computing, vol. 16, no. 3, pp. 299-307, 2001.
Kim Y., B. Moon. New Usage of Sammon's Mapping for Genetic Visualization.. Conference: Genetic and Evolutionary Computation - GECCO 2003, Genetic and Evolutionary Computation Conference, Chicago, IL, USA, July 12-16, 2003, pp 1136-1147
Kestemont, M. (2014). Function words in authorship attribution. From black magic to theory?. In Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLFL), pp. 59-66.
Li, J., Chang, H. C., & Stamp, M. (2022). Free-text keystroke dynamics for user authentication. In Cybersecurity for Artificial Intelligence (pp. 357-380). Springer, Cham.
Lupea, M., Briciu, A., & Bostenaru, E. (2021). Emotion-based hierarchical clustering of romanian poetry. Studies in Informatics and Control, 30(1), 109-118.
McLachlan, G. J., Peel, D., & Bean, R. W. (2003). Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics & Data Analysis, 41(3-4), 379-388.
Memorial website of D. Khmelev, Website consulted in August 2020. http://people.oregonstate.edu/~kelberta/dima/
Ouamour S., H. Sayoud. Authorship Attribution of Short Historical Arabic Texts Based on Lexical Features. CyberC – International Conference on Cyber-enabled distributed computing and knowledge discovery CyberC conference – 2013, Beijing, China, October 10-12, 2013. http://www.cyberc.org/cyberc2013/.
Patton, J., & Can, F. (2021). A Detailed Stylometric Investigation of the İnce Memed Tetralogy. Technical Report #MiamiU-CSA-04-001, Miami University.
Sayoud H. Biometrics: An Overview on New Technologies and Ethic Problems. International Journal of Technoethics, IGI Global, Vol. 2, No. 1, January 2011.
Sayoud H. Author discrimination between the Holy Quran and Prophet's statements. Literary and Linguistic Computing 2012, Literary and Linguistic Computing, Vol. 27, No. 4, 2012, pp 427-444.
Sayoud H., 2015. Segmental analysis based authorship discrimination between the holy Quran and prophet's statements. Digital Studies journal, Canada, Volume 6, Issue 1, 2015, Congress 2015. https://www.digitalstudies.org/article/id/7268/.
Sayoud H. Statistical Analysis of the Birmingham Quran Folios and Comparison with the Sanaa Manuscripts. HDSKD journal, Vol. 4, No. 1, pp. 101-126, December 2018. ISSN 2437-069X.
Sayoud H., H. Hadjadj. (2021). Authorship Identification of Seven Arabic Religious Books -A Fusion Approach, HDSKD journal, Vol. 6, No. 1, pp. 137-157, December 2021. ISSN 2437-069X. DOI 10.5281/zenodo.6353805.
Sayoud H., S. Ouamour (2021b). HAT - A new Corpus for Experimental Stylometric Evaluation in Arabic. 12th International Conference of Experimental Linguistics, 11 - 13 October 2021, Athens, Greece. https://exlingsociety.com/wp-content/uploads/proceedings/exling-2021/12_0052_000525.pdf. Proceeding of EXLING’2021: https://exlingsociety.com/past-proceedings/#2021.
Sayoud, H. (2022). Stylometric Comparison between the Quran and Hadith based on Successive Function Words: Could the Quran be written by the Prophet? International Journal on Islamic Applications in Computer Science and Technology, Vol. 10, Issue 2, June 2022, 01- 06.
Sayoud H. (2022b). SIMSTYL – The Simulated Text Corpus. 2022. Dataset and description available in http://scholarpage.org/SimStyl.pdf.
Shlens J. (2003). A tutorial on principal component analysis: Derivation, discussion and singular value decomposition. 2003. Disponible de: www. snl. salk. edu/{} shlens/pca. pdf.
Sidorov, G. (2019). Syntactic n-grams in computational linguistics (pp. 125-125). Cham, Switzerland: Springer International Publishing.
Snášel, V., Kong, L., & Pan, J. S. (2022). Visualization of Population Convergence Results by Sammon Mapping in Multi-objective Optimization. In Advances in Intelligent Systems and Computing, pp. 295-304. Springer, Singapore.
Sottile, G., Francipane, A., Adelfio, G., & Noto, L. V. (2021). A PCA-based clustering algorithm for the identification of stratiform and convective precipitation at the event scale: An application to the sub-hourly precipitation of Sicily, Italy. Stochastic Environmental Research and Risk Assessment, 1-15.
Stamatatos E. On the robustness of authorship attribution based on character n-gram features. Journal of Law and Policy, 21(2):421–439, 2013.
Suganya, R., & Shanthi, R. (2012). Fuzzy c-means algorithm-a review. International Journal of Scientific and Research Publications, 2(11), 1.
Uddagiri, C., & Shanmuga Sundari, M. (2023). Authorship Identification Through Stylometry Analysis Using Text Processing and Machine Learning Algorithms. In Proceedings of Fourth International Conference on Computer and Communication Technologies: IC3T 2022 (pp. 573-581). Singapore: Springer Nature Singapore.
Yeshilbashian, Y. M., Asatryan, A. A., & Ghukasyan, T. G. (2022). Plagiarism Detection in Armenian Texts Using Intrinsic Stylometric Analysis. Programming and Computer Software, 48(7), 435-444.
Yule G. U., The Statistical Study of Literary Vocabulary. Cambridge: Cambridge University Press, 1944.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Malaysian Journal of Information and Communication Technology (MyJICT)

This work is licensed under a Creative Commons Attribution 4.0 International License.
