Entity Clustering of User Reviews Using Topic Modelling and Similarity Scores

Authors

  • Vee Kin Wong Faculty of Computer Science and Information Technology University of Malaya, Malaysia
  • Sze Ting Bong Faculty of Computer Science and Information Technology University of Malaya, Malaysia
  • Chun Mun Lee Faculty of Computer Science and Information Technology University of Malaya, Malaysia
  • Vimala Balakrishnan Faculty of Computer Science and Information Technology University of Malaya, Malaysia
  • Pik Yin Lok The Nielsen Company, Malaysia
  • Jiunn Jye Ng Persistent System, Malaysia

DOI:

https://doi.org/10.53840/myjict4-2-90

Keywords:

Clustering, Latent Dirichlet Allocation, Latent Semantic Analysis, similarity, topic modeling, agglomerative clustering, business intelligence

Abstract

The popularity of online review sites such as Amazon and TripAdvisor have created a window into rich insights of what matters to reviewers, and potentially customers. For the business sector, this provides opportunities to improve their businesses and to attract more customers. The unstructured nature of user reviews, however, makes it difficult to analyze, and natural language processing techniques are instrumental in extracting useful information from these resources. To digest the free form review texts, it is desirable to abstract the reviews into high level themes. However, research works in this area is still scarce. In this paper, the process taken to perform this high-level abstraction is detailed, with an example derived from the hotel industry (identity concealed for confidentiality purpose). More than 3000 sets of customer reviews were processed using the Extract, Transform and Load process, supported by topic modelling algorithms including Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA) and K-means. The study aims to identify significant business aspects or features based on the customer reviews. Preliminary results from these algorithms lack usability, which can be mainly attributed to unknown number of hidden topics and the brief nature of the review texts. To address this issue, a cosine similarity-based algorithm was introduced to improve the thematic clustering into one that is more intuitive and actionable. Ten themes emerged, and using cosine similarity, the review texts in the corpus were intuitively clustered for business planning insights. With this method, the review texts can be classified in a matter of a few minutes and demonstrates the strength of natural language processing for insights mining.

Downloads

Download data is not yet available.

References

Afzaal, M., Usman, M., Fong, A. C., & Fong, S. (2019). Multiaspect‐ based opinion classification model for tourist reviews. Expert Systems, 36(2), e12371.

Akhtar, N., Zubair, N., Kumar, A., & Ahmad, T. (2017). Aspect based sentiment oriented summarization of hotel reviews. Procedia Computer Science, 115, 563-571.

Alexander, G. Blank, S. Hale (2018). TripAdvisor Reviews of London Museums: A New Approach to Understanding Visitors. Museum International 70(1-2):154-165

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022.

Blei, D. M., & Laff erty, J. D. (2009). Topic models. In Text Mining (pp. 101-124). Chapman and Hall/CRC

Chang, Y. C., Ku, C. H., & Chen, C. H. (2017). Social media analytics: Extracting and visualizing Hilton hotel ratings and reviews from TripAdvisor. International Journal of Information Management. 48, 263-279

Devkota, B., Kim, K. S., Zhuang, C., & Miyazaki, H. (2019). Disaggregate Hotel Evaluation by Using Diverse Aspects from User Reviews. In 2019 IEEE International Conference on Big Data and Smart Computing (BigComp) (pp. 1-8). IEEE.

Dickinger, A., Lalicic, L., & Mazanec, J. (2017). Exploring the generalizability of discriminant word items and latent topics in online tourist reviews. International Journal of Contemporary Hospitality Management, 29(2), 803-816.

Ekinci, E., & Omurca, I. S. (2019). Concept-LDA: Incorporating Babelfy into LDA for aspect extraction. Journal of Information Science, 1 - 13

Guo, Y., Barnes, S. J., & Jia, Q. (2017). Mining meaning from online ratings and reviews: Tourist satisfaction analysis using latent dirichlet allocation. Tourism Management, 59, 467-483.

Li, C., Wang, H., Zhang, Z., Sun, A., & Ma, Z. (2016) Topic Modeling for Short Texts with Auxiliary Word Embeddings. Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp.165–174

Mazarura, J., & De Waal, A. (2016, November). A comparison of the performance of latent Dirichlet allocation and the Dirichlet multinomial mixture model on short text. In 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA- RobMech) (pp. 1-6). IEEE.

Putri, I. R., & Kusumaningrum, R. (2017, January). Latent Dirichlet allocation (LDA) for sentiment analysis toward tourism review in Indonesia. In Journal of Physics: Conference Series (Vol. 801, No. 1, p. 012073). IOP Publishing.

Stevens, K., Kegelmeyer, P., Andrzejewski, D., & Buttler, D. (2012). Exploring topic coherence over many models and many topics. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 952-961). Association for Computational Linguistics.

Weismayer, C., Pezenka, I., & Gan, C. H. K. (2018). Aspect-Based Sentiment Detection: Comparing Human Versus Automated Classifications of TripAdvisor Reviews. In Information and Communication Technologies in Tourism 2018. 365-380. Springer, Cham.

Wu, C., Wu, F., Wu, S., Yuan, Z., & Huang, Y. (2018). A hybrid unsupervised method for aspect term and opinion target extraction. Knowledge-Based Systems, 148, 66-73.

Xue, W., Li, T., & Rishe, N. (2017). Aspect identification and ratings inference for hotel reviews. World Wide Web, 20(1), 23-37.

Published

31-12-2019

Issue

Section

Articles

How to Cite

Wong, V. K., Bong, S. T., Lee , C. M., Balakrishnan, V., Lok , P. Y., & Ng , J. J. (2019). Entity Clustering of User Reviews Using Topic Modelling and Similarity Scores. Malaysian Journal of Information and Communication Technology (MyJICT), 4(2), 137-147. https://doi.org/10.53840/myjict4-2-90

Share