arXiv:1608.03995 Abstract | arXiv Analytics

arXiv:1608.03995 [cs.CL]Abstract References Reviews Resources

Analysis of Morphology in Topic Modeling

Chandler May, Ryan Cotterell, Benjamin Van Durme

Published 2016-08-13Version 1

Topic models make strong assumptions about their data. In particular, different words are implicitly assumed to have different meanings: topic models are often used as human-interpretable dimensionality reductions and a proliferation of words with identical meanings would undermine the utility of the top-$m$ word list representation of a topic. Though a number of authors have added preprocessing steps such as lemmatization to better accommodate these assumptions, the effects of such data massaging have not been publicly studied. We make first steps toward elucidating the role of morphology in topic modeling by testing the effect of lemmatization on the interpretability of a latent Dirichlet allocation (LDA) model. Using a word intrusion evaluation, we quantitatively demonstrate that lemmatization provides a significant benefit to the interpretability of a model learned on Wikipedia articles in a morphologically rich language.

Categories: cs.CL

Keywords: topic modeling, morphology, topic models, word intrusion evaluation, word list representation

Related articles: Most relevant | Search more

arXiv:2206.04221 [cs.CL] (Published 2022-06-09)

Analyzing Folktales of Different Regions Using Topic Modeling and Clustering

Jacob Werzinsky, Zhiyan Zhong, Xuedan Zou

arXiv:2410.11627 [cs.CL] (Published 2024-10-15)

Tokenization and Morphology in Multilingual Language Models: A~Comparative Analysis of mT5 and ByT5

Thao Anh Dang, Limor Raviv, Lukas Galke

arXiv:1706.06177 [cs.CL] (Published 2017-06-19)

Topic Modeling for Classification of Clinical Reports

Efsun Sarioglu Kayi, Kabir Yadav, James M. Chamberlain, Hyeong-Ah Choi