arXiv Analytics

Sign in

arXiv:1608.03995 [cs.CL]AbstractReferencesReviewsResources

Analysis of Morphology in Topic Modeling

Chandler May, Ryan Cotterell, Benjamin Van Durme

Published 2016-08-13Version 1

Topic models make strong assumptions about their data. In particular, different words are implicitly assumed to have different meanings: topic models are often used as human-interpretable dimensionality reductions and a proliferation of words with identical meanings would undermine the utility of the top-$m$ word list representation of a topic. Though a number of authors have added preprocessing steps such as lemmatization to better accommodate these assumptions, the effects of such data massaging have not been publicly studied. We make first steps toward elucidating the role of morphology in topic modeling by testing the effect of lemmatization on the interpretability of a latent Dirichlet allocation (LDA) model. Using a word intrusion evaluation, we quantitatively demonstrate that lemmatization provides a significant benefit to the interpretability of a model learned on Wikipedia articles in a morphologically rich language.

Related articles: Most relevant | Search more
arXiv:2206.04221 [cs.CL] (Published 2022-06-09)
Analyzing Folktales of Different Regions Using Topic Modeling and Clustering
arXiv:2410.11627 [cs.CL] (Published 2024-10-15)
Tokenization and Morphology in Multilingual Language Models: A~Comparative Analysis of mT5 and ByT5
arXiv:1706.06177 [cs.CL] (Published 2017-06-19)
Topic Modeling for Classification of Clinical Reports