arXiv Analytics

Sign in

arXiv:1605.03481 [cs.LG]AbstractReferencesReviewsResources

Tweet2Vec: Character-Based Distributed Representations for Social Media

Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick, Michael Muehl, William W. Cohen

Published 2016-05-11Version 1

Text from social media provides a set of challenges that can cause traditional NLP approaches to fail. Informal language, spelling errors, abbreviations, and special characters are all commonplace in these posts, leading to a prohibitively large vocabulary size for word-level approaches. We propose a character composition model, tweet2vec, which finds vector-space representations of whole tweets by learning complex, non-local dependencies in character sequences. The proposed model outperforms a word-level baseline at predicting user-annotated hashtags associated with the posts, doing significantly better when the input contains many out-of-vocabulary words or unusual character sequences. Our tweet2vec encoder is publicly available.

Comments: 6 pages, 2 figures, 4 tables, accepted as conference paper at ACL 2016
Categories: cs.LG, cs.CL
Related articles: Most relevant | Search more
arXiv:2410.16204 [cs.LG] (Published 2024-10-21, updated 2024-12-09)
Systematic Review: Text Processing Algorithms in Machine Learning and Deep Learning for Mental Health Detection on Social Media
arXiv:2304.10512 [cs.LG] (Published 2023-04-20)
"Can We Detect Substance Use Disorder?": Knowledge and Time Aware Classification on Social Media from Darkweb
arXiv:1403.5603 [cs.LG] (Published 2014-03-22)
Forecasting Popularity of Videos using Social Media