arXiv Analytics

Sign in

arXiv:2109.10252 [cs.LG]AbstractReferencesReviewsResources

Audiomer: A Convolutional Transformer for Keyword Spotting

Surya Kant Sahu, Sai Mitheran, Juhi Kamdar, Meet Gandhi

Published 2021-09-21Version 1

Transformers have seen an unprecedented rise in Natural Language Processing and Computer Vision tasks. However, in audio tasks, they are either infeasible to train due to extremely large sequence length of audio waveforms or reach competitive performance after feature extraction through Fourier-based methods, incurring a loss-floor. In this work, we introduce an architecture, Audiomer, where we combine 1D Residual Networks with Performer Attention to achieve state-of-the-art performance in Keyword Spotting with raw audio waveforms, out-performing all previous methods while also being computationally cheaper, much more parameter and data-efficient. Audiomer allows for deployment in compute-constrained devices and training on smaller datasets.

Comments: Submitted to NeurIPS 2021 ENLSP Workshop
Categories: cs.LG, cs.CL, cs.SD, eess.AS
Related articles: Most relevant | Search more
arXiv:1711.08058 [cs.LG] (Published 2017-11-21)
Multiple-Instance, Cascaded Classification for Keyword Spotting in Narrow-Band Audio
arXiv:2210.04959 [cs.LG] (Published 2022-10-10)
Characterization of anomalous diffusion through convolutional transformers
arXiv:2305.05110 [cs.LG] (Published 2023-05-09)
Semi-Supervised Federated Learning for Keyword Spotting