arXiv:2109.10252 Abstract | arXiv Analytics

arXiv:2109.10252 [cs.LG]Abstract References Reviews Resources

Audiomer: A Convolutional Transformer for Keyword Spotting

Surya Kant Sahu, Sai Mitheran, Juhi Kamdar, Meet Gandhi

Published 2021-09-21Version 1

Transformers have seen an unprecedented rise in Natural Language Processing and Computer Vision tasks. However, in audio tasks, they are either infeasible to train due to extremely large sequence length of audio waveforms or reach competitive performance after feature extraction through Fourier-based methods, incurring a loss-floor. In this work, we introduce an architecture, Audiomer, where we combine 1D Residual Networks with Performer Attention to achieve state-of-the-art performance in Keyword Spotting with raw audio waveforms, out-performing all previous methods while also being computationally cheaper, much more parameter and data-efficient. Audiomer allows for deployment in compute-constrained devices and training on smaller datasets.

Comments: Submitted to NeurIPS 2021 ENLSP Workshop

Categories: cs.LG, cs.CL, cs.SD, eess.AS

Keywords: keyword spotting, convolutional transformer, computer vision tasks, extremely large sequence length, 1d residual networks

Related articles: Most relevant | Search more

arXiv:1711.08058 [cs.LG] (Published 2017-11-21)

Multiple-Instance, Cascaded Classification for Keyword Spotting in Narrow-Band Audio

Ahmad AbdulKader, Kareem Nassar, Mohamed Mahmoud, Daniel Galvez, Chetan Patil

arXiv:2210.04959 [cs.LG] (Published 2022-10-10)

Characterization of anomalous diffusion through convolutional transformers

Nicolás Firbas, Òscar Garibo-i-Orts, Miguel Ángel Garcia-March, J. Alberto Conejero

arXiv:2305.05110 [cs.LG] (Published 2023-05-09)

Semi-Supervised Federated Learning for Keyword Spotting