arXiv:1611.02879 Abstract | arXiv Analytics

arXiv:1611.02879 [cs.CV]Abstract References Reviews Resources

Audio Visual Speech Recognition using Deep Recurrent Neural Networks

Published 2016-11-09Version 1

In this work, we propose a training algorithm for an audio-visual automatic speech recognition (AV-ASR) system using deep recurrent neural network (RNN).First, we train a deep RNN acoustic model with a Connectionist Temporal Classification (CTC) objective function. The frame labels obtained from the acoustic model are then used to perform a non-linear dimensionality reduction of the visual features using a deep bottleneck network. Audio and visual features are fused and used to train a fusion RNN. The use of bottleneck features for visual modality helps the model to converge properly during training. Our system is evaluated on GRID corpus. Our results show that presence of visual modality gives significant improvement in character error rate (CER) at various levels of noise even when the model is trained without noisy data. We also provide a comparison of two fusion methods: feature fusion and decision fusion.

Categories: cs.CV, cs.CL, cs.LG

Keywords: deep recurrent neural network, audio visual speech recognition, visual features, deep rnn acoustic model

Related articles: Most relevant | Search more

arXiv:1407.1165 [cs.CV] (Published 2014-07-04)

Recognition of Isolated Words using Zernike and MFCC features for Audio Visual Speech Recognition

Prashant Bordea, Amarsinh Varpeb, Ramesh Manzac, Pravin Yannawara

arXiv:2209.11894 [cs.CV] (Published 2022-09-24)

Closing the Loop: Graph Networks to Unify Semantic Objects and Visual Features for Multi-object Scenes

Jonathan J. Y. Kim, Martin Urschler, Patricia J. Riddle, Jörg S. Wicker

arXiv:1611.03607 [cs.CV] (Published 2016-11-11)

Deep Recurrent Neural Network for Mobile Human Activity Recognition with High Throughput

Masaya Inoue, Sozo Inoue, Takeshi Nishida