arXiv:2108.07153 Abstract | arXiv Analytics

arXiv:2108.07153 [cs.CV]Abstract References Reviews Resources

Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism

Published 2021-08-16Version 1

Softmax is widely used in neural networks for multiclass classification, gate structure and attention mechanisms. The statistical assumption that the input is normal distributed supports the gradient stability of Softmax. However, when used in attention mechanisms such as transformers, since the correlation scores between embeddings are often not normally distributed, the gradient vanishing problem appears, and we prove this point through experimental confirmation. In this work, we suggest that replacing the exponential function by periodic functions, and we delve into some potential periodic alternatives of Softmax from the view of value and gradient. Through experiments on a simply designed demo referenced to LeViT, our method is proved to be able to alleviate the gradient problem and yield substantial improvements compared to Softmax and its variants. Further, we analyze the impact of pre-normalization for Softmax and our methods through mathematics and experiments. Lastly, we increase the depth of the demo and prove the applicability of our method in deep structures.

Comments: 18 pages, 16 figures

Categories: cs.CV, cs.LG

Keywords: attention mechanism, gradient vanishing problem appears, yield substantial improvements, potential periodic alternatives, multiclass classification

Related articles: Most relevant | Search more

arXiv:2106.15067 [cs.CV] (Published 2021-06-29)

Towards Understanding the Effectiveness of Attention Mechanism

Xiang Ye, Zihang He, Heng Wang, Yong Li

arXiv:1812.10025 [cs.CV] (Published 2018-12-25)

Attention Branch Network: Learning of Attention Mechanism for Visual Explanation

Hiroshi Fukui, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi

arXiv:2006.05918 [cs.CV] (Published 2020-06-10)

Deep Learning with Attention Mechanism for Predicting Driver Intention at Intersection