arXiv Analytics

Sign in

arXiv:2311.12678 [cs.LG]AbstractReferencesReviewsResources

Interpretation of the Transformer and Improvement of the Extractor

Zhe Chen

Published 2023-11-21Version 1

It has been over six years since the Transformer architecture was put forward. Surprisingly, the vanilla Transformer architecture is still widely used today. One reason is that the lack of deep understanding and comprehensive interpretation of the Transformer architecture makes it more challenging to improve the Transformer architecture. In this paper, we first interpret the Transformer architecture comprehensively in plain words based on our understanding and experiences. The interpretations are further proved and verified. These interpretations also cover the Extractor, a family of drop-in replacements for the multi-head self-attention in the Transformer architecture. Then, we propose an improvement on a type of the Extractor that outperforms the self-attention, without introducing additional trainable parameters. Experimental results demonstrate that the improved Extractor performs even better, showing a way to improve the Transformer architecture.

Related articles: Most relevant | Search more
arXiv:1704.05041 [cs.LG] (Published 2017-04-17)
Fast multi-output relevance vector regression
arXiv:1905.12916 [cs.LG] (Published 2019-05-30)
Effective Medical Test Suggestions Using Deep Reinforcement Learning
arXiv:1907.06065 [cs.LG] (Published 2019-07-13)
Bringing Giant Neural Networks Down to Earth with Unlabeled Data