arXiv:2110.10090 Abstract | arXiv Analytics

arXiv:2110.10090 [cs.LG]Abstract References Reviews Resources

Inductive Biases and Variable Creation in Self-Attention Mechanisms

Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Cyril Zhang

Published 2021-10-19, updated 2022-06-24Version 2

Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the inductive biases of self-attention modules. Our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent. Our main result shows that bounded-norm Transformer networks "create sparse variables": a single self-attention head can represent a sparse function of the input sequence, with sample complexity scaling only logarithmically with the context length. To support our analysis, we present synthetic experiments to probe the sample complexity of learning sparse Boolean functions with Transformers.

Comments: v2: camera-ready revisions for ICML 2022

Categories: cs.LG, stat.ML

Keywords: inductive biases, self-attention mechanisms, variable creation, long-range dependencies self-attention blocks prefer, sample complexity

Related articles: Most relevant | Search more

arXiv:1802.04350 [cs.LG] (Published 2018-02-12)

On the Sample Complexity of Learning from a Sequence of Experiments

Longyun Guo, Jean Honorio, John Morgan

arXiv:1905.12624 [cs.LG] (Published 2019-05-28)

Combinatorial Bandits with Full-Bandit Feedback: Sample Complexity and Regret Minimization

Idan Rejwan, Yishay Mansour

arXiv:2211.14699 [cs.LG] (Published 2022-11-27)

A Theoretical Study of Inductive Biases in Contrastive Learning