arXiv:2502.02732 Abstract | arXiv Analytics

arXiv:2502.02732 [cs.LG]Abstract References Reviews Resources

Peri-LN: Revisiting Layer Normalization in the Transformer Architecture

Jeonghoon Kim, Byeongchan Lee, Cheonbok Park, Yeontaek Oh, Beomjun Kim, Taehwan Yoo, Seongjin Shin, Dongyoon Han, Jinwoo Shin, Kang Min Yoo

Published 2025-02-04Version 1

Designing Transformer architectures with the optimal layer normalization (LN) strategy that ensures large-scale training stability and expedite convergence has remained elusive, even in this era of large language models (LLMs). To this end, we present a comprehensive analytical foundation for understanding how different LN strategies influence training dynamics in large-scale Transformer training. Until recently, Pre-LN and Post-LN have long dominated standard practices despite their limitations in large-scale training. However, several open-source large-scale models have recently begun silently adopting a third strategy without much explanation. This strategy places layer normalization (LN) peripherally around sublayers, a design we term Peri-LN. While Peri-LN has demonstrated promising empirical performance, its precise mechanisms and benefits remain almost unexplored. Our in-depth analysis shows that Peri-LN strikes an ideal balance in variance growth -- unlike Pre-LN and Post-LN, which are prone to vanishing gradients and ``massive activations.'' To validate our theoretical insight, we conduct large-scale experiments on Transformers up to 3.2B parameters, showing that Peri-LN consistently achieves more balanced variance growth, steadier gradient flow, and convergence stability. Our results suggest that Peri-LN warrants broader consideration for large-scale Transformer architectures, providing renewed insights into the optimal placement and application of LN.

Comments: Preprint

Categories: cs.LG, cs.AI, cs.CL

Keywords: transformer architecture, revisiting layer normalization, long dominated standard practices despite, peri-ln warrants broader consideration, large-scale transformer

Related articles: Most relevant | Search more

arXiv:2002.04745 [cs.LG] (Published 2020-02-12)

On Layer Normalization in the Transformer Architecture

Ruibin Xiong et al.

arXiv:2005.03454 [cs.LG] (Published 2020-05-04)

Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer Architecture

Christopher Brix, Parnia Bahar, Hermann Ney

arXiv:2410.13732 [cs.LG] (Published 2024-10-17)

Reducing the Transformer Architecture to a Minimum

Bernhard Bermeitinger, Tomas Hrycej, Massimo Pavone, Julianus Kath, Siegfried Handschuh

arXiv Analytics

arXiv:2502.02732 [cs.LG]Abstract References Reviews Resources

Peri-LN: Revisiting Layer Normalization in the Transformer Architecture

Links

Toolbox

arXiv:2502.02732 [cs.LG]AbstractReferencesReviewsResources

Peri-LN: Revisiting Layer Normalization in the Transformer Architecture

Links

Toolbox

arXiv:2502.02732 [cs.LG]Abstract References Reviews Resources