LLM | TensorPlay Blog

Introduction Since Vaswani et al. proposed the Transformer architecture in 2017, the attention mechanism based on Softmax Attention has become the core component of sequence modeling, supporting the rapid development of Large Language Models (LLMs). However, Softmax Attention inherently suffers from quadratic computational complexity: as the sequence length $L$ increases, the computational and memory overhead of $O(L^2d)$ grows quadratically, which has become a core bottleneck for long-sequence modeling and efficient inference and training of LLMs. ...