Is Attention All We Need?

(This is something I'm interested in but still learning about so take this all with a grain of salt.)

The heart of a transformer is the attention module introduced in 2017 in the "Attention Is All You Need" paper. In simplified terms, the transformer is a regular feed forward neural network followed by an attention module. And an LLM is made up of some number of transformers, along with a few other smaller layers, running in parallel and linked one after the other.

The attention module is made up of a large matrix and a couple of other vectors (Q & K, V) that all get multiplied and then combined together to dynamically focus on different parts of the input at different times. Since matrix multiplication can be done relatively efficiently the number of parameters can be scaled up leading to our current large language models. Larger models are more powerful but have the drawback of requiring more resources.

Before the transformer came out Recurrent Neural Networks (RNNs) where the predominant architecture with ELMo probably being the last state of the art RNN model. RNNs worked well and required fewer resources at prediction time but were slower to train because, being sequential, they could not fully utilize a GPU and could not be scaled up as much. Consequently they soon lost out to the larger transformer based systems in terms of "accuracy".

Lately there's been a lot of work to address the drawbacks of large attention modules some focusing on optimization such as "flash attention" and Apple just released a paper, "LLM in a flash: Efficient Large Language Model Inference with Limited Memory" on efficient model parameter loading and management.

Other work, such as Heyena, RMKV, RetNet, Mamba, etc., explore new architectures that either replace or combine the attention module with different modules; some with new versions of RNNs. Many of these improved RNNs allow for increased parallel processing during training while keeping the inference time efficiency and predictability. However some require (would benefit) from features in accelerators (such as Fast Fourier Transforms and their inverse, custom convolution kernels) that may not be commonly present.

There's a lot of research (and learning) to be done and the transformer's lead is going to be hard to beat but if any of these approaches proves to be feasible we could see much smaller and faster LLMs in the future.

Want to get notified of new articles and insights?