Mixture of experts

In machine learning there is a concept of ensembling (ensemble learning) where you create a set of diverse models, have them make a prediction on the same data and then combine their individual predictions into a final prediction. This is the basic idea behind random forests and other architectures. Also, in the combination step the predictions can be averaged or another model can be trained to pick the 'best' prediction.

Mixture of Experts (MOE), as implemented in LLMs are vaguely related; where instead of combining a variety of 'predictions' an input is routed to one (or a small number) of a set 'expert' models ... though that is not exactly right either. Instead of a set of different LLMs, in the current implementations, we have one larger (sparse) model with multiple paths through it. The paths share some parameters but also have their own weights. Then there are additional (internal) gate/router layers that choose which path/expert to activate on a token by token basis. (side note: each path is a set of weights for the feed forward blocks while the self attention and other blocks are shared.)

The benefits of this approach include faster pre-training and the response quality of a larger (dense) model while only requiring the number of math operations of a smaller model. However, it still requires the near the same amount of memory as the larger model since you need to keep all paths/experts available in case they are selected (even though some parts are shared).

Also, actual performance/latency can be tricky to predict because it depends on issues like batching, how many GPUs are available, load balancing between the experts, etc. Though there is certainly a lot of research, with more to come, into making this architecture more efficient with higher quality and easier training and fine-tuning.

I wonder if a near term use of this approach is to use an MOE model to generate training data for a smaller dense model. Also, I wonder if there will be developments with the simpler approach of having a router choose the best fine-tuned model and running that model in its entirety. This might be really interesting with the ability to dynamically load LoRA weights at inference time and would require a lot less memory making it suitable for smaller devices.

Want to get notified of new articles and insights?