Mistral AI’s open source Mixtral 8x7B model has generated a lot of buzz – here’s what’s inside
Mistral AI’s new Sparse Mixing Expert LLM, Mixtral 8x7B, has recently made waves with dramatic headlines like “Mistral AI Introduces Mixtral 8x7B: Sparse Mixing Expert (SMoE) Language Model.” Transform machine learning” or “Mistral AI’s Mixtral 8x7B exceeds GPT-3.5; Shaking up the world of AI”
Mistral AI is a French AI startup founded in 2023 by Meta and former Google engineers. The company simply dumped a torrent magnet link on his Twitter account on December 8, 2023, when he released Mixtral 8x7B as perhaps the most unscrupulous release in LLM history.
scatter many sparks meme About Mistral’s unconventional model release method.
“mix of experts” (Jiang et al 2024), the accompanying research paper was published on Arxiv about a month later, on January 8 of this year. Let’s see if the hype is justified.
(Spoiler alert: Under the hood, there’s not much new technically.)
First, a little history for context.
Sparse MOE in LLM: A brief history
Mixed Experts (MoE) Model Dating back to research in the early 1990s (Jacobs et al 1991). The idea is to model the prediction y using a weighted sum of experts E. The weights are determined by the gating network G. This is a method of breaking down a large, complex problem into separate smaller sub-problems. Divide and conquer if you have to. For example, in the original study, the authors showed how different experts learn to specialize at different judgment boundaries in a vowel discrimination problem.
But what really made MoE a success was top-k routing, an idea first introduced in a 2017 paper.Extremely large neural network” (Shazeer et al. 2017). The key idea is to compute the output of only the top k experts rather than all experts. This allows FLOP to remain constant even if: