What Is This?
When GPT-4 was released in 2023, the public expected a model roughly twice the size of GPT-3. What most people didn't know — and what OpenAI never officially confirmed — is that it likely isn't a single dense model at all. According to leaked information and subsequent industry analysis, GPT-4 is believed to be a Mixture of Experts model: eight separate networks of roughly 220 billion parameters each, totalling around 1.76 trillion parameters, of which only two experts activate for any given token during inference.
This detail matters enormously. A 1.76 trillion parameter dense model would require compute costs that are genuinely prohibitive for inference at scale. But a sparse MoE model that only activates ~440 billion parameters per token runs at a fraction of that cost while retaining the quality benefits of a much larger total parameter count. The architecture is how frontier AI labs are getting more capability from the same compute budget — and why the intelligence-per-dollar curve has continued to fall faster than hardware improvements alone would explain.^1
The core idea is over 30 years old. Mixture of Experts was introduced by Jacobs, Jordan, Nowlan, and Hinton in 1991 — long before transformers existed. The concept: rather than one generalist network handling every type of input, train multiple specialised networks (experts) and learn a routing function (the gating network) that decides which expert to send each input to. The model learns both the experts and the routing simultaneously.
Applied to modern language models, it works like this: inside each transformer block, instead of a single feed-forward network, there are N feed-forward networks (the experts) and a small router. For each token at each layer, the router computes a score for every expert and selects the top K (usually 2) to process that token. Only those 2 experts activate — the other N-2 sit idle for that token. This is sparse activation: the total model is large, but the active computation per token is small.
Mixtral 8x7B — Mistral AI's open model, released in January 2024 — made this architecture widely accessible and documented. It has eight experts of 7B parameters each. For each token, it routes to 2. So the total parameter count is ~56B but the active parameter count per token is ~14B. It benchmarks comparably to Llama 2 70B (a dense model 5x its active size) while running at roughly the speed of a 12B dense model.^2
Why Does It Matter?
- It's why AI capability has compounded faster than compute budgets would suggest. The scaling hypothesis — more parameters + more data + more compute = better models — is true but incomplete. MoE is a way to decouple "total parameters" from "compute per inference." You can have a much larger model with diverse, specialised knowledge, but only pay for a fraction of it per forward pass. This is why each generation of frontier models benchmarks significantly better than the last without proportionally larger training clusters.
- Experts genuinely specialise. NVIDIA's analysis of Mixtral 8x7B found that experts do develop domain-specific preferences — some activate more frequently for code tokens, others for mathematical reasoning, others for natural language. This wasn't explicitly trained; it emerged from the routing optimisation. The specialisation is real, which is part of why MoE models can match dense models that use far more active parameters per token: the experts that activate are the right experts for the task.^3
- It explains the benchmark vs. cost gap that confuses most users. When you see a model that scores 90% on MMLU with "only" 12B active parameters, MoE is often the explanation. The model has 56B or 141B total parameters, but most of them sleep through any given token. The benchmark reflects total knowledge capacity; the inference cost reflects active compute. Understanding this distinction prevents the category error of thinking "bigger always means slower/more expensive."
- The memory problem is the primary constraint — and it shapes where this goes next. The catch with MoE: all experts must be loaded into memory (RAM/VRAM) even if only 2 activate per token. Mixtral 8x7B requires roughly 90GB of VRAM to run at full precision — far more than the ~28GB you'd expect from 14B active parameters. This makes local deployment on consumer hardware difficult. It's why quantisation (reducing precision from 32-bit floats to 4-bit integers) is central to running MoE models locally — quantised Mixtral 8x7B fits in about 24GB. The model offloading and quantisation research is directly driven by this MoE memory constraint.^4
- DeepSeek's efficiency story is the MoE story. DeepSeek-V2 (2024) and DeepSeek-V3 (2025) both use a variant called DeepSeekMoE with more granular experts (64 fine-grained experts, activate 6 per token) and a shared expert mechanism (some experts always activate, handling common tasks). DeepSeek-V3 achieves GPT-4 class performance while costing a reported $5.5M to train — roughly 50x cheaper than comparable dense models. MoE is the primary architectural reason. The "Chinese AI is cheap" story is, at its core, a MoE routing story.
Key People & Players
Robert Jacobs, Michael Jordan, Steven Nowlan, Geoffrey Hinton — Authors of the original 1991 MoE paper ("Adaptive mixtures of local experts"). Hinton won the Nobel Prize in Physics in 2024 for his broader contributions to neural networks. The foundational architecture was his lab's.
Noam Shazeer (Google Brain → Character.AI) — Lead author of the 2017 paper "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" — the paper that scaled MoE to language models and demonstrated that massive sparse models could outperform dense models of equivalent compute. Shazeer was also a co-author of "Attention Is All You Need." He is arguably the most influential figure in practical LLM architecture.^5
Mistral AI — Published Mixtral 8x7B openly in January 2024, making MoE architecture accessible and auditable for the first time at frontier scale. The paper is freely available and their model weights are downloadable. Responsible for most of the current public understanding of how MoE actually works in practice.^6
DeepSeek AI — Took MoE further with fine-grained expert architectures that achieve better specialisation and load balancing. Their cost efficiency numbers are the most striking demonstration of what intelligent architecture choices can do to the compute cost curve.
William Fedus, Barret Zoph et al. (Google Brain) — Authors of "Switch Transformers" (2021), which simplified MoE by routing each token to exactly 1 expert (not 2) and demonstrated that trillion-parameter sparse models were trainable. The proof of concept at scale that preceded the current generation.
The Current State
MoE is now the dominant architecture at the frontier. GPT-4, Gemini 1.5 Pro, Claude 3 (unconfirmed but widely suspected), Mixtral, DeepSeek-V3, and the Grok series are all believed to use MoE variants. The transition from dense to sparse architectures at frontier scale happened between 2022 and 2024 and was almost completely invisible to most users.
The active research frontiers:
Expert specialisation vs. load balancing — Routers naturally prefer certain experts, creating load imbalance. Some experts get overloaded; others rarely activate. This wastes capacity and causes training instability. Auxiliary loss functions that penalise imbalance are the standard fix, but they introduce a trade-off between specialisation quality and compute efficiency.
Fine-grained MoE — DeepSeek's approach: instead of 8 large experts, use 64 small experts and activate 6. More specialisation, better routing granularity, but more complex load balancing. The research on optimal expert count and size is active.
Mixture of Depths — A newer variant that routes tokens not to different expert networks but to different numbers of transformer layers. Some tokens are "easy" and only need shallow processing; others need deep reasoning. Applying compute proportionally to difficulty is the next efficiency gain.
Multimodal MoE — Routing different modalities (text tokens vs. image patches vs. audio segments) to different experts. Gemini 1.5's multimodal capability is partly explained by this architecture.
The practical implication: the "bigger model is always better" heuristic is increasingly obsolete. A well-designed 56B MoE model beats a poorly-designed 70B dense model. Architecture and routing quality matter as much as raw parameter count. Understanding what's under the hood of the models you're building with gives you better intuitions about why they behave the way they do — and what their actual cost structure is.
Best Resources to Learn More
- Hugging Face: Mixture of Experts Explained — The most thorough accessible technical explainer available. Covers history, architecture, training, and current models.^7
- Mixtral of Experts paper (arXiv, Jan 2024) — The Mistral AI paper. Readable, well-documented, the primary source for current MoE understanding.^8
- NVIDIA: Applying MoE in LLM Architectures — Practical engineering perspective on how MoE is implemented and optimised for inference.^9
- "Outrageously Large Neural Networks" — Shazeer et al. (2017) — The paper that brought MoE to language models at scale.^10
- DeepSeek-V3 Technical Report — The most recent and most extreme demonstration of what fine-grained MoE can achieve on a constrained compute budget.^11