Mixture of Experts (MoE)

Disclaimer: These are notes for CSE 599K "LLM Serving Systems" at the University of Washington, Spring 2025 instructed by both Prof. Baris Kasikci and TA Kan Zhu

Overview

Mixture of Experts (MoE) is an architecture that replaces large feedforward networks with multiple expert networks and a selector/routing layer. The key advantage is that you can increase the number of experts without affecting FLOPs, enabling massive parameter scaling with constant computational cost.

1. Same FLOP, More Parameters = Better Performance

2. Faster Training

3. Competitive Performance

Core MoE Architecture

Dense vs Sparse Model Comparison

Dense Model: FFN  - Single large feedforward network
Sparse Model: MoE Layer - Multiple expert FFNs + Router

Key Components

Architecture Variations

What Varies Across MoE Models

  1. Routing Function: How tokens are assigned to experts
  2. Expert Sizes: Size and number of expert networks
  3. Training Objectives: Load balancing and auxiliary losses

Common Patterns

Routing Mechanisms

Formula: For each token, select top-k experts based on routing scores

$$h_t^l = \sum_{i=1}^{N} g_{i,t} \cdot \text{FFN}_i^{(l)}(u_t) + u_t$$

Where:

Routing Strategies

  1. Token Choice: Each token selects top-k experts
  2. Expert Choice: Each expert selects which tokens to process
Model Total Experts Active Experts Shared Experts Top-K
Mixtral 8 2 0 2
DBRX 16 4 0 4
DeepSeek v1 64 6 2 6
DeepSeek v3 256 8 1 8
Qwen 1.5 60 4 4 4

Training Challenges and Solutions

Major Challenge: Non-Differentiable Routing

Problem: Sparse gating decisions break gradient flow - only selected experts receive gradients.

Solutions:

  1. Reinforcement Learning: Use REINFORCE to optimize routing policies
  2. Stochastic Perturbations: Add noise to make routing more robust
  3. Heuristic Balancing Losses: Force balanced expert usage

Load Balancing Loss

Critical Issue: Without load balancing, models collapse to using only 2 experts.

Switch Transformer Load Balancing Loss

Purpose: Systems efficiency requires using experts evenly to avoid bottlenecks.

$$\mathcal{L}{\text{aux}} = \alpha \cdot N \cdot \sum f_i \cdot P_i$$}^{N

Where:

Key insight: The derivative with respect to $P_i$ is $\frac{\alpha N}{T}\sum \mathbf{1}_{\text{argmax } p(x)=i}$, so more frequent use leads to stronger downweighting.

DeepSeek MoE Balancing Variations

DeepSeek v1-v2: Dual Balancing

Per-device balancing loss: $$\mathcal{L}{\text{DevBal}} = \alpha_2 \sum f_i^d P_i^d$$}^{D

Communication balancing loss (v2): $$\mathcal{L}{\text{CommBal}} = \alpha_3 \sum$$}^{D} f_i^{in} P_i^{out

DeepSeek v3: Auxiliary Loss-Free Balancing

DeepSeek MoE Architecture Evolution

DeepSeek v1 (16B total, 2.8B active)

DeepSeek v2 (236B total, 21B active)

DeepSeek v3 (671B total, 37B active)

Fine-Grained Expert Architecture

Training Methods: Upcycling

Concept: Initialize MoE models from pre-trained dense language models.

Process

  1. Take a pre-trained dense model
  2. Copy weights to initialize multiple experts
  3. Add routing mechanism from scratch
  4. Continue training with additional data

Qwen MoE Example

System Optimizations

Training Optimizations

Expert Parallelism

All-to-All Communication Pattern

Purpose: Scatter/gather distinct messages from each participant to every other participant.

GPU0: [A0, A1, A2, A3]  o GPU0: [A0, B0, C0, D0]
GPU1: [B0, B1, B2, B3]  o GPU1: [A1, B1, C1, D1]
GPU2: [C0, C1, C2, C3]  o GPU2: [A2, B2, C2, D2]
GPU3: [D0, D1, D2, D3]  o GPU3: [A3, B3, C3, D3]

Process:

  1. Dispatch phase: Layout transformation o Group tokens by target expert o First All-to-All
  2. Expert compute: Each expert processes its assigned tokens
  3. Combine phase: Second All-to-All o Layout transformation o Restore original positions

Communication Bottlenecks

Problem: All-to-All operations consume significant time:

Training Optimizations: Lina

Core Strategy

Intuition: Always prioritize All-to-All and avoid bandwidth sharing.

Techniques:

  1. Tensor Partitioning: Break AllReduce into micro-operations
  2. Priority Scheduling: Give All-to-All operations higher priority
  3. Pipelining: Overlap computation with All-to-All communication

Results: Up to 2.4x speedup in MoE layer execution

Deployment Strategies

Memory Requirements

Mixtral 8x7B Example:

Inference Optimizations

1. Offloading Approaches

2. CPU Compute (Fiddler)

Core Idea: Compute experts on CPU instead of copying weights to GPU.

Strategy:

  1. Initialization: Keep attention weights on GPU, profile expert popularity
  2. Placement: Popular experts on GPU, others on CPU
  3. Execution: Decide per token whether to compute on CPU or GPU
  4. Optimization: Activation copying <0.1ms vs Weight copying >50ms

Latency Model:

$$\arg\min_{\text{cpu_expert,gpu_expert}} \max\left(\sum_{i \in \text{cpu_expert}} (n_\text{input}i \times \text{latency}}}), \sum_{i \in \text{gpu_expert}} ((1 - \text{is_on_gpui) \times \text{latency})\right)$$}

Performance: 8.2-10.1x faster than Mixtral-Offloading, 19.4-22.5x faster than DeepSpeed MII

3. Expert Popularity Profiling

Challenge: During inference, expert popularity differs from training due to load balancing losses.

Solution:

  1. Collect expert selection patterns during training (after load balancing converges)
  2. Create expert selection paths across layers
  3. Use this profile to predict resource allocation during inference
  4. Allocate more resources to popular experts

DeepSeek V3 Deployment

Training Infrastructure

Prefill Stage (32-way Expert Parallelism)

Decode Stage (320 GPUs across 40 nodes)

Batching MoE Computation

GroupGemm Approach

Process:

  1. Routing: Determine expert assignments for each token
  2. Permutation: Group tokens by target expert using prefix sum
  3. Computation: Use GroupGemm for efficient batched computation
  4. Un-permutation: Restore tokens to original positions
  5. Mixing: Combine expert outputs with routing weights

Efficiency: Single GPU kernel with batching benefits across all experts.

Permutation Index Generation

Method: Use prefix sum (scan) operations for efficient parallel permutation index calculation:

Performance Results

Training Efficiency

Inference Improvements

Benchmark Performance

Mixtral 8x7B vs Dense Models:

Key Takeaways

  1. MoEs enable parameter scaling without proportional FLOP increases
  2. Load balancing is critical - models collapse without it
  3. Communication is the bottleneck in distributed MoE training (34.1% of training time)
  4. System optimizations are essential for practical deployment
  5. Recent models (DeepSeek V3) achieve competitive performance with massive scale
  6. Upcycling from dense models is a viable initialization strategy
  7. CPU-GPU hybrid approaches can dramatically improve inference efficiency
  8. Expert popularity profiling enables better resource allocation during inference