How does scaling work in practice?

MobileMoE: Scaling On-Device Mixture of Experts covers mobilemoe, scaling, ondevice from first principles with code examples. Free lesson at https://engineersofai.com/docs/research/paper-breakdowns/2026-05-26-mobilemoe-scaling-ondevice-mixture-of-experts

What is the difference between mobilemoe and ondevice?

See the full breakdown at https://engineersofai.com/docs/research/paper-breakdowns/2026-05-26-mobilemoe-scaling-ondevice-mixture-of-experts

MobileMoE: Scaling On-Device Mixture of Experts

:::info Stub — Full Engineering Breakdown Coming This paper was featured on Hugging Face Daily Papers on 2026-05-26 with 13 upvotes. A full breakdown with production viability rating, implementation notes, and honest limitations is being written. Subscribe to AI Letters → :::


Authors	Yanbei Chen et al.
Year	2026
HF Upvotes	13
arXiv	2605.27358
PDF	Download
HF Page	View on Hugging Face

Abstract

Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4times fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers 1.8-3.8times faster prefill and 2.2-3.4times faster decode than the dense baseline MobileLLM-Pro.

Engineering Breakdown

The Problem

The Approach

To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs.

Key Results

Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4times fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters.

Research Areas

This paper contributes to the following areas of AI/ML engineering:

Machine learning
Deep learning
Neural networks
Model optimization
AI systems
Mobilemoe

:::tip Subscribe Get weekly breakdowns of papers like this in AI Letters - the newsletter for engineers building production AI systems. :::

Back to Research Lab → · Subscribe to AI Letters →

Abstract​

Engineering Breakdown​

The Problem​

The Approach​

Key Results​

Research Areas​