May 2026 roadmap overview

Milestone: May: Prepping scaling training recipe + data mix + GPU training

Overview

May is the prep month for June's production run. April kicked off the preregistered 1e23 MoE and shipped the surrounding platform; May closes the loop: finish the 1e23, pin the scaling recipe and data mixture for the June 16B-A2B model, get H100s into the training fleet, and stand up an inference service for evals that survives preemption. Nine epics, grouped below by what they unlock. By end of May we should be able to forecast the June run with confidence; anything that slips past May turns the June forecast into a guess.

Epics

Pre-training

#4697 Experiment: 1e23 MoE Run

Hit (or miss) the preregistered 2.25 paloma macro target on the d5120 / 129B-total / 16B-active MoE. Currently at ~487B tokens and macro 2.4986 after three crash-resume cycles; the recurring crash signature (TPU launch-id divergence from wandb.init(resume="allow")) is bisected and the validated fix is fresh-id-per-attempt with resume="never".

#5358 Land the scaling recipe for June model

Integrate April's ablation results (AdamH-on-embed #5184 and friends), retune LR, optionally explore long-context extension, produce isoFLOP results that let us forecast the June run. Pre-registration of the June run is gated on #5359.

Post-training

No dedicated post-training epic on the May milestone. The April synthetic-data threads — SWE-ZERO scale-out #4719, the 500K SFT dose-response point on Marin-8B #4898, and the Marin-32B TerminalCorpus midtrain #4760 — are open without a milestone home. Decision pending on the parent epic #3192 (whether they get absorbed into May elsewhere or a successor opens). Worth flagging.

Data & evals

#5359 Determine data mixture for pre- and mid-training

Active swarm over all sources in datakit/sources.py, de-risked on the existing #2345 swarm against UncheatableEval, HumanEval, MMLU, GPQA, and David's PPL sets. Must-have: a checked-in mixture file that beats proportional mixing over all sources and over a hand-picked high-quality subset. Stretch: fully locked by mid-June.

#5360 Data pipeline: quality scores + dedup param selection

Get dedup params and contamination detection (p0) ready in time for #5359's May 15 launch; quality scores (p1, related to #5200); domain tagging and embeddings as stretch.

#5367 Identifying perplexity gaps for eval and training

Build an aggregate PPL pool that positively correlates (P<0.05) with post-trained performance, validated on Qwen 2.5 base → coder models. The output names which evals we're bad at and drives data-mix decisions feeding #5359 and #5358.

#5368 Inference service for evals

Selected evals — MMLU SL Verb 5-shot and HumanEval 5-shot on the 1e22 MoE on v5p-8 — run on Iris behind vLLM with an OpenAI-compatible HTTP proxy, preemption-resilient. Eval jobs no longer blow up when a worker gets evicted. Mega-Evals #2663 is the p1 follow-up.

Infra

#5356 Run MoEs on multinode H100s on CoreWeave

Train the June 16B-A2B MoE for ~1k steps on 2+ GPU hosts. The June model is the first one we're not committing to TPU-only.

#5357 H100 kernel perf

Hit Nemotron ±ε MFU on H100s. The Triton fwd+bwd path from #5330 (1.91× XLA on H100×8) is the starting line; SonicMoE-style local-compute work in #5328 is the open back-half of the stack.

#5369 Infra tune-up

Catch-all for the things that make daily work less painful: unified queries over operational data so agents can use it, zero-trust proxy, GH ↔ Iris integration, more consistent logging. Where the leftover canary stability and post-Ray-sunset polish lands.

What gates what