Milestone: May: Prepping scaling training recipe + data mix + GPU training
May is the prep month for June's production run. April kicked off the preregistered 1e23 MoE and shipped the surrounding platform; May closes the loop: finish the 1e23, pin the scaling recipe and data mixture for the June 16B-A2B model, get H100s into the training fleet, and stand up an inference service for evals that survives preemption. Nine epics, grouped below by what they unlock. By end of May we should be able to forecast the June run with confidence; anything that slips past May turns the June forecast into a guess.
Hit (or miss) the preregistered 2.25 paloma macro target on the d5120 / 129B-total / 16B-active MoE. Currently at ~487B tokens and macro 2.4986 after three crash-resume cycles; the recurring crash signature (TPU launch-id divergence from wandb.init(resume="allow")) is bisected and the validated fix is fresh-id-per-attempt with resume="never".
Integrate April's ablation results (AdamH-on-embed #5184 and friends), retune LR, optionally explore long-context extension, produce isoFLOP results that let us forecast the June run. Pre-registration of the June run is gated on #5359.
No dedicated post-training epic on the May milestone. The April synthetic-data threads — SWE-ZERO scale-out #4719, the 500K SFT dose-response point on Marin-8B #4898, and the Marin-32B TerminalCorpus midtrain #4760 — are open without a milestone home. Decision pending on the parent epic #3192 (whether they get absorbed into May elsewhere or a successor opens). Worth flagging.
Active swarm over all sources in datakit/sources.py, de-risked on the existing #2345 swarm against UncheatableEval, HumanEval, MMLU, GPQA, and David's PPL sets. Must-have: a checked-in mixture file that beats proportional mixing over all sources and over a hand-picked high-quality subset. Stretch: fully locked by mid-June.
Get dedup params and contamination detection (p0) ready in time for #5359's May 15 launch; quality scores (p1, related to #5200); domain tagging and embeddings as stretch.
Build an aggregate PPL pool that positively correlates (P<0.05) with post-trained performance, validated on Qwen 2.5 base → coder models. The output names which evals we're bad at and drives data-mix decisions feeding #5359 and #5358.
Selected evals — MMLU SL Verb 5-shot and HumanEval 5-shot on the 1e22 MoE on v5p-8 — run on Iris behind vLLM with an OpenAI-compatible HTTP proxy, preemption-resilient. Eval jobs no longer blow up when a worker gets evicted. Mega-Evals #2663 is the p1 follow-up.
Train the June 16B-A2B MoE for ~1k steps on 2+ GPU hosts. The June model is the first one we're not committing to TPU-only.
Hit Nemotron ±ε MFU on H100s. The Triton fwd+bwd path from #5330 (1.91× XLA on H100×8) is the starting line; SonicMoE-style local-compute work in #5328 is the open back-half of the stack.
Catch-all for the things that make daily work less painful: unified queries over operational data so agents can use it, zero-trust proxy, GH ↔ Iris integration, more consistent logging. Where the leftover canary stability and post-Ray-sunset polish lands.