April 2026 milestone retrospective

Milestone: Kick-off pre-trained 100B-A13B 1.2T token MoE (preregistered)
Scope: weeks of 2026-03-30 through 2026-05-02 (5 weekly summaries)

Summary

●  8 completed ●  4 partial ●  0 missed  of 12 epics

The April milestone was a kick-off — launch the preregistered 1e23 MoE with predictions filed, ship the surrounding platform, set up May. On those terms it largely landed: Ray sunset complete, canonical pipeline shipped, observability posture materially improved, agent-driven experimentation producing real architectural wins, synthetic data delivering downstream signal, and the 1e23 run launched with a measurable trajectory against its preregistered target (run continues on May via #4697). Technical follow-ups (H100 MFU, mixture optimization, perplexity gaps, eval inference) carry over to May with concrete sub-issues. Canary stability got close to target mid-month and slipped on JAX 0.9.2 fallout at month-end — diagnosed with fixes in flight, but no longer headlined as a May epic. Marin-as-a-library closed without an external consumer demonstrating the contract — worth checking whether that's a deliberate scope decision or a quiet drop.

Weekly summaries

Rollup

TPU chip-hours
3,179,004
across 5 weeks
Theo. peak FLOPs
5.88e24
BF16 peak × chip-hours
PRs merged
545
158 opened
Issues closed
327
255 opened
Comments
1,765
non-bot
Active contributors
30
distinct PR / issue authors
Tokens added
4.53T
to 18.09T total
New datasets
66
registered in datakit

TPU breakdown by type: v5p 1,758,802, v6e 714,623, v4 389,586, v5e 315,993 chip-hours.

Milestone epic grades

Completed (8 epics)

#4268 On-demand & reserved capacityCLOSED

Closed early in the cycle — capacity plumbing in place.

#4269 Single way of running jobs — off Ray completelyOPEN

#5028, #5031, #5076, #5087, #5089, #5131, #5132, #5137, #5138, #5140 retired Ray end-to-end. Post-sunset shakeout (log-service split, coscheduling preemption fixes) absorbed the last week. Ray is dead.

#4271 Marin-as-a-library (Bolinas can import marin)CLOSED

Closed Apr 15. Wheels publish nightly; #5156 pruned ~2.9k lines of surface area. No external Bolinas-style consumer demonstrated against the contract; closure reflects "platform side ready," not "consumer proven." Not on May.

#4272 Canonical pipeline (download → norm → dedup/quality → tokenize)CLOSED

End-to-end testbed shipped: ferry → tokenize → train arm off a single CLI, 102-dataset registry pinned by revision, datakit-smoke validator running alongside the daily ferry. Source coverage and Zephyr perf push (#5282: 17–26× test speedup) continue under May's #5360.

#4273 Improve Usability & ObservabilityOPEN

Major surface-area shift: log/stats service split landed end-to-end (#5212, #5290, #5370), endpoint proxy with IAP fix, BackgroundTracker for W&B resilience, RPC stats redesign earlier in month. Materially better observability posture than April 1.

#4281 MoE Scaling up to April goalOPEN

The milestone name is Kick-off — the April scope was to launch the preregistered 1e23 MoE run with predictions filed, not to finish it. #4697 launched Apr 11 against the 2.25 paloma macro target from the isoflop fit at #4447, reached ~487B tokens at macro 2.4986 / uncheatable 2.169 by month-end, and the recurring crash signature was bisected to wandb.init(resume="allow") creating TPU launch-id divergence. The run itself (#4697) is now on the May milestone where it will finish.

#4282 Agentify experimentationOPEN

Real signal, not just process: agent-driven ablation sweep produced AdamH-on-embed #5184 as the first cleanly Gate-2-promotable architecture change, closing the embed-norm-growth investigation #4569. MCP babysitter #5042 shipped; lightweight design-doc workflow #5210 established.

#3192 Synthetic data (research + critical path for post-training)OPEN

Real downstream signal achieved: SWE-ZERO went from "the pipeline works" to a measurable curve — 0% → 3.3% → 4.0% → 5.3% SWE-bench Verified at 10K/50K/100K trajectories on Marin-8B (#4898). 140B-token scale-out at ~52% PR coverage. The Marin-32B agentic-prior gap surfaced via TerminalCorpus midtrain #4760 is a within-epic finding, not a missed deliverable.

Partial (4 epics)

#4270 Canary pass rate to 90%+OPEN

Got close to target mid-month (TPU 100%, GPU 87%, datakit 86% in week of 4/20). Slipped at month-end: GPU dropped to 60% on the JAX 0.9.2 NCCL break #5377, datakit to 71% from bucket-consolidation fallout #5376. Both are diagnosed with fixes in flight. Not headlined as a May epic — folded implicitly into infra tune-up #5369.

#4283 MoE MFU at scaleOPEN

Kernel-level wins: Triton ragged_dot landed (#4297), JAX 0.9.2 unblocked Triton fwd+bwd path measuring 1.91× XLA on H100×8 (#5330). But the explicit 8×H100 end-to-end and Megatron head-to-head sub-issues #4302, #4311, #4312, #4313 never produced a measurement. Carried into May as #5356 and #5357.

#4474 Levanter Store, K8s Logging, and Infrastructure ImprovementsOPEN

Newer epic that opened mid-cycle. Neville's consolidate_shard_caches replacement #4814 still being stress-tested against real loads. BackgroundTracker resilience #5332, era-shuffle deprecation #5246 shipped. Fleet-observability proposal #5198 opened. Continuing into May without a renamed epic.

#3100 Data sources for pre-training / mid-trainingOPEN

Diagnosis machinery landed: gap-report tooling #4962, "mineshaft gap" sweep against Qwen3+Llama, confidence-portfolio epic #5005. New tranches ingested (HPLT v3.0 #4326, davinci-dev, SVG corpus). Mixture optimization unfinished — Calvin's anchor regression doesn't convert Uncheatable advantage to benchmark scores. May target #5359 formalized.