● #4268 On-demand & reserved capacityCLOSED
Closed early in the cycle — capacity plumbing in place.
Milestone: Kick-off pre-trained 100B-A13B 1.2T token MoE (preregistered)
Scope: weeks of 2026-03-30 through 2026-05-02 (5 weekly summaries)
● 8 completed ● 4 partial ● 0 missed of 12 epics
The April milestone was a kick-off — launch the preregistered 1e23 MoE with predictions filed, ship the surrounding platform, set up May. On those terms it largely landed: Ray sunset complete, canonical pipeline shipped, observability posture materially improved, agent-driven experimentation producing real architectural wins, synthetic data delivering downstream signal, and the 1e23 run launched with a measurable trajectory against its preregistered target (run continues on May via #4697). Technical follow-ups (H100 MFU, mixture optimization, perplexity gaps, eval inference) carry over to May with concrete sub-issues. Canary stability got close to target mid-month and slipped on JAX 0.9.2 fallout at month-end — diagnosed with fixes in flight, but no longer headlined as a May epic. Marin-as-a-library closed without an external consumer demonstrating the contract — worth checking whether that's a deliberate scope decision or a quiet drop.
TPU breakdown by type: v5p 1,758,802, v6e 714,623, v4 389,586, v5e 315,993 chip-hours.
Closed early in the cycle — capacity plumbing in place.
#5028, #5031, #5076, #5087, #5089, #5131, #5132, #5137, #5138, #5140 retired Ray end-to-end. Post-sunset shakeout (log-service split, coscheduling preemption fixes) absorbed the last week. Ray is dead.
Closed Apr 15. Wheels publish nightly; #5156 pruned ~2.9k lines of surface area. No external Bolinas-style consumer demonstrated against the contract; closure reflects "platform side ready," not "consumer proven." Not on May.
End-to-end testbed shipped: ferry → tokenize → train arm off a single CLI, 102-dataset registry pinned by revision, datakit-smoke validator running alongside the daily ferry. Source coverage and Zephyr perf push (#5282: 17–26× test speedup) continue under May's #5360.
Major surface-area shift: log/stats service split landed end-to-end (#5212, #5290, #5370), endpoint proxy with IAP fix, BackgroundTracker for W&B resilience, RPC stats redesign earlier in month. Materially better observability posture than April 1.
The milestone name is Kick-off — the April scope was to launch the preregistered 1e23 MoE run with predictions filed, not to finish it. #4697 launched Apr 11 against the 2.25 paloma macro target from the isoflop fit at #4447, reached ~487B tokens at macro 2.4986 / uncheatable 2.169 by month-end, and the recurring crash signature was bisected to wandb.init(resume="allow") creating TPU launch-id divergence. The run itself (#4697) is now on the May milestone where it will finish.
Real signal, not just process: agent-driven ablation sweep produced AdamH-on-embed #5184 as the first cleanly Gate-2-promotable architecture change, closing the embed-norm-growth investigation #4569. MCP babysitter #5042 shipped; lightweight design-doc workflow #5210 established.
Real downstream signal achieved: SWE-ZERO went from "the pipeline works" to a measurable curve — 0% → 3.3% → 4.0% → 5.3% SWE-bench Verified at 10K/50K/100K trajectories on Marin-8B (#4898). 140B-token scale-out at ~52% PR coverage. The Marin-32B agentic-prior gap surfaced via TerminalCorpus midtrain #4760 is a within-epic finding, not a missed deliverable.
Got close to target mid-month (TPU 100%, GPU 87%, datakit 86% in week of 4/20). Slipped at month-end: GPU dropped to 60% on the JAX 0.9.2 NCCL break #5377, datakit to 71% from bucket-consolidation fallout #5376. Both are diagnosed with fixes in flight. Not headlined as a May epic — folded implicitly into infra tune-up #5369.
Kernel-level wins: Triton ragged_dot landed (#4297), JAX 0.9.2 unblocked Triton fwd+bwd path measuring 1.91× XLA on H100×8 (#5330). But the explicit 8×H100 end-to-end and Megatron head-to-head sub-issues #4302, #4311, #4312, #4313 never produced a measurement. Carried into May as #5356 and #5357.
Newer epic that opened mid-cycle. Neville's consolidate_shard_caches replacement #4814 still being stress-tested against real loads. BackgroundTracker resilience #5332, era-shuffle deprecation #5246 shipped. Fleet-observability proposal #5198 opened. Continuing into May without a renamed epic.
Diagnosis machinery landed: gap-report tooling #4962, "mineshaft gap" sweep against Qwen3+Llama, confidence-portfolio epic #5005. New tranches ingested (HPLT v3.0 #4326, davinci-dev, SVG corpus). Mixture optimization unfinished — Calvin's anchor regression doesn't convert Uncheatable advantage to benchmark scores. May target #5359 formalized.