A major infrastructure push hardened Iris for multi-region and multi-cloud training, while Levanter's fused cross-entropy and attention kernels matured, Zephyr gained resilience to preemption, and the project's workflow orchestration was significantly rearchitected.
Levanter core improvements. @dlwh refactored the LM data path around a new GrugLmExample abstraction (#2727), replacing unnamed data structures with typed examples and attention masks. The Checkpointer API was simplified to take explicit state and step arguments (#2916), and evaluator infrastructure was migrated to a callback-driven TaggedEvaluator API (#2889). A pure compute_watch_stats helper was extracted for non-Trainer reuse (#2900). Deprecated shard_map imports were updated (#2919) and embedding initialization switched to init_scale (#2920).
Zephyr resilience and tokenization. @ravwojdyla and @ravwojdyla-agent hardened Zephyr against preemption: pipelines now retry on coordinator death (#2928), worker pools recover from preemptible node loss (#2946), and exponential backoff no longer overflows on high attempt counts (#2868). The tokenize pipeline was refactored to extract helpers and tune batch configuration (#2934), gained a ThreadedBatchWriter for background writes (#2933), and temporary data now lives under UUID-namespaced paths with atomic renames (#2785).
Workflow orchestration. @ravwojdyla landed the StepSpec + Artifact system (#2494), a significant rearchitecture of marin's workflow orchestration that replaces implicit magic with explicit, typed step definitions and artifact tracking. Temp bucket utilities were added to both marin core (#2882) and Zephyr (#2879) for consistent temporary storage management.
Evaluation. @AlienKevin added Harbor evaluator support (#2808), enabling agentic evaluation against 45+ benchmarks from the Harbor registry. A more reliable, unbiased combinatorial estimator for pass@k (#2493) was also merged.
Testing and CI. Test infrastructure saw broad cleanup: TPU test runtime was reduced (#2918), torch tests were fixed to use locked CPU wheels (#2897), gated HF tokenizer loads were removed from scaling-law tests (#2861), and flaky tests were stabilized (#2909). Claude Code workflows were hardened with timeouts and concurrency groups (#2855, #2816), and a policy for agent-generated GitHub activity was established (#2949).
A compute-heavy week dominated by scaling-law experiments and an SFT milestone. Will Held ran 53 isoflop scaling-law runs across 7 FLOPs budgets (1e18 to 3e19) mapping optimal model size as a function of compute, achieving best loss of 2.949 at the 1.8e19 budget with a 2.2B-parameter model. The Tomol25 1B pre-training run completed a full 48k-step training at 54.8% MFU. Kevin Li completed a 37-hour Qwen3-8B agentic SFT run on OpenThoughts-Agent-v1 data using a v5p-32. Calvin Xu ran 35 data mixture ablations testing two-phase StarCoder configurations at small scale.
| Run | Owner | Hardware | FLOPs | Wall Time | MFU | Evals | Status |
|---|---|---|---|---|---|---|---|
| #2499 Tomol25 1B pre-training | @Helw150 | v5p-8 (4 chips) | 5e19 | 39.6h | 54.8% | loss=3.830, 1B params, 48k steps | completed #1 |
| #2601 Agentic SFT: Qwen3-8B on OpenThoughts-Agent-v1 | @AlienKevin | v5p-32 (16 chips) | 3e19 | 36.9h | 42.4% | loss=0.025, Qwen3-8B SFT, seq_len=32768 | completed #1 |
| #2499 Isoflop scaling laws (3e19 FLOPs budget) | @Helw150 | v5p-8 (4 chips) | 3e19 | 10.2-11.1h per run | 46.5-48.8% | best loss=3.004 (d1024-L11), 4 model sizes from 160M to 1B, Nemotron LR schedule | completed #1 #2#3#4 |
| #2499 Isoflop scaling laws (1.8e19 FLOPs budget) | @Helw150 | v5p-8 (4 chips) | 1.8e19 | 6.4-9.8h per run | 44.0-48.9% | best loss=2.949 (d2176-L22), 14 model sizes from 160M to 2.5B | completed #1 |
| #2499 Isoflop scaling laws (1e19 FLOPs budget) | @Helw150 | v5p-8 (4 chips) | 1e19 | 4.1-6.0h per run | 40.4-45.0% | best loss=3.031, 11 model sizes | completed #1 |
| Data mixture sweep: two-phase StarCoder v4 | @Calvin-Xu | v5p-8 (4 chips) | 1e18 | 1.2-2.3h per run | 35.7-36.9% | best loss=2.288, 35 runs, 768-dim 10-layer Qwen3 arch, seq_len=2048 | completed #1 |