A week dominated by Iris infrastructure hardening and a major new Pallas fused cross-entropy kernel, with foundational SFT improvements landing alongside.
@rjpower drove a sustained push to stabilize and refine Iris, the project's cluster orchestration layer. The autoscaler was rewritten with a correct demand algorithm that reasons about individual tasks and scaling groups (#2653), and zombie TPU slices are now properly terminated via heartbeat-driven liveness checks (#2703). Job naming moved to a consistent filesystem-inspired /parent/child/task convention (#2645), the CLI was consolidated from four scattered modules into a unified interface (#2632), and raw time values were replaced throughout with type-safe Timestamp/Duration/Deadline primitives to prevent unit-confusion bugs (#2599). Stale Docker containers are now cleaned up during heartbeat reconciliation (#2671), the threading model was cleaned up with a global registry (#2620), and environment propagation was fixed for both local mode and child jobs (#2678). Additional polish included server-side pagination and filtering for the jobs list (#2676), automatic SSH tunneling for RPC operations (#2609), a simplified log viewer (#2610), and dashboard fixes for job status display (#2615). The Fray integration was cleaned up by removing FRAY_CLIENT_SPEC in favor of auto-detection (#2605), and multiple rounds of Zephyr integration fixes landed (#2675, #2616, #2698).
@dlwh landed a new TPU Pallas kernel for fused softmax-cross-entropy that streams over the vocabulary dimension, keeping only a single block in VMEM at a time (#2521). The follow-up made this streaming kernel the default path, removed the legacy forward-plus-split-backward implementation, and added tuned block sizes for v4 and v5p hardware (#2637). This work directly enables larger-vocabulary models to train without running out of HBM on the cross-entropy step.
@moojink improved SFT support for Marin, Llama 3.1, and Qwen 2.5/3 models by adding chat templates with {% generation %} tags, updating model configs, and adding gradient accumulation and tokenizer padding utilities (#2689). A train/val overlap bug was fixed by splitting before shuffling (#2700), and @Calvin-Xu fixed training resume from a final checkpoint (#2659).
@teetone added support for overriding N_REPEAT in Evalchemy evaluations, allowing control over how many seeds are used for math benchmarks like AIME and AMC (#2584). @Helw150 fixed token stats computation (#2688) and made parallel steps resilient so that one child failing no longer kills the entire batch (#2621). @dlwh added a retry loop for dev TPU reacquisition (#2701) and ensured the watchdog dependency is installed for dev TPU watch mode (#2631).
Dominated by two large-scale experiment campaigns: Kaiyue Wen's MuonHC optimizer baselines on a 128-chip v5litepod (up to 193B tokens per run at 1.2B params), and Calvin Xu's 1.2B Attn-Gate architecture LR sweep on v5p-32 (60B tokens each). Pranshu Chaturvedi continued OLMoE size sweeps at smaller scale, and Moo Jin Kim ran SFT fine-tuning on Qwen3-32B and Marin-8B.
| Run | Owner | Hardware | FLOPs | Wall Time | MFU | Evals | Status |
|---|---|---|---|---|---|---|---|
| MuonHC 1.2B baseline (lr=5e-3, 184K steps) | v5litepod-256 (128 chips) | 2.1e21 | 98.3h | 25.8% | loss=2.483, 193B tokens | completed #1 #2 | |
| MuonHC 1.2B baseline (lr=1e-2, 184K steps) | v5litepod-256 (128 chips) | 2.1e21 | 98.0h | 25.8% | loss=2.532, 193B tokens | completed #1 | |
| Attn-Gate 1.2B LR sweep (lr=x1, best) | @Calvin-Xu | v5p-32 (16 chips) | 5.2e20 | 46.9h | 43.8% | loss=2.321, 60B tokens | completed #1 #2#3#4#5#6#7 |
| OLMoE-L LR sweep (seq4096, best lr) | @pc0618 | v5p-32 (16 chips) | 6.5e19 | 16.0h | 16.6% | loss=2.841, 14.7B tokens | completed #1 #2#3 |
| OLMoE-1.7B LR sweep (seq4096) | @pc0618 | v5p-32 (16 chips) | 1.2e20 | 29.5h | 16.4% | loss=2.706, 12.6B tokens | crashed #1 |
| SFT Qwen3-32B on math30k | @moojink | v5p-64 (32 chips) | None | 15.2h | 51.7% | loss=0.0001, 9.8B tokens | completed #1 |
| SFT Qwen3-2.35B-A2.2B on math30k | @moojink | v5litepod-256 (128 chips) | None | 12.9h | 58.4% | loss=0.0003, 9.8B tokens | completed #1 |
| SFT long-context Marin-8B (lr=1e-4) | @moojink | v5p-128 (64 chips) | None | 11.2h | 58.8% | loss=0.949, 9.5B tokens | crashed #1 |