A massive infrastructure push on Iris — the cluster orchestrator gained autoscaling, a revamped dashboard, controller-initiated heartbeats, chaos testing, and Zephyr GPU support — alongside data pipeline refactoring, DNA experiment consolidation, RL model expansion, and early Fray v2 API work.
The bulk of the week's effort went into Iris, Marin's cluster orchestration layer, with @rjpower driving a sweeping set of changes across reliability, observability, and GPU support.
The largest training run this week was a 50-hour SFT of Qwen3-Base 7B on OpenThoughts4 math data (240k examples) by Moo Jin Kim, reaching 38.9% MFU on a v5p-32. Will Held launched a mega-sweep campaign for 2.5B Qwen3 configurations (33 runs across two batch-size/hidden-dim combos) as groundwork for MoE scaling decisions, plus a 1B Grugformer dense speedrun. Calvin Xu ran the most extensive data mixture experiments yet -- 124 runs comparing single-phase epoch vs no-epoch strategies.
| Run | Owner | Hardware | FLOPs | Wall Time | MFU | Evals | Status |
|---|---|---|---|---|---|---|---|
| #2199 SFT: Qwen3-Base on OpenThoughts4 + math (240k examples) | @moojink | v5p-32 (16 chips) | 5e19 | 50.0h | 38.9% | loss=0.145, Qwen3-Base 7B SFT, 4682 steps, seq_len=32768, bs=256 | completed #1 |
| #2499 Grugformer dense 1B speedrun | @Helw150 | v5p-8 (4 chips) | 2.5e18 | 0.9h | 30.1% | loss=4.624, 1B Grugformer dense model speedrun | completed #1 |
| #2499 Mega-sweep: 2.5B Qwen3 (bs32, hid512) | @Helw150 | v5p-8 (4 chips) | 2.5e18 | 0.9-1.4h per run | 34.6-35.6% | 16 trials, loss~3.5-3.7, large-batch MoE hyperparameter search | completed #1 |
| #2499 Mega-sweep: 2.5B Qwen3 (bs16, hid256) | @Helw150 | v5p-8 (4 chips) | 2.5e18 | 1.3-2.2h per run | 11.6-16.3% | 16 trials, loss~4.1-4.3, smaller hidden dim search | completed #1 |
| Data mixture: single-phase epoch sweep | @Calvin-Xu | v5p-8 (4 chips) | 1e18 | 1.3-3.4h per run | 35.5-36.7% | 62 runs + baselines, loss range 0.48-3.3, epoch-based data mixing | completed #1 |
| Data mixture: single-phase no-epoch sweep | @Calvin-Xu | v5p-8 (4 chips) | 1e18 | 1.5-3.4h per run | 20.0-36.6% | 62 runs + baselines, loss range 2.96-3.4, no-epoch data mixing | completed #1 |