A massive Iris infrastructure push — controller checkpointing, reservation system, flexible device scheduling, and a complete logging overhaul — alongside MoE scaling law experiments by @ClassicLarry and continued kernel/optimizer work from @dlwh.
Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.
Building on last week's reliability push, Iris saw its biggest week yet. @rjpower landed controller snapshot/checkpoint #3167 so restarts no longer orphan running jobs, a reservation system for pre-provisioning worker capacity #3123, #3223, and flexible device variant scheduling #3254 that lets jobs specify multiple acceptable TPU types for cross-region placement. The autoscaler was reworked with token-bucket rate limiting for scale-down #3212 and reduced lock contention under high task counts #3356. A complete logging overhaul replaced GCS-based log reads with heartbeat-forwarded logs stored in SQLite on the controller #3244, #3301, #3283, #3296, #3325, fixing dropped logs on task completion and eliminating file descriptor exhaustion. MirrorFileSystem #3258 provides transparent cross-region file access while CrossRegionGuardedFS #3162 blocks large cross-region reads. On the training side, @dlwh continued building on last week's Grug refactor with improved variant contract checks #3169, visual diff tooling for reviewing template-heavy Grug code #3127, a new modular_opt variant #3293, and MoE ring expert-parallel optimizations #3377, #3398 that closed the EP benchmark milestone #2710. Fused cross-entropy was stabilized: one production forward path #3125, miss-only autotune sweeps #3251, a backend-dispatched GMM API with GPU fallback #3256, and v4 vmem fallback stabilization #3354. @yonromai fixed Pallas GPU CE tracing on non-GB10 #3148, added NVIDIA weight tile limits #3160, and enabled S3 compilation caching #3195.
Following last week's initial MoE experiments on v4 and v5p, @ClassicLarry ran an extensive set of scaling law experiments — replicating expert-count sweeps from the TGL paper #3182 and progressing to full isoflop sweeps #2167. Key findings: DeepSeek-style aux-loss-free load balancing outperformed traditional LBL across configurations, and at high sparsity ratios (2:128) a 4x LBL coefficient boost showed only marginal benefit. The sweep converged on a baseline architecture — 64 routed experts, K=2, with AF balancing (bias_rate=0.01) plus 0.001 aux loss — now tracked as moe_iteration_01. @yonromai added a MoE canary ferry for daily TPU regression testing #3342 alongside canary diagnostics improvements: data loader stall monitoring #3346, always-on profiling with persistent artifacts #3299, MFU gating on trailing p50 #3279, and CW canary OOM fixes #3217. @dlwh set block shuffle as the default for new Grug runs #3371, dispatched Grug through Fray jobs to fix multinode training #3269, and auto-detect multinode TPUs to set replicas and coscheduling #3233.
Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.
After last week's tokenization debugging, @ravwojdyla completed Nemotron-CC tokenization at scale — processing nearly 2 trillion tokens across all 7 quality tiers at ~150M tokens/sec across 512 workers #2829. The Luxical embedding experiment #3191 kicked off to evaluate frozen embeddings as general-purpose quality/topic classifiers, with Luxical's creator @lukemerrick offering usage guidance #3049. @Helw150 added Nemotron V2 data #3317. Zephyr saw Vortex upgraded to support GCS #3268, group_by enhancements including secondary sort and generator reducers #3250, #3247, and download reliability fixes #3324, #3142.
Progress on the SFT front after last week's 0% resolve rate: @AlienKevin achieved 5/43 on the Rust subset of SWE-bench Multilingual by switching to Qwen2.5-Coder-32B-Instruct as the student model and using the exact SWE-smith hyperparameters #2956. A TRL sanity check on Modal confirmed the earlier Qwen3-8B repetition issues were not a Marin-specific bug. On OpenThoughts4, @moojink completed follow-up experiments with the larger Qwen3-235B-A22B teacher #2262 — surprisingly, it showed only modest advantage over 32B for Llama3.1-8B-Instruct students, suggesting diminishing returns from teacher scale.
@gonzalobenegas added LLR-based variant effect prediction evaluation for DNA models #3144 and an EDA notebook on perplexity vs downstream task performance #3333. Documentation alignment by @dlwh-golem across TPU cluster setup #3415, contributing hooks #3307, MkDocs commands #3271, and README paths #3235. CI improvements for PR checks on all target branches #3150.
The largest single run this week was Will Held's 10B-parameter AdamH scaling ladder at 1e22 FLOP budget on v4-256 (128 chips), reaching 47.5% MFU and processing 116B tokens over 59 hours before crashing — the first Marin run at this scale. MoE scaling law experiments intensified: ClassicLarry ran two complete TGL Phase 1 expert-count sweeps (2 to 256 experts, 8.3B tokens each) on v4-8 plus an EKN nano scaling sweep testing K and LBL coefficient interactions across ~20 configurations. The 256-expert configs consistently achieved the best loss (3.060-3.186). Moo Jin Kim completed a large OpenThoughts4 SFT run (Qwen3-1.7B with Qwen3-32B teacher, 100k steps on v4-128 at 34.7% MFU). David Hall began 32B-A4B MoE bring-up on v5p-64, running several short profiling attempts at 11-20% MFU that crashed but established baseline performance numbers. Parallel-attn-mlp experiments on v6e hardware investigated persistent NaN issues under restart.
| Run | Owner | Hardware | FLOP Budget | Wall Time | Loss | Evals | Links |
|---|---|---|---|---|---|---|---|
| adamh-scaling-ladder-nemotron-optimal-1e+23-v5-27f2fb running | Will Held | TPU v4 (512 chips) |
8.18e22 model
2.65e23 HW (31% MFU) |
22.1d | BPB: 0.796 | W&B | |
| adamh-scaling-ladder-nemotron-optimal-1e+22-v5-025b0e | Will Held | TPU v4 (256 chips) |
10.00e21 model
1.99e22 HW (50% MFU) |
3.5d | BPB: 0.768 | W&B | |
| adamh-scaling-ladder-nemotron-optimal-1e+22-v6-500e71 crashed | Will Held | TPU v4 (256 chips) |
7.23e21 model
1.52e22 HW (48% MFU) |
2.5d | BPB: 0.950 | W&B | |
| (done) exp2262pt3c_pt2_qwen3_1pt7b_base_ot4_240k_math_qwen3_4b_32768tok-a9bd48 | Moo Jin Kim | TPU v4 (128 chips) |
2.59e21 model
7.71e21 HW (34% MFU) |
2.4d | BPB: 0.067 | W&B | |
| (done) exp2262pt3d_pt2_qwen3_1pt7b_base_ot4_240k_math_qwen3_32b_32768to-58b647 | Moo Jin Kim | TPU v4 (128 chips) |
2.59e21 model
7.46e21 HW (35% MFU) |
2.1d | BPB: 0.146 | W&B | |
| AdamH scaling ladder 10B (Nemotron-optimal, 1e22 budget) crashed | @Helw150 | v4-256 (128 chips) |
(47.5% MFU) |
59.0h | loss=2.702, 115.9B tokens | W&B | |
| adamh-v6-scaling-ladder-nemotron-optimal-1e+23-a128a5 crashed | Will Held | TPU v4 (256 chips) |
1.25e21 model
3.22e21 HW (39% MFU) |
15.8h | BPB: 1.190 | W&B | |
| adamh-v6-scaling-ladder-nemotron-optimal-1e+22-81073a crashed | Will Held | TPU v4 (256 chips) |
1.07e21 model
2.93e21 HW (36% MFU) |
16.1h | BPB: 0.948 | W&B | |
| adamh-scaling-ladder-nemotron-optimal-1e+21-v5-019021 | Will Held | TPU v4 (64 chips) |
1.00e21 model
1.90e21 HW (53% MFU) |
1.3d | BPB: 0.844 | W&B | |
| adamh-scaling-ladder-nemotron-optimal-1e+21-v6-77f848 | Will Held | TPU v4 (64 chips) |
1.00e21 model
1.89e21 HW (53% MFU) |
1.3d | BPB: 0.844 | W&B | |
| (done) exp2262pt3i_100k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-a594bd | Moo Jin Kim | TPU v4 (128 chips) |
5.41e20 model
1.56e21 HW (35% MFU) |
13.3h | BPB: 0.145 | W&B | |
| (done) exp2262pt3i_100k_pt2_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-092c35 | Moo Jin Kim | TPU v5 (32 chips) |
5.41e20 model
1.50e21 HW (36% MFU) |
1.2d | BPB: 0.135 | W&B | |
| (done) exp2262pt3h_100k_pt2_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-0c0f70 | Moo Jin Kim | TPU v5 (32 chips) |
5.41e20 model
1.50e21 HW (36% MFU) |
18.5h | BPB: 0.064 | W&B | |
| exp2262pt3h_100k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-ce68a9 | Moo Jin Kim | TPU v5 (32 chips) |
5.41e20 model
1.50e21 HW (36% MFU) |
1.2d | BPB: 0.078 | W&B | |
| (done) exp2262pt3i_100k_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768tokens-227b3e | Moo Jin Kim | TPU v5 (32 chips) |
5.41e20 model
1.50e21 HW (36% MFU) |
1.7d | BPB: 0.135 | W&B | |
| isoflop-3e+20-d768-L8-B1024-adamh_scaling_v8 | Will Held | TPU v4 (32 chips) |
3.00e20 model
1.29e21 HW (23% MFU) |
1.8d | BPB: 1.002 | W&B | |
| OpenThoughts4 SFT: Qwen3-1.7B base w/ Qwen3-32B teacher (100k steps) | @moojink | v4-128 (64 chips) |
(34.7% MFU) |
13.3h | loss=0.166, 16.4B tokens | #2262 W&B | |
| TGL Phase 1 MoE run3 (2 experts, v5p-16) crashed | @ClassicLarry | v5p-16 (8 chips) |
(18.3% MFU) |
15.5h | loss=3.393, 20.7B tokens (crashed) | #3182 W&B | |
| TGL Phase 1 MoE expert-count sweep (run5, 2-256 experts) | @ClassicLarry | v4-8 (4 chips) |
(14-16% MFU) |
10-17h per run | loss=3.186 (256E) to 3.541 (2E), 8.3B tokens each, 8 configs | #3182 W&B W&BW&B | |
| TGL Phase 1 MoE expert-count sweep (run2-v2, 2-256 experts) | @ClassicLarry | v4-8 (4 chips) |
(14-16% MFU) |
10-14h per run | loss=3.060 (256E) to 3.423 (2E), 8.3B tokens each, 8 configs | #3182 W&B W&BW&B | |
| EKN MoE nano scaling sweep (K=4/8, E=8-128, LBL ablation) | @ClassicLarry | v4-8 (4 chips) |
(10-13% MFU) |
10-14h per run | loss=3.215-3.823, 6.6B tokens each, ~20 configs | #3182 W&B W&B | |
| Grug MoE 32B-A4B v5p-64 perf bring-up (profiling) crashed | @dlwh | v5p-64 (32 chips) |
(11-20% MFU) |
0.3-0.6h per attempt | profiling only, ~20M tokens per attempt | #3357 W&B W&BW&B | |
| Parallel-attn-mlp v6e NaN investigation sweeps crashed | @dlwh | v6e-8 (4 chips) |
(5-7% MFU) |
0.4-1.5h per run | loss=NaN investigation, 0.8-1.5B tokens (crashed) | #3316 W&B W&BW&B |