A massive Iris architecture overhaul — SQLite as canonical state, JWT auth, Vue 3 dashboard, and multi-VM CoreWeave support — alongside MoE isoflop sweeps advancing to iter_03 with sigmoid routing and AdamH, and Zephyr's shuffle rewrite enabling Nemotron-scale dedup.
Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.
Iris underwent a deep architectural overhaul this week. @rjpower made SQLite the canonical state store #3408, replacing the prior in-memory state with a proper schema, migration system, and normalized tables for scaling groups and slices #3514. The controller now checkpoints its SQLite state to GCS for post-mortem analysis #3497. Auth was overhauled from per-RPC DB lookups to HMAC-SHA256 JWTs #3630, #3537. The dashboard was rewritten from Preact+HTM to Vue 3 with TypeScript, Rsbuild, and Tailwind v4 — 26 components covering all existing tabs plus a new task detail page #3511. Multi-VM CoreWeave support landed with JAX coordinator bootstrap for distributed training across VMs #3638, alongside namespace-qualified RBAC so multiple Iris instances on the same cluster no longer interfere #3703. Job lifecycle got new preemption and existing-job policies #3685, proper slice reaping on worker failure #3425, and ghost-slice prevention during scale-down #3571. The autoscaler saw deadlock fixes and rate-limit logging #3580, #3531, #3616. On the data processing side, @ravwojdyla rewrote Zephyr's shuffle to use Parquet instead of per-chunk pickle blobs #3482, #3656, added dynamic batch sizing for writers #3498, and made exact dedup work at Nemotron scale via single-pass hash-group-write #3442. Protobuf generation was moved from checked-in files to an auto-generating hatch build hook #3631. @dlwh batch-tiled the XLA fused cross-entropy path to handle long sequences without hitting the TPU int32 word-count limit #3533, tuned mixed-dtype block sizes #3452, and fixed GCS executor step-lock races under worker churn #3541. @rjpower also introduced nightshift — automated GitHub Actions workflows using Claude agents for overnight cleanup, dead code removal, and issue triage #3557, #3615.
Building on last week's initial isoflop sweep, @ClassicLarry completed full results for the 15-config isoflop grid across three FLOP budgets (1e18, 3e18, 1e19) with five model widths each #2167. The sweep confirmed d768 as the optimal width at 1e18 and 3e18, with d1024 competitive at 3e18. Parallel EKN scaling experiments #3182 explored expert count and granularity — 128 experts with K=2 achieved the best BPB (1.0565) but at significantly lower MFU (14.9%) than the K=4/E=32 baseline. Aux-loss-free balancing consistently outperformed traditional load balancing loss. @Helw150 pushed the architecture to iter_03 with sigmoid routing (replacing softmax), independent per-expert gating, and an AdamH optimizer sweep on the updated recipe. @WhenWen ran AdamH comparisons showing ~0.009–0.01 BPB improvements over the v02 baseline at smaller dimensions, and tested Gated Norm from recent literature on top of AdamH. A new sub-issue tracks testing the architecture at 1e21 and 1e22 FLOP scales #3800 to validate routing stability at production scale. On the GPU side, @chloechiaw is evaluating Tokamax's Pallas Triton ragged_dot kernel as a JAX-native alternative to custom GPU MoE kernels #2828. @yonromai consolidated all canaries into a single Grug MoE entry point running through Iris #3587, with fused CE autotune fallback fixes #3605.
Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.
Following last week's Nemotron-CC tokenization milestone, @ravwojdyla completed large-scale fuzzy dedup of Nemotron splits using the rewritten group-by pipeline #2829, validating that the Parquet-based Zephyr shuffle handles production-scale data without the file-count blowup that plagued the pickle approach. The exact dedup rewrite #3442 replaces the two-pass dup-map strategy with single-pass hash-group-annotate, outputting Vortex files directly. In a new embedding evaluation effort #3535, @ravwojdyla tested Luxical-One embeddings as a general-purpose representation for data curation. Quality filtering via linear probe scaled from Spearman 0.485 at N=125 to 0.75 at N=10K — promising but below the 0.8 go threshold. Luxical outperformed Arctic-L and BGE-large at 6× fewer dimensions. Topic clustering was a clear no-go across all models (best NMI 0.478). @gonzalobenegas continued DNA model work with separate uppercase/lowercase weight handling and a functional_pos experiment tracking functional vs nonfunctional log-likelihood across model sizes #3483.
On the OpenThoughts4 front, @moojink ran follow-up experiments with smaller models — training Qwen3-1.7B-Base on data from Qwen3-4B and Qwen3-32B teachers, plus a rejection sampling variant where Qwen3-4B generates and Qwen3-32B verifies #2262. These experiments test whether the modest advantage of larger teachers observed with Llama3.1-8B holds across student scales. @natolambert flagged the need to test multiple teachers as the key takeaway, with rejection sampling results still pending.
@dlwh added an HBM optimization guide #3595 and refactored the Pallas kernel skill with reference guides #3570. @dlwh-golem continued documentation alignment — contributing guides #3520, #3596, docs build guides #3551, #3607, eval entrypoint drift #3549, and agent workflow skills #3663, #3662.
MoE scaling dominated the week: a 15-config isoflop sweep at 1e18–1e19 FLOPs settled on d768 as optimal, alongside long-run architecture comparisons on preemptible v5p-64 and a 235B parameter trial queued for v5p-256. Agentic SFT reproduced NemotronTerminal baselines on Terminal-Bench.
| Run | Owner | Hardware | FLOP Budget | Wall Time | Loss | Evals | Links |
|---|---|---|---|---|---|---|---|
| adamh-scaling-ladder-nemotron-optimal-1e+23-v5-27f2fb running | Will Held | TPU v4 (512 chips) |
8.18e22 model
2.65e23 HW (31% MFU) |
22.1d | BPB: 0.796 | W&B | |
| adamh-scaling-ladder-nemotron-optimal-1e+22-v6-500e71 crashed | Will Held | TPU v4 (256 chips) |
7.23e21 model
1.52e22 HW (48% MFU) |
2.5d | BPB: 0.950 | W&B | |
| exp2262pt3k_100k_llama_3_2_1b_ot4_math_qwen3_32b_32768tokens-ad9dd3 | Moo Jin Kim | TPU v4 (128 chips) |
1.01e21 model
5.01e21 HW (20% MFU) |
1.7d | BPB: 0.175 | W&B | |
| adamh-scaling-ladder-nemotron-optimal-1e+22-v7-5f064e crashed | Will Held | TPU v4 (256 chips) |
1.92e21 model
3.84e21 HW (50% MFU) |
15.7h | BPB: 0.959 | W&B | |
| exp2262pt3l_240k_pt2_qwen3_1pt7b_base_ot4_math_qwen3_30b_a3b_thi-1b291f | Moo Jin Kim | TPU v4 (128 chips) |
1.30e21 model
3.73e21 HW (35% MFU) |
1.2d | BPB: 0.143 | W&B | |
| exp2262pt3i_240k_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768tokens-ec97cb | Moo Jin Kim | TPU v4 (128 chips) |
1.30e21 model
3.73e21 HW (35% MFU) |
1.3d | BPB: 0.126 | W&B | |
| exp2262pt3i_240k_pt2_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-2ff561 | Moo Jin Kim | TPU v4 (128 chips) |
1.30e21 model
3.73e21 HW (35% MFU) |
1.2d | BPB: 0.139 | W&B | |
| exp2262pt3i_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-eeb1d6 | Moo Jin Kim | TPU v4 (128 chips) |
1.30e21 model
3.73e21 HW (35% MFU) |
1.2d | BPB: 0.181 | W&B | |
| exp2262pt3h_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-007999 | Moo Jin Kim | TPU v4 (128 chips) |
1.30e21 model
3.73e21 HW (35% MFU) |
1.2d | BPB: 0.085 | W&B | |
| exp2262pt3h_240k_pt2_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-29bb5e | Moo Jin Kim | TPU v4 (128 chips) |
1.30e21 model
3.73e21 HW (35% MFU) |
1.2d | BPB: 0.061 | W&B | |
| exp2262pt3l_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_30b_a3b_thi-5eb7b0 | Moo Jin Kim | TPU v4 (128 chips) |
1.30e21 model
3.73e21 HW (35% MFU) |
1.2d | BPB: 0.167 | W&B | |
| exp2262pt3h_240k_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tokens-985c01 | Moo Jin Kim | TPU v4 (128 chips) |
1.30e21 model
3.73e21 HW (35% MFU) |
1.3d | BPB: 0.055 | W&B | |
| exp2262pt3l_240k_qwen3_1pt7b_base_ot4_math_qwen3_30b_a3b_thinkin-b8cff7 | Moo Jin Kim | TPU v4 (128 chips) |
1.30e21 model
3.73e21 HW (35% MFU) |
1.2d | BPB: 0.142 | W&B | |
| exp2262pt3k_50k_llama_3_2_1b_ot4_math_qwen3_32b_32768tokens-d60dcb | Moo Jin Kim | TPU v5 (32 chips) |
5.03e20 model
1.62e21 HW (31% MFU) |
1.3d | BPB: 0.195 | W&B | |
| (done) exp2262pt3l_100k_pt2_qwen3_1pt7b_base_ot4_math_qwen3_30b_a3b_thi-427e5c | Moo Jin Kim | TPU v4 (128 chips) |
5.41e20 model
1.62e21 HW (33% MFU) |
12.9h | BPB: 0.157 | W&B | |
| MoE 1e18 real-data recipe sweep (8 runs) | @dlwh | v5p-8 (4 chips) |
(21–23.3% MFU) |
6–9.5h per run | E=64/K=4. Best: h1024 loss=3.404 at 2.6B tokens (23.3% MFU). Long 30k-step: loss=3.433 at 4.2B tokens. | #3522 W&B W&BW&B | |
| MoE isoflop sweep (15 configs × 3 budgets) | @ClassicLarry | v5p-8 (4 chips) |
(17–28% MFU) |
~1.2h (1e18) / ~3h (3e18) / ~9h (1e19) per config | Best at 1e18: d768 loss=3.406. Best at 3e18: d1536 loss=3.090. Best at 1e19: d1536 loss=2.931. | #3522 W&B | |
| MoE nano scaling + weight decay sweep | @ClassicLarry | v5p-8 (4 chips) |
(17.0–17.1% MFU) |
~1.7h per wd run, ~2h (3e18 d768) | Best wd=0.08 loss=3.402. 3e18 d768: loss=3.110 at 2.5B tokens. | #3466 W&B | |
| Architecture compare: es3r2 vs g15 vs sab4 vs es1/es2 | @dlwh | v5p-64 (32 chips) preempt |
(es3r2: 23.6%, g15: 22.2%, sab4: 21.9%, es2: 22.5%, es1: 20.6% MFU) |
~20min each (profiling runs) | es3r2 promoted as primary target shape. Throughput profiling at v5p-64 scale. | #3528 W&B W&BW&BW&BW&B | |
| Grug MoE ~235B topk4 shared2x trial failed | @dlwh | v5p-256 (128 chips) preempt |
(— MFU) |
failed after ~30min (2 attempts) | 238B param MoE, topk=4, 2× shared expert. Both attempts failed. | #3536 | |
| NemotronTerminal-8B SFT (full corpus) | @AlienKevin | v5p-32 (16 chips) SFT, v5p-8 (4 chips) eval preempt |
(40.9% MFU) |
120h (SFT), ~1h (eval) | SFT loss=0.360. TB2.0: 13.5% (matches 13.0±2.2). TBLite: 16.0%, mean reward 0.180. | #3490 W&B W&B | |
| Grug MoE canary (GPU + TPU) running | @yonromai | H100 (CoreWeave) + GCP TPU |
(3.7% (canary, not perf target) MFU) |
continuous | Consolidated all canaries to single Grug MoE entry point via Iris. | #3505 #3587 | |
| LLaMA-50M Muon speedrun (1× Chinchilla) | @redagavin | — |
(— MFU) |
— | Muon optimizer at 1× Chinchilla scale. loss=4.061, 0.7B tokens. | #2185 W&B | |
| GPU MoE EP benchmark (ragged all-to-all) | @chloechiaw | 8× H100 (CoreWeave) |
(— MFU) |
benchmark | GPU MoE expert parallelism with ragged all-to-all kernel. | #3633 | |
| JAX DeepEP vs Megatron head-to-head | @dlwh | 8× H100 (CoreWeave) |
(— MFU) |
benchmark | JAX custom-call DeepEP benchmarked against Megatron baseline. | #3665 |