The 1e22 MoE run launched on v4-512 while capacity-factor and LR-decay ablations refined the recipe, and the Delphi 1e23 dense scaling ladder continued on v4-1024 at 30% MFU. Iris performance work cut controller lock hold time from 80ms to under 5ms and delivered 4x faster heartbeat batching. Agentic SFT v2 runs fixed critical gradient clipping and think-token issues, matching released OT-Agent 32K benchmarks at 13% SWE-bench.
moe-sharded-qb-gn-xsa as the best iter-04 configuration with a Vizier hyperparameter sweep running on us-east5-a #2167. On the Delphi dense scaling ladder #1337, the 1e23 run on v4-1024 continued at 30% MFU with Paloma macro BPB 0.79, and the 1e22 seed42 run finished at 47% MFU (macro BPB 0.84). @dlwh landed XLA-first Mamba-3 SISO and MIMO TPU kernels #3961 (+5,352 lines) with a sharding-safe ranked public API #4149, and @msclar merged the AdaMuon optimizer implementation #3300.Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.
drain_dispatch_all dropped from 80ms to under 5ms #4222, a two-pass heartbeat batch delivered 4x faster provider loops #4210, lightweight job state polling reduced controller load #4209, the ORM query builder was replaced with raw SQL #4181, and log fetching was unified under FetchLogs with LIKE patterns #4202 (-598 lines). Checkpoint management got zstd compression and old-checkpoint pruning #4143. @ravwojdyla shipped the actor proxy service for external access to cluster actors #4126, refactored Zephyr chunking for improved shuffle scalability #3839 (+1,357/-539 lines), and allowed specifying coordinator resources #4095. User-defined counters (MapReduce-style per-job stats) were added #4085 with records_in/records_out counters for readers and writers #4189 and per-worker counter queries #4164. @yonromai fixed a TensorStore handle leak in cache-copy (~14 MiB/shard) #4198, fixed stale coordinators killing new workers on retry #4199, and added Slack alerts and Claude triage to the canary ferry #4158 #4177. @dlwh landed the region-aware executor on Iris #3824 (+1,075 lines). The integration test suite was redesigned #4009 and optional auth mode landed for gradual adoption #3937.| Run | User | Hardware(?) | Hours(?) | FLOP Budget(?) | Loss | BPB(?) |
|---|---|---|---|---|---|---|
| #1337 adamh-scaling-ladder-nemotron-optimal-1e+23-v5-27f2fb | Will Held |
TPU v4 (512 chips) |
22.0d |
8.16e22 model
2.64e23 HW (31%) |
BPB: 0.796 | |
| #1337 adamh-scaling-ladder-nemotron-optimal-1e+22-v5-seed42-deeff4 | Will Held |
TPU v4 (256 chips) |
3.4d |
10.00e21 model
2.12e22 HW (47%) |
BPB: 0.769 | |
| exp3490b_sft_nemotron_terminal_corpus_full_qwen3_8b_32768tokens_-3da6c1 | Kevin Li |
TPU v5 (32 chips) |
5.0d |
2.49e21 model
6.08e21 HW (41%) |
— | |
| moe-d2304-1e21 | Larry Dial |
TPU v4 (128 chips) |
2.1d |
1.00e21 model
5.59e21 HW (18%) |
BPB: 0.823 | |
| exp2262pt3h_240k_pt4_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-f3ec95 | Moo Jin Kim |
TPU v4 (128 chips) |
1.2d |
1.30e21 model
3.73e21 HW (35%) |
BPB: 0.124 | |
| exp2262pt3i_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-eeb1d6 | Moo Jin Kim |
TPU v4 (128 chips) |
1.2d |
1.30e21 model
3.73e21 HW (35%) |
BPB: 0.181 | |
| exp2262pt3h_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-007999 | Moo Jin Kim |
TPU v4 (128 chips) |
1.2d |
1.30e21 model
3.73e21 HW (35%) |
BPB: 0.085 | |
| exp2262pt3l_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_30b_a3b_thi-5eb7b0 | Moo Jin Kim |
TPU v4 (128 chips) |
1.2d |
1.30e21 model
3.73e21 HW (35%) |
BPB: 0.167 | |
| exp2262pt3i_240k_pt4_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-84f4d3 | Moo Jin Kim |
TPU v4 (128 chips) |
1.2d |
1.30e21 model
3.73e21 HW (35%) |
BPB: 0.243 | |
| exp2262pt3l_240k_pt4_qwen3_1pt7b_base_ot4_math_qwen3_30b_a3b_thi-77d4d8 | Moo Jin Kim |
TPU v4 (128 chips) |
1.2d |
1.30e21 model
3.73e21 HW (35%) |
BPB: 0.209 | |
| exp3897_sft_ota_131k_qwen3_8b_131072tokens_v5p256-f7d21a | Kevin Li |
TPU v5 (128 chips) |
16.1h |
1.16e21 model
3.16e21 HW (37%) |
— | |
| adamh-scaling-ladder-nemotron-optimal-1e+21-v5-seed42-e251d0 | Will Held |
TPU v4 (64 chips) |
1.3d |
1.00e21 model
2.05e21 HW (49%) |
BPB: 0.844 | |
| adamh-scaling-ladder-nemotron-optimal-1e+21-v5-seed62746-659a1b | Will Held |
TPU v4 (64 chips) |
1.3d |
1.00e21 model
1.89e21 HW (53%) |
BPB: 0.845 | |
| isoflop-moe-v2-1e+20-d1536-bs128 | Larry Dial |
TPU v4 (64 chips) |
12.5h |
10.00e19 model
5.73e20 HW (17%) |
BPB: 0.918 | |
| #2167 isoflop-moe-adamh-gatednorm-v5p64-r2-1e20-d1536-retry25 | Kaiyue Wen |
TPU v5 (32 chips) |
11.7h |
9.36e19 model
5.13e20 HW (18%) |
BPB: 0.913 |
2 comments on 2 threads