Week of March 30th summary for marin-community/marin

Milestone: Kick-off a 32B-A4B 10T token MoE training run & advance scaling laws work & get ~15T+ tokens ready
88 merged 24 opened 36 issues closed 16 contributors 2 epics 268 comments this week
W&B 3.37e23 HW FLOPs (1.10e23 model FLOPs)

The 1e22 MoE run launched on v4-512 while capacity-factor and LR-decay ablations refined the recipe, and the Delphi 1e23 dense scaling ladder continued on v4-1024 at 30% MFU. Iris performance work cut controller lock hold time from 80ms to under 5ms and delivered 4x faster heartbeat batching. Agentic SFT v2 runs fixed critical gradient clipping and think-token issues, matching released OT-Agent 32K benchmarks at 13% SWE-bench.

#3096 Pre-training: MoE Scaling Laws


2/6 sub-issues closed
@ClassicLarry launched the 1e22 MoE run #3800 — a 34.6B-total / 5.1B-active model on v4-512 with capacity factor 1.0, running at 488k tok/s and 23% MFU (W&B). The capacity factor was set to 1.0 based on ablation results #4016 showing cap=1.0 is 8.3% faster than 1.25 with negligible loss impact at 1e20 scale. The v7 isoflop sweep #4225 mapped LR schedule and decay interactions with AdamH across dimensions, finding that removing linear decay at small step counts (d2048 @ 1e18) improved BPB by 0.03 — suggesting decay fraction needs to scale with training length. @Helw150 resolved the QB overhead problem #3972: the async CPU overlap approach was a dead end (a misleading benchmark), but sharded microbatch QB eliminated the 1.2 MFU-point overhead entirely. He also updated the MoE baseline #4084 and validated moe-sharded-qb-gn-xsa as the best iter-04 configuration with a Vizier hyperparameter sweep running on us-east5-a #2167. On the Delphi dense scaling ladder #1337, the 1e23 run on v4-1024 continued at 30% MFU with Paloma macro BPB 0.79, and the 1e22 seed42 run finished at 47% MFU (macro BPB 0.84). @dlwh landed XLA-first Mamba-3 SISO and MIMO TPU kernels #3961 (+5,352 lines) with a sharding-safe ranked public API #4149, and @msclar merged the AdaMuon optimizer implementation #3300.
1 PR this week, 9 new comments, and 0 new issues (6 total)
Sort:
10 autocategorized

#2836 Infrastructure: MoE Training Support


Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.

35/49 sub-issues closed
@rjpower drove a major Iris controller performance push: lock hold time in drain_dispatch_all dropped from 80ms to under 5ms #4222, a two-pass heartbeat batch delivered 4x faster provider loops #4210, lightweight job state polling reduced controller load #4209, the ORM query builder was replaced with raw SQL #4181, and log fetching was unified under FetchLogs with LIKE patterns #4202 (-598 lines). Checkpoint management got zstd compression and old-checkpoint pruning #4143. @ravwojdyla shipped the actor proxy service for external access to cluster actors #4126, refactored Zephyr chunking for improved shuffle scalability #3839 (+1,357/-539 lines), and allowed specifying coordinator resources #4095. User-defined counters (MapReduce-style per-job stats) were added #4085 with records_in/records_out counters for readers and writers #4189 and per-worker counter queries #4164. @yonromai fixed a TensorStore handle leak in cache-copy (~14 MiB/shard) #4198, fixed stale coordinators killing new workers on retry #4199, and added Slack alerts and Claude triage to the canary ferry #4158 #4177. @dlwh landed the region-aware executor on Iris #3824 (+1,075 lines). The integration test suite was redesigned #4009 and optional auth mode landed for gradual adoption #3937.
0 PRs this week, and 0 new issues (49 total)
Sort:
5 autocategorized

Other Changes


@dlwh drafted a staged modeling experiment skill #4166 to make modeling experiments Grug-first with W&B view tagging and optional stage gates.
112 PRs this week, 205 new comments, and 36 issues closed (36 total)
Sort:

External Contributions


Chloe Chia · San Francisco 1 PR, 2 comments

2 comments on 2 threads
  • #4297 Add GPU Triton kernel for ragged_dot MoE grouped matmul
  • #2828 Port MoE training to GPU: kernel experiments and performance validation

eramis73 2 PRs

Rohith Kuditipudi · cs phd @ stanford 1 PR, 1 comment

1 comment on 1 thread
  • #4389 Identify a soft proxy for agentic benchmarks to support data-mixture studies

Gavin Yang · NY, USA · First-year CS PhD student at Northeastern University. Graduated with a Bachelor's degree in CS & DS from NYU. 1 PR, 1 comment

1 comment on 1 thread
  • #2185 speedrun submission: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale

Top 15 runs (by FLOPs) this week (completed, running, crashed)


The largest active run is the Delphi 1e23 dense scaling ladder (W&B), a 25B parameter model on v4-1024 now at 608B tokens with Paloma macro BPB 0.79 and train loss 2.08 — @Helw150 noted the loss trend during cooldown is tracking between the pessimistic and optimistic forecasts #1337. The 1e22 MoE v7 run (W&B) launched mid-week on v4-512 at 23% MFU, estimated 7.7 days #3800. Two finished v7 isoflop runs at 1e20 validated capacity factor=1.0 as 8.3% faster with near-identical loss #4016. On the post-training side, @AlienKevin's 32K v2 OT-Agent SFT (W&B) reached 13% SWE-bench (matching released 14%) #3896, while the 131K v2 run on v5p-256 (W&B) revealed that batch-size-scaled gradient clipping is needed at 131K context #3897. Three Qwen3-14B resilience SFT runs for math and medical domains completed on v5p-16 at 56-57% MFU.
Run User Hardware(?) Hours(?) FLOP Budget(?) Loss BPB(?)
#1337 adamh-scaling-ladder-nemotron-optimal-1e+23-v5-27f2fb Will Held TPU v4
(512 chips)
22.0d 8.16e22 model
2.64e23 HW (31%)
BPB: 0.796
#1337 adamh-scaling-ladder-nemotron-optimal-1e+22-v5-seed42-deeff4 Will Held TPU v4
(256 chips)
3.4d 10.00e21 model
2.12e22 HW (47%)
BPB: 0.769
exp3490b_sft_nemotron_terminal_corpus_full_qwen3_8b_32768tokens_-3da6c1 Kevin Li TPU v5
(32 chips)
5.0d 2.49e21 model
6.08e21 HW (41%)
moe-d2304-1e21 Larry Dial TPU v4
(128 chips)
2.1d 1.00e21 model
5.59e21 HW (18%)
BPB: 0.823
exp2262pt3h_240k_pt4_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-f3ec95 Moo Jin Kim TPU v4
(128 chips)
1.2d 1.30e21 model
3.73e21 HW (35%)
BPB: 0.124
exp2262pt3i_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-eeb1d6 Moo Jin Kim TPU v4
(128 chips)
1.2d 1.30e21 model
3.73e21 HW (35%)
BPB: 0.181
exp2262pt3h_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-007999 Moo Jin Kim TPU v4
(128 chips)
1.2d 1.30e21 model
3.73e21 HW (35%)
BPB: 0.085
exp2262pt3l_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_30b_a3b_thi-5eb7b0 Moo Jin Kim TPU v4
(128 chips)
1.2d 1.30e21 model
3.73e21 HW (35%)
BPB: 0.167
exp2262pt3i_240k_pt4_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-84f4d3 Moo Jin Kim TPU v4
(128 chips)
1.2d 1.30e21 model
3.73e21 HW (35%)
BPB: 0.243
exp2262pt3l_240k_pt4_qwen3_1pt7b_base_ot4_math_qwen3_30b_a3b_thi-77d4d8 Moo Jin Kim TPU v4
(128 chips)
1.2d 1.30e21 model
3.73e21 HW (35%)
BPB: 0.209
exp3897_sft_ota_131k_qwen3_8b_131072tokens_v5p256-f7d21a Kevin Li TPU v5
(128 chips)
16.1h 1.16e21 model
3.16e21 HW (37%)
adamh-scaling-ladder-nemotron-optimal-1e+21-v5-seed42-e251d0 Will Held TPU v4
(64 chips)
1.3d 1.00e21 model
2.05e21 HW (49%)
BPB: 0.844
adamh-scaling-ladder-nemotron-optimal-1e+21-v5-seed62746-659a1b Will Held TPU v4
(64 chips)
1.3d 1.00e21 model
1.89e21 HW (53%)
BPB: 0.845
isoflop-moe-v2-1e+20-d1536-bs128 Larry Dial TPU v4
(64 chips)
12.5h 10.00e19 model
5.73e20 HW (17%)
BPB: 0.918
#2167 isoflop-moe-adamh-gatednorm-v5p64-r2-1e20-d1536-retry25 Kaiyue Wen TPU v5
(32 chips)
11.7h 9.36e19 model
5.13e20 HW (18%)
BPB: 0.913
Merged PR Open PR Draft PR Closed PR Open issue Closed issue

Keyboard shortcuts

?
Toggle this help
j / k
Next / previous section
t
Toggle details in current section
s
Cycle sort order in current section
o
Open current epic on GitHub
m
Open current milestone on GitHub
M
Open milestones list on GitHub
Data: weekly-data-2026-03-30_2026-04-05.json · sections.json · wandb-flops.json