Week of March 23rd summary for marin-community/marin

Milestone: Kick-off a 32B-A4B 10T token MoE training run & advance scaling laws work & get ~15T+ tokens ready
119 merged 59 opened 75 issues closed 15 contributors 4 epics 282 comments this week
GCP TPU 1.02e24 HW FLOPs (0 reserved) W&B 3.77e23 HW FLOPs (1.18e23 model FLOPs)
Data 13.6T tokens 66.1% synthetic 67 datasets ๐Ÿค— collection
web 7.7T (57.1%) multilingual 3.9T (28.4%) code 1.3T (9.3%) math 377.1B (2.8%) specialized 330.0B (2.4%)

The 1e22 MoE run launched on v4-512 while capacity-factor and LR-decay ablations refined the recipe, and the Delphi 1e23 dense scaling ladder continued on v4-1024 at 30% MFU. Iris performance work cut controller lock hold time from 80ms to under 5ms and delivered 4x faster heartbeat batching. Agentic SFT v2 runs fixed critical gradient clipping and think-token issues, matching released OT-Agent 32K benchmarks at 13% SWE-bench.

#3192 Synthetic Data & Post-training


0/4 sub-issues closed

The agentic SFT reproduction made significant progress this week. @AlienKevin identified four root causes in the v1 32K run #3896: catastrophic overfitting (loss dropping to 0.00003) caused by max_grad_norm=1.0 instead of OT-Agent's 1e-4, and generating <|start_think|> instead of native <think> tokens. The v2 32K SFT run with these fixes reached 13% SWE-bench (matching released 14%) and 7.9% TB2 (matching released ~8.1%), though TB-Lite averaged 12% vs released 18%. For 131K context #3897, the v2 run on v5p-256 with max_grad_norm=1e-4 regressed to 15% SWE-bench (from v1's 30%) because 1e-4 clipping with only 248 steps at batch=128 was too aggressive โ€” gradients were clipped by ~1300x every step. A v2a run with sqrt-scaled hyperparams (LR ร— sqrt(8), grad_norm ร— 8) showed much healthier loss curves before preemption. @AlienKevin also successfully reproduced NemotronTerminal #3490, reaching 15.9% on TB2 (vs released 13.0%) and 29.0% TB-Lite (vs released 23.0%). @eramis73 merged DspyEvaluator improvements with ToonAdapter support #4213, and @taivu1998 drafted a long-context evaluation lane for the exp2062 long-context plan #4249.

0 PRs this week, and 0 new issues (4 total)
Sort:
13 autocategorized

#3100 Data Sources for Pre-training


Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.

0/5 sub-issues closed

The synthetic reasoning bootstrap corpus #4148 expanded rapidly with 24 comments this week from @dlwh, adding deterministic generators for BFS shortest path, binary search, coin change DP, connected components, Dijkstra, and more algorithmic reasoning domains โ€” all producing step-by-step traces without LLM labels. @ravwojdyla shipped major data infrastructure work: fuzzy dedup was updated to scale to Nemotron #3750 (+465/-514 lines), and the new datakit tool consolidated and cleaned up data downloads #4142 (+878/-1,331 lines). FlatMixture was added for virtual dataset concatenation with global shuffle without re-tokenizing #4133. @yonromai fixed a TensorStore handle leak that was accumulating ~14 MiB per shard during cache-copy #4198, and shared ts.Transaction for metadata consolidation cut serial copy time #4105. @ahmeda14960 added generation_config.json support for chat model checkpoints #4160 and shipped data browser QOL updates #3954.

0 PRs this week, 24 new comments, and 1 new issue (5 total)
Sort:
27 autocategorized

#3096 Pre-training: MoE Scaling Laws


1/6 sub-issues closed

@ClassicLarry launched the 1e22 MoE run #3800 โ€” a 34.6B-total / 5.1B-active model on v4-512 with capacity factor 1.0, running at 488k tok/s and 23% MFU (W&B). The capacity factor was set to 1.0 based on ablation results #4016 showing cap=1.0 is 8.3% faster than 1.25 with negligible loss impact at 1e20 scale. The v7 isoflop sweep #4225 mapped LR schedule and decay interactions with AdamH across dimensions, finding that removing linear decay at small step counts (d2048 @ 1e18) improved BPB by 0.03 โ€” suggesting decay fraction needs to scale with training length. @Helw150 resolved the QB overhead problem #3972: the async CPU overlap approach was a dead end (a misleading benchmark), but sharded microbatch QB eliminated the 1.2 MFU-point overhead entirely. He also updated the MoE baseline #4084 and validated moe-sharded-qb-gn-xsa as the best iter-04 configuration with a Vizier hyperparameter sweep running on us-east5-a #2167. On the Delphi dense scaling ladder #1337, the 1e23 run on v4-1024 continued at 30% MFU with Paloma macro BPB 0.79, and the 1e22 seed42 run finished at 47% MFU (macro BPB 0.84). @dlwh landed XLA-first Mamba-3 SISO and MIMO TPU kernels #3961 (+5,352 lines) with a sharding-safe ranked public API #4149, and @msclar merged the AdaMuon optimizer implementation #3300.

0 PRs this week, 13 new comments, and 1 new issue (6 total)
Sort:
104 autocategorized

#2836 Infrastructure: MoE Training Support


Summary: Train a 50B MoE model on GPU hardware reliably โ€” from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.

35/49 sub-issues closed

@rjpower drove a major Iris controller performance push: lock hold time in drain_dispatch_all dropped from 80ms to under 5ms #4222, a two-pass heartbeat batch delivered 4x faster provider loops #4210, lightweight job state polling reduced controller load #4209, the ORM query builder was replaced with raw SQL #4181, and log fetching was unified under FetchLogs with LIKE patterns #4202 (-598 lines). Checkpoint management got zstd compression and old-checkpoint pruning #4143. @ravwojdyla shipped the actor proxy service for external access to cluster actors #4126, refactored Zephyr chunking for improved shuffle scalability #3839 (+1,357/-539 lines), and allowed specifying coordinator resources #4095. User-defined counters (MapReduce-style per-job stats) were added #4085 with records_in/records_out counters for readers and writers #4189 and per-worker counter queries #4164. @yonromai fixed a TensorStore handle leak in cache-copy (~14 MiB/shard) #4198, fixed stale coordinators killing new workers on retry #4199, and added Slack alerts and Claude triage to the canary ferry #4158 #4177. @dlwh landed the region-aware executor on Iris #3824 (+1,075 lines). The integration test suite was redesigned #4009 and optional auth mode landed for gradual adoption #3937.

0 PRs this week, and 0 new issues (49 total)
Sort:
180 autocategorized

Other Changes


@dlwh drafted a staged modeling experiment skill #4166 to make modeling experiments Grug-first with W&B view tagging and optional stage gates.

1 PR this week, 11 new comments, and 75 issues closed (75 total)
Sort:

Top 15 runs (by FLOPs) this week (completed, running, crashed)


The largest active run is the Delphi 1e23 dense scaling ladder (W&B), a 25B parameter model on v4-1024 now at 608B tokens with Paloma macro BPB 0.79 and train loss 2.08 โ€” @Helw150 noted the loss trend during cooldown is tracking between the pessimistic and optimistic forecasts #1337. The 1e22 MoE v7 run (W&B) launched mid-week on v4-512 at 23% MFU, estimated 7.7 days #3800. Two finished v7 isoflop runs at 1e20 validated capacity factor=1.0 as 8.3% faster with near-identical loss #4016. On the post-training side, @AlienKevin's 32K v2 OT-Agent SFT (W&B) reached 13% SWE-bench (matching released 14%) #3896, while the 131K v2 run on v5p-256 (W&B) revealed that batch-size-scaled gradient clipping is needed at 131K context #3897. Three Qwen3-14B resilience SFT runs for math and medical domains completed on v5p-16 at 56-57% MFU.

Run User Hardware(?) Hours(?) FLOP Budget(?) Loss BPB(?)
#1337 adamh-scaling-ladder-nemotron-optimal-1e+23-v5-27f2fb Will Held TPU v4
(512 chips)
26.2d 9.69e22 model
3.23e23 HW (30%)
BPB: 0.725
#1337 adamh-scaling-ladder-nemotron-optimal-1e+22-v5-seed42-deeff4 Will Held TPU v4
(256 chips)
3.4d 10.00e21 model
2.12e22 HW (47%)
BPB: 0.769
#3800 moe-v7-1e22-d3200 Larry Dial TPU v4
(256 chips)
2.7d 3.37e21 model
1.46e22 HW (23%)
BPB: 0.834
#1337 adamh-scaling-ladder-nemotron-optimal-1e+22-v5-seed62746-10f597 Will Held TPU v4
(256 chips)
1.7d 4.86e21 model
9.64e21 HW (50%)
BPB: 0.873
#3897 exp3897v2_sft_ota_131k_qwen3_8b_131072tokens_v5p256-de2c26 Kevin Li TPU v5
(128 chips)
16.5h 1.16e21 model
3.16e21 HW (37%)
โ€”
#4016 isoflop-moe-v7-1e+20-d1536 Larry Dial TPU v4
(64 chips)
15.1h 1.28e20 model
7.18e20 HW (18%)
BPB: 0.893
#4016 isoflop-moe-v7-1e+20-d1536-cap1p0 Larry Dial TPU v4
(64 chips)
14.4h 1.28e20 model
6.63e20 HW (19%)
BPB: 0.900
#3897 exp3897v2_sft_ota_131k_qwen3_8b_131072tokens_v5p32-913963 Kevin Li TPU v5
(16 chips)
1.2d 2.40e20 model
6.56e20 HW (37%)
โ€”
#2167 isoflop-moe-adamh-gatednorm-v5p64-r2-1e20-d1536-retry25 Kaiyue Wen TPU v5
(32 chips)
11.7h 9.36e19 model
5.13e20 HW (18%)
BPB: 0.913
#3896 exp3896v2_sft_ota_32k_qwen3_8b_32768tokens_v5p32-dbc611 Kevin Li TPU v5
(16 chips)
22.2h 2.15e20 model
5.09e20 HW (42%)
โ€”
math-14b-resili-best-extract-qwen3-14b-base-a7e7d9 Michael Ryan TPU v5
(16 chips)
1.0d 2.78e20 model
4.94e20 HW (56%)
โ€”
math-14b-resili-best-resili-qwen3-14b-base-c0112d Michael Ryan TPU v5
(16 chips)
1.0d 2.78e20 model
4.94e20 HW (56%)
โ€”
math-14b-resili-default-qwen3-14b-base-d81f39 Michael Ryan TPU v5
(16 chips)
21.6h 2.78e20 model
4.84e20 HW (57%)
โ€”
medical-14b-resili-best-extract-qwen3-14b-base-537b2a Michael Ryan TPU v5
(16 chips)
19.4h 2.29e20 model
3.99e20 HW (57%)
โ€”
medical-14b-resili-best-resili-qwen3-14b-base-973573 Michael Ryan TPU v5
(16 chips)
19.5h 2.29e20 model
3.96e20 HW (58%)
โ€”
Merged PR Open PR Draft PR Closed PR Open issue Closed issue

Keyboard shortcuts

?
Toggle this help
j / k
Next / previous section
t
Toggle details in current section
s
Cycle sort order in current section
o
Open current epic on GitHub
m
Open current milestone on GitHub
M
Open milestones list on GitHub
Data: weekly-data-2026-03-23_2026-03-29.json ยท sections-2026-03-23_2026-03-29.json ยท wandb-flops-2026-03-23_2026-03-29.json ยท tpu-usage-2026-03-23_2026-03-29.json ยท token-counts-2026-03-23_2026-03-29.json