Week of March 2nd summary for marin-community/marin

Milestone: Kick-off a 32B-A4B 10T token MoE training run & advance scaling laws work & get ~15T+ tokens ready
156 merged 11 opened 77 issues closed 14 contributors 4 epics 653 comments this week

A massive Iris infrastructure push — controller checkpointing, reservation system, flexible device scheduling, and a complete logging overhaul — alongside MoE scaling law experiments by @ClassicLarry and continued kernel/optimizer work from @dlwh.

#2836 Infrastructure: MoE Training Support


Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.

24/41 sub-issues closed

Building on last week's reliability push, Iris saw its biggest week yet. @rjpower landed controller snapshot/checkpoint #3167 so restarts no longer orphan running jobs, a reservation system for pre-provisioning worker capacity #3123, #3223, and flexible device variant scheduling #3254 that lets jobs specify multiple acceptable TPU types for cross-region placement. The autoscaler was reworked with token-bucket rate limiting for scale-down #3212 and reduced lock contention under high task counts #3356. A complete logging overhaul replaced GCS-based log reads with heartbeat-forwarded logs stored in SQLite on the controller #3244, #3301, #3283, #3296, #3325, fixing dropped logs on task completion and eliminating file descriptor exhaustion. MirrorFileSystem #3258 provides transparent cross-region file access while CrossRegionGuardedFS #3162 blocks large cross-region reads. On the training side, @dlwh continued building on last week's Grug refactor with improved variant contract checks #3169, visual diff tooling for reviewing template-heavy Grug code #3127, a new modular_opt variant #3293, and MoE ring expert-parallel optimizations #3377, #3398 that closed the EP benchmark milestone #2710. Fused cross-entropy was stabilized: one production forward path #3125, miss-only autotune sweeps #3251, a backend-dispatched GMM API with GPU fallback #3256, and v4 vmem fallback stabilization #3354. @yonromai fixed Pallas GPU CE tracing on non-GB10 #3148, added NVIDIA weight tile limits #3160, and enabled S3 compilation caching #3195.

126 PRs this week, 24 new comments, and 1 new issue (41 total)
Sort:
5 potentially related in Other Changes

#3096 Pre-training: MoE Scaling Laws


1/6 sub-issues closed

Following last week's initial MoE experiments on v4 and v5p, @ClassicLarry ran an extensive set of scaling law experiments — replicating expert-count sweeps from the TGL paper #3182 and progressing to full isoflop sweeps #2167. Key findings: DeepSeek-style aux-loss-free load balancing outperformed traditional LBL across configurations, and at high sparsity ratios (2:128) a 4x LBL coefficient boost showed only marginal benefit. The sweep converged on a baseline architecture — 64 routed experts, K=2, with AF balancing (bias_rate=0.01) plus 0.001 aux loss — now tracked as moe_iteration_01. @yonromai added a MoE canary ferry for daily TPU regression testing #3342 alongside canary diagnostics improvements: data loader stall monitoring #3346, always-on profiling with persistent artifacts #3299, MFU gating on trailing p50 #3279, and CW canary OOM fixes #3217. @dlwh set block shuffle as the default for new Grug runs #3371, dispatched Grug through Fray jobs to fix multinode training #3269, and auto-detect multinode TPUs to set replicas and coscheduling #3233.

12 PRs this week, 49 new comments, and 3 new issues (6 total)
Sort:
6 potentially related in Other Changes

#3100 Data Sources for Pre-training


Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.

0/4 sub-issues closed

After last week's tokenization debugging, @ravwojdyla completed Nemotron-CC tokenization at scale — processing nearly 2 trillion tokens across all 7 quality tiers at ~150M tokens/sec across 512 workers #2829. The Luxical embedding experiment #3191 kicked off to evaluate frozen embeddings as general-purpose quality/topic classifiers, with Luxical's creator @lukemerrick offering usage guidance #3049. @Helw150 added Nemotron V2 data #3317. Zephyr saw Vortex upgraded to support GCS #3268, group_by enhancements including secondary sort and generator reducers #3250, #3247, and download reliability fixes #3324, #3142.

4 PRs this week, 6 new comments, and 2 new issues (4 total)
Sort:
2 potentially related in Other Changes

#3192 Synthetic Data


0/4 sub-issues closed

Progress on the SFT front after last week's 0% resolve rate: @AlienKevin achieved 5/43 on the Rust subset of SWE-bench Multilingual by switching to Qwen2.5-Coder-32B-Instruct as the student model and using the exact SWE-smith hyperparameters #2956. A TRL sanity check on Modal confirmed the earlier Qwen3-8B repetition issues were not a Marin-specific bug. On OpenThoughts4, @moojink completed follow-up experiments with the larger Qwen3-235B-A22B teacher #2262 — surprisingly, it showed only modest advantage over 32B for Llama3.1-8B-Instruct students, suggesting diminishing returns from teacher scale.

0 PRs this week, 6 new comments, and 0 new issues (4 total)
Sort:

Other Changes


@gonzalobenegas added LLR-based variant effect prediction evaluation for DNA models #3144 and an EDA notebook on perplexity vs downstream task performance #3333. Documentation alignment by @dlwh-golem across TPU cluster setup #3415, contributing hooks #3307, MkDocs commands #3271, and README paths #3235. CI improvements for PR checks on all target branches #3150.

29 PRs this week, 53 new comments, and 77 issues closed (77 total)
Sort:

Top 15 runs (by FLOPs) this week (completed, running, crashed)


The largest single run this week was Will Held's 10B-parameter AdamH scaling ladder at 1e22 FLOP budget on v4-256 (128 chips), reaching 47.5% MFU and processing 116B tokens over 59 hours before crashing — the first Marin run at this scale. MoE scaling law experiments intensified: ClassicLarry ran two complete TGL Phase 1 expert-count sweeps (2 to 256 experts, 8.3B tokens each) on v4-8 plus an EKN nano scaling sweep testing K and LBL coefficient interactions across ~20 configurations. The 256-expert configs consistently achieved the best loss (3.060-3.186). Moo Jin Kim completed a large OpenThoughts4 SFT run (Qwen3-1.7B with Qwen3-32B teacher, 100k steps on v4-128 at 34.7% MFU). David Hall began 32B-A4B MoE bring-up on v5p-64, running several short profiling attempts at 11-20% MFU that crashed but established baseline performance numbers. Parallel-attn-mlp experiments on v6e hardware investigated persistent NaN issues under restart.

Run Owner Hardware FLOP Budget Wall Time Loss Evals Links
adamh-scaling-ladder-nemotron-optimal-1e+23-v5-27f2fb running Will Held TPU v4
(512 chips)
8.18e22 model
2.65e23 HW
(31% MFU)
22.1d Final loss: 2.2834 BPB: 0.796
adamh-scaling-ladder-nemotron-optimal-1e+22-v5-025b0e Will Held TPU v4
(256 chips)
10.00e21 model
1.99e22 HW
(50% MFU)
3.5d Final loss: 2.2582 BPB: 0.768
adamh-scaling-ladder-nemotron-optimal-1e+22-v6-500e71 crashed Will Held TPU v4
(256 chips)
7.23e21 model
1.52e22 HW
(48% MFU)
2.5d Final loss: 3.4445 BPB: 0.950
(done) exp2262pt3c_pt2_qwen3_1pt7b_base_ot4_240k_math_qwen3_4b_32768tok-a9bd48 Moo Jin Kim TPU v4
(128 chips)
2.59e21 model
7.71e21 HW
(34% MFU)
2.4d Final loss: 0.0532 BPB: 0.067
(done) exp2262pt3d_pt2_qwen3_1pt7b_base_ot4_240k_math_qwen3_32b_32768to-58b647 Moo Jin Kim TPU v4
(128 chips)
2.59e21 model
7.46e21 HW
(35% MFU)
2.1d Final loss: 0.1431 BPB: 0.146
AdamH scaling ladder 10B (Nemotron-optimal, 1e22 budget) crashed @Helw150 v4-256
(128 chips)

(47.5% MFU)
59.0h loss=2.702, 115.9B tokens
adamh-v6-scaling-ladder-nemotron-optimal-1e+23-a128a5 crashed Will Held TPU v4
(256 chips)
1.25e21 model
3.22e21 HW
(39% MFU)
15.8h Final loss: 3.3285 BPB: 1.190
adamh-v6-scaling-ladder-nemotron-optimal-1e+22-81073a crashed Will Held TPU v4
(256 chips)
1.07e21 model
2.93e21 HW
(36% MFU)
16.1h Final loss: 2.7127 BPB: 0.948
adamh-scaling-ladder-nemotron-optimal-1e+21-v5-019021 Will Held TPU v4
(64 chips)
1.00e21 model
1.90e21 HW
(53% MFU)
1.3d Final loss: 2.4289 BPB: 0.844
adamh-scaling-ladder-nemotron-optimal-1e+21-v6-77f848 Will Held TPU v4
(64 chips)
1.00e21 model
1.89e21 HW
(53% MFU)
1.3d Final loss: 2.4282 BPB: 0.844
(done) exp2262pt3i_100k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-a594bd Moo Jin Kim TPU v4
(128 chips)
5.41e20 model
1.56e21 HW
(35% MFU)
13.3h Final loss: 0.1505 BPB: 0.145
(done) exp2262pt3i_100k_pt2_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-092c35 Moo Jin Kim TPU v5
(32 chips)
5.41e20 model
1.50e21 HW
(36% MFU)
1.2d Final loss: 0.1773 BPB: 0.135
(done) exp2262pt3h_100k_pt2_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-0c0f70 Moo Jin Kim TPU v5
(32 chips)
5.41e20 model
1.50e21 HW
(36% MFU)
18.5h Final loss: 0.0693 BPB: 0.064
exp2262pt3h_100k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-ce68a9 Moo Jin Kim TPU v5
(32 chips)
5.41e20 model
1.50e21 HW
(36% MFU)
1.2d Final loss: 0.0493 BPB: 0.078
(done) exp2262pt3i_100k_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768tokens-227b3e Moo Jin Kim TPU v5
(32 chips)
5.41e20 model
1.50e21 HW
(36% MFU)
1.7d Final loss: 0.2166 BPB: 0.135
isoflop-3e+20-d768-L8-B1024-adamh_scaling_v8 Will Held TPU v4
(32 chips)
3.00e20 model
1.29e21 HW
(23% MFU)
1.8d Final loss: 2.9525 BPB: 1.002
OpenThoughts4 SFT: Qwen3-1.7B base w/ Qwen3-32B teacher (100k steps) @moojink v4-128
(64 chips)

(34.7% MFU)
13.3h loss=0.166, 16.4B tokens
TGL Phase 1 MoE run3 (2 experts, v5p-16) crashed @ClassicLarry v5p-16
(8 chips)

(18.3% MFU)
15.5h loss=3.393, 20.7B tokens (crashed)
TGL Phase 1 MoE expert-count sweep (run5, 2-256 experts) @ClassicLarry v4-8
(4 chips)

(14-16% MFU)
10-17h per run loss=3.186 (256E) to 3.541 (2E), 8.3B tokens each, 8 configs
TGL Phase 1 MoE expert-count sweep (run2-v2, 2-256 experts) @ClassicLarry v4-8
(4 chips)

(14-16% MFU)
10-14h per run loss=3.060 (256E) to 3.423 (2E), 8.3B tokens each, 8 configs
EKN MoE nano scaling sweep (K=4/8, E=8-128, LBL ablation) @ClassicLarry v4-8
(4 chips)

(10-13% MFU)
10-14h per run loss=3.215-3.823, 6.6B tokens each, ~20 configs
Grug MoE 32B-A4B v5p-64 perf bring-up (profiling) crashed @dlwh v5p-64
(32 chips)

(11-20% MFU)
0.3-0.6h per attempt profiling only, ~20M tokens per attempt
Parallel-attn-mlp v6e NaN investigation sweeps crashed @dlwh v6e-8
(4 chips)

(5-7% MFU)
0.4-1.5h per run loss=NaN investigation, 0.8-1.5B tokens (crashed)
Merged PR Open PR Draft PR Closed PR Open issue Closed issue

Keyboard shortcuts

?
Toggle this help
j / k
Next / previous section
t
Toggle details in current section
s
Cycle sort order in current section
o
Open current epic on GitHub
m
Open current milestone on GitHub
M
Open milestones list on GitHub