Week of March 16th summary for marin-community/marin

Milestone: Kick-off a 32B-A4B 10T token MoE training run & advance scaling laws work & get ~15T+ tokens ready
120 merged 20 opened 82 issues closed 13 contributors 4 epics 616 comments this week

A week of deep infrastructure hardening — Iris got security-first auth, DuckDB log storage, 5x controller performance, and multi-host GPU on CoreWeave — while MoE scaling research advanced to iter_03 with PID-controlled sigmoid routing, AdamH, and Gated Norm showing uniform BPB improvements across the isoflop grid.

#2836 Infrastructure: MoE Training Support


Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.

24/41 sub-issues closed

Iris saw major performance and security work this week. @rjpower landed a 5x+ controller speedup via a SQLite read pool, query consolidation, and dashboard SQL rewrites #3719, then followed up with further heartbeat and scheduling query optimizations #3881 and faster dashboard queries with expanded benchmark coverage #3791. The log store was overhauled twice — first replacing SQLite with DuckDB + rotating Parquet #3828, then optimizing with connection pooling, sorted segments, and page indexes #3837 — after a fix for the DuckDB store's 60GB memory usage #3843. The BundleStore was also replaced with flat-file storage #3776, and migration scripts were added for the SQLite→fsspec/Parquet transitions #3852. Security hardening landed with default-deny auth, CSRF protection, auth DB isolation, and traceback sanitization #3894, plus sensitive env var redaction in API responses #3889. Multi-host GPU support for CoreWeave shipped via the new KubernetesProvider #3806, with smoke CI added #3927, endpoint leak fixes #3814, #3740, tmpfs for task workdirs #3696, #3770, and stale heartbeat reaping on controller restart #3772. The Dockerfiles were unified into a single multi-stage build #3735 and container OOM issues were addressed with memory limits #3840. @dlwh made the executor region-aware on Iris from GCS dependencies #3824, fixed Grug checkpoint resume #3790, handled Pallas autotune misses under mosaic partitioning #3669, and added fused CE autotune fixes #3949, #3963. @yonromai refactored the Zephyr coordinator to use host_actor and worker group polling #3861, added S3/R2 conditional-write locking #3874, and fixed sync actor RPC kwargs #3907. @ravwojdyla added combiner support to group-by #3725, iris resource_utils for task resource queries #3847, map_shard with shard info #3757, and improved the job detail view with child jobs and sorting #3733. The harbor dependency was moved out to marin-community/harbor #3836, removing ~200K lines. A major Platform abstraction elimination is in progress #3900, reorganizing into Service + Provider layers.

113 PRs this week, 8 new comments, and 0 new issues (41 total)
Sort:

#3096 Pre-training: MoE Scaling Laws


1/6 sub-issues closed

Building on last week's iter_02 improvements, @Helw150 pushed the MoE architecture to iter_03 with five major changes: sigmoid routing replacing softmax for independent per-expert selection, a PID bias controller (Kp/Ki/Kd) replacing the bang-bang balancer, AdamH (scale-invariant Adam that keeps weight Frobenius norms constant) replacing standard Adam, removal of all auxiliary losses (load balancing, router z-loss, logit z-loss), and variance-preserving residual connections. Two parallel Vizier sweeps — 160 trials each at d=1024/3e18 FLOPs — are comparing iter_02 baseline against iter_03 AdamH to find optimal hyperparameters at matched compute #2167. @WhenWen completed the full AdamH isoflop grid, showing 0.009–0.01 BPB improvements over v02 at most configurations, and added Gated Norm on top — a learned sigmoid gate that compensates for AdamH's bounded activation norms — producing uniform improvements of 0.002–0.006 BPB across all 15 grid points, with the d2048 regression at 1e18 nearly eliminated. @ClassicLarry ran the recipe at 1e20 scale with directionally good results, completed iter_01 vs iter_02 comparisons showing consistent improvements across the full grid, then kicked off a 1e21 run (14B total params, 2B active, 75B tokens) before stepping out #3800. On the GPU front, @chloechiaw is testing Tokamax's Pallas Triton ragged_dot kernel as a JAX-native alternative for GPU MoE #2828. @dlwh added an explicit EP implementation selector for the Grug MoE path #3779, an ArrayStacked grug variant with stack-aware optimizer support #3797, and only-store-EMA-when-enabled optimization #3670. Selective MoE remat checkpoint policies were added and then scoped down after feedback that the model file was getting too heavy #3657.

4 PRs this week, 17 new comments, and 2 new issues (6 total)
Sort:

#3100 Data Sources for Pre-training


Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.

0/4 sub-issues closed

Following last week's exact dedup rewrite, @ravwojdyla confirmed successful large-scale group-by for fuzzy dedup on a Nemotron split #2829, validating the pipeline at production scale. @rjpower switched Nemotron-CC downloads to streaming .jsonl.zst output #3796. @dlwh added support for hf://buckets paths in default_download and default_tokenize #3793. The dupekit Rust code was refactored into rust/ with CI wheels and a mode switch #3850. A fuzzy dedup update to scale to Nemotron is in progress #3750.

4 PRs this week, and 0 new issues (4 total)
Sort:

#3192 Synthetic Data & Post-training


0/4 sub-issues closed

@ahmeda14960 opened a major PR for an alignment function that goes from spec to synthetic preference data to DPO #3950, alongside work migrating the RL pipeline from Fray v1 (Ray) to Fray v2 (Iris) #3960. A collocated RL project doc and Tunix RL reference was also opened #3948. @teetone added new Evalchemy evals and fixes #3690.

1 PR this week, 1 new comment, and 0 new issues (4 total)
Sort:

Other Notable Changes


@dlwh opened a PR for XLA-first Mamba-3 SISO and MIMO TPU kernels #3961, a significant new architecture direction. @RohithKuditipudi fixed final checkpoint saving #3958. The nightshift agent system continued evolving — skills were separated for PR authoring vs review #3920, triage was upgraded to opus #3749, and turn limits were increased 10x #3935. @Helw150 has a tokenized molecule training draft in progress #3742.

23 PRs this week, 31 new comments, and 82 issues closed (82 total)
Sort:

Top 15 runs (by FLOPs) this week (completed, running, crashed)


Run User Hardware(?) Hours(?) FLOP Budget(?) Loss BPB(?)
adamh-scaling-ladder-nemotron-optimal-1e+23-v5-27f2fb Will Held TPU v4
(512 chips)
22.0d 8.16e22 model
2.64e23 HW
(31% MFU)
BPB: 0.796
adamh-scaling-ladder-nemotron-optimal-1e+22-v5-seed42-deeff4 Will Held TPU v4
(256 chips)
3.4d 10.00e21 model
2.12e22 HW
(47% MFU)
BPB: 0.769
exp3490b_sft_nemotron_terminal_corpus_full_qwen3_8b_32768tokens_-3da6c1 Kevin Li TPU v5
(32 chips)
5.0d 2.49e21 model
6.08e21 HW
(41% MFU)
moe-d2304-1e21 Larry Dial TPU v4
(128 chips)
2.1d 1.00e21 model
5.59e21 HW
(18% MFU)
BPB: 0.823
exp2262pt3h_240k_pt4_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-f3ec95 Moo Jin Kim TPU v4
(128 chips)
1.2d 1.30e21 model
3.73e21 HW
(35% MFU)
BPB: 0.124
exp2262pt3i_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-eeb1d6 Moo Jin Kim TPU v4
(128 chips)
1.2d 1.30e21 model
3.73e21 HW
(35% MFU)
BPB: 0.181
exp2262pt3h_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-007999 Moo Jin Kim TPU v4
(128 chips)
1.2d 1.30e21 model
3.73e21 HW
(35% MFU)
BPB: 0.085
exp2262pt3l_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_30b_a3b_thi-5eb7b0 Moo Jin Kim TPU v4
(128 chips)
1.2d 1.30e21 model
3.73e21 HW
(35% MFU)
BPB: 0.167
exp2262pt3i_240k_pt4_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-84f4d3 Moo Jin Kim TPU v4
(128 chips)
1.2d 1.30e21 model
3.73e21 HW
(35% MFU)
BPB: 0.243
exp2262pt3l_240k_pt4_qwen3_1pt7b_base_ot4_math_qwen3_30b_a3b_thi-77d4d8 Moo Jin Kim TPU v4
(128 chips)
1.2d 1.30e21 model
3.73e21 HW
(35% MFU)
BPB: 0.209
exp3897_sft_ota_131k_qwen3_8b_131072tokens_v5p256-f7d21a Kevin Li TPU v5
(128 chips)
16.1h 1.16e21 model
3.16e21 HW
(37% MFU)
adamh-scaling-ladder-nemotron-optimal-1e+21-v5-seed42-e251d0 Will Held TPU v4
(64 chips)
1.3d 1.00e21 model
2.05e21 HW
(49% MFU)
BPB: 0.844
adamh-scaling-ladder-nemotron-optimal-1e+21-v5-seed62746-659a1b Will Held TPU v4
(64 chips)
1.3d 1.00e21 model
1.89e21 HW
(53% MFU)
BPB: 0.845
isoflop-moe-v2-1e+20-d1536-bs128 Larry Dial TPU v4
(64 chips)
12.5h 10.00e19 model
5.73e20 HW
(17% MFU)
BPB: 0.918
isoflop-moe-adamh-gatednorm-v5p64-r2-1e20-d1536-retry25 Kaiyue Wen TPU v5
(32 chips)
11.7h 9.36e19 model
5.13e20 HW
(18% MFU)
BPB: 0.913
Merged PR Open PR Draft PR Closed PR Open issue Closed issue

Keyboard shortcuts

?
Toggle this help
j / k
Next / previous section
t
Toggle details in current section
s
Cycle sort order in current section
o
Open current epic on GitHub
m
Open current milestone on GitHub
M
Open milestones list on GitHub