Week of March 2nd summary for marin-community/marin

Milestone: Kick-off a 32B-A4B 10T token MoE training run & advance scaling laws work & get ~15T+ tokens ready
154 merged 26 opened 73 issues closed 14 contributors 4 epics 466 comments this week

Iris moved from last week's reliability hardening to a full log storage rewrite and reservation system. The MoE expert-parallel benchmark thread concluded with production compaction optimizations, and @ClassicLarry submitted a 15-run isoflop scaling sweep after last week's initial MoE experiments.

#2836 Infrastructure: MoE Training Support


Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.

24/41 sub-issues closed

Building on last week's Iris reliability push, @rjpower replaced GCS-based log reads with a SQLite-backed controller log store forwarded via heartbeat (#3301, #3244) — a full rewrite of the log pipeline. He also landed a reservation system for pre-provisioning worker capacity (#3123, #3223), and fixed FD exhaustion under load (#3389), controller lock contention (#3356), and delivery-failure retry budget inflation (#3366, #3367). @dlwh continued the Pallas kernel cleanup from last week, consolidating to a single production forward path for fused cross-entropy (#3125) and stabilizing it across TPU v4/v5e/v6e (#3354). The MoE ring expert-parallel benchmark thread (#2710) concluded this week with production compaction optimizations merged (#3377, #3398). @yonromai hardened CoreWeave deployment with konnectivity tunnel retries (#3323), Docker CLI pinning for TPU host compat (#3348), and compilation cache improvements (#3195). @ravwojdyla parallelized GCP zone queries (#3259) and added MirrorFileSystem for transparent cross-region file access (#3258).

108 PRs this week, 22 new comments, and 1 new issue (41 total)
Sort:
21 potentially related in Other Changes

#3096 Pre-training: 32B MoE Kick-off


1/4 sub-issues closed

Following last week's initial MoE experiments on v4 and v5p, @ClassicLarry completed Phase 1 scaling law replication (#3182) and submitted a full 15-run isoflop sweep varying expert counts, granularity, and activation ratios (#2167). The canonical Grug MoE module and template variant landed (#3046). @yonromai built on last week's CW canary ferry by adding a TPU canary (#3342), data loader stall diagnostics (#3346), always-on profiling with persistent artifacts (#3299), and MFU gating on trailing p50 windows (#3279). Grug MoE ring EP got block shuffle as the new default (#3371) and loop profiler annotations (#3376).

16 PRs this week, 33 new comments, and 1 new issue (4 total)
Sort:
17 potentially related in Other Changes

#3100 Data Sources for Pre-training


Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.

0/4 sub-issues closed

After last week's tokenization debugging, @ravwojdyla shifted to the Luxical embedding experiment for quality and topic evaluation (#3191) — the Luxical creator @lukemerrick dropped in with guidance on embedding storage and model usage (#3049). Vortex got GCS support (#3268) and @Helw150 added Nemotron V2 data (#3317). Zephyr gained group_by enhancements — secondary sort (#3250) and generator reducers (#3247). Tokenization tuning (#3170) and download reliability fixes (#3324, #3142) rounded out the pipeline work.

13 PRs this week, 6 new comments, and 2 new issues (4 total)
Sort:
3 potentially related in Other Changes

#3192 Synthetic Data


0/4 sub-issues closed

Progress on the SFT front after last week's 0% resolve rate: @AlienKevin reported that switching to Qwen2.5-Coder-32B-Instruct as the student model reached 5/43 on the Rust subset of SWE-bench Multilingual (#2956). A TRL sanity check on Modal confirmed the Marin SFT pipeline isn't at fault for earlier repetition issues. @moojink followed up with experiments using the larger Qwen3-235B-A22B teacher model and rejection sampling (#2262). No merged PRs this week but active experimental progress.

0 PRs this week, 3 new comments, and 0 new issues (4 total)
Sort:

Other Changes


Documentation alignment across the repo by @dlwh-golem — TPU cluster setup (#3415), contributing hooks (#3307), MkDocs commands (#3271), README paths (#3235). @gonzalobenegas added LLR-based VEP eval for DNA models (#3144) and perplexity vs. downstream task performance EDA (#3333). CI improvements for PR checks on all target branches (#3150).

43 PRs this week, 71 new comments, and 73 issues closed (73 total)
Sort:
Merged PR Open PR Draft PR Closed PR Open issue Closed issue

Keyboard shortcuts

?
Toggle this help
j / k
Next / previous section
t
Toggle details in current section
s
Cycle sort order in current section
o
Open current epic on GitHub
m
Open current milestone on GitHub
M
Open milestones list on GitHub