Week of February 23rd summary for marin-community/marin

Milestone: Kick-off a 32B-A4B 10T token MoE training run & advance scaling laws work & get ~15T+ tokens ready
73 merged 5 opened 39 issues closed 11 contributors 4 epics 237 comments this week

Iris reliability hardening, Grug's module-first API refactor, and early MoE training experiments on TPU. The first CoreWeave GPU canary ferry was stood up and @ClassicLarry got Grug MoE running with replicated weights on v4 and v5p.

#2836 Infrastructure: MoE Training Support


Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.

24/41 sub-issues closed

Iris saw a major reliability push: @yonromai added request-level RPC observability (#3073), auto-retry on DEADLINE_EXCEEDED (#3067), container phase tracking replacing a boolean (#3105, #3106), local K8s e2e tests (#3097), and CW RBAC + scheduling fixes (#3111). @rjpower landed live resource monitoring in the dashboard (#3085), fractional CPU support via millicores (#3040), unified WorkerConfig proto (#3077), and replaced multi-region AR push with GHCR + AR remote repos (#2996). @dlwh refactored Grug to a module-first API (#3017), tuned fused cross-entropy for v4 with XLA fallback (#3052), set TPU VMEM defaults (#3053), and added Pallas cost estimates (#2999). Zephyr got heartbeat timeout suppression (#3014) and shared data replaced with disk-based serialization (#2986) by @ravwojdyla.

60 PRs this week, 49 new comments, and 3 new issues (41 total)
Sort:

#3100 Data Sources for Pre-training


Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.

0/4 sub-issues closed

@ravwojdyla spent the week debugging tokenization at scale in preemptible environments (#2829), cataloging and fixing a series of Zephyr bugs around shared data, heartbeat timeouts, and Zarr metadata errors. Nemotron tokenization got a disk_cache tokenizer and configurable shard counts (#2984, #3036). @Helw150 provided context on the removal of Bert/Fasttext quality classifiers, motivating the Luxical embedding exploration (#3049).

3 PRs this week, 8 new comments, and 4 new issues (4 total)
Sort:

#3096 Pre-training: 32B MoE Kick-off


1/4 sub-issues closed

@ClassicLarry got Grug MoE running with replicated weights on v5p (#3064), benchmarking MFU across configurations and demonstrating that v5p's 95GB HBM allows full weight replication with batch-only sharding. @yonromai stood up the first daily CoreWeave GPU canary ferry workflow (#3050). Active discussion on MoE expert-parallel benchmarks by @dlwh on v5p-8 (#2710), and @ClassicLarry began scoping the MoE isoflop sweep (#2167).

3 PRs this week, 43 new comments, and 1 new issue (4 total)
Sort:

#3192 Synthetic Data


0/4 sub-issues closed

@AlienKevin's SFT run on SWE-smith trajectories finished in 8.5 hours but yielded 0% resolve rate on SWE-bench Multilingual with gpt-5-mini teacher (#2956). Two alternative teacher models (minimax-m2.5 and another) are being tried. Repetition issues with Qwen3-8B during SFT prompted a sanity check using TRL on Modal. @moojink continued OpenThoughts4 teacher model comparison (#2262) with @natolambert following along.

0 PRs this week, 7 new comments, and 2 new issues (4 total)
Sort:

Other Changes


@gonzalobenegas added DNA experiments covering promoters, genomic regions, and k-mer tokenization (#2992), plus auto-detection of BOS/EOS tokens in the DNA batch tokenizer (#3055). @teetone updated Evalchemy non-math evaluation domains (#3128). Agent recipe and scrub skill improvements by @dlwh (#3129) and @rjpower (#3056).

12 PRs this week, 24 new comments, and 39 issues closed (39 total)
Sort:
Merged PR Open PR Draft PR Closed PR Open issue Closed issue

Keyboard shortcuts

?
Toggle this help
j / k
Next / previous section
t
Toggle details in current section
s
Cycle sort order in current section
o
Open current epic on GitHub
m
Open current milestone on GitHub
M
Open milestones list on GitHub