Week of February 9th summary for marin-community/marin

Milestone: Reliable, repeatable, enjoyable infrastructure
64 merged 0 opened 40 issues closed 10 contributors 0 epics 252 comments this week

A massive infrastructure push on Iris — the cluster orchestrator gained autoscaling, a revamped dashboard, controller-initiated heartbeats, chaos testing, and Zephyr GPU support — alongside data pipeline refactoring, DNA experiment consolidation, RL model expansion, and early Fray v2 API work.

Other Changes


Iris cluster orchestrator overhaul

The bulk of the week's effort went into Iris, Marin's cluster orchestration layer, with @rjpower driving a sweeping set of changes across reliability, observability, and GPU support.

64 PRs this week, 100 new comments, and 40 issues closed (40 total)
Sort:

Training Runs This Week


The largest training run this week was a 50-hour SFT of Qwen3-Base 7B on OpenThoughts4 math data (240k examples) by Moo Jin Kim, reaching 38.9% MFU on a v5p-32. Will Held launched a mega-sweep campaign for 2.5B Qwen3 configurations (33 runs across two batch-size/hidden-dim combos) as groundwork for MoE scaling decisions, plus a 1B Grugformer dense speedrun. Calvin Xu ran the most extensive data mixture experiments yet -- 124 runs comparing single-phase epoch vs no-epoch strategies.

Run Owner Hardware FLOPs Wall Time MFU Evals Status
#2199 SFT: Qwen3-Base on OpenThoughts4 + math (240k examples) @moojink v5p-32 (16 chips) 5e19 50.0h 38.9% loss=0.145, Qwen3-Base 7B SFT, 4682 steps, seq_len=32768, bs=256 completed #1
#2499 Grugformer dense 1B speedrun @Helw150 v5p-8 (4 chips) 2.5e18 0.9h 30.1% loss=4.624, 1B Grugformer dense model speedrun completed #1
#2499 Mega-sweep: 2.5B Qwen3 (bs32, hid512) @Helw150 v5p-8 (4 chips) 2.5e18 0.9-1.4h per run 34.6-35.6% 16 trials, loss~3.5-3.7, large-batch MoE hyperparameter search completed #1
#2499 Mega-sweep: 2.5B Qwen3 (bs16, hid256) @Helw150 v5p-8 (4 chips) 2.5e18 1.3-2.2h per run 11.6-16.3% 16 trials, loss~4.1-4.3, smaller hidden dim search completed #1
Data mixture: single-phase epoch sweep @Calvin-Xu v5p-8 (4 chips) 1e18 1.3-3.4h per run 35.5-36.7% 62 runs + baselines, loss range 0.48-3.3, epoch-based data mixing completed #1
Data mixture: single-phase no-epoch sweep @Calvin-Xu v5p-8 (4 chips) 1e18 1.5-3.4h per run 20.0-36.6% 62 runs + baselines, loss range 2.96-3.4, no-epoch data mixing completed #1
Merged PR Open PR Draft PR Closed PR Open issue Closed issue

Keyboard shortcuts

?
Toggle this help
j / k
Next / previous section
t
Toggle details in current section
s
Cycle sort order in current section
o
Open current epic on GitHub
m
Open current milestone on GitHub
M
Open milestones list on GitHub