Week of February 9th summary for marin-community/marin

A massive infrastructure push on Iris — the cluster orchestrator gained autoscaling, a revamped dashboard, controller-initiated heartbeats, chaos testing, and Zephyr GPU support — alongside data pipeline refactoring, DNA experiment consolidation, RL model expansion, and early Fray v2 API work.

Other Changes

Iris cluster orchestrator overhaul

The bulk of the week's effort went into Iris, Marin's cluster orchestration layer, with @rjpower driving a sweeping set of changes across reliability, observability, and GPU support.

64 PRs this week, 100 new comments, and 40 issues closed (40 total)

Sort:

#2807 Ignore entire dot dir +1 −0 @ravwojdyla
#2806 Iris/Fray/Marin nits: simplify TPU env, separate loop intervals, improve error reporting 💬1 +20 −63 @rjpower
#2805 iris: show accelerator in dashboard resource card + coscheduling TPU test 💬1 +83 −1 @rjpower
#2800 Codex/tighten claude review prompt 💬1 +42 −19 @rjpower
#2799 Fix Claude workflow config for commits and PR reviews 💬1 +33 −21 @rjpower
#2798 Try to fix claude PRs/branches for fix_issue recipe 💬1 +6 −2 @ravwojdyla
#2797 Extract device env vars into shared env.py module 💬3 +480 −180 @rjpower
#2796 Replace v3-8 TPU references with v4-8 in tests +7 −7 @rjpower
#2795 Restrict Claude Code Action to claude[bot] user +2 −0 @rjpower
#2794 iris: add 100MB max bundle size validation 💬3 +47 −3 @rjpower
#2793 Ideas for agent PRs. 💬2 +151 −18 @rjpower
#2792 Tweaks to get iris configs fixed, error on missing tpus. +38 −43 @rjpower
#2791 Refactor time utilities to use Deadline and Timer classes +35 −49 @rjpower
#2788 Frayv2 - Fix logging for local backends 💬1 +28 −5 @rjpower
#2786 Iris: fix resource leak in coscheduled failure cascade +62 −5 @rjpower
#2784 Clarify TPU fleet-size metric guidance +6 −2 @yonromai
#2783 Iris: improve uv and cargo caching for task containers 💬3 +51 −52 @rjpower
#2782 Fix scheduler error messages to show actual rejection reasons 💬14 +479 −112 @claude
#2781 Fix TPU chip counts in cluster configs by deriving from topology +12 −17 @moojink
#2780 Fix autoscaler routing demand for dead/cancelled jobs +45 −1 @rjpower
#2779 Integrate Evalchemy reasoning evals into Marin +2067 −0 @moojink
#2778 Iris: fix CLI tunnel hang and local cluster lifecycle +113 −78 @yonromai
#2773 Replace ThreadPoolExecutor with daemon threads in cluster shutdown +44 −29 @rjpower
#2771 Iris: fix autoscaler dashboard and add slice visibility 💬1 +199 −32 @rjpower
#2770 Replace Copilot PR reviews with Claude PR reviews +58 −112 @rjpower
#2767 Reduce pre-commit output noise 💬1 +97 −94 @rjpower
#2765 Install Python and uv in Claude Code workflow +9 −1 @rjpower
#2764 Iris: Add type-safe memray memory profiling 💬4 +742 −329 @rjpower
#2763 Always run Zephyr CI on merge to main +0 −4 @ravwojdyla
#2760 Iris: fix slice reaping issues +541 −1146 @rjpower
#2758 Iris: depth-first scheduler to guarantee job tree progress 💬6 +351 −15 @rjpower
#2757 Fixup Iris cluster fixture 💬1 +2 −3 @ravwojdyla-agent
#2756 Iris: E2E testing normalization 💬13 +2478 −2417 @rjpower
#2755 Add --no-sync flag to dev_tpu execute +21 −8 @dlwh
#2753 Fix Iris transactions tab to show newest first and limit to 100 +3 −2 @rjpower
#2750 pyrefly: disable ignore-file filtering in workspace config +4 −0 @dlwh
#2748 Update dev TPU guide and docs navigation +148 −23 @dlwh
#2747 Script to configure marin temp buckets 💬10 +183 −3 @ravwojdyla
#2746 Fix fray CLI +8 −13 @ravwojdyla
#2743 Iris Platform redesign: replace cluster/vm/ with cluster/platform/ 💬2 +8562 −8326 @rjpower
#2732 Fix mixed-precision dtype and OOM in fused Pallas backward kernel +8 −8 @moojink
#2731 Address comments from zephyr merge 💬2 +72 −622 @rjpower
#2728 Add py-spy profiling integration for Iris 💬8 +819 −212 @rjpower
#2725 Support resumable writes in write_levanter_cache 💬2 +222 −16 @Calvin-Xu
#2723 Iris: add validation for config loading +99 −11 @rjpower
#2722 Move autoscaler from vm/ to controller/ layer 💬6 +688 −598 @rjpower
#2720 Fix child jobs disappearing when parent succeeds (#2694) +31 −0 @rjpower
#2718 Add a default prompt for the claude yml so it actually runs tests. +10 −4 @rjpower
#2717 Prune dead workers from controller state on failure +43 −0 @rjpower
#2716 Migrate repo license headers to SPDX identifiers +648 −7704 @dlwh
#2715 Add ability to shuffle dataset before train/val split +42 −10 @moojink
#2714 Remove in-progress dataset length APIs and waiting semantics +99 −388 @dlwh
#2713 Add first/all exhaustion stop strategies to MixtureDataset +200 −28 @dlwh
#2712 Move uv/pip command construction from types.py to entrypoint.py 💬2 +28 −30 @rjpower
#2711 Use SliceLifecycleState enum instead of string literals 💬6 +84 −35 @rjpower
#2709 Make dupekit a proxy package that defers Rust build until use +73 −36 @rjpower
#2707 Allow scaling groups to accept demand in non-terminal states and clean up task stopping +31 −29 @rjpower
#2706 Default JAX_PLATFORMS only for CPU-only fray jobs +118 −8 @dlwh
#2705 docs: add agent-directed research logbook recipe 💬1 +267 −0 @dlwh
#2697 WandB Aesthetics and Off-By-One +52 −8 @Helw150
#2612 Make feistel default; remove linear permutation 💬1 +40 −75 @pc0618
#2565 Frayv2: Migrate Zephyr/Levanter/Executor 💬4 +3897 −2013 @rjpower
#2475 Add TPU observability doc +446 −0 @yonromai
#2281 Gated Attention & Scaling Speedruns 💬2 +2536 −29 @Calvin-Xu

Training Runs This Week

The largest training run this week was a 50-hour SFT of Qwen3-Base 7B on OpenThoughts4 math data (240k examples) by Moo Jin Kim, reaching 38.9% MFU on a v5p-32. Will Held launched a mega-sweep campaign for 2.5B Qwen3 configurations (33 runs across two batch-size/hidden-dim combos) as groundwork for MoE scaling decisions, plus a 1B Grugformer dense speedrun. Calvin Xu ran the most extensive data mixture experiments yet -- 124 runs comparing single-phase epoch vs no-epoch strategies.

Run	Owner	Hardware	FLOPs	Wall Time	MFU	Evals	Status
#2199 SFT: Qwen3-Base on OpenThoughts4 + math (240k examples)	@moojink	v5p-32 (16 chips)	5e19	50.0h	38.9%	loss=0.145, Qwen3-Base 7B SFT, 4682 steps, seq_len=32768, bs=256	completed #1
#2499 Grugformer dense 1B speedrun	@Helw150	v5p-8 (4 chips)	2.5e18	0.9h	30.1%	loss=4.624, 1B Grugformer dense model speedrun	completed #1
#2499 Mega-sweep: 2.5B Qwen3 (bs32, hid512)	@Helw150	v5p-8 (4 chips)	2.5e18	0.9-1.4h per run	34.6-35.6%	16 trials, loss~3.5-3.7, large-batch MoE hyperparameter search	completed #1
#2499 Mega-sweep: 2.5B Qwen3 (bs16, hid256)	@Helw150	v5p-8 (4 chips)	2.5e18	1.3-2.2h per run	11.6-16.3%	16 trials, loss~4.1-4.3, smaller hidden dim search	completed #1
Data mixture: single-phase epoch sweep	@Calvin-Xu	v5p-8 (4 chips)	1e18	1.3-3.4h per run	35.5-36.7%	62 runs + baselines, loss range 0.48-3.3, epoch-based data mixing	completed #1
Data mixture: single-phase no-epoch sweep	@Calvin-Xu	v5p-8 (4 chips)	1e18	1.5-3.4h per run	20.0-36.6%	62 runs + baselines, loss range 2.96-3.4, no-epoch data mixing	completed #1

Other Changes

Iris cluster orchestrator overhaul

Training Runs This Week

Keyboard shortcuts