Week of February 2nd summary for marin-community/marin

A major infrastructure week: Iris got a deep architectural overhaul (platform abstraction, depth-first scheduling, profiling, slice management), Fray v2 absorbed Zephyr and Levanter workloads, and the training stack saw fixes to mixed-precision kernels, dataset APIs, and evaluation tooling.

This Week's Work

Iris Platform Overhaul

The Iris cluster orchestrator received its most sweeping set of changes yet. @rjpower replaced the tangled cluster/vm/ package with a clean cluster/platform/ abstraction (#2743), moved the autoscaler into the controller layer where it logically belongs (#2722), and introduced a depth-first scheduler that guarantees progress for job trees by prioritizing child tasks over unrelated root-level work (#2758). Slice management was simplified by removing the GCP API probing cache in favor of pure worker-timeout-based reclamation (#2760).

Observability and profiling got substantial attention: py-spy and memray profiling are now available on-demand through the dashboard, CLI, and RPC (#2728, #2764), the autoscaler dashboard was fixed and now shows individual slice rows (#2771), and dead workers are properly pruned from controller state (#2717). Several resource-management bugs were fixed, including a leak in coscheduled failure cascades (#2786) and autoscaler routing demand for dead jobs (#2780).

@yonromai fixed CLI tunnel hangs and local cluster lifecycle issues (#2778), and config validation was consolidated into a single entry point (#2723). The E2E test suite was normalized into a single tests/e2e/ directory with Playwright-based dashboard assertions (#2756).

Fray v2 and Execution Infrastructure

@rjpower completed the migration of Zephyr and Levanter workloads to Fray v2 (#2565), a major lift-and-shift that moves the core training and data-processing orchestration onto the new execution framework. Follow-up work cleaned up the post-merge code (#2731), fixed local backend logging (#2788), and fixed the Fray CLI (@ravwojdyla, #2746). @dlwh ensured CPU-only Fray jobs default JAX_PLATFORMS=cpu correctly (#2706).

TPU environment setup was consolidated: device env var construction was extracted into a shared env.py module used by both Docker and process runtimes (#2797), and TPU chip counts in cluster configs are now derived from topology rather than hardcoded (@moojink, #2781).

Training and Data Pipeline

@Calvin-Xu landed gated attention with scaling speedruns (#2281), implementing the mechanism from Qiu et al. and sweeping for optimal learning rate scaling. @moojink fixed a mixed-precision dtype mismatch in the fused Pallas cross-entropy backward kernel that was causing OOM and crash in bfloat16 training (#2732).

Dataset handling saw several improvements: @dlwh added first/all exhaustion stop strategies to MixtureDataset (#2713) and removed the in-progress dataset length APIs, simplifying the interface to finite-known-length vs infinite semantics (#2714). @moojink added the ability to shuffle datasets before train/val split (#2715), and @pc0618 made feistel the default (and only) permutation type for LM mixtures (#2612). @Calvin-Xu added resumable writes to write_levanter_cache so that worker preemption no longer wipes hours of tokenization progress (#2725).

@Helw150 fixed WandB double-initialization that caused BrokenPipeErrors and an off-by-one in tracker metrics (#2697).

Evaluation

@moojink integrated the Evalchemy evaluation library for reasoning tasks (AIME, AMC, HMMT, MATH500, HumanEval+, MBPP+, LiveCodeBench) into Marin (#2779), adding a new EvalchemyEvaluator that runs these benchmarks as part of the standard eval pipeline.

Developer Experience and CI

@rjpower replaced Copilot PR reviews with Claude-based reviews (#2770) and tightened the review prompt for terse, actionable output (#2800). The Claude Code GitHub Action was configured with proper Python/uv setup (#2765) and restricted to the claude[bot] user for security (#2795). A scheduler error message improvement was itself authored by Claude, replacing boolean rejection returns with structured RejectionReason objects (#2782).

@dlwh migrated the entire repo's license headers to SPDX identifiers (#2716) and added an agent-directed research logbook recipe (#2705). Pre-commit output was streamlined to single ok/FAIL lines with a failure summary (#2767). @yonromai added a comprehensive TPU observability doc (#2475), and @dlwh updated the dev TPU guide (#2748).

@rjpower made dupekit a proxy package that defers the Rust build until actual use (#2709), eliminating build friction for contributors who don't need deduplication.

43 PRs this week, 48 new comments, and 93 issues closed (93 total)

Sort:

#2703 Fix autoscaler zombie slices, scale-down, and liveness timeouts 💬4 +561 −183 @rjpower
#2701 add retry loop to dev_tpu +120 −13 @dlwh
#2700 Fix train and val set overlap bug by splitting before shuffling 💬1 +10 −7 @moojink
#2698 More fixes towards getting iris & frayv2 working. +6224 −3654 @rjpower
#2689 [SFT] Make SFT work better with Marin/Llama/Qwen models +719 −43 @moojink
#2688 Token Stats Fix +2 −2 @Helw150
#2683 Move dev-only deps out of main dependency lists +18 −30 @rjpower
#2681 make: create ~/.ssh in get_secret_key for dev_setup +1 −0 @dlwh
#2680 Iris: Log child jobs by default. +1 −1 @rjpower
#2679 Iris: Add job list/delete commands, fix log streaming error messages +181 −35 @rjpower
#2678 Iris: Fix env propagation for local mode and child jobs +304 −9 @rjpower
#2676 Add server-side pagination, sorting, and filtering to jobs list +515 −206 @rjpower
#2675 Iris: Incorporate fixes from zephyr rollout +646 −395 @rjpower
#2671 Iris: Fix stale Docker container accumulation in worker. +212 −72 @rjpower
#2659 Fix training resume from final checkpoint and add HF_ALLOW_CODE_EVAL +15 −0 @Calvin-Xu
#2653 Iris: Improve autoscaler design and presentation. +1647 −385 @rjpower
#2650 Iris: fix zephyr local issues +31 −13 @rjpower
#2649 Iris: Fix Fray v2 integration touch points. +223 −327 @rjpower
#2647 Iris: Fix screenshot tests & service issue. +212 −40 @rjpower
#2646 Iris: Have dockerfiles cache on pyproject.toml for controller & worker. +16 −16 @rjpower
#2645 Iris: Use a consistent JobName type for representing jobs/tasks in Iris 💬2 +1747 −1096 @rjpower
#2644 Iris: Try adding configuration knobs. 💬4 +2367 −1359 @rjpower
#2637 Make streaming fused CE kernel default +363 −239 @dlwh
#2636 use a git recognized path for agent logs +1 −1 @rjpower
#2635 fix: use :path converter for hierarchical IDs in controller dashboard +2 −2 @rjpower
#2632 Iris: consolidate CLI functionality 💬2 +1788 −3342 @rjpower
#2631 [Dev TPU] Add watchdog dependency +37 −7 @dlwh
#2627 Fix claude yaml (base -> bash). Try to add `gh` access. +2 −1 @rjpower
#2622 Iris: cleanup tests. 💬8 +213 −303 @rjpower
#2621 Don't Make Parallel Steps Fail if one child fails +24 −6 @Helw150
#2620 Iris: Clean up threading model 💬2 +1136 −1514 @rjpower
#2617 Iris: Restore ThreadContainer and client-side Dockerfile APIs +24 −47 @rjpower
#2616 Iris: More fixes from Zephyr integration +553 −590 @rjpower
#2615 Fix dashboard showing running jobs as pending, reduce log noise +166 −109 @rjpower
#2611 Try to get claude to run precommits. +1 −2 @rjpower
#2610 Simplify log viewer: pass negative start_line directly 💬4 +61 −50 @rjpower
#2609 feat(iris): Add automatic SSH tunneling for RPC operations 💬6 +64 −22 @rjpower
#2607 Iris: disable httpx logging by default +3 −0 @rjpower
#2605 fray: remove FRAY_CLIENT_SPEC and use auto-detection 💬6 +142 −179 @rjpower
#2599 Iris: Migrate to type-safe time primitives 💬2 +2765 −1538 @rjpower
#2584 Support N_REPEAT override for Evalchemy +75 −9 @teetone
#2521 Pallas fused cross-entropy kernel + recipe trial run +4477 −665 @dlwh
#2663 Mega-Evals 💬9 +1127 −86 @Helw150

Run	Owner	Hardware	FLOPs	Wall Time	MFU	Evals	Status
#2499 Ref sweep: Qwen3-130M Vizier 10B tokens	@Helw150	v5p-8 (4 chips)	1e19	3.5-3.6h per run	30.0-35.4%	best loss=3.333, 16 Vizier trials, 130M Qwen3 on 10B tokens, bs=64	completed #1
#2499 Ref sweep: Qwen3-130M Vizier v3 (continued)	@Helw150	v5p-8 (4 chips)	3e18	0.6-0.7h per run	36.3-36.5%	loss~3.6-3.7, loop 3 continuation, bs=64	completed #1
Data mixture sweep: two-phase StarCoder v4 (baselines)	@Calvin-Xu	v5p-8 (4 chips)	1e18	1.1-1.2h per run	35.9-37.0%	7 baseline runs, loss range 0.14-3.42, 768-dim Qwen3	completed #1