Week of February 2nd summary for marin-community/marin

Milestone: Reliable, repeatable, enjoyable infrastructure
42 merged 1 opened 93 issues closed 6 contributors 0 epics 124 comments this week

A major infrastructure week: Iris got a deep architectural overhaul (platform abstraction, depth-first scheduling, profiling, slice management), Fray v2 absorbed Zephyr and Levanter workloads, and the training stack saw fixes to mixed-precision kernels, dataset APIs, and evaluation tooling.

This Week's Work


Iris Platform Overhaul

The Iris cluster orchestrator received its most sweeping set of changes yet. @rjpower replaced the tangled cluster/vm/ package with a clean cluster/platform/ abstraction (#2743), moved the autoscaler into the controller layer where it logically belongs (#2722), and introduced a depth-first scheduler that guarantees progress for job trees by prioritizing child tasks over unrelated root-level work (#2758). Slice management was simplified by removing the GCP API probing cache in favor of pure worker-timeout-based reclamation (#2760).

Observability and profiling got substantial attention: py-spy and memray profiling are now available on-demand through the dashboard, CLI, and RPC (#2728, #2764), the autoscaler dashboard was fixed and now shows individual slice rows (#2771), and dead workers are properly pruned from controller state (#2717). Several resource-management bugs were fixed, including a leak in coscheduled failure cascades (#2786) and autoscaler routing demand for dead jobs (#2780).

@yonromai fixed CLI tunnel hangs and local cluster lifecycle issues (#2778), and config validation was consolidated into a single entry point (#2723). The E2E test suite was normalized into a single tests/e2e/ directory with Playwright-based dashboard assertions (#2756).

Fray v2 and Execution Infrastructure

@rjpower completed the migration of Zephyr and Levanter workloads to Fray v2 (#2565), a major lift-and-shift that moves the core training and data-processing orchestration onto the new execution framework. Follow-up work cleaned up the post-merge code (#2731), fixed local backend logging (#2788), and fixed the Fray CLI (@ravwojdyla, #2746). @dlwh ensured CPU-only Fray jobs default JAX_PLATFORMS=cpu correctly (#2706).

TPU environment setup was consolidated: device env var construction was extracted into a shared env.py module used by both Docker and process runtimes (#2797), and TPU chip counts in cluster configs are now derived from topology rather than hardcoded (@moojink, #2781).

Training and Data Pipeline

@Calvin-Xu landed gated attention with scaling speedruns (#2281), implementing the mechanism from Qiu et al. and sweeping for optimal learning rate scaling. @moojink fixed a mixed-precision dtype mismatch in the fused Pallas cross-entropy backward kernel that was causing OOM and crash in bfloat16 training (#2732).

Dataset handling saw several improvements: @dlwh added first/all exhaustion stop strategies to MixtureDataset (#2713) and removed the in-progress dataset length APIs, simplifying the interface to finite-known-length vs infinite semantics (#2714). @moojink added the ability to shuffle datasets before train/val split (#2715), and @pc0618 made feistel the default (and only) permutation type for LM mixtures (#2612). @Calvin-Xu added resumable writes to write_levanter_cache so that worker preemption no longer wipes hours of tokenization progress (#2725).

@Helw150 fixed WandB double-initialization that caused BrokenPipeErrors and an off-by-one in tracker metrics (#2697).

Evaluation

@moojink integrated the Evalchemy evaluation library for reasoning tasks (AIME, AMC, HMMT, MATH500, HumanEval+, MBPP+, LiveCodeBench) into Marin (#2779), adding a new EvalchemyEvaluator that runs these benchmarks as part of the standard eval pipeline.

Developer Experience and CI

@rjpower replaced Copilot PR reviews with Claude-based reviews (#2770) and tightened the review prompt for terse, actionable output (#2800). The Claude Code GitHub Action was configured with proper Python/uv setup (#2765) and restricted to the claude[bot] user for security (#2795). A scheduler error message improvement was itself authored by Claude, replacing boolean rejection returns with structured RejectionReason objects (#2782).

@dlwh migrated the entire repo's license headers to SPDX identifiers (#2716) and added an agent-directed research logbook recipe (#2705). Pre-commit output was streamlined to single ok/FAIL lines with a failure summary (#2767). @yonromai added a comprehensive TPU observability doc (#2475), and @dlwh updated the dev TPU guide (#2748).

@rjpower made dupekit a proxy package that defers the Rust build until actual use (#2709), eliminating build friction for contributors who don't need deduplication.

43 PRs this week, 48 new comments, and 93 issues closed (93 total)
Sort:

Training Runs This Week


An infrastructure-focused week with lighter training activity. Will Held ran a 16-trial Vizier hyperparameter sweep for the 130M Qwen3 reference configuration on 10B tokens, achieving best loss of 3.333, plus continued loop-3 sweeps from prior weeks. Calvin Xu ran 7 baseline data mixture configurations. No large-scale pre-training runs this week as the focus was on Iris platform overhaul and Fray v2 migration.

Run Owner Hardware FLOPs Wall Time MFU Evals Status
#2499 Ref sweep: Qwen3-130M Vizier 10B tokens @Helw150 v5p-8 (4 chips) 1e19 3.5-3.6h per run 30.0-35.4% best loss=3.333, 16 Vizier trials, 130M Qwen3 on 10B tokens, bs=64 completed #1
#2499 Ref sweep: Qwen3-130M Vizier v3 (continued) @Helw150 v5p-8 (4 chips) 3e18 0.6-0.7h per run 36.3-36.5% loss~3.6-3.7, loop 3 continuation, bs=64 completed #1
Data mixture sweep: two-phase StarCoder v4 (baselines) @Calvin-Xu v5p-8 (4 chips) 1e18 1.1-1.2h per run 35.9-37.0% 7 baseline runs, loss range 0.14-3.42, 768-dim Qwen3 completed #1
Merged PR Open PR Draft PR Closed PR Open issue Closed issue

Keyboard shortcuts

?
Toggle this help
j / k
Next / previous section
t
Toggle details in current section
s
Cycle sort order in current section
o
Open current epic on GitHub
m
Open current milestone on GitHub
M
Open milestones list on GitHub