A major infrastructure week: Iris got a deep architectural overhaul (platform abstraction, depth-first scheduling, profiling, slice management), Fray v2 absorbed Zephyr and Levanter workloads, and the training stack saw fixes to mixed-precision kernels, dataset APIs, and evaluation tooling.
The Iris cluster orchestrator received its most sweeping set of changes yet. @rjpower replaced the tangled cluster/vm/ package with a clean cluster/platform/ abstraction (#2743), moved the autoscaler into the controller layer where it logically belongs (#2722), and introduced a depth-first scheduler that guarantees progress for job trees by prioritizing child tasks over unrelated root-level work (#2758). Slice management was simplified by removing the GCP API probing cache in favor of pure worker-timeout-based reclamation (#2760).
Observability and profiling got substantial attention: py-spy and memray profiling are now available on-demand through the dashboard, CLI, and RPC (#2728, #2764), the autoscaler dashboard was fixed and now shows individual slice rows (#2771), and dead workers are properly pruned from controller state (#2717). Several resource-management bugs were fixed, including a leak in coscheduled failure cascades (#2786) and autoscaler routing demand for dead jobs (#2780).
@yonromai fixed CLI tunnel hangs and local cluster lifecycle issues (#2778), and config validation was consolidated into a single entry point (#2723). The E2E test suite was normalized into a single tests/e2e/ directory with Playwright-based dashboard assertions (#2756).
@rjpower completed the migration of Zephyr and Levanter workloads to Fray v2 (#2565), a major lift-and-shift that moves the core training and data-processing orchestration onto the new execution framework. Follow-up work cleaned up the post-merge code (#2731), fixed local backend logging (#2788), and fixed the Fray CLI (@ravwojdyla, #2746). @dlwh ensured CPU-only Fray jobs default JAX_PLATFORMS=cpu correctly (#2706).
TPU environment setup was consolidated: device env var construction was extracted into a shared env.py module used by both Docker and process runtimes (#2797), and TPU chip counts in cluster configs are now derived from topology rather than hardcoded (@moojink, #2781).
@Calvin-Xu landed gated attention with scaling speedruns (#2281), implementing the mechanism from Qiu et al. and sweeping for optimal learning rate scaling. @moojink fixed a mixed-precision dtype mismatch in the fused Pallas cross-entropy backward kernel that was causing OOM and crash in bfloat16 training (#2732).
Dataset handling saw several improvements: @dlwh added first/all exhaustion stop strategies to MixtureDataset (#2713) and removed the in-progress dataset length APIs, simplifying the interface to finite-known-length vs infinite semantics (#2714). @moojink added the ability to shuffle datasets before train/val split (#2715), and @pc0618 made feistel the default (and only) permutation type for LM mixtures (#2612). @Calvin-Xu added resumable writes to write_levanter_cache so that worker preemption no longer wipes hours of tokenization progress (#2725).
@Helw150 fixed WandB double-initialization that caused BrokenPipeErrors and an off-by-one in tracker metrics (#2697).
@moojink integrated the Evalchemy evaluation library for reasoning tasks (AIME, AMC, HMMT, MATH500, HumanEval+, MBPP+, LiveCodeBench) into Marin (#2779), adding a new EvalchemyEvaluator that runs these benchmarks as part of the standard eval pipeline.
@rjpower replaced Copilot PR reviews with Claude-based reviews (#2770) and tightened the review prompt for terse, actionable output (#2800). The Claude Code GitHub Action was configured with proper Python/uv setup (#2765) and restricted to the claude[bot] user for security (#2795). A scheduler error message improvement was itself authored by Claude, replacing boolean rejection returns with structured RejectionReason objects (#2782).
@dlwh migrated the entire repo's license headers to SPDX identifiers (#2716) and added an agent-directed research logbook recipe (#2705). Pre-commit output was streamlined to single ok/FAIL lines with a failure summary (#2767). @yonromai added a comprehensive TPU observability doc (#2475), and @dlwh updated the dev TPU guide (#2748).
@rjpower made dupekit a proxy package that defers the Rust build until actual use (#2709), eliminating build friction for contributors who don't need deduplication.
An infrastructure-focused week with lighter training activity. Will Held ran a 16-trial Vizier hyperparameter sweep for the 130M Qwen3 reference configuration on 10B tokens, achieving best loss of 3.333, plus continued loop-3 sweeps from prior weeks. Calvin Xu ran 7 baseline data mixture configurations. No large-scale pre-training runs this week as the focus was on Iris platform overhaul and Fray v2 migration.
| Run | Owner | Hardware | FLOPs | Wall Time | MFU | Evals | Status |
|---|---|---|---|---|---|---|---|
| #2499 Ref sweep: Qwen3-130M Vizier 10B tokens | @Helw150 | v5p-8 (4 chips) | 1e19 | 3.5-3.6h per run | 30.0-35.4% | best loss=3.333, 16 Vizier trials, 130M Qwen3 on 10B tokens, bs=64 | completed #1 |
| #2499 Ref sweep: Qwen3-130M Vizier v3 (continued) | @Helw150 | v5p-8 (4 chips) | 3e18 | 0.6-0.7h per run | 36.3-36.5% | loss~3.6-3.7, loop 3 continuation, bs=64 | completed #1 |
| Data mixture sweep: two-phase StarCoder v4 (baselines) | @Calvin-Xu | v5p-8 (4 chips) | 1e18 | 1.1-1.2h per run | 35.9-37.0% | 7 baseline runs, loss range 0.14-3.42, 768-dim Qwen3 | completed #1 |