Week of January 5th summary for marin-community/marin

Milestone: Iris V1
20 merged 2 opened 24 issues closed 9 contributors 0 epics 67 comments this week

A week dominated by Iris infrastructure hardening and a major new Pallas fused cross-entropy kernel, with foundational SFT improvements landing alongside.

This Week's Work


Iris cluster manager overhaul

@rjpower drove a sustained push to stabilize and refine Iris, the project's cluster orchestration layer. The autoscaler was rewritten with a correct demand algorithm that reasons about individual tasks and scaling groups (#2653), and zombie TPU slices are now properly terminated via heartbeat-driven liveness checks (#2703). Job naming moved to a consistent filesystem-inspired /parent/child/task convention (#2645), the CLI was consolidated from four scattered modules into a unified interface (#2632), and raw time values were replaced throughout with type-safe Timestamp/Duration/Deadline primitives to prevent unit-confusion bugs (#2599). Stale Docker containers are now cleaned up during heartbeat reconciliation (#2671), the threading model was cleaned up with a global registry (#2620), and environment propagation was fixed for both local mode and child jobs (#2678). Additional polish included server-side pagination and filtering for the jobs list (#2676), automatic SSH tunneling for RPC operations (#2609), a simplified log viewer (#2610), and dashboard fixes for job status display (#2615). The Fray integration was cleaned up by removing FRAY_CLIENT_SPEC in favor of auto-detection (#2605), and multiple rounds of Zephyr integration fixes landed (#2675, #2616, #2698).

Pallas fused cross-entropy kernel

@dlwh landed a new TPU Pallas kernel for fused softmax-cross-entropy that streams over the vocabulary dimension, keeping only a single block in VMEM at a time (#2521). The follow-up made this streaming kernel the default path, removed the legacy forward-plus-split-backward implementation, and added tuned block sizes for v4 and v5p hardware (#2637). This work directly enables larger-vocabulary models to train without running out of HBM on the cross-entropy step.

SFT and training fixes

@moojink improved SFT support for Marin, Llama 3.1, and Qwen 2.5/3 models by adding chat templates with {% generation %} tags, updating model configs, and adding gradient accumulation and tokenizer padding utilities (#2689). A train/val overlap bug was fixed by splitting before shuffling (#2700), and @Calvin-Xu fixed training resume from a final checkpoint (#2659).

Evaluation and reliability

@teetone added support for overriding N_REPEAT in Evalchemy evaluations, allowing control over how many seeds are used for math benchmarks like AIME and AMC (#2584). @Helw150 fixed token stats computation (#2688) and made parallel steps resilient so that one child failing no longer kills the entire batch (#2621). @dlwh added a retry loop for dev TPU reacquisition (#2701) and ensured the watchdog dependency is installed for dev TPU watch mode (#2631).

22 PRs this week, 15 new comments, and 24 issues closed (24 total)
Sort:

Training Runs This Week


Dominated by two large-scale experiment campaigns: Kaiyue Wen's MuonHC optimizer baselines on a 128-chip v5litepod (up to 193B tokens per run at 1.2B params), and Calvin Xu's 1.2B Attn-Gate architecture LR sweep on v5p-32 (60B tokens each). Pranshu Chaturvedi continued OLMoE size sweeps at smaller scale, and Moo Jin Kim ran SFT fine-tuning on Qwen3-32B and Marin-8B.

Run Owner Hardware FLOPs Wall Time MFU Evals Status
MuonHC 1.2B baseline (lr=5e-3, 184K steps) v5litepod-256 (128 chips) 2.1e21 98.3h 25.8% loss=2.483, 193B tokens completed #1 #2
MuonHC 1.2B baseline (lr=1e-2, 184K steps) v5litepod-256 (128 chips) 2.1e21 98.0h 25.8% loss=2.532, 193B tokens completed #1
Attn-Gate 1.2B LR sweep (lr=x1, best) @Calvin-Xu v5p-32 (16 chips) 5.2e20 46.9h 43.8% loss=2.321, 60B tokens completed #1 #2#3#4#5#6#7
OLMoE-L LR sweep (seq4096, best lr) @pc0618 v5p-32 (16 chips) 6.5e19 16.0h 16.6% loss=2.841, 14.7B tokens completed #1 #2#3
OLMoE-1.7B LR sweep (seq4096) @pc0618 v5p-32 (16 chips) 1.2e20 29.5h 16.4% loss=2.706, 12.6B tokens crashed #1
SFT Qwen3-32B on math30k @moojink v5p-64 (32 chips) None 15.2h 51.7% loss=0.0001, 9.8B tokens completed #1
SFT Qwen3-2.35B-A2.2B on math30k @moojink v5litepod-256 (128 chips) None 12.9h 58.4% loss=0.0003, 9.8B tokens completed #1
SFT long-context Marin-8B (lr=1e-4) @moojink v5p-128 (64 chips) None 11.2h 58.8% loss=0.949, 9.5B tokens crashed #1
Merged PR Open PR Draft PR Closed PR Open issue Closed issue

Keyboard shortcuts

?
Toggle this help
j / k
Next / previous section
t
Toggle details in current section
s
Cycle sort order in current section
o
Open current epic on GitHub
m
Open current milestone on GitHub
M
Open milestones list on GitHub