Week of January 5th summary for marin-community/marin

A week dominated by Iris infrastructure hardening and a major new Pallas fused cross-entropy kernel, with foundational SFT improvements landing alongside.

This Week's Work

Iris cluster manager overhaul

@rjpower drove a sustained push to stabilize and refine Iris, the project's cluster orchestration layer. The autoscaler was rewritten with a correct demand algorithm that reasons about individual tasks and scaling groups (#2653), and zombie TPU slices are now properly terminated via heartbeat-driven liveness checks (#2703). Job naming moved to a consistent filesystem-inspired /parent/child/task convention (#2645), the CLI was consolidated from four scattered modules into a unified interface (#2632), and raw time values were replaced throughout with type-safe Timestamp/Duration/Deadline primitives to prevent unit-confusion bugs (#2599). Stale Docker containers are now cleaned up during heartbeat reconciliation (#2671), the threading model was cleaned up with a global registry (#2620), and environment propagation was fixed for both local mode and child jobs (#2678). Additional polish included server-side pagination and filtering for the jobs list (#2676), automatic SSH tunneling for RPC operations (#2609), a simplified log viewer (#2610), and dashboard fixes for job status display (#2615). The Fray integration was cleaned up by removing FRAY_CLIENT_SPEC in favor of auto-detection (#2605), and multiple rounds of Zephyr integration fixes landed (#2675, #2616, #2698).

Pallas fused cross-entropy kernel

@dlwh landed a new TPU Pallas kernel for fused softmax-cross-entropy that streams over the vocabulary dimension, keeping only a single block in VMEM at a time (#2521). The follow-up made this streaming kernel the default path, removed the legacy forward-plus-split-backward implementation, and added tuned block sizes for v4 and v5p hardware (#2637). This work directly enables larger-vocabulary models to train without running out of HBM on the cross-entropy step.

SFT and training fixes

@moojink improved SFT support for Marin, Llama 3.1, and Qwen 2.5/3 models by adding chat templates with {% generation %} tags, updating model configs, and adding gradient accumulation and tokenizer padding utilities (#2689). A train/val overlap bug was fixed by splitting before shuffling (#2700), and @Calvin-Xu fixed training resume from a final checkpoint (#2659).

Evaluation and reliability

@teetone added support for overriding N_REPEAT in Evalchemy evaluations, allowing control over how many seeds are used for math benchmarks like AIME and AMC (#2584). @Helw150 fixed token stats computation (#2688) and made parallel steps resilient so that one child failing no longer kills the entire batch (#2621). @dlwh added a retry loop for dev TPU reacquisition (#2701) and ensured the watchdog dependency is installed for dev TPU watch mode (#2631).

22 PRs this week, 15 new comments, and 24 issues closed (24 total)

Sort:

#2309 Owner Died Preemption 💬1 +22 −12 @Helw150
#2308 Remove uses of FUSE +918 −618 @dlwh
#2305 Fixup eval download 💬1 +1 −1 @ravwojdyla
#2303 Ray cluster manual restart policy +30 −7 @ravwojdyla
#2302 push compute_next_token_loss into LmHeadModel +202 −1147 @dlwh
#2301 Update cluster docker image to `20260104` 💬1 +10 −10 @ravwojdyla
#2299 Calvin/Speedrun Quick Fixes +165 −122 @Calvin-Xu
#2295 Bump js-yaml in /data_browser +22 −34 @dependabot
#2293 Bump mkdocs-include-markdown-plugin from 7.1.5 to 7.1.8 in /lib/haliax/docs +1 −1 @dependabot
#2292 Bump qs and express in /data_browser +123 −50 @dependabot
#2290 cleanup: remove unused functions and dead code across the codebase 💬1 +56 −2031 @rjpower
#2289 Fix Tensorstore restore when exemplar uses AbstractMesh shardings (for grug) +34 −1 @dlwh
#2288 Support explicit mesh axes and disambiguate haliax.take under sharding +133 −7 @dlwh
#2283 Replace fasttext with floret for macOS ARM64 support 💬2 +28 −18 @rjpower
#2272 Add Apertus Implementation 💬1 +814 −0 @Helw150
#2267 OLMo 3 Implementation 💬1 +1101 −9 @Helw150
#2265 Remove unused processing and generation library code. +455 −11156 @rjpower
#2249 Add speedrun/profiling helpers for OLMoE + Mixtral 💬3 +2209 −2 @pc0618
#2248 Fix eval hardcoded path handling + MoE GMM sharding 💬1 +60 −12 @pc0618
#2172 Multilingual LogProb Evals 💬1 +330 −12 @Helw150
#2317 Update Docker image tag 💬1 +10 −10 @github-actions
#2298 VLM for Marin 💬5 +11509 −12 @ruili33

Training Runs This Week

Dominated by two large-scale experiment campaigns: Kaiyue Wen's MuonHC optimizer baselines on a 128-chip v5litepod (up to 193B tokens per run at 1.2B params), and Calvin Xu's 1.2B Attn-Gate architecture LR sweep on v5p-32 (60B tokens each). Pranshu Chaturvedi continued OLMoE size sweeps at smaller scale, and Moo Jin Kim ran SFT fine-tuning on Qwen3-32B and Marin-8B.