Week of January 12th summary for marin-community/marin

A week of major infrastructure overhauls — Iris gained co-scheduling and task abstractions for TPU support, Grugformer landed as a simpler JAX LM implementation, and the RL pipeline saw critical fixes to loss computation and weight transfer efficiency.

This Week's Work

Iris: Co-scheduling and Task Abstractions

@rjpower landed two foundational changes to Iris, Marin's job scheduler. #2382 restructured the entire scheduler around "tasks" and task attempts — a prerequisite for TPU VM support where a single job needs to coordinate across multiple workers sharing a failure domain. Building on that, #2422 added co-scheduling constraints so jobs can request that their tasks be placed together on the same TPU pod. Together these changes prepare Iris for the multi-host TPU training runs the MoE effort will require.

Grugformer: A Simpler JAX LM

@dlwh introduced Grugformer (#2171), a deliberately minimal language model implementation in JAX that favors explicit sharding annotations and top-level functions over deep abstraction layers. It plugs into the existing Levanter trainer pipeline via a thin adapter, providing a more transparent alternative for experimentation. Separately, #2320 moved vLLM into a Docker sidecar container, decoupling its tight version pins from the rest of the stack and simplifying dependency management.

RL Pipeline Hardening

@AlienKevin shipped several fixes to the reinforcement learning pipeline. #2376 corrected a regression in DAPO loss computation — per-example normalization was giving shorter responses disproportionate gradient weight, hurting math reasoning tasks. #2392 halved network transfer time by converting weights to bfloat16 during Arrow Flight serialization (32GB to 16GB for Llama-3.1-8B). #2356 fixed WandB logging continuity across preemptions by using stable worker indices instead of process IDs, and #2375 pinned triton to resolve a torch 2.9.0 dependency conflict.

Data Pipeline Reliability

@ravwojdyla tackled a series of pain points in the data pipeline. Tokenization jobs were failing due to HuggingFace rate limits when many parallel workers fetched tokenizers simultaneously — #2396 addressed this with smarter caching, while #2395 added proper backoff with jitter to HF downloads. A previous attempt to make tokenization more robust was reverted in #2394 after discovering it could silently lose data. The deduplication pipeline got a thorough cleanup (#2402), splitting monolithic logic into separate modules for exact and fuzzy dedup with improved tests.

Scaling Analysis and Training Improvements

@Helw150 turned scaling plots and analysis into a first-class Executor step (#2243), making scaling law analysis reproducible as part of the pipeline rather than ad-hoc notebook work. He also fixed a subtle optimizer bug where beta2 was becoming unreasonably low at very large batch sizes (#2447) and added WandB logging for mixture weights (#2420). @gonzalobenegas landed loss downweighting for repetitive DNA elements (#2310), adding a preprocessing step that identifies repetitive regions and a dataloader that reads per-token loss weights from disk.

Developer Experience and CI

@rjpower added a Claude Code GitHub Actions workflow (#2384) so Claude can respond to PR and issue comments automatically, simplified the pre-commit script to use uv's native script mode (#2440), added a logging panel to the dashboard proxy (#2438), and normalized Jupyter notebooks on precommit to reduce git churn (#2372). @yonromai fixed pytest skipif conditions that were silently broken (#2453), added Iris tests to CI (#2450), an RL integration test (#2445), and disabled WandB logging during test runs (#2379). @ravwojdyla enabled pyrefly type checking across all libs (#2425). @Helw150 cleaned up legacy code ahead of infrastructure improvements (#2391) and added a guard against accidentally tokenizing test data into training sets (#2381).

Evaluation and Experiment Support

@RohithKuditipudi fixed eval_lm by correctly passing the device mesh to the dataloader (#2411). @pc0618 fixed HF checkpoint export for training-from-scratch configs (#2319) and added WANDB_GROUP support for organizing sweeps (#2412). @Calvin-Xu added max_concurrent to the Executor (#2426) for controlling parallelism in experiment sweeps.

17 PRs this week, 15 new comments, and 9 issues closed (9 total)

Sort:

#2373 git ignore `.marin.yaml` +3 −0 @ravwojdyla
#2369 Iris: Use machine-readable resource quantities in ResourceSpec proto +1037 −838 @rjpower
#2368 Fix flaky test (race condition) +1 −1 @yonromai
#2350 Unstuck RL training by setting `num_cpus` to 0 for Curriculum Actor and ArrowFlightCoordinator +22 −5 @AlienKevin
#2348 Load HF models directly from hf, skipping cache +186 −1 @rjpower
#2346 Improve cluster cleanup script with topology and preemptibility info +52 −7 @rjpower
#2342 Initial implementation of the Iris Cluster/Actor system. 💬7 +24163 −585 @rjpower
#2337 Further patch hf_hub_download - Fix #2247 💬4 +25 −2 @yonromai
#2335 Ray Auth workaround +8 −0 @AlienKevin
#2333 Remove old RL scripts and update architecture docs +1 −607 @AlienKevin
#2332 Fixup Zephyr CLI Ray memory req 💬1 +7 −1 @ravwojdyla
#2330 Upgrade tpu-inference in alignment with jax==0.8.0 +3069 −487 @AlienKevin
#2327 RL Loss Improvements +38 −7 @AlienKevin
#2325 Support inflight weight updates +3857 −815 @AlienKevin
#2316 Add nvidia/OpenMathReasoning dataset (cot, tir, genselect splits) 💬1 +60 −0 @XinyuGuan
#2311 Surface RLJob errors to Ray, fail the job accordingly 💬1 +21 −6 @yonromai
#2307 Move MinHashLSH logic to Rust, we're back 💬1 +615 −123 @yonromai

Training Runs This Week

A massive optimizer comparison campaign from Kaiyue Wen testing Adam, Muon, and MuonH across learning rates at 128B-token scale on v5lite-128 pods. Will Held's Nemotron scaling ladder reached 1e21 FLOPs on v5p-64. Pranshu Chaturvedi ran OLMoE 1.7B size sweeps with bilinear and SwiGLU variants on v5p-64 pods. Moo Jin Kim ran SFT fine-tuning for Marin-8B long-context and Qwen3-8B on v5p/v4 hardware.

Run	Owner	Hardware	FLOPs	Wall Time	MFU	Evals	Status
Optimizer sweep: MuonH 128B tokens (best: lr=2.5e-3)		v5litepod-128 (64 chips)	2.6e20	27.4h	24.3%	loss=3.042, 128B tokens	completed #1 #2 #3 #4
Optimizer sweep: Muon 128B tokens (best: lr=2.5e-3)		v5litepod-128 (64 chips)	2.6e20	25.9h	21.3%	loss=3.097, 128B tokens	completed #1 #2 #3 #4
Optimizer sweep: Adam 128B tokens (best: lr=2.5e-3)		v5litepod-128 (64 chips)	2.6e20	24.2h	16.6%	loss=3.115, 128B tokens	completed #1 #2 #3 #4
Nemotron scaling ladder 1e21 FLOPs	@Helw150	v5p-64 (32 chips)	1.0e21	45.3h	37.0%	loss=2.424, 46.3B tokens	completed #1
OLMoE-1.7B ferry pre-train to cooldown	@pc0618	v5p-64 (32 chips)	2.5e20	18.4h	26.6%	loss=2.659, 23.1B tokens	crashed #1
OLMoE-1.7B bilinear size sweep (lr x2)	@pc0618	v5p-64 (32 chips)	1.9e20	19.7h	19.2%	loss=26.707, 20.6B tokens	crashed #1 #2 #3 #4
SFT long-context Marin-8B redo4 (lr=3e-4)	@moojink	v5p-128 (64 chips)	1.8e21	29.5h	58.2%	loss=0.928, 24.9B tokens	crashed #1
SFT Qwen3-8B (lr=8e-5)	@moojink	v5p-128 (64 chips)	1.3e21	23.7h	55.1%	loss=0.870, 17.8B tokens	crashed #1
SFT Llama-3.1-8B-Instruct (lr=8e-5)	@moojink	v4-256 (128 chips)	1.4e21	23.5h	58.2%	loss=3.602, 19.5B tokens	crashed #1
Qwen3-1.2B MuonH optimizer variants sweep		v5litepod-256 (128 chips)	2.1e20	7.1h	36.4%	loss=2.370 (sink), 24B tokens	completed #1 #2 #3 #4 #5