Week of January 12th summary for marin-community/marin

Milestone: Iris V1
17 merged 0 opened 9 issues closed 5 contributors 0 epics 133 comments this week

A week of major infrastructure overhauls — Iris gained co-scheduling and task abstractions for TPU support, Grugformer landed as a simpler JAX LM implementation, and the RL pipeline saw critical fixes to loss computation and weight transfer efficiency.

This Week's Work


Iris: Co-scheduling and Task Abstractions

@rjpower landed two foundational changes to Iris, Marin's job scheduler. #2382 restructured the entire scheduler around "tasks" and task attempts — a prerequisite for TPU VM support where a single job needs to coordinate across multiple workers sharing a failure domain. Building on that, #2422 added co-scheduling constraints so jobs can request that their tasks be placed together on the same TPU pod. Together these changes prepare Iris for the multi-host TPU training runs the MoE effort will require.

Grugformer: A Simpler JAX LM

@dlwh introduced Grugformer (#2171), a deliberately minimal language model implementation in JAX that favors explicit sharding annotations and top-level functions over deep abstraction layers. It plugs into the existing Levanter trainer pipeline via a thin adapter, providing a more transparent alternative for experimentation. Separately, #2320 moved vLLM into a Docker sidecar container, decoupling its tight version pins from the rest of the stack and simplifying dependency management.

RL Pipeline Hardening

@AlienKevin shipped several fixes to the reinforcement learning pipeline. #2376 corrected a regression in DAPO loss computation — per-example normalization was giving shorter responses disproportionate gradient weight, hurting math reasoning tasks. #2392 halved network transfer time by converting weights to bfloat16 during Arrow Flight serialization (32GB to 16GB for Llama-3.1-8B). #2356 fixed WandB logging continuity across preemptions by using stable worker indices instead of process IDs, and #2375 pinned triton to resolve a torch 2.9.0 dependency conflict.

Data Pipeline Reliability

@ravwojdyla tackled a series of pain points in the data pipeline. Tokenization jobs were failing due to HuggingFace rate limits when many parallel workers fetched tokenizers simultaneously — #2396 addressed this with smarter caching, while #2395 added proper backoff with jitter to HF downloads. A previous attempt to make tokenization more robust was reverted in #2394 after discovering it could silently lose data. The deduplication pipeline got a thorough cleanup (#2402), splitting monolithic logic into separate modules for exact and fuzzy dedup with improved tests.

Scaling Analysis and Training Improvements

@Helw150 turned scaling plots and analysis into a first-class Executor step (#2243), making scaling law analysis reproducible as part of the pipeline rather than ad-hoc notebook work. He also fixed a subtle optimizer bug where beta2 was becoming unreasonably low at very large batch sizes (#2447) and added WandB logging for mixture weights (#2420). @gonzalobenegas landed loss downweighting for repetitive DNA elements (#2310), adding a preprocessing step that identifies repetitive regions and a dataloader that reads per-token loss weights from disk.

Developer Experience and CI

@rjpower added a Claude Code GitHub Actions workflow (#2384) so Claude can respond to PR and issue comments automatically, simplified the pre-commit script to use uv's native script mode (#2440), added a logging panel to the dashboard proxy (#2438), and normalized Jupyter notebooks on precommit to reduce git churn (#2372). @yonromai fixed pytest skipif conditions that were silently broken (#2453), added Iris tests to CI (#2450), an RL integration test (#2445), and disabled WandB logging during test runs (#2379). @ravwojdyla enabled pyrefly type checking across all libs (#2425). @Helw150 cleaned up legacy code ahead of infrastructure improvements (#2391) and added a guard against accidentally tokenizing test data into training sets (#2381).

Evaluation and Experiment Support

@RohithKuditipudi fixed eval_lm by correctly passing the device mesh to the dataloader (#2411). @pc0618 fixed HF checkpoint export for training-from-scratch configs (#2319) and added WANDB_GROUP support for organizing sweeps (#2412). @Calvin-Xu added max_concurrent to the Executor (#2426) for controlling parallelism in experiment sweeps.

17 PRs this week, 15 new comments, and 9 issues closed (9 total)
Sort:

Training Runs This Week


A massive optimizer comparison campaign from Kaiyue Wen testing Adam, Muon, and MuonH across learning rates at 128B-token scale on v5lite-128 pods. Will Held's Nemotron scaling ladder reached 1e21 FLOPs on v5p-64. Pranshu Chaturvedi ran OLMoE 1.7B size sweeps with bilinear and SwiGLU variants on v5p-64 pods. Moo Jin Kim ran SFT fine-tuning for Marin-8B long-context and Qwen3-8B on v5p/v4 hardware.

Run Owner Hardware FLOPs Wall Time MFU Evals Status
Optimizer sweep: MuonH 128B tokens (best: lr=2.5e-3) v5litepod-128 (64 chips) 2.6e20 27.4h 24.3% loss=3.042, 128B tokens completed #1 #2#3#4
Optimizer sweep: Muon 128B tokens (best: lr=2.5e-3) v5litepod-128 (64 chips) 2.6e20 25.9h 21.3% loss=3.097, 128B tokens completed #1 #2#3#4
Optimizer sweep: Adam 128B tokens (best: lr=2.5e-3) v5litepod-128 (64 chips) 2.6e20 24.2h 16.6% loss=3.115, 128B tokens completed #1 #2#3#4
Nemotron scaling ladder 1e21 FLOPs @Helw150 v5p-64 (32 chips) 1.0e21 45.3h 37.0% loss=2.424, 46.3B tokens completed #1
OLMoE-1.7B ferry pre-train to cooldown @pc0618 v5p-64 (32 chips) 2.5e20 18.4h 26.6% loss=2.659, 23.1B tokens crashed #1
OLMoE-1.7B bilinear size sweep (lr x2) @pc0618 v5p-64 (32 chips) 1.9e20 19.7h 19.2% loss=26.707, 20.6B tokens crashed #1 #2#3#4
SFT long-context Marin-8B redo4 (lr=3e-4) @moojink v5p-128 (64 chips) 1.8e21 29.5h 58.2% loss=0.928, 24.9B tokens crashed #1
SFT Qwen3-8B (lr=8e-5) @moojink v5p-128 (64 chips) 1.3e21 23.7h 55.1% loss=0.870, 17.8B tokens crashed #1
SFT Llama-3.1-8B-Instruct (lr=8e-5) @moojink v4-256 (128 chips) 1.4e21 23.5h 58.2% loss=3.602, 19.5B tokens crashed #1
Qwen3-1.2B MuonH optimizer variants sweep v5litepod-256 (128 chips) 2.1e20 7.1h 36.4% loss=2.370 (sink), 24B tokens completed #1 #2#3#4#5
Merged PR Open PR Draft PR Closed PR Open issue Closed issue

Keyboard shortcuts

?
Toggle this help
j / k
Next / previous section
t
Toggle details in current section
s
Cycle sort order in current section
o
Open current epic on GitHub
m
Open current milestone on GitHub
M
Open milestones list on GitHub