A week of major infrastructure overhauls — Iris gained co-scheduling and task abstractions for TPU support, Grugformer landed as a simpler JAX LM implementation, and the RL pipeline saw critical fixes to loss computation and weight transfer efficiency.
@rjpower landed two foundational changes to Iris, Marin's job scheduler. #2382 restructured the entire scheduler around "tasks" and task attempts — a prerequisite for TPU VM support where a single job needs to coordinate across multiple workers sharing a failure domain. Building on that, #2422 added co-scheduling constraints so jobs can request that their tasks be placed together on the same TPU pod. Together these changes prepare Iris for the multi-host TPU training runs the MoE effort will require.
@dlwh introduced Grugformer (#2171), a deliberately minimal language model implementation in JAX that favors explicit sharding annotations and top-level functions over deep abstraction layers. It plugs into the existing Levanter trainer pipeline via a thin adapter, providing a more transparent alternative for experimentation. Separately, #2320 moved vLLM into a Docker sidecar container, decoupling its tight version pins from the rest of the stack and simplifying dependency management.
@AlienKevin shipped several fixes to the reinforcement learning pipeline. #2376 corrected a regression in DAPO loss computation — per-example normalization was giving shorter responses disproportionate gradient weight, hurting math reasoning tasks. #2392 halved network transfer time by converting weights to bfloat16 during Arrow Flight serialization (32GB to 16GB for Llama-3.1-8B). #2356 fixed WandB logging continuity across preemptions by using stable worker indices instead of process IDs, and #2375 pinned triton to resolve a torch 2.9.0 dependency conflict.
@ravwojdyla tackled a series of pain points in the data pipeline. Tokenization jobs were failing due to HuggingFace rate limits when many parallel workers fetched tokenizers simultaneously — #2396 addressed this with smarter caching, while #2395 added proper backoff with jitter to HF downloads. A previous attempt to make tokenization more robust was reverted in #2394 after discovering it could silently lose data. The deduplication pipeline got a thorough cleanup (#2402), splitting monolithic logic into separate modules for exact and fuzzy dedup with improved tests.
@Helw150 turned scaling plots and analysis into a first-class Executor step (#2243), making scaling law analysis reproducible as part of the pipeline rather than ad-hoc notebook work. He also fixed a subtle optimizer bug where beta2 was becoming unreasonably low at very large batch sizes (#2447) and added WandB logging for mixture weights (#2420). @gonzalobenegas landed loss downweighting for repetitive DNA elements (#2310), adding a preprocessing step that identifies repetitive regions and a dataloader that reads per-token loss weights from disk.
@rjpower added a Claude Code GitHub Actions workflow (#2384) so Claude can respond to PR and issue comments automatically, simplified the pre-commit script to use uv's native script mode (#2440), added a logging panel to the dashboard proxy (#2438), and normalized Jupyter notebooks on precommit to reduce git churn (#2372). @yonromai fixed pytest skipif conditions that were silently broken (#2453), added Iris tests to CI (#2450), an RL integration test (#2445), and disabled WandB logging during test runs (#2379). @ravwojdyla enabled pyrefly type checking across all libs (#2425). @Helw150 cleaned up legacy code ahead of infrastructure improvements (#2391) and added a guard against accidentally tokenizing test data into training sets (#2381).
@RohithKuditipudi fixed eval_lm by correctly passing the device mesh to the dataloader (#2411). @pc0618 fixed HF checkpoint export for training-from-scratch configs (#2319) and added WANDB_GROUP support for organizing sweeps (#2412). @Calvin-Xu added max_concurrent to the Executor (#2426) for controlling parallelism in experiment sweeps.
A massive optimizer comparison campaign from Kaiyue Wen testing Adam, Muon, and MuonH across learning rates at 128B-token scale on v5lite-128 pods. Will Held's Nemotron scaling ladder reached 1e21 FLOPs on v5p-64. Pranshu Chaturvedi ran OLMoE 1.7B size sweeps with bilinear and SwiGLU variants on v5p-64 pods. Moo Jin Kim ran SFT fine-tuning for Marin-8B long-context and Qwen3-8B on v5p/v4 hardware.
| Run | Owner | Hardware | FLOPs | Wall Time | MFU | Evals | Status |
|---|---|---|---|---|---|---|---|
| Optimizer sweep: MuonH 128B tokens (best: lr=2.5e-3) | v5litepod-128 (64 chips) | 2.6e20 | 27.4h | 24.3% | loss=3.042, 128B tokens | completed #1 #2#3#4 | |
| Optimizer sweep: Muon 128B tokens (best: lr=2.5e-3) | v5litepod-128 (64 chips) | 2.6e20 | 25.9h | 21.3% | loss=3.097, 128B tokens | completed #1 #2#3#4 | |
| Optimizer sweep: Adam 128B tokens (best: lr=2.5e-3) | v5litepod-128 (64 chips) | 2.6e20 | 24.2h | 16.6% | loss=3.115, 128B tokens | completed #1 #2#3#4 | |
| Nemotron scaling ladder 1e21 FLOPs | @Helw150 | v5p-64 (32 chips) | 1.0e21 | 45.3h | 37.0% | loss=2.424, 46.3B tokens | completed #1 |
| OLMoE-1.7B ferry pre-train to cooldown | @pc0618 | v5p-64 (32 chips) | 2.5e20 | 18.4h | 26.6% | loss=2.659, 23.1B tokens | crashed #1 |
| OLMoE-1.7B bilinear size sweep (lr x2) | @pc0618 | v5p-64 (32 chips) | 1.9e20 | 19.7h | 19.2% | loss=26.707, 20.6B tokens | crashed #1 #2#3#4 |
| SFT long-context Marin-8B redo4 (lr=3e-4) | @moojink | v5p-128 (64 chips) | 1.8e21 | 29.5h | 58.2% | loss=0.928, 24.9B tokens | crashed #1 |
| SFT Qwen3-8B (lr=8e-5) | @moojink | v5p-128 (64 chips) | 1.3e21 | 23.7h | 55.1% | loss=0.870, 17.8B tokens | crashed #1 |
| SFT Llama-3.1-8B-Instruct (lr=8e-5) | @moojink | v4-256 (128 chips) | 1.4e21 | 23.5h | 58.2% | loss=3.602, 19.5B tokens | crashed #1 |
| Qwen3-1.2B MuonH optimizer variants sweep | v5litepod-256 (128 chips) | 2.1e20 | 7.1h | 36.4% | loss=2.370 (sink), 24B tokens | completed #1 #2#3#4#5 |