Week of January 19th summary for marin-community/marin

An infrastructure-heavy week focused on eliminating GCSFuse dependencies, cleaning up dead code, landing new model architectures (OLMo 3, Apertus), and laying groundwork for MoE scaling experiments.

This Week's Work

GCS and storage infrastructure. @dlwh landed a sweeping removal of all GCSFuse usage across the codebase, replacing it with direct GCS reads or local tmp downloads (#2308). This eliminated a fragile dependency that had caused intermittent failures in training and eval pipelines. @dlwh also pushed compute_next_token_loss into LmHeadModel (#2302), simplifying the loss computation path and making the upcoming Grug MoE integration cleaner by removing the activations/loss separation.

Codebase cleanup. @rjpower removed over 11,000 lines of dead code — the unused Sophia optimizer, stale processing and generation library code, and deprecated functions (#2290, #2265). He also replaced the fasttext dependency with floret for macOS ARM64 compatibility (#2283), fixing a long-standing pain point for developers on Apple Silicon.

New model architectures. @Helw150 added implementations of both OLMo 3 (#2267) and Apertus (#2272), expanding the set of architectures available for training and evaluation. He also landed multilingual log-probability evals that compare models against Llama 3 8B across languages (#2172), and added handling for owner-died preemption errors so they are correctly treated as preemptions rather than hard failures (#2309).

MoE groundwork. @pc0618 added speedrun and profiling helpers for OLMoE and Mixtral (#2249), establishing the A/B testing infrastructure for MoE kernels (ragged-dot vs grouped-matmul). He also fixed eval hardcoded path handling and MoE GMM sharding under shard_map (#2248).

Levanter and Haliax fixes. @dlwh fixed Tensorstore restore when exemplar pytrees use AbstractMesh shardings, which was needed for Grug (#2289), and added support for explicit mesh axes plus disambiguated haliax.take behavior under sharding (#2288).

Speedrun and cluster updates. @Calvin-Xu fixed speedruns to work again after the Fray migration, particularly for local clusters with hardware accelerators (#2299). @ravwojdyla documented the manual Ray cluster restart policy (#2303), fixed an eval download typing bug (#2305), and updated the cluster Docker image (#2301).

VLM and open work. @ruili33 opened a large PR adding Vision-Language Model training support to Levanter with SigLIP/Siglip2 vision encoders and the LLaVA OneVision architecture (#2298).

49 PRs this week, 37 new comments, and 23 issues closed (23 total)

Sort:

#2453 Fix pytest.mark.skipif condition to use bool instead of string +7 −7 @yonromai
#2450 Add iris tests to CI +43 −0 @yonromai
#2448 Add a CLAUDE.md to improve Claude users' productivity in Marin +1 −2 @AlienKevin
#2447 Beta2 gets a bit wacky with very large batch sizes... +18 −28 @Helw150
#2445 Add simple RL integration test for iris +127 −0 @yonromai
#2443 uv cache in Iris worker 💬3 +179 −119 @ravwojdyla
#2440 Simplify pre-commit script to use uv's script mode 💬2 +28 −24 @rjpower
#2438 Add logging panel to dashboard proxy with error tracking 💬4 +107 −9 @rjpower
#2431 Write Tokenized Data Sizes as metadata 💬1 +30 −2 @Helw150
#2430 Give Claude a better way to check on a job +14 −0 @Helw150
#2429 Avoid eager tokenizer fetch on test collection +7 −3 @ravwojdyla
#2427 Update Fray doc +181 −115 @yonromai
#2426 Add max_concurrent to Executor +22 −4 @Calvin-Xu
#2425 Enable pyrefly on all libs +1275 −1461 @ravwojdyla
#2423 Upgrade GitHub Actions self-hosted runner +1 −1 @rjpower
#2422 Iris - Add co-scheduling support for TPU scheduling. 💬1 +3815 −999 @rjpower
#2421 Fixup pandas in the docker image 💬1 +41 −22 @ravwojdyla
#2420 Log Mixture Weights to WandB +18 −0 @Helw150
#2419 Update PR template given we now use it for commit descriptions. 💬3 +18 −7 @rjpower
#2412 Allow setting W&B group via WANDB_GROUP +9 −0 @pc0618
#2411 Fix eval_lm +6 −2 @RohithKuditipudi
#2409 Enable rebase and pre-commit for Claude Code Action +3 −5 @rjpower
#2406 Fix #2405: Update the marin cluster image tags +9 −9 @yonromai
#2402 Cleanup deduplication logic 💬1 +1320 −1829 @ravwojdyla
#2401 Improve `create_job_ctx` doc to doc ray cpu/mem +1 −0 @ravwojdyla
#2400 Fixup/improve `fineweb_edu` simple downloads +16 −11 @ravwojdyla
#2397 Make Zephyr logging great again 💬1 +4 −0 @ravwojdyla
#2396 Improve tokenize - avoid HF tokenizer fetch limits +45 −11 @ravwojdyla
#2395 Improve HF download retry +23 −3 @ravwojdyla
#2394 Revert "Make Tokenization More Robust Hopefully" 💬1 +2 −59 @ravwojdyla
#2392 [RL] Convert weights to bfloat16 during Arrow Flight serialization 💬1 +33 −5 @AlienKevin
#2391 Cleanup in prep for infra improvements 💬5 +295 −3532 @Helw150
#2384 Add Claude Code GitHub Workflow 💬1 +67 −0 @rjpower
#2382 Iris: add job tasks in preparation for co-scheduling. 💬3 +7307 −5634 @rjpower
#2381 It is still too easy to tokenize test into train +2 −0 @Helw150
#2380 Make Ray (re)try harder +1 −1 @Helw150
#2379 Disable WANDB logging during pytest runs 💬1 +6 −0 @yonromai
#2378 Try to make new tokenization jobs a bit more robust to rate limits and preemption +1 −1 @Helw150
#2376 [RL] Fix loss: use global token normalization instead of per-example +5 −3 @AlienKevin
#2375 [RL] Pin triton==3.5.0 to fix torch 2.9.0 dependency conflict +8 −51 @AlienKevin
#2372 Normalize Jupyter notebooks on precommit, to reduce git churn from modications 💬3 +122 −669 @rjpower
#2356 [RL] Consistently log to the same wandb run despite pre-emptions +29 −22 @AlienKevin
#2340 More descriptive zephyr stage names +67 −15 @ravwojdyla
#2320 Vllm docker sidecar 💬2 +1646 −415 @dlwh
#2319 Fix HF export when reference checkpoint unset +14 −4 @pc0618
#2310 Loss downweighting on DNA repetitive elements 💬1 +566 −41 @gonzalobenegas
#2243 Scaling Plots & Analysis as an Executor Step +2543 −1805 @Helw150
#2171 Grug 💬2 +3290 −10 @dlwh
#2393 Domain-Phase Data Mixture Swarm Experiments +108817 −170 @Calvin-Xu

Training Runs This Week

Will Held's scaling ladder campaign hit its largest runs yet — a 1e23 FLOPs Nemotron run on 512 v4 chips (375B tokens, crashed after 369h) and 1e22 FLOPs runs on 256 v4 chips for both Nemotron and COMMA classifiers. Kaiyue Wen ran MuonHT optimizer sweeps at 130M and 1.2B scale. Calvin Xu continued Attn-Gate architecture sweeps at 520M and 1.2B scale on v5p-32.

Run	Owner	Hardware	FLOPs	Wall Time	MFU	Evals	Status
Nemotron scaling ladder 1e23 FLOPs	@Helw150	v4-1024 (512 chips)	6.9e22	369.1h	41.9%	loss=5.456 (crashed), 375B tokens	crashed #1
COMMA scaling ladder 1e22 FLOPs	@Helw150	v4-512 (256 chips)	1.0e22	92.2h	44.0%	loss=1.671, 192B tokens	completed #1
Nemotron scaling ladder 1e22 FLOPs	@Helw150	v4-512 (256 chips)	1.0e22	89.8h	45.3%	loss=2.196, 160B tokens	completed #1
COMMA scaling ladder 1e21 FLOPs	@Helw150	v5p-32 (16 chips)	1.0e21	83.1h	49.3%	loss=1.906, 62.1B tokens	completed #1 #2 #3
Attn-Gate 520M LR sweep (best: lr x2)	@Calvin-Xu	v5p-32 (16 chips)	1.0e20	16.0h	26.0%	loss=2.666, 25.1B tokens	completed #1 #2
Attn-Gate 1.2B LR sweep (stage 2, lr x2)	@Calvin-Xu	v5p-32 (16 chips)	2.0e20	18.2h	43.6%	loss=2.685, 23.1B tokens	crashed #1
Qwen3-1.2B MuonHT optimizer (lr=0.01)		v5litepod-256 (128 chips)	2.1e20	9.2h	36.8%	loss=2.409, 24B tokens	completed #1 #2
MuonH/MuonHT 130M batch-size sweep (8x, best lr=0.01)		v5litepod-128 (64 chips)	1.4e19	2.6h	19.3%	loss=3.123, 20.8B tokens	completed #1 #2 #3 #4

This Week's Work

Training Runs This Week

Keyboard shortcuts