Week of January 19th summary for marin-community/marin

Milestone: Iris V1
48 merged 1 opened 23 issues closed 10 contributors 0 epics 100 comments this week

An infrastructure-heavy week focused on eliminating GCSFuse dependencies, cleaning up dead code, landing new model architectures (OLMo 3, Apertus), and laying groundwork for MoE scaling experiments.

This Week's Work


GCS and storage infrastructure. @dlwh landed a sweeping removal of all GCSFuse usage across the codebase, replacing it with direct GCS reads or local tmp downloads (#2308). This eliminated a fragile dependency that had caused intermittent failures in training and eval pipelines. @dlwh also pushed compute_next_token_loss into LmHeadModel (#2302), simplifying the loss computation path and making the upcoming Grug MoE integration cleaner by removing the activations/loss separation.

Codebase cleanup. @rjpower removed over 11,000 lines of dead code — the unused Sophia optimizer, stale processing and generation library code, and deprecated functions (#2290, #2265). He also replaced the fasttext dependency with floret for macOS ARM64 compatibility (#2283), fixing a long-standing pain point for developers on Apple Silicon.

New model architectures. @Helw150 added implementations of both OLMo 3 (#2267) and Apertus (#2272), expanding the set of architectures available for training and evaluation. He also landed multilingual log-probability evals that compare models against Llama 3 8B across languages (#2172), and added handling for owner-died preemption errors so they are correctly treated as preemptions rather than hard failures (#2309).

MoE groundwork. @pc0618 added speedrun and profiling helpers for OLMoE and Mixtral (#2249), establishing the A/B testing infrastructure for MoE kernels (ragged-dot vs grouped-matmul). He also fixed eval hardcoded path handling and MoE GMM sharding under shard_map (#2248).

Levanter and Haliax fixes. @dlwh fixed Tensorstore restore when exemplar pytrees use AbstractMesh shardings, which was needed for Grug (#2289), and added support for explicit mesh axes plus disambiguated haliax.take behavior under sharding (#2288).

Speedrun and cluster updates. @Calvin-Xu fixed speedruns to work again after the Fray migration, particularly for local clusters with hardware accelerators (#2299). @ravwojdyla documented the manual Ray cluster restart policy (#2303), fixed an eval download typing bug (#2305), and updated the cluster Docker image (#2301).

VLM and open work. @ruili33 opened a large PR adding Vision-Language Model training support to Levanter with SigLIP/Siglip2 vision encoders and the LLaVA OneVision architecture (#2298).

49 PRs this week, 37 new comments, and 23 issues closed (23 total)
Sort:

Training Runs This Week


Will Held's scaling ladder campaign hit its largest runs yet — a 1e23 FLOPs Nemotron run on 512 v4 chips (375B tokens, crashed after 369h) and 1e22 FLOPs runs on 256 v4 chips for both Nemotron and COMMA classifiers. Kaiyue Wen ran MuonHT optimizer sweeps at 130M and 1.2B scale. Calvin Xu continued Attn-Gate architecture sweeps at 520M and 1.2B scale on v5p-32.

Run Owner Hardware FLOPs Wall Time MFU Evals Status
Nemotron scaling ladder 1e23 FLOPs @Helw150 v4-1024 (512 chips) 6.9e22 369.1h 41.9% loss=5.456 (crashed), 375B tokens crashed #1
COMMA scaling ladder 1e22 FLOPs @Helw150 v4-512 (256 chips) 1.0e22 92.2h 44.0% loss=1.671, 192B tokens completed #1
Nemotron scaling ladder 1e22 FLOPs @Helw150 v4-512 (256 chips) 1.0e22 89.8h 45.3% loss=2.196, 160B tokens completed #1
COMMA scaling ladder 1e21 FLOPs @Helw150 v5p-32 (16 chips) 1.0e21 83.1h 49.3% loss=1.906, 62.1B tokens completed #1 #2#3
Attn-Gate 520M LR sweep (best: lr x2) @Calvin-Xu v5p-32 (16 chips) 1.0e20 16.0h 26.0% loss=2.666, 25.1B tokens completed #1 #2
Attn-Gate 1.2B LR sweep (stage 2, lr x2) @Calvin-Xu v5p-32 (16 chips) 2.0e20 18.2h 43.6% loss=2.685, 23.1B tokens crashed #1
Qwen3-1.2B MuonHT optimizer (lr=0.01) v5litepod-256 (128 chips) 2.1e20 9.2h 36.8% loss=2.409, 24B tokens completed #1 #2
MuonH/MuonHT 130M batch-size sweep (8x, best lr=0.01) v5litepod-128 (64 chips) 1.4e19 2.6h 19.3% loss=3.123, 20.8B tokens completed #1 #2#3#4
Merged PR Open PR Draft PR Closed PR Open issue Closed issue

Keyboard shortcuts

?
Toggle this help
j / k
Next / previous section
t
Toggle details in current section
s
Cycle sort order in current section
o
Open current epic on GitHub
m
Open current milestone on GitHub
M
Open milestones list on GitHub