An infrastructure-heavy week focused on eliminating GCSFuse dependencies, cleaning up dead code, landing new model architectures (OLMo 3, Apertus), and laying groundwork for MoE scaling experiments.
GCS and storage infrastructure. @dlwh landed a sweeping removal of all GCSFuse usage across the codebase, replacing it with direct GCS reads or local tmp downloads (#2308). This eliminated a fragile dependency that had caused intermittent failures in training and eval pipelines. @dlwh also pushed compute_next_token_loss into LmHeadModel (#2302), simplifying the loss computation path and making the upcoming Grug MoE integration cleaner by removing the activations/loss separation.
Codebase cleanup. @rjpower removed over 11,000 lines of dead code — the unused Sophia optimizer, stale processing and generation library code, and deprecated functions (#2290, #2265). He also replaced the fasttext dependency with floret for macOS ARM64 compatibility (#2283), fixing a long-standing pain point for developers on Apple Silicon.
New model architectures. @Helw150 added implementations of both OLMo 3 (#2267) and Apertus (#2272), expanding the set of architectures available for training and evaluation. He also landed multilingual log-probability evals that compare models against Llama 3 8B across languages (#2172), and added handling for owner-died preemption errors so they are correctly treated as preemptions rather than hard failures (#2309).
MoE groundwork. @pc0618 added speedrun and profiling helpers for OLMoE and Mixtral (#2249), establishing the A/B testing infrastructure for MoE kernels (ragged-dot vs grouped-matmul). He also fixed eval hardcoded path handling and MoE GMM sharding under shard_map (#2248).
Levanter and Haliax fixes. @dlwh fixed Tensorstore restore when exemplar pytrees use AbstractMesh shardings, which was needed for Grug (#2289), and added support for explicit mesh axes plus disambiguated haliax.take behavior under sharding (#2288).
Speedrun and cluster updates. @Calvin-Xu fixed speedruns to work again after the Fray migration, particularly for local clusters with hardware accelerators (#2299). @ravwojdyla documented the manual Ray cluster restart policy (#2303), fixed an eval download typing bug (#2305), and updated the cluster Docker image (#2301).
VLM and open work. @ruili33 opened a large PR adding Vision-Language Model training support to Levanter with SigLIP/Siglip2 vision encoders and the LLaVA OneVision architecture (#2298).
Will Held's scaling ladder campaign hit its largest runs yet — a 1e23 FLOPs Nemotron run on 512 v4 chips (375B tokens, crashed after 369h) and 1e22 FLOPs runs on 256 v4 chips for both Nemotron and COMMA classifiers. Kaiyue Wen ran MuonHT optimizer sweeps at 130M and 1.2B scale. Calvin Xu continued Attn-Gate architecture sweeps at 520M and 1.2B scale on v5p-32.
| Run | Owner | Hardware | FLOPs | Wall Time | MFU | Evals | Status |
|---|---|---|---|---|---|---|---|
| Nemotron scaling ladder 1e23 FLOPs | @Helw150 | v4-1024 (512 chips) | 6.9e22 | 369.1h | 41.9% | loss=5.456 (crashed), 375B tokens | crashed #1 |
| COMMA scaling ladder 1e22 FLOPs | @Helw150 | v4-512 (256 chips) | 1.0e22 | 92.2h | 44.0% | loss=1.671, 192B tokens | completed #1 |
| Nemotron scaling ladder 1e22 FLOPs | @Helw150 | v4-512 (256 chips) | 1.0e22 | 89.8h | 45.3% | loss=2.196, 160B tokens | completed #1 |
| COMMA scaling ladder 1e21 FLOPs | @Helw150 | v5p-32 (16 chips) | 1.0e21 | 83.1h | 49.3% | loss=1.906, 62.1B tokens | completed #1 #2#3 |
| Attn-Gate 520M LR sweep (best: lr x2) | @Calvin-Xu | v5p-32 (16 chips) | 1.0e20 | 16.0h | 26.0% | loss=2.666, 25.1B tokens | completed #1 #2 |
| Attn-Gate 1.2B LR sweep (stage 2, lr x2) | @Calvin-Xu | v5p-32 (16 chips) | 2.0e20 | 18.2h | 43.6% | loss=2.685, 23.1B tokens | crashed #1 |
| Qwen3-1.2B MuonHT optimizer (lr=0.01) | v5litepod-256 (128 chips) | 2.1e20 | 9.2h | 36.8% | loss=2.409, 24B tokens | completed #1 #2 | |
| MuonH/MuonHT 130M batch-size sweep (8x, best lr=0.01) | v5litepod-128 (64 chips) | 1.4e19 | 2.6h | 19.3% | loss=3.123, 20.8B tokens | completed #1 #2#3#4 |