147 merged21 opened80 issues closed12 contributors4 epics377 comments this week
Iris got a full-stack overhaul — JWT auth, Vue 3 dashboard, SQLite as canonical state, multi-VM CoreWeave support, and automated nightshift maintenance — while @ClassicLarry completed a systematic MoE isoflop sweep and @dlwh drove a gruggification refactor across the training codebase.
Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.
24/41 sub-issues closed
Following last week's controller checkpointing and reservation work, @rjpower made SQLite the canonical state store for the Iris controller (#3408) and added GCS checkpointing for post-mortem analysis (#3497). The Iris dashboard was rewritten from Preact+HTM to Vue 3 + TypeScript + Tailwind v4 (#3511) with 26 components. Auth landed as a full system: GCP/static-user auth workflows (#3537) followed by a switch to HMAC-SHA256 JWTs for zero-DB-hit verification (#3630). Multi-VM CoreWeave support shipped with JAX coordinator bootstrap (#3638) and CW-specific fixes: R2 endpoint correction (#3629), interruptable taint toleration (#3609), and hardened port-forward tunnels (#3540). The autoscaler saw deadlock fixes and rate-limit logging (#3580), direct worker ID assignment (#3512), and ghost-slice prevention on failed scale-down (#3571). @rjpower introduced nightshift — 7 scheduled GitHub Actions workflows using Claude agents for overnight maintenance: cleanup, doc-drift, issue triage, and automated PR fixes (#3557, #3612, #3614, #3615). @yonromai consolidated all canaries to Grug MoE via Iris (#3587), replacing the old Ray-based TPU canary and separate CW script. On the training side, @dlwh landed a series of gruggification PRs — sweeping hax.* annotations across model modules (#3326), removing direct haliax imports (#3327), decoupling eval from model.Pos (#3328), adding explicit axis-mapping foundations (#3329, #3331), an array-loss bridge for LM/ASR eval (#3313), and optimizer support for eqx linear masks (#3318). Fused cross-entropy got batch-tiled XLA streaming to avoid int32 word-count limits on long sequences (#3533), plus autotune ExceptionGroup fallback (#3605). GCS executor lock races under worker churn were fixed (#3541), and cross-region transfer was unified under a single 10GB budget (#3627).
143 PRs this week, 4 new comments, and 0 new issues (41 total)
Sort:
#3472feat(iris): add cloud-mode smoke test to CI💬2+142−2@rjpower
Building on last week's baseline architecture (64 experts, K=4, AF balancing), @ClassicLarry ran a full isoflop sweep across three FLOP budgets (1e18, 3e18, 1e19) and five hidden dimensions. The sweep architecture used DeepSeek-style MoE with 1 shared expert, top-4 routing, and 2 leading dense layers, testing both AdamW and Muon optimizers with aux-free and load-balance-loss variants. Key finding: at 1e19 FLOPs, d_model=1024 achieved the best BPB (0.9943) — the sweep's sweet spot between parameter efficiency and training token ratio. Muon showed competitive BPB with AdamW at larger scales but the comparison was complicated by small batch sizes inflating AdamW's buffer-retune advantage. @ClassicLarry also discovered that lr warmup is the key differentiator — Muon doesn't need it while Adam diverges without it, giving Muon a 100-500 step head start. The routing bias term may need to scale with lr at very small batch sizes. Results are tracked on the moe_scaling_iter_01 branch. @dlwh selected the TPU fused CE backend explicitly for the MoE loss path (#3644) and started work on selective remat checkpoint policies (#3657).
2 PRs this week, 26 new comments, and 0 new issues (4 total)
Sort:
#3644[grug/moe] Select TPU fused CE backend in MoE loss path💬3+4−2@dlwh
Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.
0/4 sub-issues closed
After last week's Nemotron-CC tokenization milestone, @ravwojdyla focused on Zephyr reliability at scale. The shuffle layer was rewritten to use Parquet instead of per-chunk pickle blobs (#3482), with a pickle-in-parquet fallback for non-Arrow-serializable items (#3656) — eliminating the M×R×C file blowup that had been a bottleneck. Writers got dynamic batch sizing targeting 64MB buffers (#3498) and streaming Vortex output (#3479). Exact dedup was rewritten for Nemotron scale with single-pass hash-group-write and Vortex output (#3442). The Zephyr coordinator hang when all workers OOM was fixed (#3600). @Helw150 added data inspection improvements for debugging training spikes (#3284).
2 PRs this week, and 0 new issues (4 total)
Sort:
#3600Fix Zephyr coordinator hang when all workers OOM💬6+86−10@rjpower
#3284Tweaks to data inspection from debugging spikes💬5+339−88@Helw150
Issues
#3049Test Luxical as a General Tool for Data Integration Pipelines
Light activity this week. On OpenThoughts4, @natolambert commented that the modest advantage of 235B over 32B as teacher is "the most interesting thing" and urged testing multiple teachers and investigating rejection sampling findings (#2262). No new PRs landed in this area.
0 PRs this week, 2 new comments, and 0 new issues (4 total)
Sort:
Issues
#2956[Agentic SFT] SFT Qwen3-8B on 5K SWE-smith trajectories and show improvement on SWE-bench
#2905[Agentic SFT] Generate 30K Coding Trajectories across 6 Languages
#3093[Agentic SFT] Tracking SFT datasets for SWE tasks
#2262Experiment: OpenThoughts4 Teacher Model Comparison - Qwen3-32B vs. Qwen3-235B-A22B💬2
@gonzalobenegas generalized DNA model tokenization with separate uppercase/lowercase weights and added a functional_pos experiment (#3483). @dlwh-golem continued documentation alignment across tutorials, contributing guides, and eval docs. @teetone opened Evalchemy eval fixes (#3690).
23 PRs this week, 44 new comments, and 80 issues closed (80 total)