73 merged5 opened39 issues closed11 contributors4 epics237 comments this week
Iris reliability hardening, Grug's module-first API refactor, and early MoE training experiments on TPU. The first CoreWeave GPU canary ferry was stood up and @ClassicLarry got Grug MoE running with replicated weights on v4 and v5p.
Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.
24/41 sub-issues closed
Iris saw a major reliability push: @yonromai added request-level RPC observability (#3073), auto-retry on DEADLINE_EXCEEDED (#3067), container phase tracking replacing a boolean (#3105, #3106), local K8s e2e tests (#3097), and CW RBAC + scheduling fixes (#3111). @rjpower landed live resource monitoring in the dashboard (#3085), fractional CPU support via millicores (#3040), unified WorkerConfig proto (#3077), and replaced multi-region AR push with GHCR + AR remote repos (#2996). @dlwh refactored Grug to a module-first API (#3017), tuned fused cross-entropy for v4 with XLA fallback (#3052), set TPU VMEM defaults (#3053), and added Pallas cost estimates (#2999). Zephyr got heartbeat timeout suppression (#3014) and shared data replaced with disk-based serialization (#2986) by @ravwojdyla.
60 PRs this week, 49 new comments, and 3 new issues (41 total)
Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.
0/4 sub-issues closed
@ravwojdyla spent the week debugging tokenization at scale in preemptible environments (#2829), cataloging and fixing a series of Zephyr bugs around shared data, heartbeat timeouts, and Zarr metadata errors. Nemotron tokenization got a disk_cache tokenizer and configurable shard counts (#2984, #3036). @Helw150 provided context on the removal of Bert/Fasttext quality classifiers, motivating the Luxical embedding exploration (#3049).
3 PRs this week, 8 new comments, and 4 new issues (4 total)
@ClassicLarry got Grug MoE running with replicated weights on v5p (#3064), benchmarking MFU across configurations and demonstrating that v5p's 95GB HBM allows full weight replication with batch-only sharding. @yonromai stood up the first daily CoreWeave GPU canary ferry workflow (#3050). Active discussion on MoE expert-parallel benchmarks by @dlwh on v5p-8 (#2710), and @ClassicLarry began scoping the MoE isoflop sweep (#2167).
3 PRs this week, 43 new comments, and 1 new issue (4 total)
@AlienKevin's SFT run on SWE-smith trajectories finished in 8.5 hours but yielded 0% resolve rate on SWE-bench Multilingual with gpt-5-mini teacher (#2956). Two alternative teacher models (minimax-m2.5 and another) are being tried. Repetition issues with Qwen3-8B during SFT prompted a sanity check using TRL on Modal. @moojink continued OpenThoughts4 teacher model comparison (#2262) with @natolambert following along.
0 PRs this week, 7 new comments, and 2 new issues (4 total)
Sort:
Issues
#2956🆕 [Agentic SFT] SFT Qwen3-8B on 5K SWE-smith trajectories and show improvement on SWE-bench💬4
#2905[Agentic SFT] Generate 30K Coding Trajectories across 6 Languages
#3093🆕 [Agentic SFT] Tracking SFT datasets for SWE tasks
#2262Experiment: OpenThoughts4 Teacher Model Comparison - Qwen3-32B vs. Qwen3-235B-A22B💬3
Other Changes
@gonzalobenegas added DNA experiments covering promoters, genomic regions, and k-mer tokenization (#2992), plus auto-detection of BOS/EOS tokens in the DNA batch tokenizer (#3055). @teetone updated Evalchemy non-math evaluation domains (#3128). Agent recipe and scrub skill improvements by @dlwh (#3129) and @rjpower (#3056).
12 PRs this week, 24 new comments, and 39 issues closed (39 total)
Sort:
#3129Add scrub skills for docs parity and self-improvement💬2+64−0@dlwh