120 merged20 opened82 issues closed13 contributors4 epics616 comments this week
A week of deep infrastructure hardening — Iris got security-first auth, DuckDB log storage, 5x controller performance, and multi-host GPU on CoreWeave — while MoE scaling research advanced to iter_03 with PID-controlled sigmoid routing, AdamH, and Gated Norm showing uniform BPB improvements across the isoflop grid.
Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.
24/41 sub-issues closed
Iris saw major performance and security work this week. @rjpower landed a 5x+ controller speedup via a SQLite read pool, query consolidation, and dashboard SQL rewrites #3719, then followed up with further heartbeat and scheduling query optimizations #3881 and faster dashboard queries with expanded benchmark coverage #3791. The log store was overhauled twice — first replacing SQLite with DuckDB + rotating Parquet #3828, then optimizing with connection pooling, sorted segments, and page indexes #3837 — after a fix for the DuckDB store's 60GB memory usage #3843. The BundleStore was also replaced with flat-file storage #3776, and migration scripts were added for the SQLite→fsspec/Parquet transitions #3852. Security hardening landed with default-deny auth, CSRF protection, auth DB isolation, and traceback sanitization #3894, plus sensitive env var redaction in API responses #3889. Multi-host GPU support for CoreWeave shipped via the new KubernetesProvider #3806, with smoke CI added #3927, endpoint leak fixes #3814, #3740, tmpfs for task workdirs #3696, #3770, and stale heartbeat reaping on controller restart #3772. The Dockerfiles were unified into a single multi-stage build #3735 and container OOM issues were addressed with memory limits #3840. @dlwh made the executor region-aware on Iris from GCS dependencies #3824, fixed Grug checkpoint resume #3790, handled Pallas autotune misses under mosaic partitioning #3669, and added fused CE autotune fixes #3949, #3963. @yonromai refactored the Zephyr coordinator to use host_actor and worker group polling #3861, added S3/R2 conditional-write locking #3874, and fixed sync actor RPC kwargs #3907. @ravwojdyla added combiner support to group-by #3725, iris resource_utils for task resource queries #3847, map_shard with shard info #3757, and improved the job detail view with child jobs and sorting #3733. The harbor dependency was moved out to marin-community/harbor #3836, removing ~200K lines. A major Platform abstraction elimination is in progress #3900, reorganizing into Service + Provider layers.
113 PRs this week, 8 new comments, and 0 new issues (41 total)
Sort:
#3779[grug/moe] Add explicit EP implementation selector💬6+264−9@dlwh-golem
#3963Lower shard_map tracers directly in fused CE autotune💬1+132−8@dlwh
Building on last week's iter_02 improvements, @Helw150 pushed the MoE architecture to iter_03 with five major changes: sigmoid routing replacing softmax for independent per-expert selection, a PID bias controller (Kp/Ki/Kd) replacing the bang-bang balancer, AdamH (scale-invariant Adam that keeps weight Frobenius norms constant) replacing standard Adam, removal of all auxiliary losses (load balancing, router z-loss, logit z-loss), and variance-preserving residual connections. Two parallel Vizier sweeps — 160 trials each at d=1024/3e18 FLOPs — are comparing iter_02 baseline against iter_03 AdamH to find optimal hyperparameters at matched compute #2167. @WhenWen completed the full AdamH isoflop grid, showing 0.009–0.01 BPB improvements over v02 at most configurations, and added Gated Norm on top — a learned sigmoid gate that compensates for AdamH's bounded activation norms — producing uniform improvements of 0.002–0.006 BPB across all 15 grid points, with the d2048 regression at 1e18 nearly eliminated. @ClassicLarry ran the recipe at 1e20 scale with directionally good results, completed iter_01 vs iter_02 comparisons showing consistent improvements across the full grid, then kicked off a 1e21 run (14B total params, 2B active, 75B tokens) before stepping out #3800. On the GPU front, @chloechiaw is testing Tokamax's Pallas Triton ragged_dot kernel as a JAX-native alternative for GPU MoE #2828. @dlwh added an explicit EP implementation selector for the Grug MoE path #3779, an ArrayStacked grug variant with stack-aware optimizer support #3797, and only-store-EMA-when-enabled optimization #3670. Selective MoE remat checkpoint policies were added and then scoped down after feedback that the model file was getting too heavy #3657.
4 PRs this week, 17 new comments, and 2 new issues (6 total)
Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.
0/4 sub-issues closed
Following last week's exact dedup rewrite, @ravwojdyla confirmed successful large-scale group-by for fuzzy dedup on a Nemotron split #2829, validating the pipeline at production scale. @rjpower switched Nemotron-CC downloads to streaming .jsonl.zst output #3796. @dlwh added support for hf://buckets paths in default_download and default_tokenize #3793. The dupekit Rust code was refactored into rust/ with CI wheels and a mode switch #3850. A fuzzy dedup update to scale to Nemotron is in progress #3750.
4 PRs this week, and 0 new issues (4 total)
Sort:
#3796Stream Nemotron-CC download and output as .jsonl.zst💬1+99−144@rjpower
#3793Support hf://buckets paths in default_download and default_tokenize💬2+177−17@dlwh
#3850Refactor Rust code: move dupekit to rust/, add CI wheels and mode switch💬12+375−63@rjpower
@ahmeda14960 opened a major PR for an alignment function that goes from spec to synthetic preference data to DPO #3950, alongside work migrating the RL pipeline from Fray v1 (Ray) to Fray v2 (Iris) #3960. A collocated RL project doc and Tunix RL reference was also opened #3948. @teetone added new Evalchemy evals and fixes #3690.
1 PR this week, 1 new comment, and 0 new issues (4 total)
#2262Experiment: OpenThoughts4 Teacher Model Comparison - Qwen3-32B vs. Qwen3-235B-A22B💬1
Other Notable Changes
@dlwh opened a PR for XLA-first Mamba-3 SISO and MIMO TPU kernels #3961, a significant new architecture direction. @RohithKuditipudi fixed final checkpoint saving #3958. The nightshift agent system continued evolving — skills were separated for PR authoring vs review #3920, triage was upgraded to opus #3749, and turn limits were increased 10x #3935. @Helw150 has a tokenized molecule training draft in progress #3742.
23 PRs this week, 31 new comments, and 82 issues closed (82 total)
Sort:
#3865[docs] Fix stale evaluation tutorial links💬1+3−5@dlwh