Iris moved from last week's reliability hardening to a full log storage rewrite and reservation system. The MoE expert-parallel benchmark thread concluded with production compaction optimizations, and @ClassicLarry submitted a 15-run isoflop scaling sweep after last week's initial MoE experiments.
#2836
Infrastructure: MoE Training Support
Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.
Building on last week's Iris reliability push, @rjpower replaced GCS-based log reads with a SQLite-backed controller log store forwarded via heartbeat (#3301 , #3244 ) — a full rewrite of the log pipeline. He also landed a reservation system for pre-provisioning worker capacity (#3123 , #3223 ), and fixed FD exhaustion under load (#3389 ), controller lock contention (#3356 ), and delivery-failure retry budget inflation (#3366 , #3367 ). @dlwh continued the Pallas kernel cleanup from last week, consolidating to a single production forward path for fused cross-entropy (#3125 ) and stabilizing it across TPU v4/v5e/v6e (#3354 ). The MoE ring expert-parallel benchmark thread (#2710 ) concluded this week with production compaction optimizations merged (#3377 , #3398 ). @yonromai hardened CoreWeave deployment with konnectivity tunnel retries (#3323 ), Docker CLI pinning for TPU host compat (#3348 ), and compilation cache improvements (#3195 ). @ravwojdyla parallelized GCP zone queries (#3259 ) and added MirrorFileSystem for transparent cross-region file access (#3258 ).
108 PRs this week, 22 new comments, and 1 new issue (41 total)
Sort:
Updated ↓
Updated ↑
Created ↓
Created ↑
Activity
#3398
Simplify Grug MoE ring EP local counts
+26 −10
@dlwh
#3377
Optimize Grug MoE ring EP compaction
+16 −12
@dlwh
#3167
iris: controller snapshot/checkpoint
+4647 −488
@rjpower
#3162
Add CrossRegionGuardedFS to block large cross-region GCS reads
+774 −312
@rjpower
#3427
Improve non-daemon thread reporting by grouping identical stacks
+14 −4
@rjpower
#3424
cleanup: remove autoscaler unmet demand log spam
+2 −12
@rjpower
#3411
[Iris] Align TPU VM metadata for v5p, v5e-8, and v6e-8
+21 −22
@dlwh
#3407
Fix log store WAL bloat during eviction
+15 −4
@rjpower
#3405
iris: reject unschedulable coscheduled jobs at submission time
+172 −60
@rjpower
#3403
Fix SSH tunnel poisoning parent stdout with O_NONBLOCK
+5 −2
@yonromai
#3401
fix: handle unhealthy worker under lock to prevent race condition
+24 −22
@rjpower
#3397
iris: generalize profiling with threads support and target-based routing
+571 −274
@rjpower
#3395
Use human-readable slice names instead of epoch milliseconds
+28 −9
@rjpower
#3393
iris: add worker health checks and unify heartbeat paths
+307 −44
@app/claude
#3390
iris: default endpoints tab to 100 rows with prefix search
+78 −11
@rjpower
#3389
Fix controller FD exhaustion under load
+5 −1
@rjpower
#3387
iris: disable coredumps in worker and task containers
+4 −0
@rjpower
#3382
fix: add null guards for users data in dashboard
+2 −2
@rjpower
#3369
fix(iris): log perf, scheduling fixes, holder task device constraints
+1213 −529
@rjpower
#3368
fix(iris): forward server-side tracebacks in RPC errors
+233 −193
@rjpower
#3367
Fix preemption_count inflation from delivery failures
+38 −49
@rjpower
#3366
Fix delivery failure handling: don't count undelivered tasks against retry budgets
+281 −37
@rjpower
#3364
fix(zephyr,iris): add retry+backoff for controller RPCs and pipeline retries
+16 −2
@rjpower
#3363
Handle reservation holder task worker deaths gracefully
+222 −5
@rjpower
#3362
Fix profiling summary pipeline on GPU traces
+125 −29
@yonromai
#3361
Improve log propagation reliability and shutdown handling
+32 −9
@rjpower
#3360
feat(iris): prefix-based log fetching with autoincrement cursor
+434 −455
@rjpower
#3356
iris: reduce controller lock contention and RPC overhead
+420 −301
@rjpower
#3354
[Levanter] Stabilize fused CE TPU v4 tuning and vmem fallback
+1511 −324
@dlwh
#3352
cleanup: remove iris demo cluster & yaml
+16 −1460
@chonky-bot
#3350
SSH to GCP util
+109 −0
@ravwojdyla
#3348
Pin Docker CLI to 24.0 in worker image for TPU host compat
+4 −1
@chonky-bot
#3344
iris: use pd-ssd boot disk for controller VM
+20 −1
@app/claude
#3343
Allow to filter iris logs by level in CLI
+26 −1
@ravwojdyla
#3340
Search all zones for the TPU
+14 −8
@ravwojdyla
#3325
iris: use shared logging widget for process & task logs
+829 −2244
@rjpower
#3323
coreweave: widen tunnel timeout and add diagnostics for konnectivity startup race
+26 −4
@yonromai
#3306
Update job-monitoring loop for Ray and Iris tracks
+159 −96
@dlwh
#3305
Validate --region/--zone CLI values in iris job run
+147 −1
@ravwojdyla-agent
#3303
Fixup sqlite logs locking, counts
+123 −105
@ravwojdyla
#3302
Document iris push auth
+15 −0
@ravwojdyla
#3301
Use sqlite3 for log storage on the controller.
+191 −120
@rjpower
#3298
iris: use tmpfs for uv sync and .venv IO on GCE
+119 −5
@app/claude
#3296
normalize logging: unified format, level tagging, and log filtering
+650 −225
@rjpower
#3288
Extend optimizer linear transforms to eqx and marker linears
+96 −45
@dlwh
#3287
Bind eval and inference runtime to explicit mesh resources
+99 −62
@dlwh
#3286
fix: normalize abbreviated bucket names in region_from_prefix
+34 −3
@chonky-bot
#3283
iris: fix dropped logs on task completion + remove FetchTaskLogs RPC
+667 −558
@rjpower
#3281
Use SSH BatchMode to force errors on missing keys
+4 −0
@rjpower
#3273
fray/iris: forward JobRequest retry budgets on submit
+2 −1
@dlwh
#3272
iris smoke-test: fix _run_iris logging format mismatch
+1 −1
@dlwh-golem
#3270
step_runner: preserve underlying step failure cause
+38 −4
@dlwh
#3269
grug: dispatch through fray jobs (to fix multinode)
+102 −6
@dlwh
#3267
distributed_lock: read legacy worker_id lock leases
+48 −48
@dlwh
#3265
executor: run distributed lock and cache on executor node
+154 −59
@app/claude
#3263
Remove resources/env_vars from ExecutorStep; use @remote for dispatch
+42 −53
@rjpower
#3260
Fix runtime_env propagation to TPU SliceActor workers
+75 −19
@Calvin-Xu
#3259
Parallelize GCP zone queries in list_slices and list_vms
+125 −13
@rjpower
#3258
Add MirrorFileSystem for transparent cross-region file access
+917 −228
@rjpower
#3257
Add Cache-Control headers to static assets
+28 −1
@rjpower
#3256
Add backend-dispatched GMM API with GPU fallback
+150 −34
@dlwh-golem
#3254
feat(iris): flexible device variant requests
+3944 −3363
@rjpower
#3251
Add miss-only autotune sweep and cache for pallas fused CE
+619 −9
@dlwh-golem
#3248
[levanter] Default fused CE TPU path to XLA and retune v4 huge-batch blocks
+101 −3
@dlwh
#3244
iris: forward task logs via heartbeat instead of reading from GCS
+496 −435
@rjpower
#3242
feat(iris): parallelize scaling group restoration on startup
+88 −30
@rjpower
#3241
fix(iris): fix parallel CLI tunnel port collisions and improve RPC retry
+18 −11
@rjpower
#3239
Add inline cache to Iris image publishes
+2 −1
@yonromai
#3234
Fix Iris marin_prefix mapping for europe-west4
+8 −1
@dlwh
#3233
feat(iris): auto-detect multinode TPUs and set replicas/coscheduling
+63 −3
@dlwh
#3232
fix(iris): use kebab-case CLI command in bug report autoscaler status hint
+1 −1
@dlwh
#3223
Autoscaler: model reservations as first-class objects with synthetic holder tasks
+441 −286
@rjpower
#3222
Set min-slices and fix worker dashboard.
+208 −68
@rjpower
#3221
iris: validate TPU replicas match topology vm_count
+72 −2
@rjpower
#3218
Use GHCR weekly image as Docker build cache source
+6 −0
@yonromai
#3215
Add Meta-Llama-3.1-8B-Instruct to _KNOWN_VOCAB_SIZES to avoid HF access during dry-runs
+6 −0
@rjpower
#3214
Cache per-window layout and permutations in BlockShufflingDataset
+4 −0
@yonromai
#3213
Iris: derive smoke test bundle prefix from cluster config
+6 −6
@app/claude
#3212
iris: use token bucket for scale-down rate limiting instead of cooldown
+168 −102
@rjpower
#3209
fix(iris): remove duplicate retry loop from actor resolution
+86 −64
@rjpower
#3207
Set max_task_retries=10 for Zephyr workers to survive transient errors
+14 −0
@rjpower
#3206
iris: add network bandwidth tracking and disk sparkline for workers
+396 −195
@app/claude
#3199
iris: make task memory/CPU bars show human-readable values with sparklines
+256 −21
@rjpower
#3195
Enable S3 compilation cache and disable XLA autotune sub-cache
+26 −29
@yonromai
#3193
Make gpu_type required in ResourceConfig.with_gpu()
+6 −6
@yonromai
#3188
Iris: autoscaler fixes, heartbeat performance, and observability
+4424 −1199
@rjpower
#3169
grug: improved variant contract checks
+299 −244
@dlwh
#3168
Show reservation device in dashboard job detail
+22 −1
@rjpower
#3163
Cleanup job CLI and autoscaler visualization.
+249 −60
@rjpower
#3161
Fall back to local compilation cache when MARIN_PREFIX is S3
+18 −3
@yonromai
#3160
Fix RESOURCE_EXHAUSTED: add NVIDIA weight tile limit for Pallas CE kernel
+7 −4
@yonromai
#3158
Retry port-forward tunnel on konnectivity failure
+40 −20
@yonromai
#3156
Fall back to local compilation cache when MARIN_PREFIX is S3
+18 −3
@yonromai
#3154
fix: resolve worker detail 404 when using worker name
+285 −441
@app/claude
#3152
Update cluster configs
+30 −0
@ravwojdyla
#3148
Fix Pallas GPU CE custom backward tracing on non-GB10
+54 −29
@yonromai
#3146
Fix rollout-restart race in CW controller startup
+5 −0
@yonromai
#3143
Improve profile gap attribution and trace quality diagnostics
+269 −37
@dlwh
#3137
refactor(iris): SE cleanup — dead code, deduplication, interface clarity
+949 −1283
@rjpower
#3136
cleanup: replace remaining private API usage in core modules
+46 −25
@dlwh
#3135
trainer/eval_harness: batch-size int-ification for data loaders
+22 −9
@dlwh
#3134
optim: group linear-like routing into explicit marker range
+176 −27
@dlwh
#3133
grug: remove Axis dependency from train/eval dataset wiring
+46 −2
@dlwh
#3127
Add Grug variant visual diff tooling and PR workflow
+1188 −0
@dlwh
#3125
Clean linear CE TPU kernel variants and keep one production forward path
+1864 −405
@dlwh
#3123
Iris: Add reservation system for pre-provisioning worker capacity
+2994 −677
@rjpower
#3119
Name Ray actor processes after the actor group name
+56 −6
@ravwojdyla-agent
#3045
Fix Pallas GPU CE backward gradient tracing on non-GB10
+54 −29
@dlwh
#2822
Iris: Implement CoreWeave platform
#2823
Iris: Improve worker/process status visibility and post-mortem log access
#2824
Iris: Multi-region support with per-scaling-group environment configuration
#2825
Iris: Quota-aware scheduling and cross-zone fallback
#2826
Iris: Richer profiling and worker-level observability
#2827
Iris: Proactive unhealthy/degraded node identification
#2829
Data processing pipeline: validate end-to-end tokenization for all target datasets
#2830
Training monitoring: alerting on stalled/diverging loss and health dashboard
#2831
Validate fault tolerance: checkpoint resume and preemption recovery on CoreWeave
#2832
Agent can run a small model E2E without human intervention
#2833
Establish daily canary training runs
#2834
Executor v2: split out caching module and simplify step API
#2835
Standardize on Vortex format with typed dataset schemas
#2629
Iris: bootstrap script templates are too fragile
#2377
Jobs are not tolerant to the node where `self._run_steps` is running being preempted.
#2651
Iris: Resolver/Actor system should always auto-resolve on transient errors
#2809
Iris: Survey threading and timeouts for the controller
#2810
Iris: benchmark test for controller performance
#2424
Iris - initial resource observability
#2710
Experiment: MoE EP benchmark milestone
#2418
Add AdamC, fp32 router compute, router_topk_then_softmax, qk-norm option for MoE stability sweeps
#2414
Experiment: OLMoE size sweep with MoE stability measures
#2804
fsspec should reject cross region reads (or those over X MB)
#2744
Iris: bootstrap should probably live on the scaling group
#2745
Iris: Add attributes to ScaleGroupConfig for scheduling-level metadata
#2642
Iris: preemptible shouuld be a taint, not an attribute
#2735
Iris: Zone-aware scheduler
#2762
Iris: fair scheduler
#2625
Iris: Users and Priorities
#2749
iris: Migrate GCP platform from gcloud CLI to Python API
#2772
Iris: add proxy for worker view
#2803
iris-controller: add built-in py-spy profiling endpoint to dashboard
#2754
Embed speedscope in Iris dashboard for one-click profile viewing
#2413
SwiGLU vs Bilinear MLP layers for MoE Experts
#2708
Zephyr: auto-scale worker groups up to match demand
#2535
Iris: Integrate chronos virtual time into chaos test suite
#2849
Iris: add smoke test into CI
#2926
Iris: Add Levanter health check in Iris
#3035
StepRunner shouldn't launch tasks with Fray by default
#3098
Evaluate (first few steps) x00B MoE on TPU and GPU
#3164
🆕 Iris: allow controller restarts without resetting tasks
21 potentially related in Other Changes
#3415
docs: align TPU cluster setup tutorial with cluster CLI
+3 −8
@dlwh-golem
#3337
Add Ray cluster safety rule to AGENTS
+1 −0
@dlwh
#3150
ci: run PR checks on all target branches, not just main
+0 −19
@yonromai
#3392
[iris] Bound Docker task workdirs with tmpfs
+201 −0
@rjpower
#3353
iris: remove GCE VM spec defaults, require explicit config
+54 −21
@app/claude
#3331
gruggification: pass explicit axis mappings through train/eval callers
+31 −8
@dlwh
#3329
gruggification: explicit axis-mapping foundation for LM loss path
+23 −9
@dlwh
#3328
gruggification: decouple eval and inference surface from model.Pos
+1 −5
@dlwh
#3327
gruggification: remove remaining direct haliax symbol imports
+619 −542
@dlwh
#3315
lm_model: migrate public LM surface to array-native protocols
+256 −208
@dlwh
#3314
main: default train/eval/lora/viz to array-first Grug datasets
+171 −69
@dlwh
#3313
lm/eval: add array-loss bridge for LM and ASR
+1397 −53
@dlwh
#3312
trainer/runtime: bind execution to explicit mesh resources
+518 −290
@dlwh
#3311
partitioning: complete named_jit facade migration
+20 −18
@dlwh
#3310
mesh/models: centralize scan and partitioning foundations
+427 −234
@dlwh
#3309
eval: explicit batch-resource wiring and compute-axis naming
+213 −82
@dlwh
#3290
Default trainer meshes to explicit axis types
+12 −3
@dlwh
#3289
Add tensor-opaque LM model adapters for array migration
+75 −20
@dlwh
#3275
Rjpower/flatten monorepo
+7760 −32666
@rjpower
#3245
Extract shared utilities into new rigging package
+729 −593
@rjpower
#3153
fix: default `ray_run` entrypoint to 1 CPU
+10 −4
@app/claude
#3096
Pre-training: 32B MoE Kick-off
Following last week's initial MoE experiments on v4 and v5p, @ClassicLarry completed Phase 1 scaling law replication (#3182 ) and submitted a full 15-run isoflop sweep varying expert counts, granularity, and activation ratios (#2167 ). The canonical Grug MoE module and template variant landed (#3046 ). @yonromai built on last week's CW canary ferry by adding a TPU canary (#3342 ), data loader stall diagnostics (#3346 ), always-on profiling with persistent artifacts (#3299 ), and MFU gating on trailing p50 windows (#3279 ). Grug MoE ring EP got block shuffle as the new default (#3371 ) and loop profiler annotations (#3376 ).
16 PRs this week, 33 new comments, and 1 new issue (4 total)
Sort:
Updated ↓
Updated ↑
Created ↓
Created ↑
Activity
#3342
Add MoE canary ferry for daily TPU regression testing
+206 −0
@yonromai
#3046
Add canonical Grug MoE module and template variant
+1832 −9
@dlwh
#3229
grug/moe: restore aux-loss metrics and remove smoke launcher
+70 −9
@dlwh
#3293
Add modular_opt variant and move Grug variant docs
+1092 −0
@dlwh
#3371
Set block shuffle as the new-run default for Grug
+33 −26
@dlwh
#3376
[grug] Add loop profiler annotations
+41 −32
@dlwh
#3346
canary: add data loader stall diagnostics + keep_nodepool option
+33 −4
@yonromai
#3299
[canary] Enable always-on profiling with persistent artifacts for agent triage
+52 −9
@yonromai
#3280
canary: log fused CE implementation + enable profiler
+8 −0
@yonromai
#3279
canary: switch MFU gate to p50 over trailing window
+31 −2
@yonromai
#3277
ci: delete CW NodePools after canary ferry workflow
+13 −0
@yonromai
#3240
Re-enable daily CW GPU canary ferry schedule
+4 −3
@yonromai
#3217
Fix CW canary OOM and improve training observability
+33 −5
@yonromai
#3177
Use GCS compilation cache for CW canary ferry
+15 −0
@yonromai
#3171
Bump canary ferry timeout from 2h to 4h
+1 −1
@yonromai
#3187
update v6e configs to not request so many big v6e slices
+4 −8
@dlwh
#2371
Grug MoE
#2167
Add a version of isoflop_sweep for MoE's
#3182
🆕 Determine optimal scaling parameters for MoE
#2828
Port MoE training to GPU: kernel experiments and performance validation
17 potentially related in Other Changes
#3331
gruggification: pass explicit axis mappings through train/eval callers
+31 −8
@dlwh
#3329
gruggification: explicit axis-mapping foundation for LM loss path
+23 −9
@dlwh
#3328
gruggification: decouple eval and inference surface from model.Pos
+1 −5
@dlwh
#3327
gruggification: remove remaining direct haliax symbol imports
+619 −542
@dlwh
#3315
lm_model: migrate public LM surface to array-native protocols
+256 −208
@dlwh
#3314
main: default train/eval/lora/viz to array-first Grug datasets
+171 −69
@dlwh
#3313
lm/eval: add array-loss bridge for LM and ASR
+1397 −53
@dlwh
#3312
trainer/runtime: bind execution to explicit mesh resources
+518 −290
@dlwh
#3311
partitioning: complete named_jit facade migration
+20 −18
@dlwh
#3310
mesh/models: centralize scan and partitioning foundations
+427 −234
@dlwh
#3309
eval: explicit batch-resource wiring and compute-axis naming
+213 −82
@dlwh
#3300
AdaMuon implementation
+525 −0
@msclar
#3292
Delphi Scaling Setup
+1405 −617
@Helw150
#3290
Default trainer meshes to explicit axis types
+12 −3
@dlwh
#3289
Add tensor-opaque LM model adapters for array migration
+75 −20
@dlwh
#3274
[Speedrun] Submit NAMO-D LLaMA-300M run
+392 −0
@suraj-ranganath
#3237
Grug Demo, small scale feature maxed moe
+1393 −23
@ClassicLarry
#3100
Data Sources for Pre-training
Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.
After last week's tokenization debugging, @ravwojdyla shifted to the Luxical embedding experiment for quality and topic evaluation (#3191 ) — the Luxical creator @lukemerrick dropped in with guidance on embedding storage and model usage (#3049 ). Vortex got GCS support (#3268 ) and @Helw150 added Nemotron V2 data (#3317 ). Zephyr gained group_by enhancements — secondary sort (#3250 ) and generator reducers (#3247 ). Tokenization tuning (#3170 ) and download reliability fixes (#3324 , #3142 ) rounded out the pipeline work.
13 PRs this week, 6 new comments, and 2 new issues (4 total)
Sort:
Updated ↓
Updated ↑
Created ↓
Created ↑
Activity
3 potentially related in Other Changes
#3192
Synthetic Data
Progress on the SFT front after last week's 0% resolve rate: @AlienKevin reported that switching to Qwen2.5-Coder-32B-Instruct as the student model reached 5/43 on the Rust subset of SWE-bench Multilingual (#2956 ). A TRL sanity check on Modal confirmed the Marin SFT pipeline isn't at fault for earlier repetition issues. @moojink followed up with experiments using the larger Qwen3-235B-A22B teacher model and rejection sampling (#2262 ). No merged PRs this week but active experimental progress.
0 PRs this week, 3 new comments, and 0 new issues (4 total)
Sort:
Updated ↓
Updated ↑
Created ↓
Created ↑
Activity
#2956
[Agentic SFT] SFT Qwen3-8B on 5K SWE-smith trajectories and show improvement on SWE-bench
#2905
[Agentic SFT] Generate 30K Coding Trajectories across 6 Languages
#3093
[Agentic SFT] Tracking SFT datasets for SWE tasks
#2262
Experiment: OpenThoughts4 Teacher Model Comparison - Qwen3-32B vs. Qwen3-235B-A22B
Other Changes
Documentation alignment across the repo by @dlwh-golem — TPU cluster setup (#3415 ), contributing hooks (#3307 ), MkDocs commands (#3271 ), README paths (#3235 ). @gonzalobenegas added LLR-based VEP eval for DNA models (#3144 ) and perplexity vs. downstream task performance EDA (#3333 ). CI improvements for PR checks on all target branches (#3150 ).
43 PRs this week, 71 new comments, and 73 issues closed (73 total)
Sort:
Updated ↓
Updated ↑
Created ↓
Created ↑
Activity
Merged PR
Open PR
Draft PR
Closed PR
Open issue
Closed issue
Keyboard shortcuts
? Toggle this help
j / k Next / previous section
t Toggle details in current section
s Cycle sort order in current section
o Open current epic on GitHub
m Open current milestone on GitHub
M Open milestones list on GitHub