Week of March 2nd summary for marin-community/marin

Iris moved from last week's reliability hardening to a full log storage rewrite and reservation system. The MoE expert-parallel benchmark thread concluded with production compaction optimizations, and @ClassicLarry submitted a 15-run isoflop scaling sweep after last week's initial MoE experiments.

#2836 Infrastructure: MoE Training Support

Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.

24/41 sub-issues closed

Building on last week's Iris reliability push, @rjpower replaced GCS-based log reads with a SQLite-backed controller log store forwarded via heartbeat (#3301, #3244) — a full rewrite of the log pipeline. He also landed a reservation system for pre-provisioning worker capacity (#3123, #3223), and fixed FD exhaustion under load (#3389), controller lock contention (#3356), and delivery-failure retry budget inflation (#3366, #3367). @dlwh continued the Pallas kernel cleanup from last week, consolidating to a single production forward path for fused cross-entropy (#3125) and stabilizing it across TPU v4/v5e/v6e (#3354). The MoE ring expert-parallel benchmark thread (#2710) concluded this week with production compaction optimizations merged (#3377, #3398). @yonromai hardened CoreWeave deployment with konnectivity tunnel retries (#3323), Docker CLI pinning for TPU host compat (#3348), and compilation cache improvements (#3195). @ravwojdyla parallelized GCP zone queries (#3259) and added MirrorFileSystem for transparent cross-region file access (#3258).

108 PRs this week, 22 new comments, and 1 new issue (41 total)

Sort:

#3398 Simplify Grug MoE ring EP local counts 💬1 +26 −10 @dlwh
#3377 Optimize Grug MoE ring EP compaction 💬2 +16 −12 @dlwh
#3167 iris: controller snapshot/checkpoint 💬1 +4647 −488 @rjpower
#3162 Add CrossRegionGuardedFS to block large cross-region GCS reads 💬7 +774 −312 @rjpower
#3427 Improve non-daemon thread reporting by grouping identical stacks 💬1 +14 −4 @rjpower
#3424 cleanup: remove autoscaler unmet demand log spam 💬1 +2 −12 @rjpower
#3411 [Iris] Align TPU VM metadata for v5p, v5e-8, and v6e-8 💬3 +21 −22 @dlwh
#3407 Fix log store WAL bloat during eviction 💬1 +15 −4 @rjpower
#3405 iris: reject unschedulable coscheduled jobs at submission time 💬7 +172 −60 @rjpower
#3403 Fix SSH tunnel poisoning parent stdout with O_NONBLOCK 💬2 +5 −2 @yonromai
#3401 fix: handle unhealthy worker under lock to prevent race condition 💬1 +24 −22 @rjpower
#3397 iris: generalize profiling with threads support and target-based routing 💬11 +571 −274 @rjpower
#3395 Use human-readable slice names instead of epoch milliseconds 💬1 +28 −9 @rjpower
#3393 iris: add worker health checks and unify heartbeat paths 💬4 +307 −44 @app/claude
#3390 iris: default endpoints tab to 100 rows with prefix search 💬1 +78 −11 @rjpower
#3389 Fix controller FD exhaustion under load 💬1 +5 −1 @rjpower
#3387 iris: disable coredumps in worker and task containers 💬1 +4 −0 @rjpower
#3382 fix: add null guards for users data in dashboard 💬1 +2 −2 @rjpower
#3369 fix(iris): log perf, scheduling fixes, holder task device constraints 💬1 +1213 −529 @rjpower
#3368 fix(iris): forward server-side tracebacks in RPC errors 💬1 +233 −193 @rjpower
#3367 Fix preemption_count inflation from delivery failures 💬1 +38 −49 @rjpower
#3366 Fix delivery failure handling: don't count undelivered tasks against retry budgets 💬1 +281 −37 @rjpower
#3364 fix(zephyr,iris): add retry+backoff for controller RPCs and pipeline retries 💬1 +16 −2 @rjpower
#3363 Handle reservation holder task worker deaths gracefully 💬4 +222 −5 @rjpower
#3362 Fix profiling summary pipeline on GPU traces 💬2 +125 −29 @yonromai
#3361 Improve log propagation reliability and shutdown handling 💬1 +32 −9 @rjpower
#3360 feat(iris): prefix-based log fetching with autoincrement cursor +434 −455 @rjpower
#3356 iris: reduce controller lock contention and RPC overhead +420 −301 @rjpower
#3354 [Levanter] Stabilize fused CE TPU v4 tuning and vmem fallback 💬5 +1511 −324 @dlwh
#3352 cleanup: remove iris demo cluster & yaml 💬6 +16 −1460 @chonky-bot
#3350 SSH to GCP util +109 −0 @ravwojdyla
#3348 Pin Docker CLI to 24.0 in worker image for TPU host compat 💬4 +4 −1 @chonky-bot
#3344 iris: use pd-ssd boot disk for controller VM 💬7 +20 −1 @app/claude
#3343 Allow to filter iris logs by level in CLI +26 −1 @ravwojdyla
#3340 Search all zones for the TPU 💬2 +14 −8 @ravwojdyla
#3325 iris: use shared logging widget for process & task logs 💬14 +829 −2244 @rjpower
#3323 coreweave: widen tunnel timeout and add diagnostics for konnectivity startup race +26 −4 @yonromai
#3306 Update job-monitoring loop for Ray and Iris tracks 💬2 +159 −96 @dlwh
#3305 Validate --region/--zone CLI values in iris job run 💬5 +147 −1 @ravwojdyla-agent
#3303 Fixup sqlite logs locking, counts 💬5 +123 −105 @ravwojdyla
#3302 Document iris push auth +15 −0 @ravwojdyla
#3301 Use sqlite3 for log storage on the controller. 💬3 +191 −120 @rjpower
#3298 iris: use tmpfs for uv sync and .venv IO on GCE 💬11 +119 −5 @app/claude
#3296 normalize logging: unified format, level tagging, and log filtering 💬11 +650 −225 @rjpower
#3288 Extend optimizer linear transforms to eqx and marker linears 💬3 +96 −45 @dlwh
#3287 Bind eval and inference runtime to explicit mesh resources 💬4 +99 −62 @dlwh
#3286 fix: normalize abbreviated bucket names in region_from_prefix 💬1 +34 −3 @chonky-bot
#3283 iris: fix dropped logs on task completion + remove FetchTaskLogs RPC 💬1 +667 −558 @rjpower
#3281 Use SSH BatchMode to force errors on missing keys 💬1 +4 −0 @rjpower
#3273 fray/iris: forward JobRequest retry budgets on submit 💬5 +2 −1 @dlwh
#3272 iris smoke-test: fix _run_iris logging format mismatch +1 −1 @dlwh-golem
#3270 step_runner: preserve underlying step failure cause 💬1 +38 −4 @dlwh
#3269 grug: dispatch through fray jobs (to fix multinode) 💬4 +102 −6 @dlwh
#3267 distributed_lock: read legacy worker_id lock leases 💬4 +48 −48 @dlwh
#3265 executor: run distributed lock and cache on executor node 💬6 +154 −59 @app/claude
#3263 Remove resources/env_vars from ExecutorStep; use @remote for dispatch 💬1 +42 −53 @rjpower
#3260 Fix runtime_env propagation to TPU SliceActor workers 💬1 +75 −19 @Calvin-Xu
#3259 Parallelize GCP zone queries in list_slices and list_vms 💬3 +125 −13 @rjpower
#3258 Add MirrorFileSystem for transparent cross-region file access 💬3 +917 −228 @rjpower
#3257 Add Cache-Control headers to static assets 💬1 +28 −1 @rjpower
#3256 Add backend-dispatched GMM API with GPU fallback 💬12 +150 −34 @dlwh-golem
#3254 feat(iris): flexible device variant requests 💬6 +3944 −3363 @rjpower
#3251 Add miss-only autotune sweep and cache for pallas fused CE 💬1 +619 −9 @dlwh-golem
#3248 [levanter] Default fused CE TPU path to XLA and retune v4 huge-batch blocks 💬2 +101 −3 @dlwh
#3244 iris: forward task logs via heartbeat instead of reading from GCS 💬2 +496 −435 @rjpower
#3242 feat(iris): parallelize scaling group restoration on startup 💬1 +88 −30 @rjpower
#3241 fix(iris): fix parallel CLI tunnel port collisions and improve RPC retry 💬1 +18 −11 @rjpower
#3239 Add inline cache to Iris image publishes +2 −1 @yonromai
#3234 Fix Iris marin_prefix mapping for europe-west4 💬1 +8 −1 @dlwh
#3233 feat(iris): auto-detect multinode TPUs and set replicas/coscheduling 💬4 +63 −3 @dlwh
#3232 fix(iris): use kebab-case CLI command in bug report autoscaler status hint 💬1 +1 −1 @dlwh
#3223 Autoscaler: model reservations as first-class objects with synthetic holder tasks 💬11 +441 −286 @rjpower
#3222 Set min-slices and fix worker dashboard. 💬3 +208 −68 @rjpower
#3221 iris: validate TPU replicas match topology vm_count 💬3 +72 −2 @rjpower
#3218 Use GHCR weekly image as Docker build cache source 💬2 +6 −0 @yonromai
#3215 Add Meta-Llama-3.1-8B-Instruct to _KNOWN_VOCAB_SIZES to avoid HF access during dry-runs 💬1 +6 −0 @rjpower
#3214 Cache per-window layout and permutations in BlockShufflingDataset +4 −0 @yonromai
#3213 Iris: derive smoke test bundle prefix from cluster config 💬2 +6 −6 @app/claude
#3212 iris: use token bucket for scale-down rate limiting instead of cooldown 💬3 +168 −102 @rjpower
#3209 fix(iris): remove duplicate retry loop from actor resolution 💬3 +86 −64 @rjpower
#3207 Set max_task_retries=10 for Zephyr workers to survive transient errors 💬3 +14 −0 @rjpower
#3206 iris: add network bandwidth tracking and disk sparkline for workers +396 −195 @app/claude
#3199 iris: make task memory/CPU bars show human-readable values with sparklines 💬1 +256 −21 @rjpower
#3195 Enable S3 compilation cache and disable XLA autotune sub-cache +26 −29 @yonromai
#3193 Make gpu_type required in ResourceConfig.with_gpu() +6 −6 @yonromai
#3188 Iris: autoscaler fixes, heartbeat performance, and observability +4424 −1199 @rjpower
#3169 grug: improved variant contract checks 💬5 +299 −244 @dlwh
#3168 Show reservation device in dashboard job detail 💬1 +22 −1 @rjpower
#3163 Cleanup job CLI and autoscaler visualization. 💬3 +249 −60 @rjpower
#3161 Fall back to local compilation cache when MARIN_PREFIX is S3 +18 −3 @yonromai
#3160 Fix RESOURCE_EXHAUSTED: add NVIDIA weight tile limit for Pallas CE kernel +7 −4 @yonromai
#3158 Retry port-forward tunnel on konnectivity failure 💬1 +40 −20 @yonromai
#3156 Fall back to local compilation cache when MARIN_PREFIX is S3 💬1 +18 −3 @yonromai
#3154 fix: resolve worker detail 404 when using worker name 💬6 +285 −441 @app/claude
#3152 Update cluster configs +30 −0 @ravwojdyla
#3148 Fix Pallas GPU CE custom backward tracing on non-GB10 +54 −29 @yonromai
#3146 Fix rollout-restart race in CW controller startup 💬2 +5 −0 @yonromai
#3143 Improve profile gap attribution and trace quality diagnostics 💬4 +269 −37 @dlwh
#3137 refactor(iris): SE cleanup — dead code, deduplication, interface clarity 💬1 +949 −1283 @rjpower
#3136 cleanup: replace remaining private API usage in core modules 💬1 +46 −25 @dlwh
#3135 trainer/eval_harness: batch-size int-ification for data loaders 💬2 +22 −9 @dlwh
#3134 optim: group linear-like routing into explicit marker range 💬3 +176 −27 @dlwh
#3133 grug: remove Axis dependency from train/eval dataset wiring 💬1 +46 −2 @dlwh
#3127 Add Grug variant visual diff tooling and PR workflow 💬1 +1188 −0 @dlwh
#3125 Clean linear CE TPU kernel variants and keep one production forward path 💬2 +1864 −405 @dlwh
#3123 Iris: Add reservation system for pre-provisioning worker capacity 💬2 +2994 −677 @rjpower
#3119 Name Ray actor processes after the actor group name 💬1 +56 −6 @ravwojdyla-agent
#3045 Fix Pallas GPU CE backward gradient tracing on non-GB10 💬2 +54 −29 @dlwh
Issues
#2822 Iris: Implement CoreWeave platform
#2823 Iris: Improve worker/process status visibility and post-mortem log access
#2824 Iris: Multi-region support with per-scaling-group environment configuration
#2825 Iris: Quota-aware scheduling and cross-zone fallback
#2826 Iris: Richer profiling and worker-level observability
#2827 Iris: Proactive unhealthy/degraded node identification
#2829 Data processing pipeline: validate end-to-end tokenization for all target datasets 💬1
#2830 Training monitoring: alerting on stalled/diverging loss and health dashboard
#2831 Validate fault tolerance: checkpoint resume and preemption recovery on CoreWeave
#2832 Agent can run a small model E2E without human intervention
#2833 Establish daily canary training runs
#2834 Executor v2: split out caching module and simplify step API
#2835 Standardize on Vortex format with typed dataset schemas 💬2
#2629 Iris: bootstrap script templates are too fragile
#2377 Jobs are not tolerant to the node where `self._run_steps` is running being preempted.
#2651 Iris: Resolver/Actor system should always auto-resolve on transient errors
#2809 Iris: Survey threading and timeouts for the controller
#2810 Iris: benchmark test for controller performance
#2424 Iris - initial resource observability 💬1
#2710 Experiment: MoE EP benchmark milestone 💬4
#2418 Add AdamC, fp32 router compute, router_topk_then_softmax, qk-norm option for MoE stability sweeps
#2414 Experiment: OLMoE size sweep with MoE stability measures
#2804 fsspec should reject cross region reads (or those over X MB) 💬2
#2744 Iris: bootstrap should probably live on the scaling group
#2745 Iris: Add attributes to ScaleGroupConfig for scheduling-level metadata
#2642 Iris: preemptible shouuld be a taint, not an attribute
#2735 Iris: Zone-aware scheduler
#2762 Iris: fair scheduler
#2625 Iris: Users and Priorities
#2749 iris: Migrate GCP platform from gcloud CLI to Python API
#2772 Iris: add proxy for worker view
#2803 iris-controller: add built-in py-spy profiling endpoint to dashboard
#2754 Embed speedscope in Iris dashboard for one-click profile viewing 💬1
#2413 SwiGLU vs Bilinear MLP layers for MoE Experts
#2708 Zephyr: auto-scale worker groups up to match demand
#2535 Iris: Integrate chronos virtual time into chaos test suite
#2849 Iris: add smoke test into CI
#2926 Iris: Add Levanter health check in Iris
#3035 StepRunner shouldn't launch tasks with Fray by default
#3098 Evaluate (first few steps) x00B MoE on TPU and GPU
#3164 🆕 Iris: allow controller restarts without resetting tasks

21 potentially related in Other Changes

#3415 docs: align TPU cluster setup tutorial with cluster CLI +3 −8 @dlwh-golem
#3337 Add Ray cluster safety rule to AGENTS +1 −0 @dlwh
#3150 ci: run PR checks on all target branches, not just main 💬2 +0 −19 @yonromai
#3392 [iris] Bound Docker task workdirs with tmpfs 💬1 +201 −0 @rjpower
#3353 iris: remove GCE VM spec defaults, require explicit config 💬2 +54 −21 @app/claude
#3331 gruggification: pass explicit axis mappings through train/eval callers 💬2 +31 −8 @dlwh
#3329 gruggification: explicit axis-mapping foundation for LM loss path 💬2 +23 −9 @dlwh
#3328 gruggification: decouple eval and inference surface from model.Pos 💬2 +1 −5 @dlwh
#3327 gruggification: remove remaining direct haliax symbol imports 💬2 +619 −542 @dlwh
#3315 lm_model: migrate public LM surface to array-native protocols 💬1 +256 −208 @dlwh
#3314 main: default train/eval/lora/viz to array-first Grug datasets 💬1 +171 −69 @dlwh
#3313 lm/eval: add array-loss bridge for LM and ASR 💬1 +1397 −53 @dlwh
#3312 trainer/runtime: bind execution to explicit mesh resources 💬1 +518 −290 @dlwh
#3311 partitioning: complete named_jit facade migration 💬1 +20 −18 @dlwh
#3310 mesh/models: centralize scan and partitioning foundations 💬1 +427 −234 @dlwh
#3309 eval: explicit batch-resource wiring and compute-axis naming 💬1 +213 −82 @dlwh
#3290 Default trainer meshes to explicit axis types 💬2 +12 −3 @dlwh
#3289 Add tensor-opaque LM model adapters for array migration 💬1 +75 −20 @dlwh
#3275 Rjpower/flatten monorepo 💬1 +7760 −32666 @rjpower
#3245 Extract shared utilities into new rigging package 💬7 +729 −593 @rjpower
#3153 fix: default `ray_run` entrypoint to 1 CPU 💬9 +10 −4 @app/claude

#3096 Pre-training: 32B MoE Kick-off

1/4 sub-issues closed

Following last week's initial MoE experiments on v4 and v5p, @ClassicLarry completed Phase 1 scaling law replication (#3182) and submitted a full 15-run isoflop sweep varying expert counts, granularity, and activation ratios (#2167). The canonical Grug MoE module and template variant landed (#3046). @yonromai built on last week's CW canary ferry by adding a TPU canary (#3342), data loader stall diagnostics (#3346), always-on profiling with persistent artifacts (#3299), and MFU gating on trailing p50 windows (#3279). Grug MoE ring EP got block shuffle as the new default (#3371) and loop profiler annotations (#3376).

16 PRs this week, 33 new comments, and 1 new issue (4 total)

Sort:

#3342 Add MoE canary ferry for daily TPU regression testing 💬2 +206 −0 @yonromai
#3046 Add canonical Grug MoE module and template variant 💬3 +1832 −9 @dlwh
#3229 grug/moe: restore aux-loss metrics and remove smoke launcher 💬30 +70 −9 @dlwh
#3293 Add modular_opt variant and move Grug variant docs 💬4 +1092 −0 @dlwh
#3371 Set block shuffle as the new-run default for Grug 💬4 +33 −26 @dlwh
#3376 [grug] Add loop profiler annotations 💬2 +41 −32 @dlwh
#3346 canary: add data loader stall diagnostics + keep_nodepool option 💬1 +33 −4 @yonromai
#3299 [canary] Enable always-on profiling with persistent artifacts for agent triage 💬1 +52 −9 @yonromai
#3280 canary: log fused CE implementation + enable profiler +8 −0 @yonromai
#3279 canary: switch MFU gate to p50 over trailing window 💬3 +31 −2 @yonromai
#3277 ci: delete CW NodePools after canary ferry workflow 💬2 +13 −0 @yonromai
#3240 Re-enable daily CW GPU canary ferry schedule 💬1 +4 −3 @yonromai
#3217 Fix CW canary OOM and improve training observability 💬2 +33 −5 @yonromai
#3177 Use GCS compilation cache for CW canary ferry 💬1 +15 −0 @yonromai
#3171 Bump canary ferry timeout from 2h to 4h +1 −1 @yonromai
#3187 update v6e configs to not request so many big v6e slices 💬1 +4 −8 @dlwh
Issues
#2371 Grug MoE 💬1
#2167 Add a version of isoflop_sweep for MoE's 💬9
#3182 🆕 Determine optimal scaling parameters for MoE 💬20
#2828 Port MoE training to GPU: kernel experiments and performance validation 💬1

17 potentially related in Other Changes

#3331 gruggification: pass explicit axis mappings through train/eval callers 💬2 +31 −8 @dlwh
#3329 gruggification: explicit axis-mapping foundation for LM loss path 💬2 +23 −9 @dlwh
#3328 gruggification: decouple eval and inference surface from model.Pos 💬2 +1 −5 @dlwh
#3327 gruggification: remove remaining direct haliax symbol imports 💬2 +619 −542 @dlwh
#3315 lm_model: migrate public LM surface to array-native protocols 💬1 +256 −208 @dlwh
#3314 main: default train/eval/lora/viz to array-first Grug datasets 💬1 +171 −69 @dlwh
#3313 lm/eval: add array-loss bridge for LM and ASR 💬1 +1397 −53 @dlwh
#3312 trainer/runtime: bind execution to explicit mesh resources 💬1 +518 −290 @dlwh
#3311 partitioning: complete named_jit facade migration 💬1 +20 −18 @dlwh
#3310 mesh/models: centralize scan and partitioning foundations 💬1 +427 −234 @dlwh
#3309 eval: explicit batch-resource wiring and compute-axis naming 💬1 +213 −82 @dlwh
#3300 AdaMuon implementation 💬9 +525 −0 @msclar
#3292 Delphi Scaling Setup 💬6 +1405 −617 @Helw150
#3290 Default trainer meshes to explicit axis types 💬2 +12 −3 @dlwh
#3289 Add tensor-opaque LM model adapters for array migration 💬1 +75 −20 @dlwh
#3274 [Speedrun] Submit NAMO-D LLaMA-300M run +392 −0 @suraj-ranganath
#3237 Grug Demo, small scale feature maxed moe 💬4 +1393 −23 @ClassicLarry

#3100 Data Sources for Pre-training

Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.

0/4 sub-issues closed

After last week's tokenization debugging, @ravwojdyla shifted to the Luxical embedding experiment for quality and topic evaluation (#3191) — the Luxical creator @lukemerrick dropped in with guidance on embedding storage and model usage (#3049). Vortex got GCS support (#3268) and @Helw150 added Nemotron V2 data (#3317). Zephyr gained group_by enhancements — secondary sort (#3250) and generator reducers (#3247). Tokenization tuning (#3170) and download reliability fixes (#3324, #3142) rounded out the pipeline work.

13 PRs this week, 6 new comments, and 2 new issues (4 total)

Sort:

#3191 Luxical embedding experiment for quality & topic eval +1458 −1 @ravwojdyla-agent
#3317 Add Nemotron V2 Data 💬3 +195 −1 @Helw150
#3268 Upgrade vortex to support GCS +7 −8 @ravwojdyla
#3262 Handle empty vortex file on projection +13 −0 @ravwojdyla
#3261 Fixup Zephyr `group_by` signature types +28 −5 @ravwojdyla
#3250 feat(zephyr): support secondary sort (sort_by) in group_by 💬3 +121 −19 @ravwojdyla-agent
#3247 feat(zephyr): support generator reducers in group_by +37 −1 @ravwojdyla-agent
#3186 Inner dedup function no config +100 −50 @ravwojdyla
#3170 Tune tokenize `window` and `writer_batch_size` +6 −3 @ravwojdyla
#3282 Log when no shards +1 −0 @ravwojdyla
#3180 Add `overload` for `remote` util +22 −1 @ravwojdyla
#3324 Fix download-hf hangs and ensure fineweb speedrun completes 💬6 +59 −9 @dlwh-golem
#3142 Fix stalled HF stream reads in download_hf 💬9 +77 −6 @dlwh-golem
Issues
#3049 Test Luxical as a General Tool for Data Integration Pipelines 💬6
#3101 Ensure we have 20T deduped tokens of data
#3183 🆕 Software Heritage Foundation license
#3194 🆕 Gather code environments

3 potentially related in Other Changes

#3319 Refactor Long Context Data 💬4 +139 −91 @Helw150
#3292 Delphi Scaling Setup 💬6 +1405 −617 @Helw150
#3284 Tweaks to data inspection from debugging spikes 💬1 +206 −88 @Helw150

#3192 Synthetic Data

0/4 sub-issues closed

Progress on the SFT front after last week's 0% resolve rate: @AlienKevin reported that switching to Qwen2.5-Coder-32B-Instruct as the student model reached 5/43 on the Rust subset of SWE-bench Multilingual (#2956). A TRL sanity check on Modal confirmed the Marin SFT pipeline isn't at fault for earlier repetition issues. @moojink followed up with experiments using the larger Qwen3-235B-A22B teacher model and rejection sampling (#2262). No merged PRs this week but active experimental progress.

0 PRs this week, 3 new comments, and 0 new issues (4 total)

Sort:

Issues
#2956 [Agentic SFT] SFT Qwen3-8B on 5K SWE-smith trajectories and show improvement on SWE-bench 💬2
#2905 [Agentic SFT] Generate 30K Coding Trajectories across 6 Languages
#3093 [Agentic SFT] Tracking SFT datasets for SWE tasks
#2262 Experiment: OpenThoughts4 Teacher Model Comparison - Qwen3-32B vs. Qwen3-235B-A22B 💬1

Other Changes

Documentation alignment across the repo by @dlwh-golem — TPU cluster setup (#3415), contributing hooks (#3307), MkDocs commands (#3271), README paths (#3235). @gonzalobenegas added LLR-based VEP eval for DNA models (#3144) and perplexity vs. downstream task performance EDA (#3333). CI improvements for PR checks on all target branches (#3150).

43 PRs this week, 71 new comments, and 73 issues closed (73 total)

Sort:

#3415 docs: align TPU cluster setup tutorial with cluster CLI Infrastructure: MoE Training Support +3 −8 @dlwh-golem
#3400 tighten AGENTS.md files: 354 to 206 lines (42% reduction) +166 −287 @app/claude
#3380 docs: align ray_run and docs build commands with current repo layout +7 −7 @dlwh-golem
#3379 Fix agent guidance for pre-commit execution +3 −1 @dlwh-golem
#3378 docs: require pre-commit in fix_issue workflow +2 −0 @dlwh-golem
#3337 Add Ray cluster safety rule to AGENTS Infrastructure: MoE Training Support +1 −0 @dlwh
#3333 EDA: perplexity vs downstream task performance +160 −13 @gonzalobenegas
#3307 docs: align contributing hook setup with Makefile target +1 −7 @dlwh-golem
#3271 docs: align MkDocs commands with uv run +6 −6 @dlwh-golem
#3243 Increase Claude GitHub Action max-turns to 250 💬1 +1 −1 @rjpower
#3236 docs(recipes): require duplicate-work preflight for issue fixes +10 −0 @dlwh-golem
#3235 docs: fix stale README tutorial script path +1 −1 @dlwh-golem
#3227 docs: fix broken executor docs link in README +1 −1 @dlwh-golem
#3205 Tighten scrub skill delivery contract 💬1 +47 −0 @dlwh
#3166 Consolidate PR documentation into docs/recipes/pull-request.md 💬3 +155 −161 @rjpower
#3150 ci: run PR checks on all target branches, not just main Infrastructure: MoE Training Support 💬2 +0 −19 @yonromai
#3144 Add LLR-based VEP eval task for DNA models +656 −51 @gonzalobenegas
#3122 Update license header template to remove year 💬4 +1457 −1446 @dlwh
#3416 infra: guard docs against stale pre-commit invocation 💬2 +50 −2 @dlwh-golem
#3392 [iris] Bound Docker task workdirs with tmpfs Infrastructure: MoE Training Support 💬1 +201 −0 @rjpower
#3353 iris: remove GCE VM spec defaults, require explicit config Infrastructure: MoE Training Support 💬2 +54 −21 @app/claude
#3331 gruggification: pass explicit axis mappings through train/eval callers Infrastructure: MoE Training Support Pre-training: 32B MoE Kick-off 💬2 +31 −8 @dlwh
#3329 gruggification: explicit axis-mapping foundation for LM loss path Infrastructure: MoE Training Support Pre-training: 32B MoE Kick-off 💬2 +23 −9 @dlwh
#3328 gruggification: decouple eval and inference surface from model.Pos Infrastructure: MoE Training Support Pre-training: 32B MoE Kick-off 💬2 +1 −5 @dlwh
#3327 gruggification: remove remaining direct haliax symbol imports Infrastructure: MoE Training Support Pre-training: 32B MoE Kick-off 💬2 +619 −542 @dlwh
#3319 Refactor Long Context Data Data Sources for Pre-training 💬4 +139 −91 @Helw150
#3315 lm_model: migrate public LM surface to array-native protocols Infrastructure: MoE Training Support Pre-training: 32B MoE Kick-off 💬1 +256 −208 @dlwh
#3314 main: default train/eval/lora/viz to array-first Grug datasets Infrastructure: MoE Training Support Pre-training: 32B MoE Kick-off 💬1 +171 −69 @dlwh
#3313 lm/eval: add array-loss bridge for LM and ASR Infrastructure: MoE Training Support Pre-training: 32B MoE Kick-off 💬1 +1397 −53 @dlwh
#3312 trainer/runtime: bind execution to explicit mesh resources Infrastructure: MoE Training Support Pre-training: 32B MoE Kick-off 💬1 +518 −290 @dlwh
#3311 partitioning: complete named_jit facade migration Infrastructure: MoE Training Support Pre-training: 32B MoE Kick-off 💬1 +20 −18 @dlwh
#3310 mesh/models: centralize scan and partitioning foundations Infrastructure: MoE Training Support Pre-training: 32B MoE Kick-off 💬1 +427 −234 @dlwh
#3309 eval: explicit batch-resource wiring and compute-axis naming Infrastructure: MoE Training Support Pre-training: 32B MoE Kick-off 💬1 +213 −82 @dlwh
#3300 AdaMuon implementation Pre-training: 32B MoE Kick-off 💬9 +525 −0 @msclar
#3292 Delphi Scaling Setup Pre-training: 32B MoE Kick-off Data Sources for Pre-training 💬6 +1405 −617 @Helw150
#3290 Default trainer meshes to explicit axis types Infrastructure: MoE Training Support Pre-training: 32B MoE Kick-off 💬2 +12 −3 @dlwh
#3289 Add tensor-opaque LM model adapters for array migration Infrastructure: MoE Training Support Pre-training: 32B MoE Kick-off 💬1 +75 −20 @dlwh
#3284 Tweaks to data inspection from debugging spikes Data Sources for Pre-training 💬1 +206 −88 @Helw150
#3275 Rjpower/flatten monorepo Infrastructure: MoE Training Support 💬1 +7760 −32666 @rjpower
#3274 [Speedrun] Submit NAMO-D LLaMA-300M run Pre-training: 32B MoE Kick-off +392 −0 @suraj-ranganath
#3245 Extract shared utilities into new rigging package Infrastructure: MoE Training Support 💬7 +729 −593 @rjpower
#3237 Grug Demo, small scale feature maxed moe Pre-training: 32B MoE Kick-off 💬4 +1393 −23 @ClassicLarry
#3153 fix: default `ray_run` entrypoint to 1 CPU Infrastructure: MoE Training Support 💬9 +10 −4 @app/claude

#2836 Infrastructure: MoE Training Support

#3096 Pre-training: 32B MoE Kick-off

#3100 Data Sources for Pre-training

#3192 Synthetic Data

Other Changes

Keyboard shortcuts