Week of March 2nd summary for marin-community/marin

A massive Iris infrastructure push — controller checkpointing, reservation system, flexible device scheduling, and a complete logging overhaul — alongside MoE scaling law experiments by @ClassicLarry and continued kernel/optimizer work from @dlwh.

#2836 Infrastructure: MoE Training Support

Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.

24/41 sub-issues closed

Building on last week's reliability push, Iris saw its biggest week yet. @rjpower landed controller snapshot/checkpoint #3167 so restarts no longer orphan running jobs, a reservation system for pre-provisioning worker capacity #3123, #3223, and flexible device variant scheduling #3254 that lets jobs specify multiple acceptable TPU types for cross-region placement. The autoscaler was reworked with token-bucket rate limiting for scale-down #3212 and reduced lock contention under high task counts #3356. A complete logging overhaul replaced GCS-based log reads with heartbeat-forwarded logs stored in SQLite on the controller #3244, #3301, #3283, #3296, #3325, fixing dropped logs on task completion and eliminating file descriptor exhaustion. MirrorFileSystem #3258 provides transparent cross-region file access while CrossRegionGuardedFS #3162 blocks large cross-region reads. On the training side, @dlwh continued building on last week's Grug refactor with improved variant contract checks #3169, visual diff tooling for reviewing template-heavy Grug code #3127, a new modular_opt variant #3293, and MoE ring expert-parallel optimizations #3377, #3398 that closed the EP benchmark milestone #2710. Fused cross-entropy was stabilized: one production forward path #3125, miss-only autotune sweeps #3251, a backend-dispatched GMM API with GPU fallback #3256, and v4 vmem fallback stabilization #3354. @yonromai fixed Pallas GPU CE tracing on non-GB10 #3148, added NVIDIA weight tile limits #3160, and enabled S3 compilation caching #3195.

126 PRs this week, 24 new comments, and 1 new issue (41 total)

Sort:

#3398 Simplify Grug MoE ring EP local counts 💬1 +26 −10 @dlwh
#3377 Optimize Grug MoE ring EP compaction 💬2 +16 −12 @dlwh
#3167 iris: controller snapshot/checkpoint 💬1 +4647 −488 @rjpower
#3162 Add CrossRegionGuardedFS to block large cross-region GCS reads 💬7 +774 −312 @rjpower
#3427 Improve non-daemon thread reporting by grouping identical stacks 💬1 +14 −4 @rjpower
#3424 cleanup: remove autoscaler unmet demand log spam 💬1 +2 −12 @rjpower
#3411 [Iris] Align TPU VM metadata for v5p, v5e-8, and v6e-8 💬3 +21 −22 @dlwh
#3407 Fix log store WAL bloat during eviction 💬1 +15 −4 @rjpower
#3405 iris: reject unschedulable coscheduled jobs at submission time 💬7 +172 −60 @rjpower
#3403 Fix SSH tunnel poisoning parent stdout with O_NONBLOCK 💬2 +5 −2 @yonromai
#3401 fix: handle unhealthy worker under lock to prevent race condition 💬1 +24 −22 @rjpower
#3400 tighten AGENTS.md files: 354 to 206 lines (42% reduction) +166 −287 @claude
#3397 iris: generalize profiling with threads support and target-based routing 💬11 +571 −274 @rjpower
#3395 Use human-readable slice names instead of epoch milliseconds 💬1 +28 −9 @rjpower
#3393 iris: add worker health checks and unify heartbeat paths 💬4 +307 −44 @claude
#3390 iris: default endpoints tab to 100 rows with prefix search 💬1 +78 −11 @rjpower
#3389 Fix controller FD exhaustion under load 💬1 +5 −1 @rjpower
#3387 iris: disable coredumps in worker and task containers 💬1 +4 −0 @rjpower
#3382 fix: add null guards for users data in dashboard 💬1 +2 −2 @rjpower
#3369 fix(iris): log perf, scheduling fixes, holder task device constraints 💬1 +1213 −529 @rjpower
#3368 fix(iris): forward server-side tracebacks in RPC errors 💬1 +233 −193 @rjpower
#3367 Fix preemption_count inflation from delivery failures 💬1 +38 −49 @rjpower
#3366 Fix delivery failure handling: don't count undelivered tasks against retry budgets 💬1 +281 −37 @rjpower
#3364 fix(zephyr,iris): add retry+backoff for controller RPCs and pipeline retries 💬1 +16 −2 @rjpower
#3363 Handle reservation holder task worker deaths gracefully 💬4 +222 −5 @rjpower
#3362 Fix profiling summary pipeline on GPU traces 💬2 +125 −29 @yonromai
#3361 Improve log propagation reliability and shutdown handling 💬1 +32 −9 @rjpower
#3360 feat(iris): prefix-based log fetching with autoincrement cursor +434 −455 @rjpower
#3356 iris: reduce controller lock contention and RPC overhead +420 −301 @rjpower
#3354 [Levanter] Stabilize fused CE TPU v4 tuning and vmem fallback 💬5 +1511 −324 @dlwh
#3352 cleanup: remove iris demo cluster & yaml 💬6 +16 −1460 @chonky-bot
#3348 Pin Docker CLI to 24.0 in worker image for TPU host compat 💬4 +4 −1 @chonky-bot
#3344 iris: use pd-ssd boot disk for controller VM 💬7 +20 −1 @claude
#3343 Allow to filter iris logs by level in CLI +26 −1 @ravwojdyla
#3340 Search all zones for the TPU 💬2 +14 −8 @ravwojdyla
#3325 iris: use shared logging widget for process & task logs 💬14 +829 −2244 @rjpower
#3323 coreweave: widen tunnel timeout and add diagnostics for konnectivity startup race +26 −4 @yonromai
#3303 Fixup sqlite logs locking, counts 💬5 +123 −105 @ravwojdyla
#3302 Document iris push auth +15 −0 @ravwojdyla
#3301 Use sqlite3 for log storage on the controller. 💬3 +191 −120 @rjpower
#3298 iris: use tmpfs for uv sync and .venv IO on GCE 💬11 +119 −5 @claude
#3296 normalize logging: unified format, level tagging, and log filtering 💬11 +650 −225 @rjpower
#3293 Add modular_opt variant and move Grug variant docs 💬4 +1092 −0 @dlwh
#3288 Extend optimizer linear transforms to eqx and marker linears 💬3 +96 −45 @dlwh
#3287 Bind eval and inference runtime to explicit mesh resources 💬4 +99 −62 @dlwh
#3286 fix: normalize abbreviated bucket names in region_from_prefix 💬1 +34 −3 @chonky-bot
#3283 iris: fix dropped logs on task completion + remove FetchTaskLogs RPC 💬1 +667 −558 @rjpower
#3282 Log when no shards +1 −0 @ravwojdyla
#3281 Use SSH BatchMode to force errors on missing keys 💬1 +4 −0 @rjpower
#3280 canary: log fused CE implementation + enable profiler +8 −0 @yonromai
#3273 fray/iris: forward JobRequest retry budgets on submit 💬5 +2 −1 @dlwh
#3272 iris smoke-test: fix _run_iris logging format mismatch +1 −1 @dlwh-golem
#3270 step_runner: preserve underlying step failure cause 💬1 +38 −4 @dlwh
#3269 grug: dispatch through fray jobs (to fix multinode) 💬4 +102 −6 @dlwh
#3268 Upgrade vortex to support GCS +7 −8 @ravwojdyla
#3267 distributed_lock: read legacy worker_id lock leases 💬4 +48 −48 @dlwh
#3265 executor: run distributed lock and cache on executor node 💬6 +154 −59 @claude
#3263 Remove resources/env_vars from ExecutorStep; use @remote for dispatch 💬1 +42 −53 @rjpower
#3262 Handle empty vortex file on projection +13 −0 @ravwojdyla
#3261 Fixup Zephyr `group_by` signature types +28 −5 @ravwojdyla
#3260 Fix runtime_env propagation to TPU SliceActor workers 💬1 +75 −19 @Calvin-Xu
#3259 Parallelize GCP zone queries in list_slices and list_vms 💬3 +125 −13 @rjpower
#3258 Add MirrorFileSystem for transparent cross-region file access 💬3 +917 −228 @rjpower
#3257 Add Cache-Control headers to static assets 💬1 +28 −1 @rjpower
#3256 Add backend-dispatched GMM API with GPU fallback 💬12 +150 −34 @dlwh-golem
#3254 feat(iris): flexible device variant requests 💬6 +3944 −3363 @rjpower
#3251 Add miss-only autotune sweep and cache for pallas fused CE 💬1 +619 −9 @dlwh-golem
#3250 feat(zephyr): support secondary sort (sort_by) in group_by 💬3 +121 −19 @ravwojdyla-agent
#3248 [levanter] Default fused CE TPU path to XLA and retune v4 huge-batch blocks 💬2 +101 −3 @dlwh
#3247 feat(zephyr): support generator reducers in group_by +37 −1 @ravwojdyla-agent
#3244 iris: forward task logs via heartbeat instead of reading from GCS 💬2 +496 −435 @rjpower
#3243 Increase Claude GitHub Action max-turns to 250 💬1 +1 −1 @rjpower
#3242 feat(iris): parallelize scaling group restoration on startup 💬1 +88 −30 @rjpower
#3241 fix(iris): fix parallel CLI tunnel port collisions and improve RPC retry 💬1 +18 −11 @rjpower
#3239 Add inline cache to Iris image publishes +2 −1 @yonromai
#3234 Fix Iris marin_prefix mapping for europe-west4 💬1 +8 −1 @dlwh
#3233 feat(iris): auto-detect multinode TPUs and set replicas/coscheduling 💬4 +63 −3 @dlwh
#3232 fix(iris): use kebab-case CLI command in bug report autoscaler status hint 💬1 +1 −1 @dlwh
#3229 grug/moe: restore aux-loss metrics and remove smoke launcher 💬30 +70 −9 @dlwh
#3227 docs: fix broken executor docs link in README +1 −1 @dlwh-golem
#3223 Autoscaler: model reservations as first-class objects with synthetic holder tasks 💬11 +441 −286 @rjpower
#3222 Set min-slices and fix worker dashboard. 💬3 +208 −68 @rjpower
#3221 iris: validate TPU replicas match topology vm_count 💬3 +72 −2 @rjpower
#3218 Use GHCR weekly image as Docker build cache source 💬2 +6 −0 @yonromai
#3217 Fix CW canary OOM and improve training observability 💬2 +33 −5 @yonromai
#3215 Add Meta-Llama-3.1-8B-Instruct to _KNOWN_VOCAB_SIZES to avoid HF access during dry-runs 💬1 +6 −0 @rjpower
#3214 Cache per-window layout and permutations in BlockShufflingDataset +4 −0 @yonromai
#3213 Iris: derive smoke test bundle prefix from cluster config 💬2 +6 −6 @claude
#3212 iris: use token bucket for scale-down rate limiting instead of cooldown 💬3 +168 −102 @rjpower
#3209 fix(iris): remove duplicate retry loop from actor resolution 💬3 +86 −64 @rjpower
#3207 Set max_task_retries=10 for Zephyr workers to survive transient errors 💬3 +14 −0 @rjpower
#3206 iris: add network bandwidth tracking and disk sparkline for workers +396 −195 @claude
#3205 Tighten scrub skill delivery contract 💬1 +47 −0 @dlwh
#3199 iris: make task memory/CPU bars show human-readable values with sparklines 💬1 +256 −21 @rjpower
#3195 Enable S3 compilation cache and disable XLA autotune sub-cache +26 −29 @yonromai
#3193 Make gpu_type required in ResourceConfig.with_gpu() +6 −6 @yonromai
#3188 Iris: autoscaler fixes, heartbeat performance, and observability +4424 −1199 @rjpower
#3187 update v6e configs to not request so many big v6e slices 💬1 +4 −8 @dlwh
#3186 Inner dedup function no config +100 −50 @ravwojdyla
#3180 Add `overload` for `remote` util +22 −1 @ravwojdyla
#3177 Use GCS compilation cache for CW canary ferry 💬1 +15 −0 @yonromai
#3171 Bump canary ferry timeout from 2h to 4h +1 −1 @yonromai
#3170 Tune tokenize `window` and `writer_batch_size` +6 −3 @ravwojdyla
#3169 grug: improved variant contract checks 💬5 +299 −244 @dlwh
#3168 Show reservation device in dashboard job detail 💬1 +22 −1 @rjpower
#3166 Consolidate PR documentation into docs/recipes/pull-request.md 💬3 +155 −161 @rjpower
#3163 Cleanup job CLI and autoscaler visualization. 💬3 +249 −60 @rjpower
#3161 Fall back to local compilation cache when MARIN_PREFIX is S3 +18 −3 @yonromai
#3160 Fix RESOURCE_EXHAUSTED: add NVIDIA weight tile limit for Pallas CE kernel +7 −4 @yonromai
#3158 Retry port-forward tunnel on konnectivity failure 💬1 +40 −20 @yonromai
#3156 Fall back to local compilation cache when MARIN_PREFIX is S3 💬1 +18 −3 @yonromai
#3154 fix: resolve worker detail 404 when using worker name 💬6 +285 −441 @claude
#3152 Update cluster configs +30 −0 @ravwojdyla
#3148 Fix Pallas GPU CE custom backward tracing on non-GB10 +54 −29 @yonromai
#3146 Fix rollout-restart race in CW controller startup 💬2 +5 −0 @yonromai
#3143 Improve profile gap attribution and trace quality diagnostics 💬4 +269 −37 @dlwh
#3142 Fix stalled HF stream reads in download_hf 💬9 +77 −6 @dlwh-golem
#3137 refactor(iris): SE cleanup — dead code, deduplication, interface clarity 💬1 +949 −1283 @rjpower
#3136 cleanup: replace remaining private API usage in core modules 💬1 +46 −25 @dlwh
#3135 trainer/eval_harness: batch-size int-ification for data loaders 💬2 +22 −9 @dlwh
#3134 optim: group linear-like routing into explicit marker range 💬3 +176 −27 @dlwh
#3133 grug: remove Axis dependency from train/eval dataset wiring 💬1 +46 −2 @dlwh
#3127 Add Grug variant visual diff tooling and PR workflow 💬1 +1188 −0 @dlwh
#3125 Clean linear CE TPU kernel variants and keep one production forward path 💬2 +1864 −405 @dlwh
#3123 Iris: Add reservation system for pre-provisioning worker capacity 💬2 +2994 −677 @rjpower
#3350 SSH to GCP util +109 −0 @ravwojdyla
Issues
#2822 Iris: Implement CoreWeave platform
#2823 Iris: Improve worker/process status visibility and post-mortem log access
#2824 Iris: Multi-region support with per-scaling-group environment configuration
#2825 Iris: Quota-aware scheduling and cross-zone fallback
#2826 Iris: Richer profiling and worker-level observability
#2827 Iris: Proactive unhealthy/degraded node identification
#2829 Data processing pipeline: validate end-to-end tokenization for all target datasets 💬3
#2830 Training monitoring: alerting on stalled/diverging loss and health dashboard
#2831 Validate fault tolerance: checkpoint resume and preemption recovery on CoreWeave
#2832 Agent can run a small model E2E without human intervention
#2833 Establish daily canary training runs
#2834 Executor v2: split out caching module and simplify step API
#2835 Standardize on Vortex format with typed dataset schemas 💬2
#2629 Iris: bootstrap script templates are too fragile
#2377 Jobs are not tolerant to the node where `self._run_steps` is running being preempted.
#2651 Iris: Resolver/Actor system should always auto-resolve on transient errors
#2809 Iris: Survey threading and timeouts for the controller
#2810 Iris: benchmark test for controller performance
#2424 Iris - initial resource observability 💬1
#2710 Experiment: MoE EP benchmark milestone 💬4
#2418 Add AdamC, fp32 router compute, router_topk_then_softmax, qk-norm option for MoE stability sweeps
#2414 Experiment: OLMoE size sweep with MoE stability measures
#2804 fsspec should reject cross region reads (or those over X MB) 💬2
#2744 Iris: bootstrap should probably live on the scaling group
#2745 Iris: Add attributes to ScaleGroupConfig for scheduling-level metadata
#2642 Iris: preemptible shouuld be a taint, not an attribute
#2735 Iris: Zone-aware scheduler
#2762 Iris: fair scheduler
#2625 Iris: Users and Priorities
#2749 iris: Migrate GCP platform from gcloud CLI to Python API
#2772 Iris: add proxy for worker view
#2803 iris-controller: add built-in py-spy profiling endpoint to dashboard
#2754 Embed speedscope in Iris dashboard for one-click profile viewing 💬1
#2413 SwiGLU vs Bilinear MLP layers for MoE Experts
#2708 Zephyr: auto-scale worker groups up to match demand
#2535 Iris: Integrate chronos virtual time into chaos test suite
#2849 Iris: add smoke test into CI
#2926 Iris: Add Levanter health check in Iris
#3035 StepRunner shouldn't launch tasks with Fray by default
#3098 Evaluate (first few steps) x00B MoE on TPU and GPU
#3164 🆕 Iris: allow controller restarts without resetting tasks

5 potentially related in Other Changes

#3415 docs: align TPU cluster setup tutorial with cluster CLI +3 −8 @dlwh-golem
#3150 ci: run PR checks on all target branches, not just main 💬2 +0 −19 @yonromai
#3315 lm_model: migrate public LM surface to array-native protocols 💬5 +285 −161 @dlwh
#3314 main: default train/eval/lora/viz to array-first Grug datasets 💬6 +143 −84 @dlwh
#3290 Default trainer meshes to explicit axis types 💬2 +54 −26 @dlwh

#3096 Pre-training: MoE Scaling Laws

1/6 sub-issues closed

Following last week's initial MoE experiments on v4 and v5p, @ClassicLarry ran an extensive set of scaling law experiments — replicating expert-count sweeps from the TGL paper #3182 and progressing to full isoflop sweeps #2167. Key findings: DeepSeek-style aux-loss-free load balancing outperformed traditional LBL across configurations, and at high sparsity ratios (2:128) a 4x LBL coefficient boost showed only marginal benefit. The sweep converged on a baseline architecture — 64 routed experts, K=2, with AF balancing (bias_rate=0.01) plus 0.001 aux loss — now tracked as moe_iteration_01. @yonromai added a MoE canary ferry for daily TPU regression testing #3342 alongside canary diagnostics improvements: data loader stall monitoring #3346, always-on profiling with persistent artifacts #3299, MFU gating on trailing p50 #3279, and CW canary OOM fixes #3217. @dlwh set block shuffle as the default for new Grug runs #3371, dispatched Grug through Fray jobs to fix multinode training #3269, and auto-detect multinode TPUs to set replicas and coscheduling #3233.

12 PRs this week, 49 new comments, and 3 new issues (6 total)

Sort:

#3342 Add MoE canary ferry for daily TPU regression testing 💬2 +206 −0 @yonromai
#3346 canary: add data loader stall diagnostics + keep_nodepool option 💬1 +33 −4 @yonromai
#3279 canary: switch MFU gate to p50 over trailing window 💬3 +31 −2 @yonromai
#3280 canary: log fused CE implementation + enable profiler +8 −0 @yonromai
#3277 ci: delete CW NodePools after canary ferry workflow 💬2 +13 −0 @yonromai
#3240 Re-enable daily CW GPU canary ferry schedule 💬1 +4 −3 @yonromai
#3217 Fix CW canary OOM and improve training observability 💬2 +33 −5 @yonromai
#3371 Set block shuffle as the new-run default for Grug 💬4 +33 −26 @dlwh
#3376 [grug] Add loop profiler annotations 💬2 +41 −32 @dlwh
#3269 grug: dispatch through fray jobs (to fix multinode) 💬4 +102 −6 @dlwh
#3306 Update job-monitoring loop for Ray and Iris tracks 💬2 +159 −96 @dlwh
#3233 feat(iris): auto-detect multinode TPUs and set replicas/coscheduling 💬4 +63 −3 @dlwh
Issues
#2371 Grug MoE 💬1
#2167 Add a version of isoflop_sweep for MoE's 💬23
#3182 🆕 Determine optimal scaling parameters for MoE 💬20
#2828 Port MoE training to GPU: kernel experiments and performance validation 💬2
#3800 🆕 Test MoE Arch at 1e21 and 1e22 Flop Scales
#4012 🆕 [moe] Experiment: compute-optimal 32B-A4B (~1e22 FLOPs) on TPU 💬1

6 potentially related in Other Changes

#3333 EDA: perplexity vs downstream task performance +160 −13 @gonzalobenegas
#3315 lm_model: migrate public LM surface to array-native protocols 💬5 +285 −161 @dlwh
#3314 main: default train/eval/lora/viz to array-first Grug datasets 💬6 +143 −84 @dlwh
#3300 AdaMuon implementation 💬9 +525 −0 @msclar
#3290 Default trainer meshes to explicit axis types 💬2 +54 −26 @dlwh
#3237 Grug Demo, small scale feature maxed moe 💬4 +1393 −23 @ClassicLarry

#3100 Data Sources for Pre-training

Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.

0/4 sub-issues closed

After last week's tokenization debugging, @ravwojdyla completed Nemotron-CC tokenization at scale — processing nearly 2 trillion tokens across all 7 quality tiers at ~150M tokens/sec across 512 workers #2829. The Luxical embedding experiment #3191 kicked off to evaluate frozen embeddings as general-purpose quality/topic classifiers, with Luxical's creator @lukemerrick offering usage guidance #3049. @Helw150 added Nemotron V2 data #3317. Zephyr saw Vortex upgraded to support GCS #3268, group_by enhancements including secondary sort and generator reducers #3250, #3247, and download reliability fixes #3324, #3142.

4 PRs this week, 6 new comments, and 2 new issues (4 total)

Sort:

#3191 Luxical embedding experiment for quality & topic eval +1458 −1 @ravwojdyla-agent
#3317 Add Nemotron V2 Data 💬3 +195 −1 @Helw150
#3324 Fix download-hf hangs and ensure fineweb speedrun completes 💬6 +59 −9 @dlwh-golem
#3305 Validate --region/--zone CLI values in iris job run 💬5 +147 −1 @ravwojdyla-agent
Issues
#3049 Test Luxical as a General Tool for Data Integration Pipelines 💬6
#3101 Ensure we have 20T deduped tokens of data
#3183 🆕 Software Heritage Foundation license
#3194 🆕 Gather code environments

2 potentially related in Other Changes

#3333 EDA: perplexity vs downstream task performance +160 −13 @gonzalobenegas
#3144 Add LLR-based VEP eval task for DNA models +656 −51 @gonzalobenegas

#3192 Synthetic Data

0/4 sub-issues closed

Progress on the SFT front after last week's 0% resolve rate: @AlienKevin achieved 5/43 on the Rust subset of SWE-bench Multilingual by switching to Qwen2.5-Coder-32B-Instruct as the student model and using the exact SWE-smith hyperparameters #2956. A TRL sanity check on Modal confirmed the earlier Qwen3-8B repetition issues were not a Marin-specific bug. On OpenThoughts4, @moojink completed follow-up experiments with the larger Qwen3-235B-A22B teacher #2262 — surprisingly, it showed only modest advantage over 32B for Llama3.1-8B-Instruct students, suggesting diminishing returns from teacher scale.

0 PRs this week, 6 new comments, and 0 new issues (4 total)

Sort:

Issues
#2956 [Agentic SFT] SFT Qwen3-8B on 5K SWE-smith trajectories and show improvement on SWE-bench 💬2
#2905 [Agentic SFT] Generate 30K Coding Trajectories across 6 Languages
#3093 [Agentic SFT] Tracking SFT datasets
#2262 Experiment: OpenThoughts4 Teacher Model Comparison - Qwen3-32B vs. Qwen3-235B-A22B 💬4

Other Changes

@gonzalobenegas added LLR-based variant effect prediction evaluation for DNA models #3144 and an EDA notebook on perplexity vs downstream task performance #3333. Documentation alignment by @dlwh-golem across TPU cluster setup #3415, contributing hooks #3307, MkDocs commands #3271, and README paths #3235. CI improvements for PR checks on all target branches #3150.

29 PRs this week, 53 new comments, and 77 issues closed (77 total)

Sort:

#3415 docs: align TPU cluster setup tutorial with cluster CLI Infrastructure: MoE Training Support +3 −8 @dlwh-golem
#3380 docs: align ray_run and docs build commands with current repo layout +7 −7 @dlwh-golem
#3379 Fix agent guidance for pre-commit execution +3 −1 @dlwh-golem
#3378 docs: require pre-commit in fix_issue workflow +2 −0 @dlwh-golem
#3337 Add Ray cluster safety rule to AGENTS +1 −0 @dlwh
#3333 EDA: perplexity vs downstream task performance Pre-training: MoE Scaling Laws Data Sources for Pre-training +160 −13 @gonzalobenegas
#3307 docs: align contributing hook setup with Makefile target +1 −7 @dlwh-golem
#3299 [canary] Enable always-on profiling with persistent artifacts for agent triage 💬1 +52 −9 @yonromai
#3271 docs: align MkDocs commands with uv run +6 −6 @dlwh-golem
#3236 docs(recipes): require duplicate-work preflight for issue fixes +10 −0 @dlwh-golem
#3235 docs: fix stale README tutorial script path +1 −1 @dlwh-golem
#3150 ci: run PR checks on all target branches, not just main Infrastructure: MoE Training Support 💬2 +0 −19 @yonromai
#3144 Add LLR-based VEP eval task for DNA models Data Sources for Pre-training +656 −51 @gonzalobenegas
#3122 Update license header template to remove year 💬4 +1457 −1446 @dlwh
#3119 Name Ray actor processes after the actor group name 💬1 +56 −6 @ravwojdyla-agent
#3046 Add canonical Grug MoE module and template variant 💬3 +1832 −9 @dlwh
#3045 Fix Pallas GPU CE backward gradient tracing on non-GB10 💬2 +54 −29 @dlwh
#2904 Data Inspection Tool 💬1 +607 −0 @Helw150
#2432 NanoChat and Hparam Sweep References +928 −7 @Helw150
#3353 iris: remove GCE VM spec defaults, require explicit config 💬2 +54 −21 @claude
#3315 lm_model: migrate public LM surface to array-native protocols Infrastructure: MoE Training Support Pre-training: MoE Scaling Laws 💬5 +285 −161 @dlwh
#3314 main: default train/eval/lora/viz to array-first Grug datasets Infrastructure: MoE Training Support Pre-training: MoE Scaling Laws 💬6 +143 −84 @dlwh
#3300 AdaMuon implementation Pre-training: MoE Scaling Laws 💬9 +525 −0 @msclar
#3290 Default trainer meshes to explicit axis types Infrastructure: MoE Training Support Pre-training: MoE Scaling Laws 💬2 +54 −26 @dlwh
#3275 Rjpower/flatten monorepo 💬1 +7760 −32666 @rjpower
#3274 [Speedrun] Submit NAMO-D LLaMA-300M run +392 −0 @suraj-ranganath
#3245 Extract shared utilities into new rigging package 💬7 +729 −593 @rjpower
#3237 Grug Demo, small scale feature maxed moe Pre-training: MoE Scaling Laws 💬4 +1393 −23 @ClassicLarry
#3153 fix: default `ray_run` entrypoint to 1 CPU 💬10 +10 −4 @claude

Top 15 runs (by FLOPs) this week (completed, running, crashed)

The largest single run this week was Will Held's 10B-parameter AdamH scaling ladder at 1e22 FLOP budget on v4-256 (128 chips), reaching 47.5% MFU and processing 116B tokens over 59 hours before crashing — the first Marin run at this scale. MoE scaling law experiments intensified: ClassicLarry ran two complete TGL Phase 1 expert-count sweeps (2 to 256 experts, 8.3B tokens each) on v4-8 plus an EKN nano scaling sweep testing K and LBL coefficient interactions across ~20 configurations. The 256-expert configs consistently achieved the best loss (3.060-3.186). Moo Jin Kim completed a large OpenThoughts4 SFT run (Qwen3-1.7B with Qwen3-32B teacher, 100k steps on v4-128 at 34.7% MFU). David Hall began 32B-A4B MoE bring-up on v5p-64, running several short profiling attempts at 11-20% MFU that crashed but established baseline performance numbers. Parallel-attn-mlp experiments on v6e hardware investigated persistent NaN issues under restart.

Run	Owner	Hardware	FLOP Budget	Wall Time	Evals	Links
adamh-scaling-ladder-nemotron-optimal-1e+23-v5-27f2fb running	Will Held	TPU v4 (512 chips)	8.18e22 model 2.65e23 HW (31% MFU)	22.1d	BPB: 0.796	W&B
adamh-scaling-ladder-nemotron-optimal-1e+22-v5-025b0e	Will Held	TPU v4 (256 chips)	10.00e21 model 1.99e22 HW (50% MFU)	3.5d	BPB: 0.768	W&B
adamh-scaling-ladder-nemotron-optimal-1e+22-v6-500e71 crashed	Will Held	TPU v4 (256 chips)	7.23e21 model 1.52e22 HW (48% MFU)	2.5d	BPB: 0.950	W&B
(done) exp2262pt3c_pt2_qwen3_1pt7b_base_ot4_240k_math_qwen3_4b_32768tok-a9bd48	Moo Jin Kim	TPU v4 (128 chips)	2.59e21 model 7.71e21 HW (34% MFU)	2.4d	BPB: 0.067	W&B
(done) exp2262pt3d_pt2_qwen3_1pt7b_base_ot4_240k_math_qwen3_32b_32768to-58b647	Moo Jin Kim	TPU v4 (128 chips)	2.59e21 model 7.46e21 HW (35% MFU)	2.1d	BPB: 0.146	W&B
AdamH scaling ladder 10B (Nemotron-optimal, 1e22 budget) crashed	@Helw150	v4-256 (128 chips)	(47.5% MFU)	59.0h	loss=2.702, 115.9B tokens	W&B
adamh-v6-scaling-ladder-nemotron-optimal-1e+23-a128a5 crashed	Will Held	TPU v4 (256 chips)	1.25e21 model 3.22e21 HW (39% MFU)	15.8h	BPB: 1.190	W&B
adamh-v6-scaling-ladder-nemotron-optimal-1e+22-81073a crashed	Will Held	TPU v4 (256 chips)	1.07e21 model 2.93e21 HW (36% MFU)	16.1h	BPB: 0.948	W&B
adamh-scaling-ladder-nemotron-optimal-1e+21-v5-019021	Will Held	TPU v4 (64 chips)	1.00e21 model 1.90e21 HW (53% MFU)	1.3d	BPB: 0.844	W&B
adamh-scaling-ladder-nemotron-optimal-1e+21-v6-77f848	Will Held	TPU v4 (64 chips)	1.00e21 model 1.89e21 HW (53% MFU)	1.3d	BPB: 0.844	W&B
(done) exp2262pt3i_100k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-a594bd	Moo Jin Kim	TPU v4 (128 chips)	5.41e20 model 1.56e21 HW (35% MFU)	13.3h	BPB: 0.145	W&B
(done) exp2262pt3i_100k_pt2_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-092c35	Moo Jin Kim	TPU v5 (32 chips)	5.41e20 model 1.50e21 HW (36% MFU)	1.2d	BPB: 0.135	W&B
(done) exp2262pt3h_100k_pt2_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-0c0f70	Moo Jin Kim	TPU v5 (32 chips)	5.41e20 model 1.50e21 HW (36% MFU)	18.5h	BPB: 0.064	W&B
exp2262pt3h_100k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-ce68a9	Moo Jin Kim	TPU v5 (32 chips)	5.41e20 model 1.50e21 HW (36% MFU)	1.2d	BPB: 0.078	W&B
(done) exp2262pt3i_100k_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768tokens-227b3e	Moo Jin Kim	TPU v5 (32 chips)	5.41e20 model 1.50e21 HW (36% MFU)	1.7d	BPB: 0.135	W&B
isoflop-3e+20-d768-L8-B1024-adamh_scaling_v8	Will Held	TPU v4 (32 chips)	3.00e20 model 1.29e21 HW (23% MFU)	1.8d	BPB: 1.002	W&B
OpenThoughts4 SFT: Qwen3-1.7B base w/ Qwen3-32B teacher (100k steps)	@moojink	v4-128 (64 chips)	(34.7% MFU)	13.3h	loss=0.166, 16.4B tokens	#2262 W&B
TGL Phase 1 MoE run3 (2 experts, v5p-16) crashed	@ClassicLarry	v5p-16 (8 chips)	(18.3% MFU)	15.5h	loss=3.393, 20.7B tokens (crashed)	#3182 W&B
TGL Phase 1 MoE expert-count sweep (run5, 2-256 experts)	@ClassicLarry	v4-8 (4 chips)	(14-16% MFU)	10-17h per run	loss=3.186 (256E) to 3.541 (2E), 8.3B tokens each, 8 configs	#3182 W&B W&B W&B
TGL Phase 1 MoE expert-count sweep (run2-v2, 2-256 experts)	@ClassicLarry	v4-8 (4 chips)	(14-16% MFU)	10-14h per run	loss=3.060 (256E) to 3.423 (2E), 8.3B tokens each, 8 configs	#3182 W&B W&B W&B
EKN MoE nano scaling sweep (K=4/8, E=8-128, LBL ablation)	@ClassicLarry	v4-8 (4 chips)	(10-13% MFU)	10-14h per run	loss=3.215-3.823, 6.6B tokens each, ~20 configs	#3182 W&B W&B
Grug MoE 32B-A4B v5p-64 perf bring-up (profiling) crashed	@dlwh	v5p-64 (32 chips)	(11-20% MFU)	0.3-0.6h per attempt	profiling only, ~20M tokens per attempt	#3357 W&B W&B W&B
Parallel-attn-mlp v6e NaN investigation sweeps crashed	@dlwh	v6e-8 (4 chips)	(5-7% MFU)	0.4-1.5h per run	loss=NaN investigation, 0.8-1.5B tokens (crashed)	#3316 W&B W&B W&B

#2836 Infrastructure: MoE Training Support

#3096 Pre-training: MoE Scaling Laws

#3100 Data Sources for Pre-training

#3192 Synthetic Data

Other Changes

Top 15 runs (by FLOPs) this week (completed, running, crashed)

Keyboard shortcuts