Week of March 9th summary for marin-community/marin

A massive Iris architecture overhaul — SQLite as canonical state, JWT auth, Vue 3 dashboard, and multi-VM CoreWeave support — alongside MoE isoflop sweeps advancing to iter_03 with sigmoid routing and AdamH, and Zephyr's shuffle rewrite enabling Nemotron-scale dedup.

#2836 Infrastructure: MoE Training Support

Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.

24/41 sub-issues closed

Iris underwent a deep architectural overhaul this week. @rjpower made SQLite the canonical state store #3408, replacing the prior in-memory state with a proper schema, migration system, and normalized tables for scaling groups and slices #3514. The controller now checkpoints its SQLite state to GCS for post-mortem analysis #3497. Auth was overhauled from per-RPC DB lookups to HMAC-SHA256 JWTs #3630, #3537. The dashboard was rewritten from Preact+HTM to Vue 3 with TypeScript, Rsbuild, and Tailwind v4 — 26 components covering all existing tabs plus a new task detail page #3511. Multi-VM CoreWeave support landed with JAX coordinator bootstrap for distributed training across VMs #3638, alongside namespace-qualified RBAC so multiple Iris instances on the same cluster no longer interfere #3703. Job lifecycle got new preemption and existing-job policies #3685, proper slice reaping on worker failure #3425, and ghost-slice prevention during scale-down #3571. The autoscaler saw deadlock fixes and rate-limit logging #3580, #3531, #3616. On the data processing side, @ravwojdyla rewrote Zephyr's shuffle to use Parquet instead of per-chunk pickle blobs #3482, #3656, added dynamic batch sizing for writers #3498, and made exact dedup work at Nemotron scale via single-pass hash-group-write #3442. Protobuf generation was moved from checked-in files to an auto-generating hatch build hook #3631. @dlwh batch-tiled the XLA fused cross-entropy path to handle long sequences without hitting the TPU int32 word-count limit #3533, tuned mixed-dtype block sizes #3452, and fixed GCS executor step-lock races under worker churn #3541. @rjpower also introduced nightshift — automated GitHub Actions workflows using Claude agents for overnight cleanup, dead code removal, and issue triage #3557, #3615.

137 PRs this week, 6 new comments, and 0 new issues (41 total)

Sort:

#3472 feat(iris): add cloud-mode smoke test to CI 💬2 +142 −2 @rjpower
#3703 Iris/CW: namespace-qualify RBAC and isolate canary lifecycle 💬1 +169 −19 @yonromai
#3689 Update agents doc re `generate_protos.py` +1 −1 @ravwojdyla
#3688 fix(fray): early-exit discover_new when target count is reached 💬1 +8 −2 @ravwojdyla-agent
#3685 iris: Add JobPreemptionPolicy and ExistingJobPolicy enums 💬5 +381 −38 @rjpower
#3683 fix(iris): remove misleading 'All' log option and increase default max lines 💬1 +1 −2 @rjpower
#3682 fix(iris): add --async to gcloud delete/reset to unblock heartbeat thread 💬2 +13 −5 @rjpower
#3681 Update iris debugger skill 💬7 +59 −12 @ravwojdyla
#3679 iris: pin tunnel port scan to start at 10000 💬7 +12 −2 @Helw150
#3676 Add file-issue skill 💬2 +183 −0 @ravwojdyla
#3673 iris: send tail=true for GetTaskLogs RPC +1 −0 @ravwojdyla-agent
#3664 iris: disable gzip compression for actor RPC responses 💬1 +2 −0 @ravwojdyla-agent
#3656 zephyr: pack pickle inside parquet shuffle 💬3 +71 −31 @ravwojdyla-agent
#3655 iris: add IP cp buttons 💬2 +57 −4 @ravwojdyla
#3654 iris: include proto files in bundles, remove dashboard from build hook 💬5 +104 −49 @rjpower
#3653 fix(iris): poll pending_reason in scheduling diagnostic e2e test +8 −1 @ravwojdyla-agent
#3651 iris: add job name copy btn +54 −2 @ravwojdyla
#3650 iris: copy hatch_build.py into Docker images 💬1 +5 −5 @rjpower
#3649 iris: add generic query API to controller 💬7 +758 −5 @rjpower
#3648 Update uv lock +39 −4 @ravwojdyla
#3646 iris: clean up terminal operations from server memory 💬3 +10 −2 @ravwojdyla
#3645 Log examples arrow failure +5 −1 @ravwojdyla
#3642 Update exact dedup to groupby via parquet 💬2 +15 −15 @ravwojdyla
#3638 iris: add multi-VM CoreWeave support with JAX coordinator bootstrap 💬6 +970 −103 @rjpower
#3637 fix(iris): preserve thread container isolation in LocalCluster 💬4 +1 −1 @claude
#3632 refactor(iris): replace LocalController with LocalCluster, decouple from Platform 💬5 +114 −158 @rjpower
#3631 Auto-generate protobuf and Connect RPC files via hatch build hook 💬1 +274 −6653 @rjpower
#3630 iris: switch auth from per-RPC DB lookups to HMAC-SHA256 JWTs 💬3 +849 −419 @rjpower
#3629 fix(iris): correct R2 endpoint and bucket in coreweave.yaml +2 −2 @yonromai
#3627 Unified cross-region transfer budget (10GB) 💬3 +424 −498 @rjpower
#3626 fix: inherit worker region constraint in child jobs at client level 💬4 +152 −3 @rjpower
#3625 Add AGENTS rule for storage transfer service consent 💬1 +1 −0 @dlwh
#3618 [nightshift] Narrow bare except Exception in fray FileQueue 💬4 +3 −4 @github-actions
#3617 Move task profiling to the controller. 💬5 +219 −148 @rjpower
#3616 fix: Match autoscaler scale-down defaults to scale-up rate 💬3 +35 −29 @rjpower
#3615 Move nightshift into clean Python scripts, fix permissions +343 −231 @rjpower
#3614 refactor: rewrite nightshift workflows to use extracted scripts 💬1 +208 −175 @rjpower
#3612 impl: Create nightshift scripts and prompt templates 💬1 +178 −0 @rjpower
#3611 Add experiments AGENTS guidance for mirror artifact paths +28 −0 @dlwh
#3609 iris: add CoreWeave interruptable taint toleration to worker and task pods +16 −10 @yonromai
#3607 docs: require docs source-link validation +10 −2 @dlwh-golem
#3605 fix(levanter): catch autotune ExceptionGroup so fused CE falls back to XLA +19 −12 @yonromai
#3603 Set replicate_path on grug WandbConfig so tracker_metrics.jsonl is written +3 −0 @yonromai
#3601 iris: add e2e smoke test for job cancellation resource decommit 💬1 +24 −0 @rjpower
#3600 Fix Zephyr coordinator hang when all workers OOM 💬6 +86 −10 @rjpower
#3597 docs: tighten scrub artifact dedupe guidance +5 −3 @dlwh-golem
#3596 docs: align contributing guide data browser workflow +12 −3 @dlwh-golem
#3595 [Docs] Add HBM optimization guide and cross-links 💬3 +182 −0 @dlwh
#3592 Restructure nightshift cleanup prompt around three review dimensions 💬1 +49 −16 @rjpower
#3590 Fix up heartbeat perf. 💬3 +144 −160 @rjpower
#3589 iris: add CPU/memory/thread profile buttons to controller & worker status pages 💬5 +135 −74 @rjpower
#3588 iris: split storage into local_state_dir + remote_state_dir, remove bundle_prefix 💬7 +412 −388 @rjpower
#3587 canary: consolidate canaries to Grug MoE via Iris (#3505) 💬1 +313 −458 @yonromai
#3582 iris: add /threads route for thread dump viewing with real URL 💬11 +187 −36 @rjpower
#3581 Prefix profile dumps wiht ts to sort well in Finder +4 −2 @ravwojdyla
#3580 fix(iris): autoscaler deadlock, rate-limit log, worker ID mismatch 💬2 +266 −135 @rjpower
#3579 iris: fleet view shows zone from worker attributes 💬1 +3 −3 @ravwojdyla-agent
#3578 iris: surface scheduler diagnostics in ListJobs for CLI visibility 💬4 +79 −32 @ahmeda14960
#3575 iris: show thread dump as text overlay instead of file download 💬10 +36 −14 @claude
#3573 Build dashboard on proxy 💬2 +5 −1 @ravwojdyla
#3571 fix(iris): prevent ghost slices when terminate() fails during scale-down +95 −8 @claude
#3570 [Docs] Refactor Pallas kernel skill and add reference guides 💬3 +425 −487 @dlwh
#3567 Fix trailing whitespace lint failure +1 −1 @yonromai
#3566 Actually fixup UTF bugs in Iris dashboard +14 −14 @ravwojdyla
#3565 Add proxy iris dashboard +107 −0 @ravwojdyla
#3564 iris: cloud smoke test manual trigger + single concurrency 💬1 +213 −126 @rjpower
#3562 Add log page and verification in CI. 💬1 +240 −22 @rjpower
#3561 iris: preserve scheduler diagnostics when autoscaler is not scaling up 💬2 +45 −25 @ravwojdyla
#3559 Fixup Iris dashboard symbols +4 −4 @ravwojdyla
#3557 Add nightshift automated cleanup workflows 💬1 +635 −0 @rjpower
#3554 Add debug iris controller live sql skill 💬1 +87 −0 @ravwojdyla
#3553 iris: release committed worker resources on job cancellation 💬2 +51 −1 @ravwojdyla
#3551 docs: align docs build guide with current PR checklist +7 −2 @dlwh-golem
#3549 Fix eval docs entrypoint drift +11 −8 @dlwh-golem
#3548 [Iris] Clarify TPU job submission vs reservation 💬2 +11 −2 @dlwh
#3546 Add Iris dev TPU workflow 💬4 +1049 −0 @dlwh
#3545 iris: fixup controller shutdown 💬1 +23 −12 @ravwojdyla
#3543 iris: fix flaky dashboard test with Playwright auto-retry +4 −2 @yonromai
#3541 Fix executor step lock races on GCS .executor_status.lock 💬7 +362 −128 @dlwh
#3540 iris: harden CW port-forward tunnel with demand-driven reconnect 💬5 +106 −23 @yonromai
#3539 iris: fix LogStore SQLite corruption during checkpoint restore 💬1 +9 −9 @rjpower
#3537 [iris] Add auth workflows 💬1 +4436 −314 @rjpower
#3534 [iris] Fix checkpoint restore and worker metadata e2e tests 💬1 +2 −1 @rjpower
#3533 Batch-tile XLA fused CE streaming 💬5 +627 −71 @dlwh
#3531 Fix autoscaler unknown handling. +192 −0 @rjpower
#3527 iris: fixup all attempts logs +1 −1 @ravwojdyla
#3526 Revamp issue triage workflow: skip non-code issues, add research reports 💬1 +25 −20 @rjpower
#3524 agents: prefer explicit gh issue/pr view fields +3 −1 @dlwh-golem
#3520 docs: add contributing pre-PR checklist +18 −2 @dlwh-golem
#3517 docs: fix ray dashboard command reference +3 −3 @dlwh-golem
#3515 iris: controller owns log_store; fix FD leak in tests 💬4 +300 −243 @claude
#3514 Normalize ScalingGroup/Slice state into proper DB tables 💬1 +366 −199 @rjpower
#3512 iris: autoscaler assigns worker IDs directly 💬1 +508 −400 @rjpower
#3511 Replace Preact+HTM dashboard with Vue 3 + Rsbuild + Tailwind v4 💬1 +8303 −5131 @rjpower
#3503 Fixup Fray test, poll vs get +2 −2 @ravwojdyla
#3501 Add nightly TL;DR scrub for experiment issues 💬10 +262 −0 @dlwh-golem
#3499 Fix Grug shared-expert flop accounting 💬2 +6 −1 @dlwh
#3498 zephyr: dynamic batch sizing for writers 💬6 +116 −92 @ravwojdyla
#3497 Iris: add GCS checkpointing for controller SQLite state 💬8 +747 −866 @claude
#3496 iris: rename TaskName → TaskAttempt, unify task_id+attempt_id APIs 💬2 +256 −109 @claude
#3495 Remove NotifyTaskUpdate RPC from Iris controller +78 −207 @claude
#3494 fix(iris): exclude e2e tests from cpu-test CI job 💬1 +1 −1 @rjpower
#3484 Fix logging - threat thread id as string +1 −1 @ravwojdyla
#3482 zaphyr: vortex^H^H^H^H^H^H parquet shuffle 💬5 +392 −102 @ravwojdyla
#3480 docs: fix first experiment rerun guidance +3 −3 @dlwh-golem
#3479 Streaming zephyr vortex writer 💬2 +118 −59 @ravwojdyla
#3477 Iris Actor Long Running Operations 💬3 +849 −40 @ravwojdyla
#3476 zephyr: log execution dir cleanup time +5 −0 @ravwojdyla
#3474 Set default Iris actor RPC timeout to None +7 −2 @ravwojdyla
#3468 fix(iris): show last N lines in task log viewer instead of first N 💬5 +108 −49 @rjpower
#3465 Disallow multiple `run_pipeline` on Zephyr coordinator +89 −36 @ravwojdyla
#3464 Include thread ID in standard logging config +1 −1 @ravwojdyla
#3463 Shout on Iris RPC error +2 −2 @ravwojdyla
#3462 iris: always use py-spy --nonblocking 💬2 +1 −0 @yonromai
#3460 Fix smoke test: remove psutil dependency, fix concurrent SQLite reads 💬3 +75 −35 @rjpower
#3458 Fix local grug/moe sharding 💬4 +15 −5 @dlwh-golem
#3457 Add issue triage, PR autofix workflows and coding standards 💬4 +187 −15 @rjpower
#3455 [Levanter] Delete stray grug model_moe module 💬2 +0 −388 @dlwh
#3452 Tune mixed-dtype TPU fused CE block sizes 💬4 +310 −36 @dlwh
#3449 Restore full add-pallas-kernel skill content +444 −58 @claude
#3448 agents: strengthen reflection scrub triage workflow +19 −0 @dlwh-golem
#3447 Allow to debug uv sync +6 −3 @ravwojdyla
#3446 Refactor Iris e2e tests: merge smoke-test, split chaos from smoke +1409 −3776 @rjpower
#3444 Truncate WandB artifact names to supported limit 💬1 +28 −1 @dlwh
#3443 Add agent skills documentation for common workflows 💬4 +460 −820 @rjpower
#3442 Make exact dedup work at Nemotron scale 💬6 +328 −639 @ravwojdyla
#3441 Improve autoscaler/reservation diagnostic logging 💬1 +19 −5 @rjpower
#3440 iris: persist host-side py-spy profiles to cloud storage 💬9 +209 −34 @yonromai
#3438 Add git_hash field to ProcessInfo and WorkerMetadata 💬1 +217 −361 @rjpower
#3432 docs: fix stale script paths in smoke-test dry-run guide +3 −3 @dlwh-golem
#3430 iris: add TODO for heartbeat race in sibling worker pruning +11 −1 @claude
#3426 iris: add process status page shared by controller & workers 💬2 +1085 −353 @rjpower
#3425 iris: reap entire slice when any worker fails 💬3 +164 −3 @claude
#3419 Fix Iris retry cleanup and attempt log retention regressions 💬4 +210 −132 @rjpower
#3408 Iris: make sqlite the canonical state storage 💬6 +6133 −6960 @rjpower
#3396 Iris: use controller based bundle storage instead of GCS 💬8 +1084 −805 @rjpower
#3386 Make register_endpoint idempotent for safe retry 💬2 +85 −75 @rjpower
Issues
#2822 Iris: Implement CoreWeave platform
#2823 Iris: Improve worker/process status visibility and post-mortem log access
#2824 Iris: Multi-region support with per-scaling-group environment configuration
#2825 Iris: Quota-aware scheduling and cross-zone fallback
#2826 Iris: Richer profiling and worker-level observability
#2827 Iris: Proactive unhealthy/degraded node identification
#2829 Data processing pipeline: validate end-to-end tokenization for all target datasets 💬2
#2830 Training monitoring: alerting on stalled/diverging loss and health dashboard
#2831 Validate fault tolerance: checkpoint resume and preemption recovery on CoreWeave
#2832 Agent can run a small model E2E without human intervention
#2833 Establish daily canary training runs
#2834 Executor v2: split out caching module and simplify step API
#2835 Standardize on Vortex format with typed dataset schemas
#2629 Iris: bootstrap script templates are too fragile
#2377 Jobs are not tolerant to the node where `self._run_steps` is running being preempted.
#2651 Iris: Resolver/Actor system should always auto-resolve on transient errors
#2809 Iris: Survey threading and timeouts for the controller
#2810 Iris: benchmark test for controller performance
#2424 Iris - initial resource observability 💬1
#2710 Experiment: MoE EP benchmark milestone
#2418 Add AdamC, fp32 router compute, router_topk_then_softmax, qk-norm option for MoE stability sweeps
#2414 Experiment: OLMoE size sweep with MoE stability measures
#2804 fsspec should reject cross region reads (or those over X MB)
#2744 Iris: bootstrap should probably live on the scaling group
#2745 Iris: Add attributes to ScaleGroupConfig for scheduling-level metadata
#2642 Iris: preemptible shouuld be a taint, not an attribute
#2735 Iris: Zone-aware scheduler
#2762 Iris: fair scheduler
#2625 Iris: Users and Priorities
#2749 iris: Migrate GCP platform from gcloud CLI to Python API
#2772 Iris: add proxy for worker view
#2803 iris-controller: add built-in py-spy profiling endpoint to dashboard
#2754 Embed speedscope in Iris dashboard for one-click profile viewing 💬1
#2413 SwiGLU vs Bilinear MLP layers for MoE Experts
#2708 Zephyr: auto-scale worker groups up to match demand
#2535 Iris: Integrate chronos virtual time into chaos test suite
#2849 Iris: add smoke test into CI
#2926 Iris: Add Levanter health check in Iris
#3035 StepRunner shouldn't launch tasks with Fray by default
#3098 Evaluate (first few steps) x00B MoE on TPU and GPU
#3164 Iris: allow controller restarts without resetting tasks

8 potentially related in Other Changes

#3663 docs: align add-dataset skill with uv workflow +4 −3 @dlwh-golem
#3662 docs: align local GPU tutorial with uv +5 −4 @dlwh-golem
#3704 GPU canary: switch from SlimPajama to Nemotron 💬2 +46 −32 @yonromai
#3691 docs: refresh Ray cluster setup tutorial 💬3 +15 −17 @dlwh-golem
#3675 Fix Claude PR review for fork PR OIDC 💬7 +3 −3 @dlwh
#3661 [levanter] Add tree memory-kind helper and HBM guide usage 💬4 +47 −2 @dlwh
#3658 Replace watchdog thread with coordinator-side worker health polling 💬2 +237 −58 @claude
#3639 Clarify executor framework behavior in experiments AGENTS notes +14 −0 @dlwh

#3096 Pre-training: MoE Scaling Laws

1/6 sub-issues closed

Building on last week's initial isoflop sweep, @ClassicLarry completed full results for the 15-config isoflop grid across three FLOP budgets (1e18, 3e18, 1e19) with five model widths each #2167. The sweep confirmed d768 as the optimal width at 1e18 and 3e18, with d1024 competitive at 3e18. Parallel EKN scaling experiments #3182 explored expert count and granularity — 128 experts with K=2 achieved the best BPB (1.0565) but at significantly lower MFU (14.9%) than the K=4/E=32 baseline. Aux-loss-free balancing consistently outperformed traditional load balancing loss. @Helw150 pushed the architecture to iter_03 with sigmoid routing (replacing softmax), independent per-expert gating, and an AdamH optimizer sweep on the updated recipe. @WhenWen ran AdamH comparisons showing ~0.009–0.01 BPB improvements over the v02 baseline at smaller dimensions, and tested Gated Norm from recent literature on top of AdamH. A new sub-issue tracks testing the architecture at 1e21 and 1e22 FLOP scales #3800 to validate routing stability at production scale. On the GPU side, @chloechiaw is evaluating Tokamax's Pallas Triton ragged_dot kernel as a JAX-native alternative to custom GPU MoE kernels #2828. @yonromai consolidated all canaries into a single Grug MoE entry point running through Iris #3587, with fused CE autotune fallback fixes #3605.

3 PRs this week, 36 new comments, and 2 new issues (6 total)

Sort:

#3644 [grug/moe] Select TPU fused CE backend in MoE loss path 💬3 +4 −2 @dlwh
#3587 canary: consolidate canaries to Grug MoE via Iris (#3505) 💬1 +313 −458 @yonromai
#3605 fix(levanter): catch autotune ExceptionGroup so fused CE falls back to XLA +19 −12 @yonromai
Issues
#2371 Grug MoE
#2167 Add a version of isoflop_sweep for MoE's 💬23
#3182 Determine optimal scaling parameters for MoE 💬7
#2828 Port MoE training to GPU: kernel experiments and performance validation 💬2
#3800 🆕 Test MoE Arch at 1e21 and 1e22 Flop Scales
#4012 🆕 [moe] Experiment: compute-optimal 32B-A4B (~1e22 FLOPs) on TPU 💬1

1 potentially related in Other Changes

#3704 GPU canary: switch from SlimPajama to Nemotron 💬2 +46 −32 @yonromai

#3100 Data Sources for Pre-training

Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.

0/4 sub-issues closed

Following last week's Nemotron-CC tokenization milestone, @ravwojdyla completed large-scale fuzzy dedup of Nemotron splits using the rewritten group-by pipeline #2829, validating that the Parquet-based Zephyr shuffle handles production-scale data without the file-count blowup that plagued the pickle approach. The exact dedup rewrite #3442 replaces the two-pass dup-map strategy with single-pass hash-group-annotate, outputting Vortex files directly. In a new embedding evaluation effort #3535, @ravwojdyla tested Luxical-One embeddings as a general-purpose representation for data curation. Quality filtering via linear probe scaled from Spearman 0.485 at N=125 to 0.75 at N=10K — promising but below the 0.8 go threshold. Luxical outperformed Arctic-L and BGE-large at 6× fewer dimensions. Topic clustering was a clear no-go across all models (best NMI 0.478). @gonzalobenegas continued DNA model work with separate uppercase/lowercase weight handling and a functional_pos experiment tracking functional vs nonfunctional log-likelihood across model sizes #3483.

2 PRs this week, and 0 new issues (4 total)

Sort:

#3483 Separate uppercase/lowercase weights + functional_pos experiment +262 −44 @gonzalobenegas
#3442 Make exact dedup work at Nemotron scale 💬6 +328 −639 @ravwojdyla
Issues
#3049 Test Luxical as a General Tool for Data Integration Pipelines
#3101 Ensure we have 20T deduped tokens of data
#3183 Software Heritage Foundation license
#3194 Gather code environments

#3192 Synthetic Data

0/4 sub-issues closed

On the OpenThoughts4 front, @moojink ran follow-up experiments with smaller models — training Qwen3-1.7B-Base on data from Qwen3-4B and Qwen3-32B teachers, plus a rejection sampling variant where Qwen3-4B generates and Qwen3-32B verifies #2262. These experiments test whether the modest advantage of larger teachers observed with Llama3.1-8B holds across student scales. @natolambert flagged the need to test multiple teachers as the key takeaway, with rejection sampling results still pending.

0 PRs this week, 3 new comments, and 0 new issues (4 total)

Sort:

Issues
#2956 [Agentic SFT] SFT Qwen3-8B on 5K SWE-smith trajectories and show improvement on SWE-bench
#2905 [Agentic SFT] Generate 30K Coding Trajectories across 6 Languages
#3093 [Agentic SFT] Tracking SFT datasets
#2262 Experiment: OpenThoughts4 Teacher Model Comparison - Qwen3-32B vs. Qwen3-235B-A22B 💬3

Other Changes

@dlwh added an HBM optimization guide #3595 and refactored the Pallas kernel skill with reference guides #3570. @dlwh-golem continued documentation alignment — contributing guides #3520, #3596, docs build guides #3551, #3607, eval entrypoint drift #3549, and agent workflow skills #3663, #3662.

17 PRs this week, 31 new comments, and 85 issues closed (85 total)

Sort:

#3663 docs: align add-dataset skill with uv workflow Infrastructure: MoE Training Support +4 −3 @dlwh-golem
#3662 docs: align local GPU tutorial with uv Infrastructure: MoE Training Support +5 −4 @dlwh-golem
#3331 gruggification: pass explicit axis mappings through train/eval callers 💬2 +69 −34 @dlwh
#3329 gruggification: explicit axis-mapping foundation for LM loss path 💬2 +17 −7 @dlwh
#3328 gruggification: decouple eval and inference surface from model.Pos 💬2 +1 −1 @dlwh
#3327 gruggification: remove remaining direct haliax symbol imports 💬2 +18 −12 @dlwh
#3326 gruggification: hax annotation sweep for model and layer modules 💬2 +597 −529 @dlwh
#3318 optimizers: support eqx linear masks and vmapped linear transforms 💬1 +169 −50 @dlwh
#3313 lm/eval: add array-loss bridge for LM and ASR 💬2 +142 −31 @dlwh
#3284 Tweaks to data inspection from debugging spikes 💬5 +339 −88 @Helw150
#2185 speedrun submission: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale 💬8 +221 −0 @redagavin
#3704 GPU canary: switch from SlimPajama to Nemotron Pre-training: MoE Scaling Laws Infrastructure: MoE Training Support 💬2 +46 −32 @yonromai
#3691 docs: refresh Ray cluster setup tutorial Infrastructure: MoE Training Support 💬3 +15 −17 @dlwh-golem
#3675 Fix Claude PR review for fork PR OIDC Infrastructure: MoE Training Support 💬7 +3 −3 @dlwh
#3661 [levanter] Add tree memory-kind helper and HBM guide usage Infrastructure: MoE Training Support 💬4 +47 −2 @dlwh
#3658 Replace watchdog thread with coordinator-side worker health polling Infrastructure: MoE Training Support 💬2 +237 −58 @claude
#3639 Clarify executor framework behavior in experiments AGENTS notes Infrastructure: MoE Training Support +14 −0 @dlwh

Top 15 runs (by FLOPs) this week (completed, running, crashed)

MoE scaling dominated the week: a 15-config isoflop sweep at 1e18–1e19 FLOPs settled on d768 as optimal, alongside long-run architecture comparisons on preemptible v5p-64 and a 235B parameter trial queued for v5p-256. Agentic SFT reproduced NemotronTerminal baselines on Terminal-Bench.

Run	Owner	Hardware	FLOP Budget	Wall Time	Evals	Links
adamh-scaling-ladder-nemotron-optimal-1e+23-v5-27f2fb running	Will Held	TPU v4 (512 chips)	8.18e22 model 2.65e23 HW (31% MFU)	22.1d	BPB: 0.796	W&B
adamh-scaling-ladder-nemotron-optimal-1e+22-v6-500e71 crashed	Will Held	TPU v4 (256 chips)	7.23e21 model 1.52e22 HW (48% MFU)	2.5d	BPB: 0.950	W&B
exp2262pt3k_100k_llama_3_2_1b_ot4_math_qwen3_32b_32768tokens-ad9dd3	Moo Jin Kim	TPU v4 (128 chips)	1.01e21 model 5.01e21 HW (20% MFU)	1.7d	BPB: 0.175	W&B
adamh-scaling-ladder-nemotron-optimal-1e+22-v7-5f064e crashed	Will Held	TPU v4 (256 chips)	1.92e21 model 3.84e21 HW (50% MFU)	15.7h	BPB: 0.959	W&B
exp2262pt3l_240k_pt2_qwen3_1pt7b_base_ot4_math_qwen3_30b_a3b_thi-1b291f	Moo Jin Kim	TPU v4 (128 chips)	1.30e21 model 3.73e21 HW (35% MFU)	1.2d	BPB: 0.143	W&B
exp2262pt3i_240k_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768tokens-ec97cb	Moo Jin Kim	TPU v4 (128 chips)	1.30e21 model 3.73e21 HW (35% MFU)	1.3d	BPB: 0.126	W&B
exp2262pt3i_240k_pt2_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-2ff561	Moo Jin Kim	TPU v4 (128 chips)	1.30e21 model 3.73e21 HW (35% MFU)	1.2d	BPB: 0.139	W&B
exp2262pt3i_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-eeb1d6	Moo Jin Kim	TPU v4 (128 chips)	1.30e21 model 3.73e21 HW (35% MFU)	1.2d	BPB: 0.181	W&B
exp2262pt3h_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-007999	Moo Jin Kim	TPU v4 (128 chips)	1.30e21 model 3.73e21 HW (35% MFU)	1.2d	BPB: 0.085	W&B
exp2262pt3h_240k_pt2_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-29bb5e	Moo Jin Kim	TPU v4 (128 chips)	1.30e21 model 3.73e21 HW (35% MFU)	1.2d	BPB: 0.061	W&B
exp2262pt3l_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_30b_a3b_thi-5eb7b0	Moo Jin Kim	TPU v4 (128 chips)	1.30e21 model 3.73e21 HW (35% MFU)	1.2d	BPB: 0.167	W&B
exp2262pt3h_240k_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tokens-985c01	Moo Jin Kim	TPU v4 (128 chips)	1.30e21 model 3.73e21 HW (35% MFU)	1.3d	BPB: 0.055	W&B
exp2262pt3l_240k_qwen3_1pt7b_base_ot4_math_qwen3_30b_a3b_thinkin-b8cff7	Moo Jin Kim	TPU v4 (128 chips)	1.30e21 model 3.73e21 HW (35% MFU)	1.2d	BPB: 0.142	W&B
exp2262pt3k_50k_llama_3_2_1b_ot4_math_qwen3_32b_32768tokens-d60dcb	Moo Jin Kim	TPU v5 (32 chips)	5.03e20 model 1.62e21 HW (31% MFU)	1.3d	BPB: 0.195	W&B
(done) exp2262pt3l_100k_pt2_qwen3_1pt7b_base_ot4_math_qwen3_30b_a3b_thi-427e5c	Moo Jin Kim	TPU v4 (128 chips)	5.41e20 model 1.62e21 HW (33% MFU)	12.9h	BPB: 0.157	W&B
MoE 1e18 real-data recipe sweep (8 runs)	@dlwh	v5p-8 (4 chips)	(21–23.3% MFU)	6–9.5h per run	E=64/K=4. Best: h1024 loss=3.404 at 2.6B tokens (23.3% MFU). Long 30k-step: loss=3.433 at 4.2B tokens.	#3522 W&B W&B W&B
MoE isoflop sweep (15 configs × 3 budgets)	@ClassicLarry	v5p-8 (4 chips)	(17–28% MFU)	~1.2h (1e18) / ~3h (3e18) / ~9h (1e19) per config	Best at 1e18: d768 loss=3.406. Best at 3e18: d1536 loss=3.090. Best at 1e19: d1536 loss=2.931.	#3522 W&B
MoE nano scaling + weight decay sweep	@ClassicLarry	v5p-8 (4 chips)	(17.0–17.1% MFU)	~1.7h per wd run, ~2h (3e18 d768)	Best wd=0.08 loss=3.402. 3e18 d768: loss=3.110 at 2.5B tokens.	#3466 W&B
Architecture compare: es3r2 vs g15 vs sab4 vs es1/es2	@dlwh	v5p-64 (32 chips) preempt	(es3r2: 23.6%, g15: 22.2%, sab4: 21.9%, es2: 22.5%, es1: 20.6% MFU)	~20min each (profiling runs)	es3r2 promoted as primary target shape. Throughput profiling at v5p-64 scale.	#3528 W&B W&B W&B W&B W&B
Grug MoE ~235B topk4 shared2x trial failed	@dlwh	v5p-256 (128 chips) preempt	(— MFU)	failed after ~30min (2 attempts)	238B param MoE, topk=4, 2× shared expert. Both attempts failed.	#3536
NemotronTerminal-8B SFT (full corpus)	@AlienKevin	v5p-32 (16 chips) SFT, v5p-8 (4 chips) eval preempt	(40.9% MFU)	120h (SFT), ~1h (eval)	SFT loss=0.360. TB2.0: 13.5% (matches 13.0±2.2). TBLite: 16.0%, mean reward 0.180.	#3490 W&B W&B
Grug MoE canary (GPU + TPU) running	@yonromai	H100 (CoreWeave) + GCP TPU	(3.7% (canary, not perf target) MFU)	continuous	Consolidated all canaries to single Grug MoE entry point via Iris.	#3505 #3587
LLaMA-50M Muon speedrun (1× Chinchilla)	@redagavin	—	(— MFU)	—	Muon optimizer at 1× Chinchilla scale. loss=4.061, 0.7B tokens.	#2185 W&B
GPU MoE EP benchmark (ragged all-to-all)	@chloechiaw	8× H100 (CoreWeave)	(— MFU)	benchmark	GPU MoE expert parallelism with ragged all-to-all kernel.	#3633
JAX DeepEP vs Megatron head-to-head	@dlwh	8× H100 (CoreWeave)	(— MFU)	benchmark	JAX custom-call DeepEP benchmarked against Megatron baseline.	#3665

#2836 Infrastructure: MoE Training Support

#3096 Pre-training: MoE Scaling Laws

#3100 Data Sources for Pre-training

#3192 Synthetic Data

Other Changes

Top 15 runs (by FLOPs) this week (completed, running, crashed)

Keyboard shortcuts