Week of March 9th summary for marin-community/marin

Iris got a full-stack overhaul — JWT auth, Vue 3 dashboard, SQLite as canonical state, multi-VM CoreWeave support, and automated nightshift maintenance — while @ClassicLarry completed a systematic MoE isoflop sweep and @dlwh drove a gruggification refactor across the training codebase.

#2836 Infrastructure: MoE Training Support

Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.

24/41 sub-issues closed

Following last week's controller checkpointing and reservation work, @rjpower made SQLite the canonical state store for the Iris controller (#3408) and added GCS checkpointing for post-mortem analysis (#3497). The Iris dashboard was rewritten from Preact+HTM to Vue 3 + TypeScript + Tailwind v4 (#3511) with 26 components. Auth landed as a full system: GCP/static-user auth workflows (#3537) followed by a switch to HMAC-SHA256 JWTs for zero-DB-hit verification (#3630). Multi-VM CoreWeave support shipped with JAX coordinator bootstrap (#3638) and CW-specific fixes: R2 endpoint correction (#3629), interruptable taint toleration (#3609), and hardened port-forward tunnels (#3540). The autoscaler saw deadlock fixes and rate-limit logging (#3580), direct worker ID assignment (#3512), and ghost-slice prevention on failed scale-down (#3571). @rjpower introduced nightshift — 7 scheduled GitHub Actions workflows using Claude agents for overnight maintenance: cleanup, doc-drift, issue triage, and automated PR fixes (#3557, #3612, #3614, #3615). @yonromai consolidated all canaries to Grug MoE via Iris (#3587), replacing the old Ray-based TPU canary and separate CW script. On the training side, @dlwh landed a series of gruggification PRs — sweeping hax.* annotations across model modules (#3326), removing direct haliax imports (#3327), decoupling eval from model.Pos (#3328), adding explicit axis-mapping foundations (#3329, #3331), an array-loss bridge for LM/ASR eval (#3313), and optimizer support for eqx linear masks (#3318). Fused cross-entropy got batch-tiled XLA streaming to avoid int32 word-count limits on long sequences (#3533), plus autotune ExceptionGroup fallback (#3605). GCS executor lock races under worker churn were fixed (#3541), and cross-region transfer was unified under a single 10GB budget (#3627).

143 PRs this week, 4 new comments, and 0 new issues (41 total)

Sort:

#3472 feat(iris): add cloud-mode smoke test to CI 💬2 +142 −2 @rjpower
#3689 Update agents doc re `generate_protos.py` +1 −1 @ravwojdyla
#3688 fix(fray): early-exit discover_new when target count is reached 💬1 +8 −2 @ravwojdyla-agent
#3685 iris: Add JobPreemptionPolicy and ExistingJobPolicy enums 💬5 +381 −38 @rjpower
#3683 fix(iris): remove misleading 'All' log option and increase default max lines 💬1 +1 −2 @rjpower
#3682 fix(iris): add --async to gcloud delete/reset to unblock heartbeat thread 💬2 +13 −5 @rjpower
#3681 Update iris debugger skill 💬7 +59 −12 @ravwojdyla
#3679 iris: pin tunnel port scan to start at 10000 💬7 +12 −2 @Helw150
#3676 Add file-issue skill 💬2 +183 −0 @ravwojdyla
#3673 iris: send tail=true for GetTaskLogs RPC +1 −0 @ravwojdyla-agent
#3664 iris: disable gzip compression for actor RPC responses 💬1 +2 −0 @ravwojdyla-agent
#3656 zephyr: pack pickle inside parquet shuffle 💬3 +71 −31 @ravwojdyla-agent
#3655 iris: add IP cp buttons 💬2 +57 −4 @ravwojdyla
#3654 iris: include proto files in bundles, remove dashboard from build hook 💬5 +104 −49 @rjpower
#3653 fix(iris): poll pending_reason in scheduling diagnostic e2e test +8 −1 @ravwojdyla-agent
#3651 iris: add job name copy btn +54 −2 @ravwojdyla
#3650 iris: copy hatch_build.py into Docker images 💬1 +5 −5 @rjpower
#3648 Update uv lock +39 −4 @ravwojdyla
#3646 iris: clean up terminal operations from server memory 💬3 +10 −2 @ravwojdyla
#3645 Log examples arrow failure +5 −1 @ravwojdyla
#3642 Update exact dedup to groupby via parquet 💬2 +15 −15 @ravwojdyla
#3638 iris: add multi-VM CoreWeave support with JAX coordinator bootstrap 💬6 +970 −103 @rjpower
#3637 fix(iris): preserve thread container isolation in LocalCluster 💬4 +1 −1 @app/claude
#3632 refactor(iris): replace LocalController with LocalCluster, decouple from Platform 💬5 +114 −158 @rjpower
#3631 Auto-generate protobuf and Connect RPC files via hatch build hook 💬1 +274 −6653 @rjpower
#3630 iris: switch auth from per-RPC DB lookups to HMAC-SHA256 JWTs 💬3 +849 −419 @rjpower
#3629 fix(iris): correct R2 endpoint and bucket in coreweave.yaml +2 −2 @yonromai
#3627 Unified cross-region transfer budget (10GB) 💬3 +424 −498 @rjpower
#3626 fix: inherit worker region constraint in child jobs at client level 💬4 +152 −3 @rjpower
#3625 Add AGENTS rule for storage transfer service consent 💬1 +1 −0 @dlwh
#3618 [nightshift] Narrow bare except Exception in fray FileQueue 💬4 +3 −4 @app/github-actions
#3617 Move task profiling to the controller. 💬5 +219 −148 @rjpower
#3616 fix: Match autoscaler scale-down defaults to scale-up rate 💬3 +35 −29 @rjpower
#3615 Move nightshift into clean Python scripts, fix permissions +343 −231 @rjpower
#3614 refactor: rewrite nightshift workflows to use extracted scripts 💬1 +208 −175 @rjpower
#3612 impl: Create nightshift scripts and prompt templates 💬1 +178 −0 @rjpower
#3611 Add experiments AGENTS guidance for mirror artifact paths +28 −0 @dlwh
#3609 iris: add CoreWeave interruptable taint toleration to worker and task pods +16 −10 @yonromai
#3607 docs: require docs source-link validation +10 −2 @dlwh-golem
#3605 fix(levanter): catch autotune ExceptionGroup so fused CE falls back to XLA +19 −12 @yonromai
#3603 Set replicate_path on grug WandbConfig so tracker_metrics.jsonl is written +3 −0 @yonromai
#3601 iris: add e2e smoke test for job cancellation resource decommit 💬1 +24 −0 @rjpower
#3600 Fix Zephyr coordinator hang when all workers OOM 💬6 +86 −10 @rjpower
#3597 docs: tighten scrub artifact dedupe guidance +5 −3 @dlwh-golem
#3596 docs: align contributing guide data browser workflow +12 −3 @dlwh-golem
#3595 [Docs] Add HBM optimization guide and cross-links 💬3 +182 −0 @dlwh
#3592 Restructure nightshift cleanup prompt around three review dimensions 💬1 +49 −16 @rjpower
#3590 Fix up heartbeat perf. 💬3 +144 −160 @rjpower
#3589 iris: add CPU/memory/thread profile buttons to controller & worker status pages 💬5 +135 −74 @rjpower
#3588 iris: split storage into local_state_dir + remote_state_dir, remove bundle_prefix 💬7 +412 −388 @rjpower
#3587 canary: consolidate canaries to Grug MoE via Iris (#3505) 💬1 +313 −458 @yonromai
#3582 iris: add /threads route for thread dump viewing with real URL 💬11 +187 −36 @rjpower
#3581 Prefix profile dumps wiht ts to sort well in Finder +4 −2 @ravwojdyla
#3580 fix(iris): autoscaler deadlock, rate-limit log, worker ID mismatch 💬2 +266 −135 @rjpower
#3579 iris: fleet view shows zone from worker attributes 💬1 +3 −3 @ravwojdyla-agent
#3578 iris: surface scheduler diagnostics in ListJobs for CLI visibility 💬4 +79 −32 @ahmeda14960
#3575 iris: show thread dump as text overlay instead of file download 💬10 +36 −14 @app/claude
#3573 Build dashboard on proxy 💬2 +5 −1 @ravwojdyla
#3571 fix(iris): prevent ghost slices when terminate() fails during scale-down +95 −8 @app/claude
#3570 [Docs] Refactor Pallas kernel skill and add reference guides 💬3 +425 −487 @dlwh
#3567 Fix trailing whitespace lint failure +1 −1 @yonromai
#3566 Actually fixup UTF bugs in Iris dashboard +14 −14 @ravwojdyla
#3565 Add proxy iris dashboard +107 −0 @ravwojdyla
#3564 iris: cloud smoke test manual trigger + single concurrency 💬1 +213 −126 @rjpower
#3562 Add log page and verification in CI. 💬1 +240 −22 @rjpower
#3561 iris: preserve scheduler diagnostics when autoscaler is not scaling up 💬2 +45 −25 @ravwojdyla
#3559 Fixup Iris dashboard symbols +4 −4 @ravwojdyla
#3557 Add nightshift automated cleanup workflows 💬1 +635 −0 @rjpower
#3554 Add debug iris controller live sql skill 💬1 +87 −0 @ravwojdyla
#3553 iris: release committed worker resources on job cancellation 💬2 +51 −1 @ravwojdyla
#3551 docs: align docs build guide with current PR checklist +7 −2 @dlwh-golem
#3549 Fix eval docs entrypoint drift +11 −8 @dlwh-golem
#3548 [Iris] Clarify TPU job submission vs reservation 💬2 +11 −2 @dlwh
#3546 Add Iris dev TPU workflow 💬4 +1049 −0 @dlwh
#3545 iris: fixup controller shutdown 💬1 +23 −12 @ravwojdyla
#3543 iris: fix flaky dashboard test with Playwright auto-retry +4 −2 @yonromai
#3541 Fix executor step lock races on GCS .executor_status.lock 💬7 +362 −128 @dlwh
#3540 iris: harden CW port-forward tunnel with demand-driven reconnect 💬5 +106 −23 @yonromai
#3539 iris: fix LogStore SQLite corruption during checkpoint restore 💬1 +9 −9 @rjpower
#3537 [iris] Add auth workflows 💬1 +4436 −314 @rjpower
#3534 [iris] Fix checkpoint restore and worker metadata e2e tests 💬1 +2 −1 @rjpower
#3533 Batch-tile XLA fused CE streaming 💬5 +627 −71 @dlwh
#3531 Fix autoscaler unknown handling. +192 −0 @rjpower
#3527 iris: fixup all attempts logs +1 −1 @ravwojdyla
#3526 Revamp issue triage workflow: skip non-code issues, add research reports 💬1 +25 −20 @rjpower
#3524 agents: prefer explicit gh issue/pr view fields +3 −1 @dlwh-golem
#3520 docs: add contributing pre-PR checklist +18 −2 @dlwh-golem
#3517 docs: fix ray dashboard command reference +3 −3 @dlwh-golem
#3515 iris: controller owns log_store; fix FD leak in tests 💬4 +300 −243 @app/claude
#3514 Normalize ScalingGroup/Slice state into proper DB tables 💬1 +366 −199 @rjpower
#3512 iris: autoscaler assigns worker IDs directly 💬1 +508 −400 @rjpower
#3511 Replace Preact+HTM dashboard with Vue 3 + Rsbuild + Tailwind v4 💬1 +8303 −5131 @rjpower
#3503 Fixup Fray test, poll vs get +2 −2 @ravwojdyla
#3501 Add nightly TL;DR scrub for experiment issues 💬10 +262 −0 @dlwh-golem
#3499 Fix Grug shared-expert flop accounting 💬2 +6 −1 @dlwh
#3498 zephyr: dynamic batch sizing for writers 💬6 +116 −92 @ravwojdyla
#3497 Iris: add GCS checkpointing for controller SQLite state 💬8 +747 −866 @app/claude
#3496 iris: rename TaskName → TaskAttempt, unify task_id+attempt_id APIs 💬2 +256 −109 @app/claude
#3495 Remove NotifyTaskUpdate RPC from Iris controller +78 −207 @app/claude
#3494 fix(iris): exclude e2e tests from cpu-test CI job 💬1 +1 −1 @rjpower
#3484 Fix logging - threat thread id as string +1 −1 @ravwojdyla
#3482 zaphyr: vortex^H^H^H^H^H^H parquet shuffle 💬5 +392 −102 @ravwojdyla
#3480 docs: fix first experiment rerun guidance +3 −3 @dlwh-golem
#3479 Streaming zephyr vortex writer 💬2 +118 −59 @ravwojdyla
#3477 Iris Actor Long Running Operations 💬3 +849 −40 @ravwojdyla
#3476 zephyr: log execution dir cleanup time +5 −0 @ravwojdyla
#3474 Set default Iris actor RPC timeout to None +7 −2 @ravwojdyla
#3468 fix(iris): show last N lines in task log viewer instead of first N 💬5 +108 −49 @rjpower
#3465 Disallow multiple `run_pipeline` on Zephyr coordinator +89 −36 @ravwojdyla
#3464 Include thread ID in standard logging config +1 −1 @ravwojdyla
#3463 Shout on Iris RPC error +2 −2 @ravwojdyla
#3462 iris: always use py-spy --nonblocking 💬2 +1 −0 @yonromai
#3460 Fix smoke test: remove psutil dependency, fix concurrent SQLite reads 💬3 +75 −35 @rjpower
#3458 Fix local grug/moe sharding 💬4 +15 −5 @dlwh-golem
#3457 Add issue triage, PR autofix workflows and coding standards 💬4 +187 −15 @rjpower
#3455 [Levanter] Delete stray grug model_moe module 💬2 +0 −388 @dlwh
#3452 Tune mixed-dtype TPU fused CE block sizes 💬4 +310 −36 @dlwh
#3449 Restore full add-pallas-kernel skill content +444 −58 @app/claude
#3448 agents: strengthen reflection scrub triage workflow +19 −0 @dlwh-golem
#3447 Allow to debug uv sync +6 −3 @ravwojdyla
#3446 Refactor Iris e2e tests: merge smoke-test, split chaos from smoke +1409 −3776 @rjpower
#3444 Truncate WandB artifact names to supported limit 💬1 +28 −1 @dlwh
#3443 Add agent skills documentation for common workflows 💬4 +460 −820 @rjpower
#3442 Make exact dedup work at Nemotron scale 💬6 +328 −639 @ravwojdyla
#3441 Improve autoscaler/reservation diagnostic logging 💬1 +19 −5 @rjpower
#3440 iris: persist host-side py-spy profiles to cloud storage 💬9 +209 −34 @yonromai
#3438 Add git_hash field to ProcessInfo and WorkerMetadata 💬1 +217 −361 @rjpower
#3432 docs: fix stale script paths in smoke-test dry-run guide +3 −3 @dlwh-golem
#3430 iris: add TODO for heartbeat race in sibling worker pruning +11 −1 @app/claude
#3426 iris: add process status page shared by controller & workers 💬2 +1085 −353 @rjpower
#3425 iris: reap entire slice when any worker fails 💬3 +164 −3 @app/claude
#3419 Fix Iris retry cleanup and attempt log retention regressions 💬4 +210 −132 @rjpower
#3408 Iris: make sqlite the canonical state storage 💬6 +6133 −6960 @rjpower
#3396 Iris: use controller based bundle storage instead of GCS 💬8 +1084 −805 @rjpower
#3386 Make register_endpoint idempotent for safe retry 💬2 +85 −75 @rjpower
#3331 gruggification: pass explicit axis mappings through train/eval callers 💬2 +69 −34 @dlwh
#3329 gruggification: explicit axis-mapping foundation for LM loss path 💬2 +17 −7 @dlwh
#3328 gruggification: decouple eval and inference surface from model.Pos 💬2 +1 −1 @dlwh
#3327 gruggification: remove remaining direct haliax symbol imports 💬2 +18 −12 @dlwh
#3326 gruggification: hax annotation sweep for model and layer modules 💬2 +597 −529 @dlwh
#3318 optimizers: support eqx linear masks and vmapped linear transforms 💬1 +169 −50 @dlwh
#3313 lm/eval: add array-loss bridge for LM and ASR 💬2 +142 −31 @dlwh
#3284 Tweaks to data inspection from debugging spikes 💬5 +339 −88 @Helw150
Issues
#2822 Iris: Implement CoreWeave platform
#2823 Iris: Improve worker/process status visibility and post-mortem log access
#2824 Iris: Multi-region support with per-scaling-group environment configuration
#2825 Iris: Quota-aware scheduling and cross-zone fallback
#2826 Iris: Richer profiling and worker-level observability
#2827 Iris: Proactive unhealthy/degraded node identification
#2829 Data processing pipeline: validate end-to-end tokenization for all target datasets
#2830 Training monitoring: alerting on stalled/diverging loss and health dashboard
#2831 Validate fault tolerance: checkpoint resume and preemption recovery on CoreWeave
#2832 Agent can run a small model E2E without human intervention
#2833 Establish daily canary training runs
#2834 Executor v2: split out caching module and simplify step API
#2835 Standardize on Vortex format with typed dataset schemas
#2629 Iris: bootstrap script templates are too fragile
#2377 Jobs are not tolerant to the node where `self._run_steps` is running being preempted.
#2651 Iris: Resolver/Actor system should always auto-resolve on transient errors
#2809 Iris: Survey threading and timeouts for the controller
#2810 Iris: benchmark test for controller performance
#2424 Iris - initial resource observability 💬1
#2710 Experiment: MoE EP benchmark milestone
#2418 Add AdamC, fp32 router compute, router_topk_then_softmax, qk-norm option for MoE stability sweeps
#2414 Experiment: OLMoE size sweep with MoE stability measures
#2804 fsspec should reject cross region reads (or those over X MB)
#2744 Iris: bootstrap should probably live on the scaling group
#2745 Iris: Add attributes to ScaleGroupConfig for scheduling-level metadata
#2642 Iris: preemptible shouuld be a taint, not an attribute
#2735 Iris: Zone-aware scheduler
#2762 Iris: fair scheduler
#2625 Iris: Users and Priorities
#2749 iris: Migrate GCP platform from gcloud CLI to Python API
#2772 Iris: add proxy for worker view
#2803 iris-controller: add built-in py-spy profiling endpoint to dashboard
#2754 Embed speedscope in Iris dashboard for one-click profile viewing 💬1
#2413 SwiGLU vs Bilinear MLP layers for MoE Experts
#2708 Zephyr: auto-scale worker groups up to match demand
#2535 Iris: Integrate chronos virtual time into chaos test suite
#2849 Iris: add smoke test into CI
#2926 Iris: Add Levanter health check in Iris
#3035 StepRunner shouldn't launch tasks with Fray by default
#3098 Evaluate (first few steps) x00B MoE on TPU and GPU
#3164 Iris: allow controller restarts without resetting tasks

9 potentially related in Other Changes

#3663 docs: align add-dataset skill with uv workflow +4 −3 @dlwh-golem
#3662 docs: align local GPU tutorial with uv +5 −4 @dlwh-golem
#3675 Fix Claude PR review for fork PR OIDC 💬7 +3 −3 @dlwh
#3669 Handle pallas_tpu autotune misses under mosaic partitioning 💬1 +107 −4 @dlwh
#3661 [levanter] Add tree memory-kind helper and HBM guide usage 💬2 +47 −2 @dlwh
#3639 Clarify executor framework behavior in experiments AGENTS notes +14 −0 @dlwh
#3622 Add cross-region egress monitoring and alerting 💬3 +693 −0 @rjpower
#3620 [nightshift] fix doc drift in tutorials 💬2 +31 −24 @app/github-actions
#3481 docs: refresh PR submission tutorial 💬10 +4 −47 @dlwh-golem

Week of March 9th summary for marin-community/marin

#2836 Infrastructure: MoE Training Support

#3096 Pre-training: MoE Scaling Laws

#3100 Data Sources for Pre-training

#3192 Synthetic Data

Other Changes

#2836 Infrastructure: MoE Training Support

#3096 Pre-training: MoE Scaling Laws

#3100 Data Sources for Pre-training

#3192 Synthetic Data

Other Changes

Keyboard shortcuts