Week of February 23rd summary for marin-community/marin

Iris reliability hardening, Grug's module-first API refactor, and early MoE training experiments on TPU. The first CoreWeave GPU canary ferry was stood up and @ClassicLarry got Grug MoE running with replicated weights on v4 and v5p.

#2836 Infrastructure: MoE Training Support

Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.

24/41 sub-issues closed

Iris saw a major reliability push: @yonromai added request-level RPC observability (#3073), auto-retry on DEADLINE_EXCEEDED (#3067), container phase tracking replacing a boolean (#3105, #3106), local K8s e2e tests (#3097), and CW RBAC + scheduling fixes (#3111). @rjpower landed live resource monitoring in the dashboard (#3085), fractional CPU support via millicores (#3040), unified WorkerConfig proto (#3077), and replaced multi-region AR push with GHCR + AR remote repos (#2996). @dlwh refactored Grug to a module-first API (#3017), tuned fused cross-entropy for v4 with XLA fallback (#3052), set TPU VMEM defaults (#3053), and added Pallas cost estimates (#2999). Zephyr got heartbeat timeout suppression (#3014) and shared data replaced with disk-based serialization (#2986) by @ravwojdyla.

60 PRs this week, 49 new comments, and 3 new issues (41 total)

Sort:

#3115 Add user-aware Iris job identifiers 💬4 +975 −416 @rjpower
#3073 feat(iris): add request-level observability to controller RPCs +128 −5 @yonromai
#3067 fix(iris): retry DEADLINE_EXCEEDED in call_with_retry +22 −1 @yonromai
#3057 fix(iris): auto-retry actor calls on transient errors with re-resolution 💬6 +280 −12 @app/claude
#3121 fix(zephyr): route block_size/cache_type to file opener, not FS constructor 💬3 +30 −10 @yonromai
#3116 Globally reuse time-wait connection not just loopback on the cluster +1 −0 @ravwojdyla
#3114 Log `get_current_tpu_is_preempted` exceptions +1 −0 @ravwojdyla
#3113 Increase ephemeral port range on the cluster +2 −0 @ravwojdyla
#3112 fix(iris): use marin_temp_bucket for bundle storage +14 −9 @app/claude
#3111 fix(iris): CW RBAC and controller scheduling fixes (#3091, #3102) +13 −1 @yonromai
#3108 Change RayConfig default to not auto-start cluster 💬1 +5 −5 @dlwh
#3107 fix(iris): auto-capture controller pod diagnostics on connection failure 💬2 +68 −20 @yonromai
#3106 fix(iris): keep tasks in BUILDING until container reaches Running phase 💬2 +128 −7 @yonromai
#3105 refactor(iris): replace ContainerStatus bool with ContainerPhase enum 💬2 +73 −40 @yonromai
#3097 test(iris): local K8s e2e with auto-provisioned kind cluster 💬2 +111 −5 @yonromai
#3089 Pin python to `3.11` +1 −0 @ravwojdyla
#3088 chore: disable daily CW canary ferry schedule +2 −2 @yonromai
#3087 fix(iris): reduce controller log verbosity, keep actionable signals +14 −14 @yonromai
#3086 Fix LR scheduler when warmup exceeds cycle length 💬1 +52 −29 @dlwh
#3085 iris: live resource monitoring in dashboard and heartbeat 💬1 +1891 −411 @rjpower
#3084 Iris benchmark: add single-worker burst mode, require explicit subcommand 💬1 +237 −54 @rjpower
#3083 Add daily workflow to cleanup preempted TPUs 💬4 +45 −0 @rjpower
#3082 Iris: test audit and cleanups 💬1 +169 −334 @rjpower
#3080 Don't query for disk usage if threshold is set to 0 for cleanup. 💬1 +3 −0 @rjpower
#3078 coreweave: reserve controller CPU/memory in deployment 💬1 +55 −3 @rjpower
#3077 refactor(iris): unify worker config around WorkerConfig proto +882 −839 @rjpower
#3076 Improve Fray job name in `remote` & `StepSpec` +18 −11 @ravwojdyla
#3074 chore(iris): address PR #3072 review — remove reserved fields, fix docs +13 −15 @app/claude
#3072 chore(iris): remove unused TimeoutConfig and requesting_timeout 💬3 +32 −110 @rjpower
#3065 Smoke test: skip non-preemptible case and avoid TPU for region checks 💬1 +14 −16 @rjpower
#3063 fix(iris): switch GCE VM-slice bootstrap from SSH to startup-script 💬2 +444 −241 @app/claude
#3061 Refine autoscaler refresh and heartbeat handling 💬1 +256 −136 @rjpower
#3059 Replace VM detail page with unified worker detail page 💬3 +808 −407 @rjpower
#3054 Grug-native template cleanup and legacy path retirement 💬11 +1514 −1711 @dlwh
#3053 Set TPU VMEM defaults for v5e/v5litepod and v6e 💬1 +10 −3 @dlwh
#3052 V4 fused CE tuning and default XLA fallback 💬3 +9 −7 @dlwh
#3051 Optional Step execution via `remote` 💬2 +286 −124 @ravwojdyla
#3048 Iris: prefer KUBECONFIG env var over config kubeconfig_path 💬1 +2 −1 @yonromai
#3044 Fix JAX compilation cache path for local fallback 💬1 +18 −2 @dlwh
#3041 Zephyr: fix worker poll loop killing heartbeat on slow deserialization 💬7 +48 −13 @rjpower
#3040 Iris: fractional CPU support via millicores + lower CLI defaults 💬1 +1175 −823 @rjpower
#3039 Set default LIBTPU_INIT_ARGS for v5p/v6e launch paths 💬13 +70 −28 @dlwh
#3038 Clarify Ray RUNNING vs TPU allocation in monitoring loop doc 💬1 +2 −0 @dlwh
#3034 Add missing Ray retries, more persistent Zephyr coord retries 💬2 +4 −3 @ravwojdyla
#3033 Clean up GCS region helper after concurrent PRs 💬1 +6 −9 @rjpower
#3031 Extract reusable GCS region path validation helper 💬3 +179 −67 @dlwh
#3030 Normalize TPU queued resource capacity flags 💬2 +11 −4 @dlwh
#3029 Fix MARIN_PREFIX autouse fixtures when env var is pre-set +24 −15 @dlwh
#3028 Iris: support GCP VM slice mode for CPU single-host groups 💬4 +852 −116 @rjpower
#3026 Replace temp_buckets.py with marin_fs.py: unified prefix and temp storage +387 −507 @rjpower
#3023 Iris: surface child job termination reason in streaming logs 💬6 +203 −73 @app/claude
#3021 Hoist trainer profiler hook config into ProfilerConfig 💬1 +162 −114 @dlwh
#3017 Refactor grug to module-first API and remove old speedruns 💬3 +241 −1197 @dlwh
#3016 Use `atomic_rename` to levanter cache writer 💬2 +13 −171 @ravwojdyla
#3015 Iris: add GPU support to job CLI + CoreWeave canary ferry 💬1 +197 −23 @rjpower
#3014 zephyr: suppress spurious heartbeat timeout warnings for shut-down workers 💬1 +2 −0 @rjpower
#3013 iris: classify pod-not-found as infra with retry window 💬1 +163 −13 @rjpower
#3010 Fix levanter required checks blocking unrelated PRs 💬1 +3 −6 @yonromai
#3008 Delegate image rewriting to Platform.resolve_image() 💬1 +196 −156 @rjpower
#3005 Fix canary ferry UTC date mismatch 💬1 +1 −1 @yonromai
Issues
#2822 Iris: Implement CoreWeave platform
#2823 Iris: Improve worker/process status visibility and post-mortem log access
#2824 Iris: Multi-region support with per-scaling-group environment configuration
#2825 Iris: Quota-aware scheduling and cross-zone fallback
#2826 Iris: Richer profiling and worker-level observability
#2827 Iris: Proactive unhealthy/degraded node identification
#2829 Data processing pipeline: validate end-to-end tokenization for all target datasets 💬17
#2830 Training monitoring: alerting on stalled/diverging loss and health dashboard
#2831 Validate fault tolerance: checkpoint resume and preemption recovery on CoreWeave 💬1
#2832 Agent can run a small model E2E without human intervention
#2833 Establish daily canary training runs
#2834 Executor v2: split out caching module and simplify step API
#2835 Standardize on Vortex format with typed dataset schemas 💬2
#2629 Iris: bootstrap script templates are too fragile
#2377 Jobs are not tolerant to the node where `self._run_steps` is running being preempted. 💬1
#2651 Iris: Resolver/Actor system should always auto-resolve on transient errors 💬3
#2809 Iris: Survey threading and timeouts for the controller 💬2
#2810 Iris: benchmark test for controller performance
#2424 Iris - initial resource observability 💬1
#2710 Experiment: MoE EP benchmark milestone 💬5
#2418 Add AdamC, fp32 router compute, router_topk_then_softmax, qk-norm option for MoE stability sweeps 💬1
#2414 Experiment: OLMoE size sweep with MoE stability measures
#2804 fsspec should reject cross region reads (or those over X MB) 💬2
#2744 Iris: bootstrap should probably live on the scaling group
#2745 Iris: Add attributes to ScaleGroupConfig for scheduling-level metadata
#2642 Iris: preemptible shouuld be a taint, not an attribute
#2735 Iris: Zone-aware scheduler
#2762 Iris: fair scheduler
#2625 Iris: Users and Priorities
#2749 iris: Migrate GCP platform from gcloud CLI to Python API
#2772 Iris: add proxy for worker view
#2803 iris-controller: add built-in py-spy profiling endpoint to dashboard
#2754 Embed speedscope in Iris dashboard for one-click profile viewing 💬1
#2413 SwiGLU vs Bilinear MLP layers for MoE Experts
#2708 Zephyr: auto-scale worker groups up to match demand
#2535 Iris: Integrate chronos virtual time into chaos test suite
#2849 Iris: add smoke test into CI
#2926 Iris: Add Levanter health check in Iris
#3035 🆕 StepRunner shouldn't launch tasks with Fray by default 💬3
#3098 🆕 Evaluate (first few steps) x00B MoE on TPU and GPU
#3164 🆕 Iris: allow controller restarts without resetting tasks

Week of February 23rd summary for marin-community/marin

#2836 Infrastructure: MoE Training Support

#3100 Data Sources for Pre-training

#3096 Pre-training: 32B MoE Kick-off

#3192 Synthetic Data

Other Changes

#2836 Infrastructure: MoE Training Support

#3100 Data Sources for Pre-training

#3096 Pre-training: 32B MoE Kick-off

#3192 Synthetic Data

Other Changes

Keyboard shortcuts