Week of February 23rd summary for marin-community/marin

Iris reliability hardening, Grug's module-first API refactor, and early MoE training experiments on TPU. The first CoreWeave GPU canary ferry was stood up and @ClassicLarry got Grug MoE running with replicated weights on v4 and v5p.

Other Changes

@gonzalobenegas added DNA experiments covering promoters, genomic regions, and k-mer tokenization #2992, plus auto-detection of BOS/EOS tokens in the DNA batch tokenizer #3055. @teetone updated Evalchemy non-math evaluation domains #3128. Agent recipe and scrub skill improvements by @dlwh #3129 and @rjpower #3056.

120 PRs this week, 193 new comments, and 44 issues closed (44 total)

Sort:

#3129 Add scrub skills for docs parity and self-improvement 💬2 +64 −0 @dlwh
#3128 Evalchemy non-math domain updates +18 −5 @teetone
#3126 fix: make pyrefly pre-commit work in .codex worktrees 💬2 +388 −299 @dlwh
#3121 fix(zephyr): route block_size/cache_type to file opener, not FS constructor 💬3 +30 −10 @yonromai
#3120 docs: require single TPU process on dev TPUs 💬1 +7 −0 @dlwh
#3116 Globally reuse time-wait connection not just loopback on the cluster +1 −0 @ravwojdyla
#3115 Add user-aware Iris job identifiers 💬4 +975 −416 @rjpower
#3114 Log `get_current_tpu_is_preempted` exceptions +1 −0 @ravwojdyla
#3113 Increase ephemeral port range on the cluster +2 −0 @ravwojdyla
#3112 fix(iris): use marin_temp_bucket for bundle storage +14 −9 @claude
#3111 fix(iris): CW RBAC and controller scheduling fixes (#3091, #3102) +13 −1 @yonromai
#3108 Change RayConfig default to not auto-start cluster 💬1 +5 −5 @dlwh
#3107 fix(iris): auto-capture controller pod diagnostics on connection failure 💬2 +68 −20 @yonromai
#3106 fix(iris): keep tasks in BUILDING until container reaches Running phase 💬2 +128 −7 @yonromai
#3105 refactor(iris): replace ContainerStatus bool with ContainerPhase enum 💬2 +73 −40 @yonromai
#3097 test(iris): local K8s e2e with auto-provisioned kind cluster 💬2 +111 −5 @yonromai
#3089 Pin python to `3.11` +1 −0 @ravwojdyla
#3088 chore: disable daily CW canary ferry schedule +2 −2 @yonromai
#3087 fix(iris): reduce controller log verbosity, keep actionable signals +14 −14 @yonromai
#3086 Fix LR scheduler when warmup exceeds cycle length 💬1 +52 −29 @dlwh
#3085 iris: live resource monitoring in dashboard and heartbeat 💬1 +1891 −411 @rjpower
#3084 Iris benchmark: add single-worker burst mode, require explicit subcommand 💬1 +237 −54 @rjpower
#3083 Add daily workflow to cleanup preempted TPUs 💬4 +45 −0 @rjpower
#3082 Iris: test audit and cleanups 💬1 +169 −334 @rjpower
#3080 Don't query for disk usage if threshold is set to 0 for cleanup. 💬1 +3 −0 @rjpower
#3079 Update speedrun script to load speedrun_results.json from folder depth > 1 +7 −3 @ClassicLarry
#3078 coreweave: reserve controller CPU/memory in deployment 💬1 +55 −3 @rjpower
#3077 refactor(iris): unify worker config around WorkerConfig proto +882 −839 @rjpower
#3076 Improve Fray job name in `remote` & `StepSpec` +18 −11 @ravwojdyla
#3074 chore(iris): address PR #3072 review — remove reserved fields, fix docs +13 −15 @claude
#3073 feat(iris): add request-level observability to controller RPCs +128 −5 @yonromai
#3072 chore(iris): remove unused TimeoutConfig and requesting_timeout 💬3 +32 −110 @rjpower
#3067 fix(iris): retry DEADLINE_EXCEEDED in call_with_retry +22 −1 @yonromai
#3065 Smoke test: skip non-preemptible case and avoid TPU for region checks 💬1 +14 −16 @rjpower
#3064 Grug moe replicated weights +388 −0 @ClassicLarry
#3063 fix(iris): switch GCE VM-slice bootstrap from SSH to startup-script 💬2 +444 −241 @claude
#3061 Refine autoscaler refresh and heartbeat handling 💬1 +256 −136 @rjpower
#3059 Replace VM detail page with unified worker detail page 💬3 +808 −407 @rjpower
#3057 fix(iris): auto-retry actor calls on transient errors with re-resolution 💬6 +280 −12 @claude
#3056 Tighten fix_issue recipe to reduce agent slop 💬1 +53 −64 @rjpower
#3055 DNABatchTokenizer: auto-detect and add BOS/EOS tokens 💬4 +317 −70 @gonzalobenegas
#3054 Grug-native template cleanup and legacy path retirement 💬11 +1514 −1711 @dlwh
#3053 Set TPU VMEM defaults for v5e/v5litepod and v6e 💬1 +10 −3 @dlwh
#3052 V4 fused CE tuning and default XLA fallback 💬3 +9 −7 @dlwh
#3051 Optional Step execution via `remote` 💬2 +286 −124 @ravwojdyla
#3050 Add daily CoreWeave GPU canary ferry workflow 💬1 +72 −3 @yonromai
#3048 Iris: prefer KUBECONFIG env var over config kubeconfig_path 💬1 +2 −1 @yonromai
#3044 Fix JAX compilation cache path for local fallback 💬1 +18 −2 @dlwh
#3041 Zephyr: fix worker poll loop killing heartbeat on slow deserialization 💬7 +48 −13 @rjpower
#3040 Iris: fractional CPU support via millicores + lower CLI defaults 💬1 +1175 −823 @rjpower
#3039 Set default LIBTPU_INIT_ARGS for v5p/v6e launch paths 💬13 +70 −28 @dlwh
#3038 Clarify Ray RUNNING vs TPU allocation in monitoring loop doc 💬1 +2 −0 @dlwh
#3036 Allow to override num shard in `tokenize` +18 −13 @ravwojdyla
#3034 Add missing Ray retries, more persistent Zephyr coord retries 💬2 +4 −3 @ravwojdyla
#3033 Clean up GCS region helper after concurrent PRs 💬1 +6 −9 @rjpower
#3032 Add iris job bug-report command; remove stale Scale Groups UI 💬3 +591 −110 @rjpower
#3031 Extract reusable GCS region path validation helper 💬3 +179 −67 @dlwh
#3030 Normalize TPU queued resource capacity flags 💬2 +11 −4 @dlwh
#3029 Fix MARIN_PREFIX autouse fixtures when env var is pre-set +24 −15 @dlwh
#3028 Iris: support GCP VM slice mode for CPU single-host groups 💬4 +852 −116 @rjpower
#3026 Replace temp_buckets.py with marin_fs.py: unified prefix and temp storage +387 −507 @rjpower
#3023 Iris: surface child job termination reason in streaming logs 💬6 +203 −73 @claude
#3021 Hoist trainer profiler hook config into ProfilerConfig 💬1 +162 −114 @dlwh
#3019 Use zephyr shared data again in `tokenize` +4 −14 @ravwojdyla
#3017 Refactor grug to module-first API and remove old speedruns 💬3 +241 −1197 @dlwh
#3016 Use `atomic_rename` to levanter cache writer 💬2 +13 −171 @ravwojdyla
#3015 Iris: add GPU support to job CLI + CoreWeave canary ferry 💬1 +197 −23 @rjpower
#3014 zephyr: suppress spurious heartbeat timeout warnings for shut-down workers 💬1 +2 −0 @rjpower
#3013 iris: classify pod-not-found as infra with retry window 💬1 +163 −13 @rjpower
#3012 Remove window_size_bytes; auto-compute file grouping from max_workers 💬2 +69 −15 @claude
#3010 Fix levanter required checks blocking unrelated PRs 💬1 +3 −6 @yonromai
#3008 Delegate image rewriting to Platform.resolve_image() 💬1 +196 −156 @rjpower
#3005 Fix canary ferry UTC date mismatch 💬1 +1 −1 @yonromai
#3003 Improve autoscaler dashboard routing and availability status 💬1 +235 −63 @rjpower
#3001 Iris: fix task AR auth + concise unmet-demand diagnostics 💬1 +212 −12 @rjpower
#3000 iris: start user CLI tunnel free-port scan at 10000 💬1 +2 −2 @rjpower
#2999 Add Pallas cost estimates to existing kernels 💬3 +259 −14 @dlwh
#2998 iris: add --zone placement constraint to job run 💬2 +216 −2 @rjpower
#2997 Move torch/torchvision from core deps to extras only +15 −69 @yonromai
#2996 Iris: replace multi-region AR push with GHCR + AR remote repos 💬1 +413 −627 @rjpower
#2994 Remove dead job transition retry path in controller state 💬1 +59 −324 @rjpower
#2993 Improve pending diagnostics with autoscaler scale-up context 💬1 +366 −2 @rjpower
#2992 DNA experiments: promoters, genomic regions, evolutionary timescales, k-mer tokenization +774 −1 @gonzalobenegas
#2991 Fix flaky TPU backend selection and slice strategy assertion in tests 💬1 +23 −54 @dlwh
#2987 Add Iris OPS.md for operational guidance 💬1 +20 −0 @yonromai
#2986 Replace Zephyr shared data broadcast with disk-based serialization 💬3 +61 −28 @ravwojdyla-agent
#2984 Zephyr wait 6hrs for workers, nemotron `disk_cache` tokenizer +21 −8 @ravwojdyla
#2983 Auto-apply RBAC at cluster start and improve CoreWeave docs 💬1 +339 −113 @rjpower
#2980 Zephyr small cleanups 💬1 +38 −14 @ravwojdyla
#2979 fix tagged_lm_evaluate +10 −2 @RohithKuditipudi
#2978 Release resources when there's no work left 💬3 +66 −4 @ravwojdyla
#2977 Count live workers only +112 −13 @ravwojdyla
#2975 Stream Docker build/push output in verbose mode 💬1 +80 −23 @rjpower
#2974 Fresh coordinator/workers per `execute` + remove `ZephyrContext` context manager 💬1 +509 −503 @ravwojdyla
#2973 unwrap properly +3 −3 @RohithKuditipudi
#2970 Add agent-friendly profiling CLI bundle and semantic summaries 💬14 +4913 −0 @dlwh
#2969 Iris: multi-region AR push/pull for worker/controller/task images 💬1 +347 −106 @rjpower
#2966 Add NAMO and NAMO-D optimizer configs 💬7 +997 −0 @suraj-ranganath
#2965 Iris: Multi-region support and Labels class for resource label keys 💬2 +385 −145 @rjpower
#2964 Remove lock on heartbeat + sync in worker +11 −6 @ravwojdyla
#2963 Fix result/error reporting race condidtions +24 −5 @ravwojdyla
#2962 Refactor `disk_cache` +237 −73 @ravwojdyla
#2961 Handle task re-queue on worker re-registration 💬2 +40 −10 @ravwojdyla
#2960 Fixup Zephyr Coordinator restarts +62 −16 @ravwojdyla
#2959 Canary regression gate: validate metrics after canary ferry +159 −3 @yonromai
#2957 Tune XLA fused CE inferred v-block cap to reduce OOM risk +7 −2 @dlwh
#2952 Add ferry recipe/framework and harden monitoring loop guidance 💬4 +1250 −10 @dlwh
#2950 Keepable TensorStore metrics + shared read context for block-shuffle 💬4 +567 −45 @dlwh
#2947 Fused CE: optional argmax path for eval without logits materialization 💬1 +266 −43 @dlwh
#2942 Avoid eager transformers/torch import in levanter +29 −13 @yonromai
#2937 fix: include date in canary ferry step name to prevent skip-on-success 💬2 +10 −1 @claude
#2891 Iris: Add Coreweave Integration 💬1 +9385 −4229 @rjpower
#2877 Add GB10-safe GPU Pallas fused cross-entropy path 💬3 +2009 −10 @dlwh
#2862 Nano Architecture Ablation Speedruns using Grug-Style +6376 −2 @ClassicLarry
#2460 First Attempt at DPO 💬1 +2908 −154 @ahmeda14960
#3138 Make Ray cluster setup portable for external GCP 💬7 +369 −230 @dlwh
#3104 [Experiment][#3037] NAMO/NAMO-D weight-decay ablation in Marin 💬2 +572 −1 @suraj-ranganath
#3095 Ray TPU jobs: portability + safer CLI +772 −172 @pc0618
#3094 MoE speedrun: Mixtral load balance + v4 smoke preset 💬1 +3810 −164 @pc0618
#3092 Fix Qwen Export 💬2 +1 −0 @Helw150

Top 15 runs (by FLOPs) this week (completed, running, crashed)

Dense isoflop scaling experiments continued with a full v8 rerun of the scaling_v3 sweeps at 3e20 and 2e20 FLOP budgets, showing slightly improved losses vs. v3 (e.g., 2.576 vs. 2.577 at the 2.5B/3e20 point). The first Grug MoE runs appeared: dlwh ran a 286M MoE with replicated weights on v5p-8 to 11.3B tokens at 21.6% MFU as a flopmatch baseline against the dense canary, and attempted a 2-host v5p-16 trial that crashed. ClassicLarry submitted a 300M speedrun on v5p-16. The first CoreWeave GPU canary ferry launched on H100x8 but crashed early at 3.7% MFU, beginning the GPU platform bring-up. The AdamH hyperparameter mega-sweep continued with 50+ additional 5B-token runs.

Run	Owner	Hardware	FLOP Budget	Wall Time	Evals	Links
(done) exp2262pt2f_sft_qwen2pt5_ot4_30k_math_qwen3_235b_a22b_32768token-ad1022	Moo Jin Kim	TPU v6 lite (128 chips)	6.45e20 model 2.61e21 HW (25% MFU)	23.3h	BPB: 0.500	W&B
(done) exp2262pt2g_sft_qwen2pt5_ot4_30k_math_n1_rejsamp_qwen3_32b_32768-689abd	Moo Jin Kim	TPU v6 lite (128 chips)	6.45e20 model 2.61e21 HW (25% MFU)	17.6h	BPB: 0.279	W&B
(done) exp2262pt3d_qwen3_1pt7b_base_ot4_240k_math_qwen3_32b_32768tokens-dec321	Moo Jin Kim	TPU v5 (16 chips)	1.30e21 model 2.60e21 HW (50% MFU)	2.8d	BPB: 0.129	W&B
(done) exp2262pt3g_qwen3_1pt7b_base_ot4_30k_math_n8_rejsamp_soft_qwen3_-63d27f	Moo Jin Kim	TPU v5 (32 chips)	1.01e21 model 2.19e21 HW (46% MFU)	2.0d	BPB: 0.062	W&B
(done) exp2262pt3c_qwen3_1pt7b_base_ot4_240k_math_qwen3_4b_32768tokens-557d96	Moo Jin Kim	TPU v5 lite (128 chips)	1.30e21 model 2.08e21 HW (62% MFU)	2.3d	BPB: 0.056	W&B
isoflop-3e+20-d4096-L40-B16-adamh_scaling_v6	Will Held	TPU v4 (32 chips)	3.00e20 model 1.36e21 HW (22% MFU)	2.4d	BPB: 0.955	W&B
isoflop-3e+20-d4096-L40-B16-adamh_scaling_v3	Will Held	TPU v4 (32 chips)	3.00e20 model 1.36e21 HW (22% MFU)	2.3d	BPB: 0.955	W&B
(done) exp2262pt2g_3_llama3pt1_ot4_30k_math_n1_rejsamp_qwen3_32b_32768t-662abd	Moo Jin Kim	TPU v5 lite (256 chips)	4.76e20 model 1.30e21 HW (36% MFU)	8.9h	BPB: 0.173	W&B
isoflop-3e+20-d768-L8-B1024-adamh_scaling_v5	Will Held	TPU v4 (32 chips)	3.00e20 model 1.29e21 HW (23% MFU)	1.8d	BPB: 1.005	W&B
isoflop-3e+20-d768-L8-B1024-adamh_scaling_v8	Will Held	TPU v4 (32 chips)	3.00e20 model 1.29e21 HW (23% MFU)	1.8d	BPB: 1.002	W&B
isoflop-3e+20-d768-L8-B1024-adamh_scaling_v3	Will Held	TPU v4 (32 chips)	3.00e20 model 1.29e21 HW (23% MFU)	1.8d	BPB: 1.013	W&B
isoflop-3e+20-d768-L8-B1024-adamh_scaling_v6	Will Held	TPU v4 (32 chips)	3.00e20 model 1.29e21 HW (23% MFU)	1.8d	BPB: 1.004	W&B
isoflop-3e+20-d1024-L11-B512-adamh_scaling_v7	Will Held	TPU v4 (32 chips)	3.00e20 model 1.23e21 HW (24% MFU)	1.7d	BPB: 0.947	W&B
isoflop-3e+20-d1024-L11-B512-adamh_scaling_v8	Will Held	TPU v4 (32 chips)	3.00e20 model 1.23e21 HW (24% MFU)	1.7d	BPB: 0.946	W&B
isoflop-3e+20-d1024-L11-B512-adamh_scaling_v6	Will Held	TPU v4 (32 chips)	3.00e20 model 1.23e21 HW (24% MFU)	1.7d	BPB: 0.947	W&B
Dense isoflop v8 sweep (3e20 budget, 273M-4.3B, 7 sizes)	@Helw150	v4-32 (16 chips)	(23-44% MFU)	30-48h per run	loss=2.576-2.937, 11-223B tokens per run	W&B W&B W&B
Dense isoflop v8 sweep (2e20 budget, 273M-2.5B, 6 sizes)	@Helw150	v4-16 (8 chips)	(30-42% MFU)	25-39h per run	loss=2.605-2.911, 11-134B tokens per run	W&B
Grug MoE flopmatch daily (286M MoE, replicated weights)	@dlwh	v5p-8 (4 chips)	(21.6% MFU)	12.5h	loss=3.122, 11.3B tokens	#2710 W&B W&B
Grug MoE v5p-16 trial (286M, 2-host) crashed	@dlwh	v5p-16 (8 chips)	(21.5% MFU)	2.9h	loss=N/A, 4.2B tokens (crashed)	#2710 W&B
300M speedrun (stdattn, 4096 ctx)	@ClassicLarry	v5p-16 (8 chips)	(20.9% MFU)	2.2h	loss=2.926, 6.0B tokens	#2184 W&B
CoreWeave GPU canary ferry (H100x8) crashed	@rjpower	H100x8 (8 GPUs)	(3.7% MFU)	0.3h	20M tokens (crashed early)	#3022 W&B
AdamH hyperparameter mega-sweep v3 (loop 3-9, 157M)	@Helw150	v5p-8 (4 chips)	(23% MFU)	5h per run	loss=3.48-3.75, 5B tokens each, ~50+ runs	W&B

Other Changes

Top 15 runs (by FLOPs) this week (completed, running, crashed)

Keyboard shortcuts