Week of March 23rd summary for marin-community/marin

The 1e22 MoE run launched on v4-512 while capacity-factor and LR-decay ablations refined the recipe, and the Delphi 1e23 dense scaling ladder continued on v4-1024 at 30% MFU. Iris performance work cut controller lock hold time from 80ms to under 5ms and delivered 4x faster heartbeat batching. Agentic SFT v2 runs fixed critical gradient clipping and think-token issues, matching released OT-Agent 32K benchmarks at 13% SWE-bench.

#3192 Synthetic Data & Post-training

0/4 sub-issues closed

The agentic SFT reproduction made significant progress this week. @AlienKevin identified four root causes in the v1 32K run #3896: catastrophic overfitting (loss dropping to 0.00003) caused by max_grad_norm=1.0 instead of OT-Agent's 1e-4, and generating <|start_think|> instead of native <think> tokens. The v2 32K SFT run with these fixes reached 13% SWE-bench (matching released 14%) and 7.9% TB2 (matching released ~8.1%), though TB-Lite averaged 12% vs released 18%. For 131K context #3897, the v2 run on v5p-256 with max_grad_norm=1e-4 regressed to 15% SWE-bench (from v1's 30%) because 1e-4 clipping with only 248 steps at batch=128 was too aggressive — gradients were clipped by ~1300x every step. A v2a run with sqrt-scaled hyperparams (LR × sqrt(8), grad_norm × 8) showed much healthier loss curves before preemption. @AlienKevin also successfully reproduced NemotronTerminal #3490, reaching 15.9% on TB2 (vs released 13.0%) and 29.0% TB-Lite (vs released 23.0%). @eramis73 merged DspyEvaluator improvements with ToonAdapter support #4213, and @taivu1998 drafted a long-context evaluation lane for the exp2062 long-context plan #4249.

0 PRs this week, and 0 new issues (4 total)

Sort:

Issues
#2956 [Agentic SFT] SFT Qwen3-8B on 5K SWE-smith trajectories and show improvement on SWE-bench
#2905 [Agentic SFT] Generate 30K Coding Trajectories across 6 Languages
#3093 [Agentic SFT] Tracking SFT datasets
#2262 Experiment: OpenThoughts4 Teacher Model Comparison - Qwen3-32B vs. Qwen3-235B-A22B

13 autocategorized

#4249 [Eval] Add long-context evaluation lane for exp2062 +890 −2 @taivu1998
#3948 docs: collocated RL project doc and Tunix RL reference 💬1 +16948 −190 @ahmeda14960
#4213 Add DspyEvaluator improvements and ToonAdapter +514 −34 @eramis73
#2759 Add multihost inference support for TPU 💬1 +29653 −132 @ahmeda14960
#4153 [levanter] HF checkpoint save overwrites eos_token_id from source model with tokenizer default 💬2
#4159 [levanter] Add generation_config.json support for chat model checkpoints 💬1
#3826 [levanter] Multi-host TPU inference for GRPO sampling 💬1
#3490 [Agentic SFT] Reproduce NemotronTerminal 💬3
#3896 [Agentic SFT] Reproduce OT-Agent's best 32K context length SFT 💬9
#3897 [Agentic SFT] Reproduce OT-Agent's best 131K context length SFT 💬9
#2062 [Epic] Long Context Plan 💬2
#1747 RL: Investigate slow copies off of TPU in training worker (~1GB/s)
#3792 Ray Multi-Host vLLM on TPU: Serve 235B+ Models Across Multiple Hosts 💬1

#3100 Data Sources for Pre-training

Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.

0/5 sub-issues closed

The synthetic reasoning bootstrap corpus #4148 expanded rapidly with 24 comments this week from @dlwh, adding deterministic generators for BFS shortest path, binary search, coin change DP, connected components, Dijkstra, and more algorithmic reasoning domains — all producing step-by-step traces without LLM labels. @ravwojdyla shipped major data infrastructure work: fuzzy dedup was updated to scale to Nemotron #3750 (+465/-514 lines), and the new datakit tool consolidated and cleaned up data downloads #4142 (+878/-1,331 lines). FlatMixture was added for virtual dataset concatenation with global shuffle without re-tokenizing #4133. @yonromai fixed a TensorStore handle leak that was accumulating ~14 MiB per shard during cache-copy #4198, and shared ts.Transaction for metadata consolidation cut serial copy time #4105. @ahmeda14960 added generation_config.json support for chat model checkpoints #4160 and shipped data browser QOL updates #3954.

0 PRs this week, 24 new comments, and 1 new issue (5 total)

Sort:

Issues
#3049 Test Luxical as a General Tool for Data Integration Pipelines
#3101 Ensure we have 20T deduped tokens of data
#3183 Software Heritage Foundation license
#3194 Gather code environments
#4148 🆕 Experiment: synthetic reasoning bootstrap corpus 💬24

27 autocategorized

#4219 Use hardcoded path for nemotron download to stabilize version hash 💬1 +7 −3 @rjpower
#4198 Fix TensorStore handle leak in cache-copy (~14 MiB/shard) +45 −28 @yonromai
#4163 fix(fray): eliminate ray.kill() and __ray_terminate__ from actor shutdown 💬1 +79 −154 @rjpower
#4162 [dedup] Replace DupCounters with zephyr.counters for live monitoring +138 −103 @ravwojdyla-agent
#4160 [levanter] Add generation_config.json support for chat model checkpoints +365 −4 @ahmeda14960
#4142 Bootstrap `datakit` - consolidate and cleanup downloads 💬1 +878 −1331 @ravwojdyla-agent
#4133 Add FlatMixture: virtual dataset concatenation with global shuffle 💬3 +154 −3 @claude
#4105 Use shared ts.Transaction for metadata consolidation 💬2 +157 −60 @yonromai
#3999 chore: update dupekit wheels +29 −10 @github-actions
#3955 chore: update dupekit wheels +1 −1 @github-actions
#3954 QOL updates for data browser +484 −87 @ahmeda14960
#3938 [dedup] Replace side-effect mutation with reduce in fuzzy dedup map +9 −11 @rjpower
#3750 Update fuzzy dedup to scale to nemotron +465 −514 @ravwojdyla
#4247 Rebase dna branch onto main +89848 −240257 @eric-czech
#4214 Skip tokenizer tests that require HuggingFace downloads in CI +22 −31 @rjpower
#4188 datakit: add normalize step (download → standard Parquet) +849 −1 @ravwojdyla-agent
#4184 Fix fasttext classifier for distributed/S3 environments +22 −28 @claude
#4171 [download] Replace HfFileSystem.find() with HfApi.list_repo_files() to fix silent truncation 💬2 +42 −5 @claude
#4077 Remove unused modules from download and transform 💬1 +0 −1122 @ravwojdyla
#4154 [levanter] Preserve source model token IDs in HF checkpoint saves 💬2 +23 −0 @ahmeda14960
#4100 Serial metadata copy in consolidate_shard_caches takes hours for large tokenization jobs 💬2
#4104 Experiment: Ising tokenizer rollouts and critical interpolation
#4132 🤖 FlatMixture: virtual dataset concatenation without re-tokenizing 💬2
#4170 🤖 HfFileSystem.find() silently truncates file lists on large repos, causing incomplete downloads
#4204 changes to get_nemotron_split_paths causes downstream steps to rerun
#2383 Initial VLM Training Experiment 💬3
#2345 Swarm Runs Proposal 💬1

#3096 Pre-training: MoE Scaling Laws

1/6 sub-issues closed

@ClassicLarry launched the 1e22 MoE run #3800 — a 34.6B-total / 5.1B-active model on v4-512 with capacity factor 1.0, running at 488k tok/s and 23% MFU (W&B). The capacity factor was set to 1.0 based on ablation results #4016 showing cap=1.0 is 8.3% faster than 1.25 with negligible loss impact at 1e20 scale. The v7 isoflop sweep #4225 mapped LR schedule and decay interactions with AdamH across dimensions, finding that removing linear decay at small step counts (d2048 @ 1e18) improved BPB by 0.03 — suggesting decay fraction needs to scale with training length. @Helw150 resolved the QB overhead problem #3972: the async CPU overlap approach was a dead end (a misleading benchmark), but sharded microbatch QB eliminated the 1.2 MFU-point overhead entirely. He also updated the MoE baseline #4084 and validated moe-sharded-qb-gn-xsa as the best iter-04 configuration with a Vizier hyperparameter sweep running on us-east5-a #2167. On the Delphi dense scaling ladder #1337, the 1e23 run on v4-1024 continued at 30% MFU with Paloma macro BPB 0.79, and the 1e22 seed42 run finished at 47% MFU (macro BPB 0.84). @dlwh landed XLA-first Mamba-3 SISO and MIMO TPU kernels #3961 (+5,352 lines) with a sharding-safe ranked public API #4149, and @msclar merged the AdaMuon optimizer implementation #3300.

0 PRs this week, 13 new comments, and 1 new issue (6 total)

Sort:

Issues
#2371 Grug MoE
#2167 Add a version of isoflop_sweep for MoE's 💬1
#3182 Determine optimal scaling parameters for MoE
#2828 Port MoE training to GPU: kernel experiments and performance validation
#3800 Test MoE Arch at 1e21 and 1e22 Flop Scales 💬12
#4012 🆕 [moe] Experiment: compute-optimal 32B-A4B (~1e22 FLOPs) on TPU

104 autocategorized

#4149 [mamba3] Make ranked public API sharding-safe +214 −196 @dlwh
#4130 [levanter] Share Pallas autotune helpers and restore compile offload 💬3 +437 −183 @dlwh
#4118 Bring back `profiler` to `GrugMoeLaunchConfig` 💬2 +2 −1 @ravwojdyla
#4114 Use `run_grug_moe_trial` instead of `run_grug_moe` +2 −2 @ravwojdyla
#3961 [Levanter] Add XLA-first Mamba-3 SISO and MIMO TPU kernels 💬1 +5352 −0 @dlwh
#3300 AdaMuon implementation +525 −0 @msclar
#4244 [levanter] Add attention instability metrics to pretraining +415 −0 @claude
#4243 [levanter] Add NaN/Inf/spike counters to trainer metrics +230 −7 @claude
#4242 [levanter] Add gradient zero-count and zero-fraction metrics +113 −5 @claude
#4241 [levanter] Add rank-level straggler reporting callback +326 −0 @claude
#4240 [levanter] Add rank-level straggler reporting callback +265 −0 @claude
#4234 [Grug] Upcast MoE router logits to fp32 💬1 +2 −1 @dlwh
#4233 [grug] Add activation logging on heavy watch steps 💬1 +460 −34 @claude
#4092 [grug] Add hybrid Mamba-3 sweep and split-LR ladder 💬1 +15013 −46 @dlwh
#4075 [moe] Add great 10T expert count sweep (E in {128,256,512}) +234 −0 @claude
#4074 [moe] Add great 10T K sweep in {1,2,4,8} +231 −0 @claude
#4073 [moe] Add multi-budget SWA ablation for great 10T gate +246 −1 @claude
#4072 [moe] Add great 10T gated norm ablation experiment +510 −2 @claude
#4071 [moe] Add great 10T inv-sqrt LR schedule ablation +289 −0 @claude
#4069 [moe] Add multi-scale AdamH vs Adam isoflop experiment +664 −0 @claude
#4068 [moe] Add great 10T first-k-dense isoflop ablation sweep +264 −0 @claude
#4067 [moe] Add K sweep experiment (1,2,4,8) for good 10T gate +273 −0 @claude
#4066 [moe] Add attention gate support to grug MoE model +177 −0 @claude
#4065 [moe] Add Great 10T router z-loss ablation experiment +282 −0 @claude
#4064 [moe] Add headwise attention gate and ablation launch script +190 −2 @claude
#4063 [moe] Add sigmoid gating option for great-gate 10T comparison +166 −1 @claude
#4062 [moe] Add multi-budget shared expert ablation for great 10T gate +247 −0 @claude
#4061 [moe] Add residual bottleneck variant for 10T gate experiments +1278 −0 @claude
#4060 [moe] Add good-10T E=128 isoflop experiment script +200 −0 @claude
#4059 [moe] Add AdamH vs Adam comparison experiment at 1e19 FLOPs +325 −0 @claude
#4058 [moe] Add num_dense_layers to Grug MoE for first-k-dense ablation +151 −4 @claude
#4057 [moe] Add gated norm support and ablation launch script +225 −2 @claude
#4056 [moe] Make capacity_factor configurable and add sweep script +144 −1 @claude
#4055 [moe] Add sliding_window to GrugModelConfig and SWA ablation experiment +186 −1 @claude
#4054 [moe] Add 3e18 FLOP hparam sweep for MoE recipe search +297 −0 @claude
#4053 [moe] Add Muon throughput experiment for grug MoE +88 −0 @claude
#4052 [moe] Wire capacity overflow reporting into Grug MoE training metrics +39 −1 @claude
#4051 [moe] Add shared-expert ablation experiment at ~1e19 FLOPs +194 −0 @claude
#4050 [moe] Add inv-sqrt LR schedule experiment for Good 10T gate +68 −0 @claude
#4049 [moe] Add expert count sweep for E in {128,256,512} +129 −0 @claude
#3978 Fix fused CE autotune under shard_map +227 −16 @dlwh
#2185 speedrun submission: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale 💬1 +221 −0 @redagavin
#4119 Trigger levanter and marin tests for grug changes 💬1 +1 −0 @ravwojdyla
#4084 Updated MoE Baseline 💬1 +441 −139 @Helw150
#3661 [levanter] Add tree memory-kind helper and HBM guide usage +47 −2 @dlwh
#4083 [optim] Add PolynomialLrSchedule and InvSqrtDecayLrSchedule 💬2 +179 −0 @claude
#3290 Default trainer meshes to explicit axis types with explicit-sharding fixes 💬6 +110 −38 @dlwh
#3314 main: default train/eval/lora/viz to array-first Grug datasets 💬2 +143 −84 @dlwh
#4110 Experiment: MuonHR vs MuonH at 3e18 FLOPs on Nemotron
#4122 [grug] Pallas TPU load-balancing loss kernel crashes on v5p-8: Mosaic auto-partition not supported
#4129 [levanter] Extract shared Pallas autotune sharding helpers
#4131 lint is failing in main 💬1
#4225 Experiment: Map out LR schedule and tuned value interactions with AdamH 💬13
#4232 [grug] Add activation logging on heavy watch steps
#4235 [levanter] Add rank-level straggler reporting to training
#4236 [levanter] Add rank-level straggler reporting to training
#4237 [levanter] Add gradient zero-count and zero-fraction metrics
#4238 [levanter] Add NaN/Inf/spike counters to trainer metrics
#4239 [levanter] Add attention-instability metrics to pretraining
#4251 Experiment: Bolinas DNA transferred scaling sweep
#3930 Experiment: direct compare es3r2 vs Qwen3 30B-A3B / Qwen3.5 35B-A3B full-attention baselines 💬1
#3964 Experiment: AdamHR vs AdamH at 3e18 FLOPs on Nemotron 💬3
#3972 Make QB Fast Enough to Use 💬2
#3993 Compare finished 1e20 d1536 runs in dial_moe 💬1
#4013 [moe] Good 10T gate for #3469
#4014 [moe] Great 10T gate for #3469
#4015 [moe] Good 10T: E=128
#4016 [moe] Good 10T: measure capacity overflow 💬7
#4017 [moe] Good 10T: sweep capacity factor
#4018 [moe] Good 10T: MoE hparams at 3e18 FLOPs
#4019 [moe] Good 10T: scaling hparams for MoE
#4020 [moe] Good 10T: ablate attention gate
#4021 [moe] Good 10T: ablate shared expert
#4022 [moe] Good 10T: ablate first-k dense
#4023 [moe] Good 10T: compare QB vs sign+aux
#4024 [moe] Good 10T: compare AdamH vs Adam
#4025 [moe] Good 10T: ablate exclusive self-attention
#4026 [moe] Good 10T: ablate gated norms 💬1
#4027 [moe] Good 10T: ablate sliding-window attention
#4028 [moe] Good 10T: inv-sqrt LR schedule 💬2
#4029 [moe] Good 10T: sweep K in {1,2,4,8}
#4030 [moe] Good 10T: sweep E in {128,256,512}
#4031 [moe] Great 10T: latent MoE perf
#4032 [moe] Great 10T: latent MoE loss
#4033 [moe] Great 10T: Muon perf on MoE
#4034 [moe] Great 10T: Muon loss on MoE
#4035 [moe] Great 10T: residual bottleneck experiments
#4036 [moe] Great 10T: router z-loss
#4037 [moe] Great 10T: sigmoid vs softmax gating
#4038 [moe] Great 10T: ablate attention gate
#4039 [moe] Great 10T: ablate shared expert
#4040 [moe] Great 10T: ablate first-k dense
#4041 [moe] Great 10T: compare QB vs sign+aux
#4042 [moe] Great 10T: compare AdamH vs Adam
#4043 [moe] Great 10T: ablate exclusive self-attention
#4044 [moe] Great 10T: ablate gated norms 💬1
#4045 [moe] Great 10T: ablate sliding-window attention 💬2
#4046 [moe] Great 10T: inv-sqrt LR schedule 💬4
#4047 [moe] Great 10T: sweep K in {1,2,4,8} 💬2
#4048 [moe] Great 10T: sweep E in {128,256,512} 💬2
#1337 Delphi: Create a modern scaling suite ("modernized Pythia") 💬2
#3868 Experiment: ship Mamba-3 XLA SISO/MIMO TPU kernel 💬4
#4147 Summarize current scaling heuristics for run planning
#3660 Add tree memory-kind helper and document HBM offload usage

#2836 Infrastructure: MoE Training Support

Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.

35/49 sub-issues closed

@rjpower drove a major Iris controller performance push: lock hold time in drain_dispatch_all dropped from 80ms to under 5ms #4222, a two-pass heartbeat batch delivered 4x faster provider loops #4210, lightweight job state polling reduced controller load #4209, the ORM query builder was replaced with raw SQL #4181, and log fetching was unified under FetchLogs with LIKE patterns #4202 (-598 lines). Checkpoint management got zstd compression and old-checkpoint pruning #4143. @ravwojdyla shipped the actor proxy service for external access to cluster actors #4126, refactored Zephyr chunking for improved shuffle scalability #3839 (+1,357/-539 lines), and allowed specifying coordinator resources #4095. User-defined counters (MapReduce-style per-job stats) were added #4085 with records_in/records_out counters for readers and writers #4189 and per-worker counter queries #4164. @yonromai fixed a TensorStore handle leak in cache-copy (~14 MiB/shard) #4198, fixed stale coordinators killing new workers on retry #4199, and added Slack alerts and Claude triage to the canary ferry #4158 #4177. @dlwh landed the region-aware executor on Iris #3824 (+1,075 lines). The integration test suite was redesigned #4009 and optional auth mode landed for gradual adoption #3937.

0 PRs this week, and 0 new issues (49 total)

Sort:

Issues
#2822 Iris: Implement CoreWeave platform
#2823 Iris: Improve worker/process status visibility and post-mortem log access
#2824 Iris: Multi-region support with per-scaling-group environment configuration
#2825 Iris: Quota-aware scheduling and cross-zone fallback
#2826 Iris: Richer profiling and worker-level observability
#3699 [iris] Add task-level profiling to CLI
#2827 Iris: Proactive unhealthy/degraded node identification
#2829 Data processing pipeline: validate end-to-end tokenization for all target datasets
#3004 Corrupted Levanter cache
#2927 Zephyr coordinator death causes unrecoverable pipeline failure
#2943 Zephyr coordinator stuck with workers
#2982 Replace Zephyr shared data broadcast with disk_cache-based approach
#3066 Handle GCP auth/credentials error
#2830 Training monitoring: alerting on stalled/diverging loss and health dashboard
#2929 Canary regression gate: alert on metric thresholds after canary ferry
#2930 Live monitoring and stall detection for training jobs
#2831 Validate fault tolerance: checkpoint resume and preemption recovery on CoreWeave
#2832 Agent can run a small model E2E without human intervention
#2833 Establish daily canary training runs
#2834 Executor v2: split out caching module and simplify step API
#2835 Standardize on Vortex format with typed dataset schemas
#2629 Iris: bootstrap script templates are too fragile
#2377 Jobs are not tolerant to the node where `self._run_steps` is running being preempted.
#2651 Iris: Resolver/Actor system should always auto-resolve on transient errors
#2809 Iris: Survey threading and timeouts for the controller
#2810 Iris: benchmark test for controller performance
#2424 Iris - initial resource observability
#2710 Experiment: MoE EP benchmark milestone
#2418 Add AdamC, fp32 router compute, router_topk_then_softmax, qk-norm option for MoE stability sweeps
#2414 Experiment: OLMoE size sweep with MoE stability measures
#2804 fsspec should reject cross region reads (or those over X MB)
#2744 Iris: bootstrap should probably live on the scaling group
#2745 Iris: Add attributes to ScaleGroupConfig for scheduling-level metadata
#2642 Iris: preemptible shouuld be a taint, not an attribute
#2735 Iris: Zone-aware scheduler
#2762 Iris: fair scheduler
#2625 Iris: Users and Priorities
#2749 iris: Migrate GCP platform from gcloud CLI to Python API
#2772 Iris: add proxy for worker view
#2803 iris-controller: add built-in py-spy profiling endpoint to dashboard
#2754 Embed speedscope in Iris dashboard for one-click profile viewing
#2413 SwiGLU vs Bilinear MLP layers for MoE Experts
#2708 Zephyr: auto-scale worker groups up to match demand
#2535 Iris: Integrate chronos virtual time into chaos test suite
#2849 Iris: add smoke test into CI
#2926 Iris: Add Levanter health check in Iris
#3035 StepRunner shouldn't launch tasks with Fray by default
#3098 Evaluate (first few steps) x00B MoE on TPU and GPU
#3164 Iris: allow controller restarts without resetting tasks

180 autocategorized

#4229 [iris] Add job state filter to dashboard JobsTab +28 −5 @claude
#4228 [iris] Add sorting to child jobs table in dashboard +59 −6 @claude
#4222 perf(iris): reduce drain_dispatch_all lock hold time from 80ms to <5ms +193 −65 @rjpower
#4217 [iris] Fix bundle exclude patterns for docs and test snapshots +5 −5 @rjpower
#4216 fix(iris): remove --nonblocking from py-spy to fix native CPU profiles +1 −2 @rjpower
#4215 [iris] Skip per-task detail in GetJobStatus by default +32 −16 @rjpower
#4212 [zephyr] Fix counter loss and double-counting via generation tags +112 −64 @ravwojdyla-agent
#4210 [iris] Two-pass heartbeat batch: 4x faster provider loop phase 3 +267 −57 @rjpower
#4209 Add lightweight job state polling to reduce controller load +94 −12 @rjpower
#4207 [iris] Build native extensions for Rust crates in rust-dev mode +12 −0 @rjpower
#4206 fix(iris): strip trailing slash from MirrorFileSystem prefix +1 −1 @yonromai
#4205 fix(zephyr): don't SHUTDOWN idle workers while shards are in-flight 💬3 +36 −2 @yonromai
#4203 [iris] Fix flaky test_smoke_gcp_config_boots_locally timeout 💬1 +35 −41 @rjpower
#4202 [iris] Unify log fetching: FetchLogs with LIKE patterns, deprecate GetTaskLogs 💬1 +416 −598 @rjpower
#4199 [zephyr/iris] Fix stale coordinator killing new workers on retry 💬2 +48 −38 @rjpower
#4197 [iris] Remove obsolete CoreWeave CI workflows and configs +62 −1267 @rjpower
#4195 fix(canary): lower GPU canary batch size to 16 to fix OOM +1 −1 @yonromai
#4194 [iris] Capture periodic memory profiles alongside CPU profiles 💬2 +127 −32 @claude
#4192 Add --frozen flag to uv sync and optimize PYTHONDONTWRITEBYTECODE +9 −3 @rjpower
#4191 [nightshift] Deduplicate vLLM server polling loop and env defaults +122 −107 @github-actions
#4189 [zephyr] Add records_in / records_out counters to readers and writers 💬1 +22 −1 @ravwojdyla-agent
#4186 Enable native frame collection in CPU profiling by default 💬1 +4 −1 @rjpower
#4185 Switch SQLite backups from WAL to DELETE journal mode 💬2 +17 −2 @rjpower
#4182 Use iris bundle for Ray working_dir to include pb2 files 💬3 +213 −145 @rjpower
#4181 [iris] Remove ORM query builder, replace with raw SQL 💬1 +430 −987 @rjpower
#4180 [iris] Fix VM bootstrap monitor killing healthy workers +242 −23 @rjpower
#4179 [iris] Fix scheduling loop: filter reservation jobs at SQL level +81 −7 @rjpower
#4177 Add Claude triage step to canary ferry workflows +146 −12 @yonromai
#4176 Tolerate controller unavailability for up to 1h in job monitoring 💬1 +276 −32 @rjpower
#4175 Halve Ray cluster minimums, boost Iris capacity +28 −28 @rjpower
#4174 Add CoreWeave CI workflow for Iris PRs 💬4 +645 −155 @rjpower
#4173 Fix canary validation artifact replication +1 −0 @yonromai
#4172 [nightshift] Extract duplicated dedup aggregator and finalization +100 −121 @github-actions
#4165 docs: add actor RPC section to debug-zephyr-job skill 💬1 +23 −0 @ravwojdyla-agent
#4164 [zephyr] Per-worker counter queries, accumulate across stages +14 −4 @ravwojdyla-agent
#4161 feat(iris): route actor RPC calls through proxy and decode responses 💬1 +72 −1 @ravwojdyla-agent
#4158 Add Slack alerts for canary workflow failures +18 −0 @yonromai
#4157 fix(iris): handle scheme in actor proxy upstream URL +45 −27 @ravwojdyla-agent
#4155 fix(iris): fixup Starlette `on_shutdown` +26 −5 @ravwojdyla
#4152 fix(iris): add Docker Buildx setup to GPU canary workflow 💬1 +3 −0 @yonromai
#4150 [infra] Revert pull_request_target to pull_request and restore id-token: write +9 −5 @rjpower
#4145 iris: add --dry-run mode to controller startup 💬2 +542 −157 @rjpower
#4144 fix(iris): add Docker Buildx setup to smoke test workflows +6 −0 @rjpower
#4143 [iris] Compress checkpoints with zstd, prune old ones, improve restart UX +250 −101 @rjpower
#4140 [zephyr] Fix _check_worker_group false abort after completed stage 💬1 +101 −0 @rjpower
#4139 [iris] Use buildx --push for atomic build+push+cache-write +35 −19 @rjpower
#4138 [infra] Remove OIDC auth from Claude workflows, use OAuth token +2 −6 @rjpower
#4137 [iris] Use npx for Playwright system deps on CI cache hit +2 −2 @rjpower
#4136 fix(iris): hash workdir tokens to prevent ENAMETOOLONG for nested jobs +23 −3 @rjpower
#4135 Fix ResourceConfig serialization and remove asia-northeast1 buckets +4 −4 @rjpower
#4128 Migrate dev TPU guides into skills +277 −292 @dlwh
#4126 feat(iris): add actor proxy service for external access to cluster actors +359 −8 @ravwojdyla
#4125 [Iris] Remove caps from marin-dev cluster config +10 −10 @ravwojdyla-agent
#4124 Update iris skill, ref config +14 −1 @ravwojdyla
#4123 [canary] Forward R2/S3 credentials to GPU canary task pod +3 −0 @rjpower
#4116 Parallelize data_size computation in consolidate_shard_caches +58 −5 @yonromai
#4112 [ferry] Use mirror:// for canary validation, drop hardcoded MARIN_PREFIX +9 −8 @rjpower
#4108 [Iris] Fix stale detail state and nested child jobs in dashboard 💬1 +230 −119 @ahmeda14960
#4106 [fray] Use graceful actor termination to avoid Ray task_manager assertion 💬1 +30 −4 @rjpower
#4101 [canary] add logging for task pods, `kind` test 💬2 +441 −0 @rjpower
#4099 fix: worker processes exit cleanly after SHUTDOWN +136 −5 @yonromai
#4097 `StepSpec` -> `ExecutorStep` and use `StepSpec` in integration test 💬2 +141 −29 @ravwojdyla
#4095 zephyr: allow to specify coordinator resources (#4005) +5 −1 @ravwojdyla
#4094 [tests] Fix tests that hit HuggingFace during CI +55 −36 @rjpower
#4091 [iris] Increase GCP smoke test shell poll timeout +1 −1 @rjpower
#4089 [iris] Fix dev cluster restart ghcr.io auth failure +16 −0 @rjpower
#4087 fix: address iris integration test review feedback 💬3 +129 −7 @claude
#4086 [nightshift] Remove dead code and deduplicate helpers in decon.py +9 −59 @github-actions
#4085 [iris] Add user-defined counters (MapReduce-style per-job stats) 💬12 +167 −6 @ravwojdyla-agent
#4081 Fix worker pod over-provisioning and sequential cleanup 💬2 +156 −13 @yonromai
#4078 Fix hatch build proto staleness detection +31 −4 @rjpower
#4076 [zephyr] Fix tests that relied on closure mutation for call counting +39 −53 @yonromai
#4011 [iris] Label always-on CoreWeave nodes as system-critical +3 −0 @rjpower
#4010 [CI] Fix required checks hanging + rename to project-scoped names +38 −80 @rjpower
#4009 [iris] Redesign integration test suite 💬1 +1069 −2 @claude
#4008 [zephyr] Fix coordinator loop crash causing silent pipeline hangs +41 −9 @yonromai
#4005 zephyr: allow to specify coordinator resources +5 −1 @ravwojdyla
#4002 [iris] CoreWeave dev cluster: always-on CPU, scale-to-zero GPU, nightly restart 💬1 +136 −353 @rjpower
#3997 fix: regenerate uv.lock in dupekit wheels workflow +6 −1 @rjpower
#3991 [iris] Make controller restart idempotent +23 −4 @rjpower
#3990 [executor] Add mirrored() for region-agnostic data references 💬1 +326 −25 @rjpower
#3986 [CI] Fix required checks hanging due to workflow path filters 💬1 +229 −47 @rjpower
#3984 [iris] Add dev cluster config with daily automated restart 💬1 +210 −0 @rjpower
#3982 [iris] Add events permission to CoreWeave RBAC ClusterRole +5 −0 @rjpower
#3980 [nightshift] Deduplicate distributed lock + heartbeat logic +49 −81 @github-actions
#3966 [nightshift] fix documentation drift in tutorials +12 −9 @github-actions
#3946 Babysit/debug Iris job skills 💬1 +290 −188 @ravwojdyla
#3937 [iris] Add optional auth mode for gradual adoption +269 −35 @claude
#3910 [zephyr/iris] Fix ConfigMap size limit for large pipelines +66 −6 @yonromai
#3900 iris: eliminate Platform abstraction, reorganize into Service + Provider layers +14497 −15020 @rjpower
#3849 [nightshift] Remove duplicated vLLM model resolution from EvalchemyEvaluator +3 −41 @github-actions
#3839 zephyr: refactor chunking, improve shuffle scalability 💬1 +1357 −539 @ravwojdyla
#3675 Fix Claude PR review for fork PR OIDC +3 −3 @dlwh
#4252 [RL] Add GPU smoke probes and resource-aware config +1361 −203 @taivu1998
#4246 [iris] Offload large workdir files to blob store in launch_job +224 −1 @rjpower
#4231 Remove deprecated GetTaskLogs RPC +7 −99 @claude
#4224 [executor] Add .mirrored() to ExecutorStep and InputName 💬6 +79 −36 @claude
#4218 Fix raw dataset tokenize input paths +12 −6 @yonromai
#4178 [zephyr] Arrow-native scatter/reduce: 1.1x reduce speedup 💬9 +816 −405 @rjpower
#4169 [iris] Fix get_job_info crash on constraint JSON with unknown fields 💬1 +27 −1 @claude
#4168 [iris] Ignore unknown inherited constraint fields +5103 −16 @dlwh
#4121 [iris] Auto-quarantine workers with consecutive SIGSEGV task crashes +195 −6 @claude
#4096 Add user budgets, priority bands and preemption to Iris 💬1 +3706 −29 @rjpower
#4090 [classification] Migrate inference pipeline from Ray to Fray v2 +210 −286 @claude
#4080 [iris] Add GCP impersonation based on logged-in user identity 💬3 +237 −0 @rjpower
#4070 Add Cloudflare Tunnel integration for public controller access +803 −15 @rjpower
#3983 Skip region validation for cached executor steps +150 −0 @claude
#3824 Make executor region-aware on Iris from GCS dependencies 💬3 +1075 −34 @dlwh
#3691 docs: refresh Ray cluster setup tutorial 💬1 +15 −17 @dlwh-golem
#4208 Add make rust-package: unified dupekit wheel build, publish, and pin 💬2 +406 −99 @rjpower
#4221 [iris] Reduce controller write lock contention in drain_dispatch_all and prune_old_data 💬1 +154 −26 @claude
#4245 [iris] Add blob storage to BundleStore for externalized workdir files 💬1 +93 −39 @rjpower
#4187 Log final counters on coordinator shutdown +5 −0 @rjpower
#4102 Handle empty string 'WANDB_API_KEY' in init_wandb +2 −1 @ravwojdyla
#3988 Soften issue-linking rule in pull-request skill +6 −4 @claude
#4230 Defer workspace bundle creation until job submission +11 −11 @rjpower
#4151 Prefer origin pushes for Marin PR branches +137 −14 @dlwh-golem
#4098 Worker processes never exit after SHUTDOWN — _host_actor() blocks forever on anonymous Event 💬2
#4103 [iris] Truncate workdir tokens for deeply nested job names 💬1
#4107 [Iris] Fix stale detail state and nested child jobs in dashboard 💬1
#4109 [iris] Actor proxy service for external access to cluster actors 💬2
#4113 [executor] Retokenize data when it doesn't exist in the local region
#4117 [iris] Flaky test_marin_pipeline_on_iris: ZephyrWorkerError in dedup_fuzzy_document
#4120 [iris] Auto-quarantine and delete TPU nodes that crash with SIGSEGV
#4127 Migrate dev TPU docs into skills
#3103 iris: worker kubectl saturation causes heartbeat timeouts and task pod disappearances
#4141 iris: add a --dry-run to controller startup
#3643 zephyr: batch instead of hard coded `chunk_size`
#4156 RayActorGroup.shutdown() should disable retries for __ray_terminate__ 💬1
#4167 [iris] get_job_info crashes on constraint JSON with mode field 💬1
#2639 Iris: Zephyr v2 context is unclear
#4183 Fasttext classifier code doesn't work in distributed/S3 environments
#3674 Claude PR Review fails on fork PRs: missing OIDC env in pull_request runs
#4190 [zephyr] Fix counter loss and double-counting between heartbeat and report_result
#4193 Iris: capture periodic memory profiles alongside CPU profiles
#4196 Memory leak in cache-copy: TensorStore handles accumulate ~14 MiB per shard
#3172 Iris: fix memray, needs LLDB in the task container
#3174 Iris: mark non-preemptible slices with a autoscaling taint
#4200 Zephyr coordinator stalls when last-shard worker dies or hangs
#4201 Remove deprecated GetTaskLogs RPC 💬1
#3185 iris: consolidate controller to single-thread event loop model
#3706 [iris] Separate auth and log tables into attached SQLite databases
#4220 perf(iris): Controller write lock contention causes 2s+ RegisterEndpoint latency at scale
#4223 Add .mirrored() to ExecutorStep and InputName for cross-region reads
#3201 iris: controller optimization for heartbeat loop
#4226 Subtasks in the Iris dashboard should be sorted by the main sort criteria
#4227 Filter and Reset Buttons Don't Work in the Iris Dashboard
#3266 smoke-test full mode: controller restart fails (VM not discoverable) + logging format bug
#2244 Consider replacing fsspec with obstore
#2257 Update docs for GPU Docker images
#3798 [zephyr] Coordinator OOM from quadratic scatter metadata growth
#3359 iris: track task/job state in DB for O(1) child status lookups in get_task_logs
#2364 Iris - Backup/Restore and post-morterm visibility
#3404 iris: missing TPUs should be marked as FAILED by autoscaler
#3926 iris: extend cloud smoke test github workflow to test against at CPU CW cluster
#3932 [iris] test_iris_run_cli_simple_job times out since #3899 preemptible heuristic
#3936 iris: add OptionalAuth mechanism to allow gradual adoption
#3947 Attempt to get collocated RL on single TPU worker 💬1
#3953 QOL Updates for Data Browser
#3968 iris: dev vs prod clusters
#3969 iris: create dev cluster config with daily automated restart 💬2
#2944 Iris: introduce provider abstraction
#3467 Iris: dashboard should fetch top-level jobs, zoom in as needed
#3981 [canary] TPU ferry fails: inherited Iris region pin conflicts with MARIN_PREFIX 💬2
#3985 Required checks hang when workflow path filters skip execution
#3989 Hardcoded gs:// paths in experiment configs pin execution to specific regions
#3992 [iris] Unify TaskProvider and K8sTaskProvider behind a single protocol
#3994 [zephyr] External sort shuffle 💬1
#3995 [iris] Add user-defined counters (MapReduce-style per-job stats) 💬4
#3996 [zephyr] Coordinator thread crashes silently, causing pipelines to hang at N-1/N 💬1
#3998 iris: decompose K8sControllerProvider into focused modules
#4001 [iris] CoreWeave dev cluster: always-on CPU, scale-to-zero GPU, nightly restart
#4003 [zephyr] On-demand workers with heterogeneous resource specs 💬2
#4004 [zephyr] Coordinator thread has no lifecycle logging — hangs are undiagnosable
#4007 iris: split smoke test into separate explicit commands
#3538 iris: merge LogStore tables into ControllerDB
#4079 Iris launches and cleans up thousands of worker pods after coordinator exits 💬3
#3058 Iris: add "thread dump" link for tasks
#4088 iris: migrate classification/inference to Fray & Iris
#4093 zephyr: workers restart race condition? 💬1

Run	User	Hardware(?)	Hours(?)	FLOP Budget(?)	BPB(?)
#1337 adamh-scaling-ladder-nemotron-optimal-1e+23-v5-27f2fb	Will Held	TPU v4 (512 chips)	26.2d	9.69e22 model 3.23e23 HW (30%)	BPB: 0.725
#1337 adamh-scaling-ladder-nemotron-optimal-1e+22-v5-seed42-deeff4	Will Held	TPU v4 (256 chips)	3.4d	10.00e21 model 2.12e22 HW (47%)	BPB: 0.769
#3800 moe-v7-1e22-d3200	Larry Dial	TPU v4 (256 chips)	2.7d	3.37e21 model 1.46e22 HW (23%)	BPB: 0.834
#1337 adamh-scaling-ladder-nemotron-optimal-1e+22-v5-seed62746-10f597	Will Held	TPU v4 (256 chips)	1.7d	4.86e21 model 9.64e21 HW (50%)	BPB: 0.873
#3897 exp3897v2_sft_ota_131k_qwen3_8b_131072tokens_v5p256-de2c26	Kevin Li	TPU v5 (128 chips)	16.5h	1.16e21 model 3.16e21 HW (37%)	—
#4016 isoflop-moe-v7-1e+20-d1536	Larry Dial	TPU v4 (64 chips)	15.1h	1.28e20 model 7.18e20 HW (18%)	BPB: 0.893
#4016 isoflop-moe-v7-1e+20-d1536-cap1p0	Larry Dial	TPU v4 (64 chips)	14.4h	1.28e20 model 6.63e20 HW (19%)	BPB: 0.900
#3897 exp3897v2_sft_ota_131k_qwen3_8b_131072tokens_v5p32-913963	Kevin Li	TPU v5 (16 chips)	1.2d	2.40e20 model 6.56e20 HW (37%)	—
#2167 isoflop-moe-adamh-gatednorm-v5p64-r2-1e20-d1536-retry25	Kaiyue Wen	TPU v5 (32 chips)	11.7h	9.36e19 model 5.13e20 HW (18%)	BPB: 0.913
#3896 exp3896v2_sft_ota_32k_qwen3_8b_32768tokens_v5p32-dbc611	Kevin Li	TPU v5 (16 chips)	22.2h	2.15e20 model 5.09e20 HW (42%)	—
math-14b-resili-best-extract-qwen3-14b-base-a7e7d9	Michael Ryan	TPU v5 (16 chips)	1.0d	2.78e20 model 4.94e20 HW (56%)	—
math-14b-resili-best-resili-qwen3-14b-base-c0112d	Michael Ryan	TPU v5 (16 chips)	1.0d	2.78e20 model 4.94e20 HW (56%)	—
math-14b-resili-default-qwen3-14b-base-d81f39	Michael Ryan	TPU v5 (16 chips)	21.6h	2.78e20 model 4.84e20 HW (57%)	—
medical-14b-resili-best-extract-qwen3-14b-base-537b2a	Michael Ryan	TPU v5 (16 chips)	19.4h	2.29e20 model 3.99e20 HW (57%)	—
medical-14b-resili-best-resili-qwen3-14b-base-973573	Michael Ryan	TPU v5 (16 chips)	19.5h	2.29e20 model 3.96e20 HW (58%)	—

Week of March 23rd summary for marin-community/marin

#3192 Synthetic Data & Post-training

#3100 Data Sources for Pre-training

#3096 Pre-training: MoE Scaling Laws

#2836 Infrastructure: MoE Training Support

Other Changes

Top 15 runs (by FLOPs) this week (completed, running, crashed)

#3192 Synthetic Data & Post-training

#3100 Data Sources for Pre-training

#3096 Pre-training: MoE Scaling Laws

#2836 Infrastructure: MoE Training Support

Other Changes

Top 15 runs (by FLOPs) this week (completed, running, crashed)

Keyboard shortcuts