Week of March 16th summary for marin-community/marin

A week of deep infrastructure hardening — Iris got security-first auth, DuckDB log storage, 5x controller performance, and multi-host GPU on CoreWeave — while MoE scaling research advanced to iter_03 with PID-controlled sigmoid routing, AdamH, and Gated Norm showing uniform BPB improvements across the isoflop grid.

#2836 Infrastructure: MoE Training Support

Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.

24/41 sub-issues closed

Iris saw major performance and security work this week. @rjpower landed a 5x+ controller speedup via a SQLite read pool, query consolidation, and dashboard SQL rewrites #3719, then followed up with further heartbeat and scheduling query optimizations #3881 and faster dashboard queries with expanded benchmark coverage #3791. The log store was overhauled twice — first replacing SQLite with DuckDB + rotating Parquet #3828, then optimizing with connection pooling, sorted segments, and page indexes #3837 — after a fix for the DuckDB store's 60GB memory usage #3843. The BundleStore was also replaced with flat-file storage #3776, and migration scripts were added for the SQLite→fsspec/Parquet transitions #3852. Security hardening landed with default-deny auth, CSRF protection, auth DB isolation, and traceback sanitization #3894, plus sensitive env var redaction in API responses #3889. Multi-host GPU support for CoreWeave shipped via the new KubernetesProvider #3806, with smoke CI added #3927, endpoint leak fixes #3814, #3740, tmpfs for task workdirs #3696, #3770, and stale heartbeat reaping on controller restart #3772. The Dockerfiles were unified into a single multi-stage build #3735 and container OOM issues were addressed with memory limits #3840. @dlwh made the executor region-aware on Iris from GCS dependencies #3824, fixed Grug checkpoint resume #3790, handled Pallas autotune misses under mosaic partitioning #3669, and added fused CE autotune fixes #3949, #3963. @yonromai refactored the Zephyr coordinator to use host_actor and worker group polling #3861, added S3/R2 conditional-write locking #3874, and fixed sync actor RPC kwargs #3907. @ravwojdyla added combiner support to group-by #3725, iris resource_utils for task resource queries #3847, map_shard with shard info #3757, and improved the job detail view with child jobs and sorting #3733. The harbor dependency was moved out to marin-community/harbor #3836, removing ~200K lines. A major Platform abstraction elimination is in progress #3900, reorganizing into Service + Provider layers.

113 PRs this week, 8 new comments, and 0 new issues (41 total)

Sort:

#3779 [grug/moe] Add explicit EP implementation selector 💬6 +264 −9 @dlwh-golem
#3963 Lower shard_map tracers directly in fused CE autotune 💬1 +132 −8 @dlwh
#3958 Make sure we save the final training checkpoint +2 −2 @RohithKuditipudi
#3949 [levanter] Preserve fused CE autotune shard-map fence 💬3 +94 −37 @dlwh
#3944 [iris] Parse log levels from kubectl log output in kubernetes_provider 💬3 +47 −60 @claude
#3935 [nightshift] Increase agent turn limits and timeouts 10x 💬2 +4 −4 @rjpower
#3934 Add forking policy documentation for external packages 💬4 +59 −0 @rjpower
#3929 [zephyr] Shut down test ZephyrContext fixtures to prevent exit crash 💬3 +2 −1 @claude
#3927 [iris] Add CoreWeave smoke CI, split smoke configs by platform 💬2 +528 −78 @claude
#3925 Fix lint check on main 💬1 +2 −2 @ravwojdyla
#3923 [zephyr] Split tests into functional (local-only) and integration (all backends) 💬2 +267 −59 @rjpower
#3922 Make memray import optional in profile tests 💬4 +2 −0 @rjpower
#3921 Fix smoke test failsafe cleanup: catch controller VM + fix TPU zone parsing 💬2 +11 −8 @rjpower
#3920 Separate PR authoring from review skills, enforce plain-text PR format in triage 💬2 +101 −125 @rjpower
#3919 Fix ZephyrContext not propagating explicit client to coordinator job 💬5 +94 −9 @ravwojdyla-agent
#3918 Fix memray profiling for controller /system/process target 💬1 +59 −5 @claude
#3917 iris: add soft (preferred) constraints for cross-cloud scheduling 💬7 +448 −24 @claude
#3916 iris: remove QueryExplorer dashboard tab 💬1 +18 −314 @claude
#3911 [iris] Decouple smoke test cluster lifecycle from pytest 💬4 +165 −114 @rjpower
#3907 Fix kwargs in sync Iris actor RPCs 💬1 +19 −1 @yonromai
#3899 iris: add non-preemptible heuristic for executor jobs 💬13 +232 −9 @claude
#3894 [iris] Security hardening: default-deny auth, CSRF, auth DB isolation, traceback sanitization 💬5 +1067 −805 @rjpower
#3889 Redact sensitive env vars in GetJobStatus API responses +74 −1 @claude
#3882 [iris] Reject coreweave + worker_provider and migrate canary to kubernetes_provider +36 −1 @yonromai
#3881 [iris] Optimize controller heartbeat and scheduling queries 💬1 +625 −265 @rjpower
#3878 Fix Ray host_actor() leaking coordinator actors +4 −3 @yonromai
#3874 [iris] S3Lease: conditional writes for S3/R2 locking 💬2 +98 −1 @yonromai
#3871 [iris] Revert to --privileged for TPU task containers 💬1 +8 −8 @rjpower
#3867 Bump connect-python to 0.9.0 and pin buf codegen plugin 💬1 +12 −28 @rjpower
#3863 Revert GCP restart permission checks for Ray and Iris 💬3 +0 −497 @dlwh
#3861 Refactor Zephyr coordinator to job with host_actor and worker group polling (#3705) 💬2 +354 −195 @yonromai
#3860 [iris] Document EnvironmentSpec pattern in AGENTS.md +16 −0 @ahmeda14960
#3858 [iris] Fix VFIO device passthrough for v5p/v6e TPUs 💬1 +3 −1 @rjpower
#3854 Fix double TPU allocation in reference hparam sweep 💬2 +1 −1 @claude
#3852 Add migration scripts for bundle and log store SQLite→fsspec/Parquet 💬1 +679 −120 @rjpower
#3851 Fix SSH tunnel hanging when gcloud compute SSH key is missing 💬1 +23 −87 @rjpower
#3850 Refactor Rust code: move dupekit to rust/, add CI wheels and mode switch 💬12 +375 −63 @rjpower
#3848 [iris] Iris task exec 💬8 +341 −2 @ravwojdyla-agent
#3847 Add `iris.resource_utils` for querying task resource limits 💬7 +338 −0 @ravwojdyla
#3845 [iris] Add --task option to iris process profile 💬1 +55 −28 @ravwojdyla-agent
#3843 Fix DuckDB LogStore 60GB memory usage 💬3 +582 −135 @claude
#3840 Fix BUILD container OOM: add memory limits and headroom 💬6 +8312 −4126 @claude
#3837 [iris] Optimize log store: connection pool, sorted segments, page index 💬1 +483 −49 @rjpower
#3836 Move harbor dependency to marin-community/harbor 💬5 +32 −198852 @rjpower
#3830 Refactor babysitting skills 💬1 +121 −66 @ravwojdyla
#3829 [iris] Preserve thread dump settings across page refresh +13 −17 @ravwojdyla-agent
#3828 [iris] Replace SQLite log store with DuckDB + rotating Parquet 💬9 +891 −309 @rjpower
#3825 Move `log_time` to iris +39 −22 @ravwojdyla
#3824 Make executor region-aware on Iris from GCS dependencies 💬9 +1075 −34 @dlwh
#3822 [iris] Terminate replaced coordinators and delete stale endpoints 💬8 +48 −0 @yonromai
#3818 [nightshift] Deduplicate Ray task drain logic in run_inference +13 −25 @github-actions
#3817 docs: refresh contributor pre-PR checklist +5 −1 @dlwh-golem
#3816 docs: align docs build guide install command with CI +3 −1 @dlwh-golem
#3814 [iris] Fix endpoint leak: reject registration for terminal tasks 💬3 +108 −24 @rjpower
#3807 Log number of chunk streams to reduce +2 −0 @ravwojdyla
#3806 Multi-host GPU support for CoreWeave (KubernetesProvider) 💬2 +7747 −4058 @rjpower
#3805 iris: prune old controller DB data periodically +289 −0 @claude
#3803 [infra] Use acceptEdits for Claude triage 💬1 +1 −0 @rjpower
#3802 fix: skip healthcheck when disk_path is not a directory +33 −3 @claude
#3797 [levanter] Add ArrayStacked grug variant and stack-aware optimizer support 💬6 +953 −38 @dlwh
#3796 Stream Nemotron-CC download and output as .jsonl.zst 💬1 +99 −144 @rjpower
#3793 Support hf://buckets paths in default_download and default_tokenize 💬2 +177 −17 @dlwh
#3791 [iris] Speed up dashboard queries and expand benchmark coverage 💬3 +113 −11 @rjpower
#3790 Fix Grug checkpoint resume (state format + fallback loading) 💬1 +314 −60 @dlwh
#3789 Configure ipv4 connection reuse on Iris GCP 💬4 +30 −0 @ravwojdyla
#3786 fix: expose GH_TOKEN to nightshift workflows 💬1 +2 −0 @rjpower
#3785 Remove dead fsspec helpers and fix exception-swallowing bug in fsspec_rm 💬2 +3 −101 @rjpower
#3778 Preserve root cause on fatal grug train failures +18 −3 @dlwh-golem
#3776 [iris] Replace sqlite BundleStore with flat-file storage 💬1 +454 −156 @rjpower
#3775 [nightshift] simplify cleanup to single agent with one PR 💬1 +37 −213 @rjpower
#3774 Fix dependency direction in AGENTS.md 💬2 +1 −1 @Helw150
#3773 fix: simplify triage workflow prompt and bump timeout 💬1 +36 −58 @rjpower
#3772 fix(iris): reap workers with stale heartbeats on controller restart 💬1 +120 −0 @rjpower
#3770 [iris] Mount /tmp as tmpfs in task containers 💬7 +135 −7 @rjpower
#3765 [grug] Inline splash kernel in_specs as literal None 💬1 +2 −5 @dlwh
#3764 docs: harden agent issue filing workflow 💬2 +21 −4 @dlwh-golem
#3757 Add `map_shard` with shard info 💬3 +96 −24 @ravwojdyla
#3754 fix(iris): fast controller restore via project-wide gcloud queries 💬2 +291 −572 @rjpower
#3753 Clean up agent skills: concise PR descriptions, remove redundant index 💬1 +20 −42 @rjpower
#3751 fix(iris): pass through v4 TPU accel devices in Docker runtime 💬1 +45 −7 @dlwh
#3749 fix: upgrade triage workflow to opus, bias toward PRs 💬1 +48 −21 @rjpower
#3748 fix: map K8s CACHE mounts under worker cache_dir 💬1 +48 −2 @rjpower
#3747 [agent-research] Leave issue refs plain for autolinks +1 −0 @yonromai
#3746 Increase RPC retry max_attempts to tolerate controller restarts +2 −2 @claude
#3745 fix(iris): always regenerate protobuf bindings before image builds 💬7 +29 −0 @ravwojdyla-agent
#3744 [iris] Fix dashboard job tree by including descendants in ListJobs +25 −6 @ravwojdyla-agent
#3740 iris: delete endpoints when task leaves running state 💬2 +123 −1 @claude
#3739 feat(iris): support --locals flag in py-spy thread dumps +32 −17 @ravwojdyla
#3738 Bump zephyr coord mem to 5g defualt +1 −1 @ravwojdyla
#3736 Unify grug vocab and lm_head sharding 💬8 +94 −59 @dlwh-golem
#3735 Unify Iris Dockerfiles into single multi-stage build 💬3 +173 −249 @rjpower
#3734 Add GCP restart permission checks for Ray and Iris 💬7 +497 −0 @dlwh
#3733 iris: improve job detail view with child jobs, task search, and sorting +280 −16 @ravwojdyla
#3727 zephyr execution id with ts 💬2 +8 −4 @ravwojdyla
#3726 Reduce noice from operation start log +1 −1 @ravwojdyla
#3725 Add combiner support to grp_by +110 −4 @ravwojdyla
#3724 Capture integration_test logs 💬2 +31 −0 @ravwojdyla
#3723 Set default logging level to INFO +1 −1 @ravwojdyla
#3720 🏭 Add zephyr babysit skill 💬2 +148 −42 @ravwojdyla
#3719 iris: 5x+ controller performance via read pool, query consolidation, and dashboard SQL rewrites 💬7 +2297 −324 @rjpower
#3715 [nightshift] fix indentation bug in eval_lm.py log_top2_gap +16 −27 @github-actions
#3713 docs: normalize Iris proto regeneration guidance +3 −3 @dlwh-golem
#3710 fix: detect missing proto outputs in iris hatch build hook +14 −8 @yonromai
#3702 🏭 Add zephyr debugger skill +165 −0 @ravwojdyla
#3700 iris/zephyr: use worker prefix for worker endpoint lookup 💬6 +20 −19 @ravwojdyla
#3696 iris: use tmpfs for task workdirs, replace du with disk_usage 💬7 +382 −217 @rjpower
#3692 docs: remove stale alpaca eval tutorial +96 −207 @dlwh-golem
#3670 Grug: only store EMA params when EMA is enabled +57 −13 @dlwh
#3669 Handle pallas_tpu autotune misses under mosaic partitioning 💬1 +107 −4 @dlwh
#3620 [nightshift] fix doc drift in tutorials 💬2 +31 −24 @github-actions
#3481 docs: consolidate PR submission guidance 💬2 +4 −47 @dlwh-golem
#3434 docs: refresh evaluation overview guidance 💬3 +54 −33 @dlwh-golem
#3416 infra: guard docs against stale pre-commit invocation 💬4 +50 −2 @dlwh-golem
Issues
#2822 Iris: Implement CoreWeave platform
#2823 Iris: Improve worker/process status visibility and post-mortem log access
#2824 Iris: Multi-region support with per-scaling-group environment configuration
#2825 Iris: Quota-aware scheduling and cross-zone fallback
#2826 Iris: Richer profiling and worker-level observability
#2827 Iris: Proactive unhealthy/degraded node identification
#2829 Data processing pipeline: validate end-to-end tokenization for all target datasets 💬2
#2830 Training monitoring: alerting on stalled/diverging loss and health dashboard
#2831 Validate fault tolerance: checkpoint resume and preemption recovery on CoreWeave
#2832 Agent can run a small model E2E without human intervention
#2833 Establish daily canary training runs
#2834 Executor v2: split out caching module and simplify step API
#2835 Standardize on Vortex format with typed dataset schemas
#2629 Iris: bootstrap script templates are too fragile
#2377 Jobs are not tolerant to the node where `self._run_steps` is running being preempted.
#2651 Iris: Resolver/Actor system should always auto-resolve on transient errors
#2809 Iris: Survey threading and timeouts for the controller
#2810 Iris: benchmark test for controller performance
#2424 Iris - initial resource observability
#2710 Experiment: MoE EP benchmark milestone
#2418 Add AdamC, fp32 router compute, router_topk_then_softmax, qk-norm option for MoE stability sweeps
#2414 Experiment: OLMoE size sweep with MoE stability measures
#2804 fsspec should reject cross region reads (or those over X MB)
#2744 Iris: bootstrap should probably live on the scaling group
#2745 Iris: Add attributes to ScaleGroupConfig for scheduling-level metadata
#2642 Iris: preemptible shouuld be a taint, not an attribute
#2735 Iris: Zone-aware scheduler
#2762 Iris: fair scheduler
#2625 Iris: Users and Priorities
#2749 iris: Migrate GCP platform from gcloud CLI to Python API
#2772 Iris: add proxy for worker view
#2803 iris-controller: add built-in py-spy profiling endpoint to dashboard
#2754 Embed speedscope in Iris dashboard for one-click profile viewing
#2413 SwiGLU vs Bilinear MLP layers for MoE Experts
#2708 Zephyr: auto-scale worker groups up to match demand
#2535 Iris: Integrate chronos virtual time into chaos test suite
#2849 Iris: add smoke test into CI
#2926 Iris: Add Levanter health check in Iris
#3035 StepRunner shouldn't launch tasks with Fray by default
#3098 Evaluate (first few steps) x00B MoE on TPU and GPU
#3164 Iris: allow controller restarts without resetting tasks

Run	User	Hardware(?)	Hours(?)	FLOP Budget(?)	BPB(?)
adamh-scaling-ladder-nemotron-optimal-1e+23-v5-27f2fb	Will Held	TPU v4 (512 chips)	22.0d	8.16e22 model 2.64e23 HW (31% MFU)	BPB: 0.796
adamh-scaling-ladder-nemotron-optimal-1e+22-v5-seed42-deeff4	Will Held	TPU v4 (256 chips)	3.4d	10.00e21 model 2.12e22 HW (47% MFU)	BPB: 0.769
exp3490b_sft_nemotron_terminal_corpus_full_qwen3_8b_32768tokens_-3da6c1	Kevin Li	TPU v5 (32 chips)	5.0d	2.49e21 model 6.08e21 HW (41% MFU)	—
moe-d2304-1e21	Larry Dial	TPU v4 (128 chips)	2.1d	1.00e21 model 5.59e21 HW (18% MFU)	BPB: 0.823
exp2262pt3h_240k_pt4_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-f3ec95	Moo Jin Kim	TPU v4 (128 chips)	1.2d	1.30e21 model 3.73e21 HW (35% MFU)	BPB: 0.124
exp2262pt3i_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-eeb1d6	Moo Jin Kim	TPU v4 (128 chips)	1.2d	1.30e21 model 3.73e21 HW (35% MFU)	BPB: 0.181
exp2262pt3h_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-007999	Moo Jin Kim	TPU v4 (128 chips)	1.2d	1.30e21 model 3.73e21 HW (35% MFU)	BPB: 0.085
exp2262pt3l_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_30b_a3b_thi-5eb7b0	Moo Jin Kim	TPU v4 (128 chips)	1.2d	1.30e21 model 3.73e21 HW (35% MFU)	BPB: 0.167
exp2262pt3i_240k_pt4_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-84f4d3	Moo Jin Kim	TPU v4 (128 chips)	1.2d	1.30e21 model 3.73e21 HW (35% MFU)	BPB: 0.243
exp2262pt3l_240k_pt4_qwen3_1pt7b_base_ot4_math_qwen3_30b_a3b_thi-77d4d8	Moo Jin Kim	TPU v4 (128 chips)	1.2d	1.30e21 model 3.73e21 HW (35% MFU)	BPB: 0.209
exp3897_sft_ota_131k_qwen3_8b_131072tokens_v5p256-f7d21a	Kevin Li	TPU v5 (128 chips)	16.1h	1.16e21 model 3.16e21 HW (37% MFU)	—
adamh-scaling-ladder-nemotron-optimal-1e+21-v5-seed42-e251d0	Will Held	TPU v4 (64 chips)	1.3d	1.00e21 model 2.05e21 HW (49% MFU)	BPB: 0.844
adamh-scaling-ladder-nemotron-optimal-1e+21-v5-seed62746-659a1b	Will Held	TPU v4 (64 chips)	1.3d	1.00e21 model 1.89e21 HW (53% MFU)	BPB: 0.845
isoflop-moe-v2-1e+20-d1536-bs128	Larry Dial	TPU v4 (64 chips)	12.5h	10.00e19 model 5.73e20 HW (17% MFU)	BPB: 0.918
isoflop-moe-adamh-gatednorm-v5p64-r2-1e20-d1536-retry25	Kaiyue Wen	TPU v5 (32 chips)	11.7h	9.36e19 model 5.13e20 HW (18% MFU)	BPB: 0.913

Week of March 16th summary for marin-community/marin

#2836 Infrastructure: MoE Training Support

#3096 Pre-training: MoE Scaling Laws

#3100 Data Sources for Pre-training

#3192 Synthetic Data & Post-training

Other Notable Changes

Top 15 runs (by FLOPs) this week (completed, running, crashed)

#2836 Infrastructure: MoE Training Support

#3096 Pre-training: MoE Scaling Laws

#3100 Data Sources for Pre-training

#3192 Synthetic Data & Post-training

Other Notable Changes

Top 15 runs (by FLOPs) this week (completed, running, crashed)

Keyboard shortcuts