Week of February 16th summary for marin-community/marin

A major infrastructure push hardened Iris for multi-region and multi-cloud training, while Levanter's fused cross-entropy and attention kernels matured, Zephyr gained resilience to preemption, and the project's workflow orchestration was significantly rearchitected.

Other Notable Work

Levanter core improvements. @dlwh refactored the LM data path around a new GrugLmExample abstraction (#2727), replacing unnamed data structures with typed examples and attention masks. The Checkpointer API was simplified to take explicit state and step arguments (#2916), and evaluator infrastructure was migrated to a callback-driven TaggedEvaluator API (#2889). A pure compute_watch_stats helper was extracted for non-Trainer reuse (#2900). Deprecated shard_map imports were updated (#2919) and embedding initialization switched to init_scale (#2920).

Zephyr resilience and tokenization. @ravwojdyla and @ravwojdyla-agent hardened Zephyr against preemption: pipelines now retry on coordinator death (#2928), worker pools recover from preemptible node loss (#2946), and exponential backoff no longer overflows on high attempt counts (#2868). The tokenize pipeline was refactored to extract helpers and tune batch configuration (#2934), gained a ThreadedBatchWriter for background writes (#2933), and temporary data now lives under UUID-namespaced paths with atomic renames (#2785).

Workflow orchestration. @ravwojdyla landed the StepSpec + Artifact system (#2494), a significant rearchitecture of marin's workflow orchestration that replaces implicit magic with explicit, typed step definitions and artifact tracking. Temp bucket utilities were added to both marin core (#2882) and Zephyr (#2879) for consistent temporary storage management.

Evaluation. @AlienKevin added Harbor evaluator support (#2808), enabling agentic evaluation against 45+ benchmarks from the Harbor registry. A more reliable, unbiased combinatorial estimator for pass@k (#2493) was also merged.

Testing and CI. Test infrastructure saw broad cleanup: TPU test runtime was reduced (#2918), torch tests were fixed to use locked CPU wheels (#2897), gated HF tokenizer loads were removed from scaling-law tests (#2861), and flaky tests were stabilized (#2909). Claude Code workflows were hardened with timeouts and concurrency groups (#2855, #2816), and a policy for agent-generated GitHub activity was established (#2949).

87 PRs this week, 148 new comments, and 56 issues closed (56 total)

Sort:

#2955 Daily ferry 2026-02-22: run closure log and seal 💬2 +17 −0 @dlwh
#2951 Make fused CE default to XLA custom VJP on TPU 💬1 +480 −10 @dlwh
#2949 Add policy for agent-generated GitHub activity 💬4 +5 −0 @dlwh
#2948 Tune TPU v4 fused CE block sizes and fix pallas dtype handling +79 −17 @dlwh
#2946 Fix Zephyr worker pool degradation on preemptible nodes 💬5 +71 −1 @ravwojdyla-agent
#2935 Fixup Zephyr coordinator retry for Iris tests +9 −1 @ravwojdyla
#2934 Refactor tokenize pipeline: extract helpers, tune batch/resource config +67 −34 @ravwojdyla-agent
#2933 Add ThreadedBatchWriter and batch_size plumbing in Zephyr 💬5 +102 −12 @ravwojdyla-agent
#2932 Remove unused zephyr_num_cpus/zephyr_memory from TokenizeConfig 💬2 +2 −11 @ravwojdyla-agent
#2931 Replicate WandB info in default_train 💬1 +1 −0 @Helw150
#2928 Retry Zephyr pipeline on coordinator death 💬2 +114 −15 @ravwojdyla-agent
#2925 Remove unused attr of `WriteOp` +0 −2 @ravwojdyla
#2923 Fix ragged paged attention static-arg tracer hashing 💬3 +72 −3 @dlwh
#2922 Silence non-actionable external warning noise 💬2 +15 −4 @dlwh
#2921 Reduce expected fallback warning noise 💬2 +112 −27 @dlwh
#2920 Use init_scale for embedding initialization 💬2 +5 −9 @dlwh
#2919 Replace deprecated shard_map imports 💬2 +39 −16 @dlwh
#2918 Reduce TPU test runtime pressure in levanter-tests 💬1 +77 −69 @dlwh
#2916 Make Checkpointer take explicit saveable_state and step 💬2 +118 −36 @dlwh
#2915 Run levanter tests on main +0 −5 @ravwojdyla
#2914 Move tokenize stats out of Zephyr writer into tokenize top-level +12 −31 @ravwojdyla-agent
#2913 Add fixture to set marin prefix in levanter tests 💬3 +12 −0 @dlwh
#2912 tests: avoid long hf gpt2 serialize torch runs 💬1 +16 −8 @dlwh
#2911 Tokenize write both shard and total to be safe +1 −1 @ravwojdyla
#2909 Fix flaky `test_update_slice_activity_tracks_active_slices` +4 −3 @ravwojdyla
#2908 Improve logging in `tokenize` +45 −7 @ravwojdyla
#2906 Expose no-worker-timeout and bump to 10 minutes for Ray 💬2 +28 −7 @ravwojdyla
#2903 Add daily canary ferry workflow (#2873) 💬1 +73 −3 @yonromai
#2902 Add multi-zone scale group expansion for Iris 💬1 +1302 −87 @rjpower
#2901 Add update-token and restart-runners commands for TPU CI 💬3 +113 −1 @rjpower
#2900 callbacks/watch: extract pure compute_watch_stats helper for non-Trainer reuse 💬1 +254 −88 @dlwh
#2899 In zephyr tests allow to print failure summary +2 −1 @ravwojdyla
#2897 Fix levanter torch CI to install locked CPU torch wheel 💬1 +26 −7 @dlwh
#2896 iris: add IN/disjunction constraint for multi-region scheduling 💬2 +388 −180 @claude
#2894 Trigger Zephyr tests on Iris change 💬1 +2 −0 @ravwojdyla
#2893 Fix ResourceSpec regions argument after Iris multi-region migration 💬3 +4 −2 @ravwojdyla-agent
#2892 Refactor iris smoke test to exercise CLI instead of internal APIs 💬8 +428 −361 @rjpower
#2889 Refactor TaggedEvaluator migration and array where support 💬2 +277 −156 @dlwh
#2888 Bump webpack from 5.94.0 to 5.105.2 in /data_browser +242 −208 @dependabot
#2887 Bump qs from 6.14.1 to 6.14.2 in /data_browser +3 −3 @dependabot
#2886 Bump bytes from 1.11.0 to 1.11.1 in /lib/dupekit +2 −2 @dependabot
#2885 Bump werkzeug from 3.1.3 to 3.1.5 in /data_browser +3 −3 @dependabot
#2884 Bump lodash from 4.17.21 to 4.17.23 in /data_browser +4 −3 @dependabot
#2883 Reduce e2e test timeouts and sleeps to cut idle wait time 💬1 +36 −36 @rjpower
#2882 Use temp bucket in marin - util + JAX compilation cache 💬2 +225 −32 @ravwojdyla
#2881 Allow `gh pr review` to claude review +1 −1 @ravwojdyla
#2880 Fix claude-review workflow cancelling itself on first run +1 −1 @yonromai
#2879 Use temp bucket in zephyr 💬3 +37 −22 @ravwojdyla
#2876 Reference for Canary Training 💬6 +71 −0 @Helw150
#2872 yolo -- Increase max-turns from 15 to 100 in claude.yml 💬1 +1 −1 @rjpower
#2871 Add GHCR push support for iris Docker images 💬1 +196 −34 @rjpower
#2870 Iris: Add basic multi-region support +2091 −274 @rjpower
#2868 Fix ExponentialBackoff overflow on high attempt counts +29 −2 @ravwojdyla-agent
#2864 iris: use sensible defaults for smoke-test CLI 💬3 +118 −34 @rjpower
#2861 Avoid gated HF tokenizer loads in scaling-law unit tests +3 −8 @dlwh
#2860 Bump aiohttp from 3.13.2 to 3.13.3 in /data_browser +88 −88 @dependabot
#2859 Bump filelock from 3.20.1 to 3.20.3 in /lib/dupekit +3 −3 @dependabot
#2858 Bump cryptography from 46.0.3 to 46.0.5 +37 −39 @dependabot
#2857 Bump protobuf from 6.33.4 to 6.33.5 +9 −9 @dependabot
#2856 Bump nbconvert from 7.16.6 to 7.17.0 +3 −3 @dependabot
#2855 Harden Claude Code GitHub Action workflows 💬4 +4 −0 @yonromai
#2853 iris: CoreWeave design doc + initial platform refactor 💬2 +2497 −1139 @rjpower
#2850 Iris: store worker and task logs on remote storage 💬2 +2012 −662 @rjpower
#2847 Update job-monitoring-loop guidance 💬2 +32 −8 @dlwh
#2845 Handle duplicate job names by adopting existing jobs 💬13 +160 −46 @rjpower
#2842 docs: add agent-directed research logbook recipe 💬1 +300 −0 @dlwh
#2841 zephyr: defer worker creation to first execute(), size to demand 💬2 +93 −40 @yonromai
#2840 Fix bundle cache race condition, bump defaults, pin uv 💬1 +34 −20 @rjpower
#2839 Trigger claude review via comment +26 −11 @ravwojdyla
#2837 Fix no `track_progress` on labeled event +1 −1 @ravwojdyla
#2819 Iris: Replace fragile bootstrap template system 💬6 +147 −36 @rjpower
#2818 iris: increase uvicorn timeout_keep_alive to 120s 💬1 +8 −0 @rjpower
#2816 Add concurrency groups to remaining CI workflows 💬1 +12 −0 @rjpower
#2815 Add Iris controller performance benchmark +1456 −470 @rjpower
#2814 iris: cache RPC stubs, eliminate redundant worker timeout 💬1 +105 −244 @rjpower
#2813 Add retry logic with exponential backoff to Iris RemoteClusterClient 💬2 +387 −24 @claude
#2812 docs(iris): document multi-region bundle storage design decision 💬1 +11 −0 @rjpower
#2811 Migrate Claude workflow prompt to system-prompt parameter 💬1 +1 −13 @rjpower
#2808 Support Harbor evaluator for agentic eval +200826 −132 @AlienKevin
#2785 Atomic rename uuid, zephyr temp data under UUID 💬1 +155 −16 @ravwojdyla
#2727 Refactor LM data path around GrugLmExample +505 −66 @dlwh
#2724 Default LM next-token loss to fused path and pin Pallas kernel 💬1 +671 −91 @dlwh
#2702 Guard rebase_file_path extension handling +41 −3 @kiankyars
#2494 `StepSpec` + `Artifact` for no-magic workflow orchestration 💬7 +1142 −329 @ravwojdyla
#2493 [RL] Adopt a more reliable, unbiased combinatorial estimator for pass@k 💬1 +55 −3 @AlienKevin
#2863 Add GPU attention tuned block-size defaults and dispatch 💬16 +2330 −5 @dlwh
#2848 Flash Gated DeltaNet Pallas Kernels 💬1 +15885 −522 @Calvin-Xu

Run	Owner	Hardware	FLOPs	Wall Time	MFU	Evals	Status
Dense isoflop 3e20 sweep (273M-4.3B, 7 sizes)	@Helw150	v4-32 (16 chips)	3e20	30-48h per run	23-44%	loss=2.577-2.966, 11-223B tokens per run	completed #1 #2 #3
Dense isoflop 2e20 sweep (273M-2.5B, 6 sizes)	@Helw150	v4-16 (8 chips)	2e20	25-39h per run	30-42%	loss=2.604-2.939, 11-134B tokens per run	completed #1 #2
Dense isoflop 9e19 sweep (273M-1.4B, 5 sizes)	@Helw150	v4-8 (4 chips) / v4-16 (8 chips)	9e19	17-33h per run	33-43%	loss=2.774-2.980, 11-67B tokens per run	completed #1
Qwen3-1.7B rephraser mid SFT v1	@MichaelRyan	v5p-8 (4 chips)	6.5e20	196.8h	53.8%	loss=0.293, 19.7B tokens	completed #1
Qwen3-0.6B rephraser mid SFT v1	@MichaelRyan	v5p-8 (4 chips)	2.8e20	159.3h	28.8%	loss=0.324, 19.7B tokens	completed #1
#2954 Daily canary ferry 125M (fused-CE argmax verification)	@dlwh	v5p-8 (4 chips)	7.4e18	4.7h	27.7%	loss=3.425, 10.8B tokens	completed #1 #2
AdamH hyperparameter mega-sweep (69M-157M, 5B tokens each)	@Helw150	v5p-8 (4 chips)	~5e18 per run	4-5h per run	11-23%	loss=3.5-4.5, 5B tokens per run, ~60 runs	completed #1

Other Notable Work

Training Runs This Week

Keyboard shortcuts