Week of February 16th summary for marin-community/marin

Milestone: Reliable, repeatable, enjoyable infrastructure
85 merged 2 opened 56 issues closed 11 contributors 0 epics 359 comments this week

A major infrastructure push hardened Iris for multi-region and multi-cloud training, while Levanter's fused cross-entropy and attention kernels matured, Zephyr gained resilience to preemption, and the project's workflow orchestration was significantly rearchitected.

Other Notable Work


Levanter core improvements. @dlwh refactored the LM data path around a new GrugLmExample abstraction (#2727), replacing unnamed data structures with typed examples and attention masks. The Checkpointer API was simplified to take explicit state and step arguments (#2916), and evaluator infrastructure was migrated to a callback-driven TaggedEvaluator API (#2889). A pure compute_watch_stats helper was extracted for non-Trainer reuse (#2900). Deprecated shard_map imports were updated (#2919) and embedding initialization switched to init_scale (#2920).

Zephyr resilience and tokenization. @ravwojdyla and @ravwojdyla-agent hardened Zephyr against preemption: pipelines now retry on coordinator death (#2928), worker pools recover from preemptible node loss (#2946), and exponential backoff no longer overflows on high attempt counts (#2868). The tokenize pipeline was refactored to extract helpers and tune batch configuration (#2934), gained a ThreadedBatchWriter for background writes (#2933), and temporary data now lives under UUID-namespaced paths with atomic renames (#2785).

Workflow orchestration. @ravwojdyla landed the StepSpec + Artifact system (#2494), a significant rearchitecture of marin's workflow orchestration that replaces implicit magic with explicit, typed step definitions and artifact tracking. Temp bucket utilities were added to both marin core (#2882) and Zephyr (#2879) for consistent temporary storage management.

Evaluation. @AlienKevin added Harbor evaluator support (#2808), enabling agentic evaluation against 45+ benchmarks from the Harbor registry. A more reliable, unbiased combinatorial estimator for pass@k (#2493) was also merged.

Testing and CI. Test infrastructure saw broad cleanup: TPU test runtime was reduced (#2918), torch tests were fixed to use locked CPU wheels (#2897), gated HF tokenizer loads were removed from scaling-law tests (#2861), and flaky tests were stabilized (#2909). Claude Code workflows were hardened with timeouts and concurrency groups (#2855, #2816), and a policy for agent-generated GitHub activity was established (#2949).

87 PRs this week, 148 new comments, and 56 issues closed (56 total)
Sort:

Training Runs This Week


Dense model isoflop scaling law experiments dominated compute this week. Will Held ran complete AdamH scaling_v3 sweeps at 3e20, 2e20, and 9e19 FLOP budgets across 7 model sizes (273M to 4.3B parameters) on TPU v4 pods, processing hundreds of billions of tokens to map optimal model size vs. token count frontiers. Separately, a large AdamH hyperparameter mega-sweep ran ~60 small runs at 5B tokens each to tune optimizer settings. Michael Ryan completed long-running Qwen3 rephraser SFT runs (0.6B and 1.7B) at 53.8% MFU on v5p, training on 19.7B tokens each. The first daily canary ferry runs validated the fused cross-entropy pipeline end-to-end at 125M scale.

Run Owner Hardware FLOPs Wall Time MFU Evals Status
Dense isoflop 3e20 sweep (273M-4.3B, 7 sizes) @Helw150 v4-32 (16 chips) 3e20 30-48h per run 23-44% loss=2.577-2.966, 11-223B tokens per run completed #1 #2#3
Dense isoflop 2e20 sweep (273M-2.5B, 6 sizes) @Helw150 v4-16 (8 chips) 2e20 25-39h per run 30-42% loss=2.604-2.939, 11-134B tokens per run completed #1 #2
Dense isoflop 9e19 sweep (273M-1.4B, 5 sizes) @Helw150 v4-8 (4 chips) / v4-16 (8 chips) 9e19 17-33h per run 33-43% loss=2.774-2.980, 11-67B tokens per run completed #1
Qwen3-1.7B rephraser mid SFT v1 @MichaelRyan v5p-8 (4 chips) 6.5e20 196.8h 53.8% loss=0.293, 19.7B tokens completed #1
Qwen3-0.6B rephraser mid SFT v1 @MichaelRyan v5p-8 (4 chips) 2.8e20 159.3h 28.8% loss=0.324, 19.7B tokens completed #1
#2954 Daily canary ferry 125M (fused-CE argmax verification) @dlwh v5p-8 (4 chips) 7.4e18 4.7h 27.7% loss=3.425, 10.8B tokens completed #1 #2
AdamH hyperparameter mega-sweep (69M-157M, 5B tokens each) @Helw150 v5p-8 (4 chips) ~5e18 per run 4-5h per run 11-23% loss=3.5-4.5, 5B tokens per run, ~60 runs completed #1
Merged PR Open PR Draft PR Closed PR Open issue Closed issue

Keyboard shortcuts

?
Toggle this help
j / k
Next / previous section
t
Toggle details in current section
s
Cycle sort order in current section
o
Open current epic on GitHub
m
Open current milestone on GitHub
M
Open milestones list on GitHub