Week of January 26th summary for marin-community/marin

Milestone: Iris V1
67 merged 2 opened 58 issues closed 10 contributors 0 epics 93 comments this week

A major infrastructure push hardened Iris for multi-region and multi-cloud training, while Levanter's fused cross-entropy and attention kernels matured, Zephyr gained resilience to preemption, and the project's workflow orchestration was significantly rearchitected.

Other Notable Work


Levanter core improvements. @dlwh refactored the LM data path around a new GrugLmExample abstraction (#2727), replacing unnamed data structures with typed examples and attention masks. The Checkpointer API was simplified to take explicit state and step arguments (#2916), and evaluator infrastructure was migrated to a callback-driven TaggedEvaluator API (#2889). A pure compute_watch_stats helper was extracted for non-Trainer reuse (#2900). Deprecated shard_map imports were updated (#2919) and embedding initialization switched to init_scale (#2920).

Zephyr resilience and tokenization. @ravwojdyla and @ravwojdyla-agent hardened Zephyr against preemption: pipelines now retry on coordinator death (#2928), worker pools recover from preemptible node loss (#2946), and exponential backoff no longer overflows on high attempt counts (#2868). The tokenize pipeline was refactored to extract helpers and tune batch configuration (#2934), gained a ThreadedBatchWriter for background writes (#2933), and temporary data now lives under UUID-namespaced paths with atomic renames (#2785).

Workflow orchestration. @ravwojdyla landed the StepSpec + Artifact system (#2494), a significant rearchitecture of marin's workflow orchestration that replaces implicit magic with explicit, typed step definitions and artifact tracking. Temp bucket utilities were added to both marin core (#2882) and Zephyr (#2879) for consistent temporary storage management.

Evaluation. @AlienKevin added Harbor evaluator support (#2808), enabling agentic evaluation against 45+ benchmarks from the Harbor registry. A more reliable, unbiased combinatorial estimator for pass@k (#2493) was also merged.

Testing and CI. Test infrastructure saw broad cleanup: TPU test runtime was reduced (#2918), torch tests were fixed to use locked CPU wheels (#2897), gated HF tokenizer loads were removed from scaling-law tests (#2861), and flaky tests were stabilized (#2909). Claude Code workflows were hardened with timeouts and concurrency groups (#2855, #2816), and a policy for agent-generated GitHub activity was established (#2949).

69 PRs this week, 22 new comments, and 58 issues closed (58 total)
Sort:

Training Runs This Week


A compute-heavy week dominated by scaling-law experiments and an SFT milestone. Will Held ran 53 isoflop scaling-law runs across 7 FLOPs budgets (1e18 to 3e19) mapping optimal model size as a function of compute, achieving best loss of 2.949 at the 1.8e19 budget with a 2.2B-parameter model. The Tomol25 1B pre-training run completed a full 48k-step training at 54.8% MFU. Kevin Li completed a 37-hour Qwen3-8B agentic SFT run on OpenThoughts-Agent-v1 data using a v5p-32. Calvin Xu ran 35 data mixture ablations testing two-phase StarCoder configurations at small scale.

Run Owner Hardware FLOPs Wall Time MFU Evals Status
#2499 Tomol25 1B pre-training @Helw150 v5p-8 (4 chips) 5e19 39.6h 54.8% loss=3.830, 1B params, 48k steps completed #1
#2601 Agentic SFT: Qwen3-8B on OpenThoughts-Agent-v1 @AlienKevin v5p-32 (16 chips) 3e19 36.9h 42.4% loss=0.025, Qwen3-8B SFT, seq_len=32768 completed #1
#2499 Isoflop scaling laws (3e19 FLOPs budget) @Helw150 v5p-8 (4 chips) 3e19 10.2-11.1h per run 46.5-48.8% best loss=3.004 (d1024-L11), 4 model sizes from 160M to 1B, Nemotron LR schedule completed #1 #2#3#4
#2499 Isoflop scaling laws (1.8e19 FLOPs budget) @Helw150 v5p-8 (4 chips) 1.8e19 6.4-9.8h per run 44.0-48.9% best loss=2.949 (d2176-L22), 14 model sizes from 160M to 2.5B completed #1
#2499 Isoflop scaling laws (1e19 FLOPs budget) @Helw150 v5p-8 (4 chips) 1e19 4.1-6.0h per run 40.4-45.0% best loss=3.031, 11 model sizes completed #1
Data mixture sweep: two-phase StarCoder v4 @Calvin-Xu v5p-8 (4 chips) 1e18 1.2-2.3h per run 35.7-36.9% best loss=2.288, 35 runs, 768-dim 10-layer Qwen3 arch, seq_len=2048 completed #1
Merged PR Open PR Draft PR Closed PR Open issue Closed issue

Keyboard shortcuts

?
Toggle this help
j / k
Next / previous section
t
Toggle details in current section
s
Cycle sort order in current section
o
Open current epic on GitHub
m
Open current milestone on GitHub
M
Open milestones list on GitHub