85 merged2 opened56 issues closed11 contributors0 epics359 comments this week
A major infrastructure push hardened Iris for multi-region and multi-cloud training, while Levanter's fused cross-entropy and attention kernels matured, Zephyr gained resilience to preemption, and the project's workflow orchestration was significantly rearchitected.
Other Notable Work
Levanter core improvements.@dlwh refactored the LM data path around a new GrugLmExample abstraction (#2727), replacing unnamed data structures with typed examples and attention masks. The Checkpointer API was simplified to take explicit state and step arguments (#2916), and evaluator infrastructure was migrated to a callback-driven TaggedEvaluator API (#2889). A pure compute_watch_stats helper was extracted for non-Trainer reuse (#2900). Deprecated shard_map imports were updated (#2919) and embedding initialization switched to init_scale (#2920).
Zephyr resilience and tokenization.@ravwojdyla and @ravwojdyla-agent hardened Zephyr against preemption: pipelines now retry on coordinator death (#2928), worker pools recover from preemptible node loss (#2946), and exponential backoff no longer overflows on high attempt counts (#2868). The tokenize pipeline was refactored to extract helpers and tune batch configuration (#2934), gained a ThreadedBatchWriter for background writes (#2933), and temporary data now lives under UUID-namespaced paths with atomic renames (#2785).
Workflow orchestration.@ravwojdyla landed the StepSpec + Artifact system (#2494), a significant rearchitecture of marin's workflow orchestration that replaces implicit magic with explicit, typed step definitions and artifact tracking. Temp bucket utilities were added to both marin core (#2882) and Zephyr (#2879) for consistent temporary storage management.
Evaluation.@AlienKevin added Harbor evaluator support (#2808), enabling agentic evaluation against 45+ benchmarks from the Harbor registry. A more reliable, unbiased combinatorial estimator for pass@k (#2493) was also merged.
Testing and CI. Test infrastructure saw broad cleanup: TPU test runtime was reduced (#2918), torch tests were fixed to use locked CPU wheels (#2897), gated HF tokenizer loads were removed from scaling-law tests (#2861), and flaky tests were stabilized (#2909). Claude Code workflows were hardened with timeouts and concurrency groups (#2855, #2816), and a policy for agent-generated GitHub activity was established (#2949).
87 PRs this week, 148 new comments, and 56 issues closed (56 total)
Sort:
#2955Daily ferry 2026-02-22: run closure log and seal💬2+17−0@dlwh
#2951Make fused CE default to XLA custom VJP on TPU💬1+480−10@dlwh
#2949Add policy for agent-generated GitHub activity💬4+5−0@dlwh
#2948Tune TPU v4 fused CE block sizes and fix pallas dtype handling+79−17@dlwh
Dense model isoflop scaling law experiments dominated compute this week. Will Held ran complete AdamH scaling_v3 sweeps at 3e20, 2e20, and 9e19 FLOP budgets across 7 model sizes (273M to 4.3B parameters) on TPU v4 pods, processing hundreds of billions of tokens to map optimal model size vs. token count frontiers. Separately, a large AdamH hyperparameter mega-sweep ran ~60 small runs at 5B tokens each to tune optimizer settings. Michael Ryan completed long-running Qwen3 rephraser SFT runs (0.6B and 1.7B) at 53.8% MFU on v5p, training on 19.7B tokens each. The first daily canary ferry runs validated the fused cross-entropy pipeline end-to-end at 125M scale.