Week of January 26th summary for marin-community/marin

A major infrastructure push hardened Iris for multi-region and multi-cloud training, while Levanter's fused cross-entropy and attention kernels matured, Zephyr gained resilience to preemption, and the project's workflow orchestration was significantly rearchitected.

Other Notable Work

Levanter core improvements. @dlwh refactored the LM data path around a new GrugLmExample abstraction (#2727), replacing unnamed data structures with typed examples and attention masks. The Checkpointer API was simplified to take explicit state and step arguments (#2916), and evaluator infrastructure was migrated to a callback-driven TaggedEvaluator API (#2889). A pure compute_watch_stats helper was extracted for non-Trainer reuse (#2900). Deprecated shard_map imports were updated (#2919) and embedding initialization switched to init_scale (#2920).

Zephyr resilience and tokenization. @ravwojdyla and @ravwojdyla-agent hardened Zephyr against preemption: pipelines now retry on coordinator death (#2928), worker pools recover from preemptible node loss (#2946), and exponential backoff no longer overflows on high attempt counts (#2868). The tokenize pipeline was refactored to extract helpers and tune batch configuration (#2934), gained a ThreadedBatchWriter for background writes (#2933), and temporary data now lives under UUID-namespaced paths with atomic renames (#2785).

Workflow orchestration. @ravwojdyla landed the StepSpec + Artifact system (#2494), a significant rearchitecture of marin's workflow orchestration that replaces implicit magic with explicit, typed step definitions and artifact tracking. Temp bucket utilities were added to both marin core (#2882) and Zephyr (#2879) for consistent temporary storage management.

Evaluation. @AlienKevin added Harbor evaluator support (#2808), enabling agentic evaluation against 45+ benchmarks from the Harbor registry. A more reliable, unbiased combinatorial estimator for pass@k (#2493) was also merged.

Testing and CI. Test infrastructure saw broad cleanup: TPU test runtime was reduced (#2918), torch tests were fixed to use locked CPU wheels (#2897), gated HF tokenizer loads were removed from scaling-law tests (#2861), and flaky tests were stabilized (#2909). Claude Code workflows were hardened with timeouts and concurrency groups (#2855, #2816), and a policy for agent-generated GitHub activity was established (#2949).

69 PRs this week, 22 new comments, and 58 issues closed (58 total)

Sort:

#2603 Fray/Iris fixes for Zephyr integration. +251 −50 @rjpower
#2598 Iris: migrate thread management to ManagedThread/ThreadRegistry +567 −560 @rjpower
#2595 Iris: move Dockerfile generation from worker to client +487 −308 @rjpower
#2593 Iris: Fix builder logs not showing during BUILDING state 💬4 +119 −8 @rjpower
#2591 Skip flaky test_rpc_heartbeat_via_connect_client +2 −1 @rjpower
#2590 Try to fix unittest caching. +1 −0 @rjpower
#2588 Fix Iris CLI task log fetching to use controller proxy 💬5 +36 −28 @rjpower
#2585 Iris: let controller re-bootstrap VMs on reload +17 −80 @rjpower
#2583 Retry with backoff on lm_eval +92 −3 @Helw150
#2582 Job Monitor LLM Agent Skill +92 −0 @Helw150
#2581 Add Zephyr support to Iris worker builds +216 −113 @rjpower
#2573 Iris: Controller-initiated heartbeat protocol +1132 −1718 @rjpower
#2571 Iris: Switch buf to npx buf with pinned version +12 −3 @rjpower
#2568 Iris: Refactor iris_run.py for better testing and error handling +292 −396 @rjpower
#2563 Add fray.v2 API surface +7091 −0 @rjpower
#2562 Iris: Add cluster reload CLI command +100 −7 @rjpower
#2560 Iris: Enable all chaos tests, fix heartbeat completion delivery and infinite retries +359 −270 @rjpower
#2559 Iris: Fix background thread logging errors during test teardown +55 −20 @rjpower
#2558 Iris: Move replicas from ResourceSpec to LaunchJobRequest +146 −188 @rjpower
#2557 Iris: Expose worker preemptible status as a schedulable constraint +443 −166 @rjpower
#2554 Iris: Refactor LocalClusterClient to use ClusterManager +664 −394 @rjpower
#2549 Iris: Task attempt tracking for stale attempt detection +351 −125 @rjpower
#2548 Iris: Make logs a tab instead of separate page +10 −10 @rjpower
#2546 Iris: Extract TaskAttempt class for cleaner worker task lifecycle +806 −489 @rjpower
#2545 Refactor Newton-Schulz iteration with configurable coefficients +130 −73 @WhenWen
#2544 Iris: Add explicit TASK_STATE_ASSIGNED for dispatched tasks +54 −39 @rjpower
#2543 Iris: Rate-limit autoscaler evaluation and async scale-up +627 −74 @rjpower
#2539 Iris: Replace /api/logs REST endpoint with GetControllerLogs RPC +559 −222 @rjpower
#2537 Iris: Chaos testing framework +1903 −0 @rjpower
#2532 Update latest Docker image to 20260129 to update pandas to 3.0.0 +10 −10 @moojink
#2531 Fix TPU name stripping for multi-host TPUs, simplify smoke test cleanup +33 −51 @rjpower
#2530 Iris: Fix infinite refresh loop in job detail page +100 −80 @rjpower
#2528 Iris: Migrate dashboard to Preact+HTM components +1969 −2234 @rjpower
#2524 Force Run Failed Default True +5 −5 @Helw150
#2523 Iris: Dashboard v2 — flat tables, diagnostics, autoscaler, log proxy, VM detail +2599 −783 @rjpower
#2522 Add --fast mode to smoke test for iterative development 💬2 +207 −56 @rjpower
#2519 Iris: add centralized ring buffer logging with dashboard UI +344 −40 @rjpower
#2517 Iris: add ClusterManager for local smoke testing, deduplicate SSH log commands +1626 −1195 @rjpower
#2516 Iris: Add a real JAX TPU test, fixing TPU env vars in the worker +598 −290 @rjpower
#2515 Iris smoke test: logging improvements and cleanup 💬2 +1787 −185 @rjpower
#2514 Retry transient HF Hub errors in tokenizer load +56 −4 @pc0618
#2512 Iris Smoke Test and fixes +5766 −1721 @rjpower
#2510 Bump @remix-run/router and react-router-dom in /data_browser +17 −14 @dependabot
#2508 Add TPU auto-detection for ray_run +44 −12 @dlwh
#2507 fix some dependabot stuff +13 −6 @dlwh
#2506 Bump pyasn1 from 0.6.1 to 0.6.2 in /data_browser +3 −3 @dependabot
#2505 Bump python-multipart from 0.0.21 to 0.0.22 +3 −3 @dependabot
#2504 Bump qs and body-parser in /data_browser +89 −38 @dependabot
#2503 Bump urllib3 from 2.5.0 to 2.6.3 in /data_browser +3 −3 @dependabot
#2502 fix getting pspec for None-backed named arrays +17 −6 @dlwh
#2501 Make vLLM GPU sidecar usable on DGX Spark + add completions smoke test path 💬2 +165 −36 @dlwh
#2490 Refactor DNA experiments to reduce code duplication +412 −947 @gonzalobenegas
#2484 Pick These Up Automatically since you basically have to pass them anyhow +4 −0 @Helw150
#2474 [Logging] Add HF download file size validation, load_jsonl print bad file path +72 −19 @Calvin-Xu
#2472 Promoter + CDS mixture experiment +162 −0 @gonzalobenegas
#2471 CDS YOLO run +95 −0 @gonzalobenegas
#2470 mRNA + promoters YOLO run +207 −0 @gonzalobenegas
#2464 Add Native Grep and Tail Functionality +27 −2 @Helw150
#2463 fix potential sharding inside hax.vmap +89 −4 @dlwh
#2458 [RL] Align model and vLLM vocabulary sizes +45 −10 @AlienKevin
#2457 [RL] Move vLLM configuration from env vars to native arguments +14 −6 @AlienKevin
#2456 [RL] Support Qwen 2.5 in RL weight transfer and model registry 💬1 +118 −20 @AlienKevin
#2444 GCSFuse is Gone, Sniff Downloads More Directly +34 −9 @Helw150
#2437 Promoters YOLO run 💬1 +223 −0 @gonzalobenegas
#2436 Iris: introduce autoscalar +16545 −2397 @rjpower
#2433 Split text data module, refactor data config classes, add prebuilt caches +3525 −3760 @dlwh
#2323 Fix Issue when non-Marin Community users run speedruns without any entity 💬1 +29 −14 @Helw150
#2518 The plan 💬2 +4884 −39 @dlwh
#2477 Prototype changes for Marin as a library 💬4 +1215 −435 @rjpower

Run	Owner	Hardware	FLOPs	Wall Time	MFU	Evals	Status
#2499 Tomol25 1B pre-training	@Helw150	v5p-8 (4 chips)	5e19	39.6h	54.8%	loss=3.830, 1B params, 48k steps	completed #1
#2601 Agentic SFT: Qwen3-8B on OpenThoughts-Agent-v1	@AlienKevin	v5p-32 (16 chips)	3e19	36.9h	42.4%	loss=0.025, Qwen3-8B SFT, seq_len=32768	completed #1
#2499 Isoflop scaling laws (3e19 FLOPs budget)	@Helw150	v5p-8 (4 chips)	3e19	10.2-11.1h per run	46.5-48.8%	best loss=3.004 (d1024-L11), 4 model sizes from 160M to 1B, Nemotron LR schedule	completed #1 #2 #3 #4
#2499 Isoflop scaling laws (1.8e19 FLOPs budget)	@Helw150	v5p-8 (4 chips)	1.8e19	6.4-9.8h per run	44.0-48.9%	best loss=2.949 (d2176-L22), 14 model sizes from 160M to 2.5B	completed #1
#2499 Isoflop scaling laws (1e19 FLOPs budget)	@Helw150	v5p-8 (4 chips)	1e19	4.1-6.0h per run	40.4-45.0%	best loss=3.031, 11 model sizes	completed #1
Data mixture sweep: two-phase StarCoder v4	@Calvin-Xu	v5p-8 (4 chips)	1e18	1.2-2.3h per run	35.7-36.9%	best loss=2.288, 35 runs, 768-dim 10-layer Qwen3 arch, seq_len=2048	completed #1

Other Notable Work

Training Runs This Week

Keyboard shortcuts