Marin: Week of April 6th summary

#4271 Marin-as-a-library (Bolinas can import marin)

Summary: Measurable: Bolinas can `import marin` and use it as a library.

This week saw a concentrated push to make Marin's library packages genuinely usable as standalone dependencies. @rjpower introduced a MarinTokenizer Protocol in #4405 that decouples all tokenizer usage from HuggingFace's transformers package — the new abstraction uses the Rust-backed tokenizers library directly, which eliminates the torch-at-import problem and enables process-isolated encode_batch calls to contain memory leaks. Every runtime tokenizer path (data processing, RL training, eval, inference, visualization) was migrated in that PR. Follow-on work in #4451 added as_hf_tokenizer() to the Protocol so that padding, save_pretrained, and LoRA export can bridge back to HF APIs when needed, and fixed kitoken resolution in levanter's pyproject.toml. A residual tokenizer consistency issue tracked in #1753 was closed as resolved by the unified pipeline, though #4678 documents a remaining gap in eval_harness.py where HF and Marin tokenizer APIs are still mixed.

Cleaning up the library/experiment boundary was a parallel thread. #4541 removed all lib/ → experiments/ import paths and deleted the obsolete speedrun system, completing the work tracked by #4469. A subtle Python 3.11+ regression was fixed by @yonromai in #4534: get_caller_path() was returning <frozen runpy> as the experiment name when scripts were launched via python -m, causing executor metadata to be written to paths like gs://…/<frozen runpy>-ee7bce.json. The fix walks the stack past frozen frames to find the actual caller file.

Config discovery and packaging received significant hardening. #4546 added a --cluster flag to the Iris CLI that resolves cluster configs by name (searching infra/, installed package resources, and ~/.config/marin/clusters/) so external repos no longer need the monorepo checkout on disk. #4607 went further and bundled cluster YAML configs inside the marin-iris wheel itself, so --cluster=marin works for any downstream consumer that installs the wheel. The tokenizer mirror path scheme was also cleaned up in #4555, switching from a flattened org--model key to versioned slash-separated paths (mirror://tokenizers/{org}/{model}/hf-hub-{version}/) so library upgrades force a fresh fetch rather than silently reusing stale cached files. A missing chex dependency that broke wheel-only installs of levanter was fixed in #4608.

The packaging layer was put on a formal footing by #4609, which renamed the six internal library packages from their bare names to a marin-* prefix on PyPI (marin-fray, marin-rigging, marin-iris, marin-zephyr, marin-haliax, marin-levanter), leaving import names unchanged. Building on that, #4612 wired up CI to publish all seven marin-* wheels nightly, on tagged releases, and to a local vendor directory for fast local iteration. Nightly builds are tagged marin-<pkg>-YYYYMMDD with a rolling marin-<pkg>-latest alias; a coherent semver release can be cut by pushing a marin-libs-vX.Y.Z tag. The rename shipped with a small amount of user-facing friction: Russell flagged in #infra on Apr 10 that older uv clients could report iris as missing after pulling main, with uv sync --all-packages --reinstall-package marin-iris as the workaround. The long-standing tracking issue #2442 for Marin-as-a-library was closed, and a demo repo #4472 remains open as the final proof-of-concept step.

1 PR this week, and 0 new issues (0 total)

Sort:

#4612 [ci] Publish marin-* lib wheels nightly, on tagged releases, and to a local vendor dir +569 −20 @rjpower
Issues

13 autocategorized

#4405 [levanter/marin] Add MarinTokenizer Protocol, migrate all tokenizer usage 💬1 +3580 −476 @rjpower
#4451 [levanter] Add MarinTokenizer.as_hf_tokenizer(), fix kitoken find-links +38 −11 @rjpower
#4534 Fix get_caller_path returning under python -m +12 −2 @yonromai
#4541 Remove all marin -> experiment import paths 💬1 +32 −28824 @rjpower
#4546 [iris] Add --cluster flag for path-agnostic config discovery +416 −24 @rjpower
#4555 [levanter] Tokenizer mirror: versioned slash-separated paths, clean local-only loading 💬8 +448 −22 @rjpower
#4607 [iris] Bundle cluster YAML configs in wheel and probe both layouts 💬2 +52 −6 @rjpower
#4608 [levanter] Move chex from test group into main dependencies +1 −0 @rjpower
#4609 [packaging] Rename lib/* packages to marin-* prefix +3005 −3010 @rjpower
#1753 Tokenization Inconsistency Between Marin and Levanter 💬1
#2442 Marin as a library
#4462 [Epic] Usability: Docs, Getting Started, and Agent Experience 💬1
#4678 [levanter] eval_harness.py mixes use of Marin and HF tokenizers 💬1

#4283 MoE MFU at scale

Summary: Tracking issue for April MFU work. Tasks/goals:

0/6 sub-issues closed

The week's most concrete GPU MoE throughput advance was the landing of #4297 by @chloechiaw, which adds a Triton kernel for the ragged_dot grouped matmul at the heart of Grug MoE GPU compute. The kernel, adapted from tokamax, runs the forward pass in Triton while falling back to XLA ragged_dot_general for the backward pass via custom_vjp (a limitation of JAX 0.8's lack of autodiff through pallas_call). Kernel-level forward benchmarks on a single H100 showed 5.2× speedup on uniform traffic (5.78 ms vs 29.98 ms) and 2.8× on skewed loads; at the 256M-parameter model level with 8 experts over 100 steps the improvement was a more modest ~20% in steps/sec. #4427, run by @yonromai and sealed the same week, then quantified the residual gap: with PR #4297 as the new baseline the H100×8 forward throughput sits at 26.12M tok/s versus Megatron's historical 33.09M tok/s anchor — a remaining 21.07% gap — and profiling shifts the blame from the old w13_ragged_dot compute bottleneck toward communication, synchronization, and overlap (roughly 29% compute vs 35% communication vs 36% host overhead in a fresh exact-cap Triton trace). The Megatron anchor itself is about to be re-measured: on April 12 @chloechiaw volunteered in #moe to run the Grug-vs-Megatron head-to-head called for by sub-issue #4311, confirming that grug_moe is the intended comparison point rather than one of the other Marin MoE variants.

A companion experiment, #4406 by @yonromai, tested whether the expert-padded w13 lowering from PR #3821 could stack on top of the new Triton path. The answer was a clear negative: across three seeds the combined approach was 3.28% slower than plain Triton, even though the same padded lowering remains a meaningful win on the XLA-only path. This rules out that direction as a follow-up for the current EP=8 configuration. Separately, #4359 by @dlwh fixed a correctness issue in ragged expert parallelism: receive buffers were previously sized for worst-case traffic rather than the configured capacity factor, and this PR clips receiver group sizes before the ragged all-to-all and preserves kept-token ordering on the return path.

On the TPU side, #4455 by @yonromai investigated JAX 0.8.0 vs 0.9.2 compatibility for Grug MoE on TPU. The initial smoke run on JAX 0.9.2 failed immediately due to stricter shard_map sharding validation — the expert weight arrays in grug_moe.py and lm_head in loss.py had mismatched in-specs. After explicit jax.sharding.reshard fixes, training runs cleanly on both stacks, and the final steady-state benchmark (7 steps, 2 warmup) shows JAX 0.9.2 slightly ahead of 0.8.0 on both v5p-8 and v4-8, clearing the path for a future JAX upgrade. The result landed in a broader context: @yonromai opened a coordination thread in #infra with @ahmeda14960 to scope a full codebase migration to JAX 0.9.2 — noting that the tpu-dep-hell branch had so far been exercised mainly for vLLM inference, not training — and posted the research/grug-moe-jax-regression branch with the fixes needed to reproduce the benchmark, which @ahmeda14960's agent then cross-filed as #4506 tracking the broader Levanter, chex, flax, and datasets ripple effects. @pranshu28 framed the acceptance criterion from the training side: “if we maintain similar MFU on grug MoE on the new jax that should be fine for training folks” — a reminder that an earlier Mixtral 8x7b regression in Levanter is the reason this gate exists.

Meanwhile, #4636 by @ClassicLarry (open at week's end) ports the MoeAdamHHeuristic from the moe_isoflop_apr_2026 branch, removes the initial-dense-layer path from Grug MoE to match the isoflop architecture, and fixes a jnp.repeat crash in align_kv_heads under abstract-mesh training — part of the preparation for the April isoflop scaling runs. Larry pitched the PR in #moe as bringing main back in sync and pointed anyone (“or anyone's agents”) to the recipe README as a “set of metrics to climb”, followed by a “moe looks 🔥” from @dlwh on return from a week off — informal validation that the combined picture (Triton kernel + isoflop heuristic + clean JAX 0.9.2 path + Iris fully operational) has the epic moving on all three fronts at once.

0 PRs this week, and 0 new issues (6 total)

Sort:

Issues
#4300 TPU v4: 25%-30% MFU sustained for 100B-A13B
#4301 Experiment: Grug MoE ~116B-A16B bring-up on v4-1024 (d4864/l47/h38)
#4302 H100 x 8 MOE MFU perf
#4311 Measure throughput of megatron on 8x H100 node on relevant geometries
#4312 Improve end-to-end MFU of MOE on 8xH100 node
#4313 Measure/improve MFU on 2 x 8xH100 node

7 autocategorized

#4297 Add GPU Triton kernel for ragged_dot MoE grouped matmul 💬1 +174 −7 @chloechiaw
#4359 Fix ragged all-to-all capacity clipping +234 −11 @dlwh
#4636 Add MoeAdamHHeuristic, drop dense layers, fix align_kv_heads sharding +454 −81 @ClassicLarry
#2828 Port MoE training to GPU: kernel experiments and performance validation
#4406 Experiment: quantify 3821-style w13 lift on top of 4297 Triton 💬1
#4427 Experiment: root-cause remaining Grug MoE vs Nemotron GPU gap after PR 4297 💬1
#4455 Experiment: Grug MoE training perf on JAX 0.8.0 vs 0.9.2 💬5

#4282 Agentify experimentation

Summary: Split from #4266.

The Nightshift automated experimentation system saw key infrastructure fixes this week, enabling its scout agents to reliably produce results for the first time. PR #4581 from @rjpower diagnosed two bugs that had been silently killing most scouts each night: git worktree add calls were racing on repo metadata when dispatched in parallel (causing exit-255 failures for two of four scouts), and the surviving scouts produced no output because the runner was passing a nonexistent --cwd flag to the Claude CLI instead of the correct cwd= argument to subprocess.run. With worktree creation now sequentialized before parallel dispatch, scouts began delivering findings consistently.

The repaired pipeline immediately bore fruit. PR #4586 (April 9) combined four scout findings across subprojects: iris’s _job_state_name() and _task_state_name() helpers were deduplicated into canonical functions in rpc/proto_utils.py; levanter’s near-identical parquet row-group seeking blocks (one even carrying a TODO: fix this duplication comment) were unified into a shared _iter_parquet_from_row helper for a net −25 lines; three dead utility files in marin were removed; and a latent bug in zephyr’s load_parquet filter-column detection was fixed, replacing fragile substring matching on PyArrow expression strings with a proper AST walker that extracts field names directly.

PR #4621 (April 10) continued the streak with another round of automated housekeeping: three dead functions and their unused psutil/subprocess imports were removed from marin’s evaluation/utils.py; iris’s ControllerTransitions pruning logic was refactored into shared helpers (_prune_per_worker_history and _batch_delete) across two previously copy-pasted methods; four dead exports and legacy-compat methods were cleared from zephyr’s execution.py; and levanter shed 165 lines across six files, including deletion of a backward-compatibility shim (models/rotary.py) with zero callers.

Discord made clear that agent-driven work has quietly become the default mode of operating on the codebase, not just the nightly scout loop. Russell routinely delegated triage to Claude mid-conversation — pasting its analysis of memray’s spurious “Failed to compress input file” errors to show they weren’t causing task failures, offloading the BATCH-priority request recipe because he couldn’t remember it but “Claude will”, and responding to an OOM-looping tokenization job by telling willheld “ill file an issue and see if our friend Claude does an okay job” — which became #4575 minutes later, followed by #4577 and #4578. Romain took a similar path on the JAX 0.9.2 upgrade, planning to “just dipatch a codex” from Pranshu’s tpu-dep-hell branch to check for training regressions rather than doing the benchmark sweep by hand.

The pattern extended beyond the core infra team. rohithck filed #4494 and noted it was his “first time using an agent to file an issue; I am feeling the agi”; Ahmed had Codex file #4495 for a TPU-in-use error in the RL migration, and later credited “claude / codex” with making a stale-GCS-cache vllm bug palatable to hunt down. Willheld flagged that “Claude has its own CronCreate tool now”, retiring the sleep 570 loops that had been scaffolding scheduled agent runs. Eric raised the harder open question — how people are organizing agent-executed jobs rather than just the babysit/recover skills — suggesting the next frontier for this epic is moving agents from diagnosis and cleanup into actually launching and steering experiments, a direction Larry echoed when he opened the MoE recipe leaderboard to “anyone, or anyone’s agents”.

3 autocategorized

#4581 [nightshift] Fix scout worktree race and bogus --cwd flag +15 −8 @rjpower
#4586 [nightshift] 20260409 multi-cleanup +270 −458 @github-actions
#4621 [nightshift] 20260410 multi-cleanup +90 −324 @github-actions

#4281 MoE Scaling up to April goal

Summary: Split from #4266.

4/8 sub-issues closed

The preregistered 1e22 MoE run tracked in #3800 completed this week, training a 34.6B-total / 4.7B-active parameter model on 326B tokens of Nemotron mix at 1e22 non-embedding FLOPs. The run reached a paloma/macro_loss of 2.432 at capacity factor 4.0, substantially outperforming the 1e22 dense baseline, though slightly above the earlier 2.3887 prediction — @ClassicLarry attributed the gap primarily to an unrealistically low irreducible-loss asymptote in the three-point fit used to make that prediction, and notes that the 1e20 anchor point was not from a proper isoflop curve. A checkpoint-resume crash tied to trainable weights inside tuple[Block, ...] structures temporarily blocked the run; a workaround was found by loading from the latest checkpoint into a new GCS folder and W&B log, and a fix via #4458 resolved the immediate blocker. Capacity factor comparisons at step 77724 showed that higher capacity factors consistently improve macro loss (cf=4.0 giving 2.432 vs cf=1.0 giving 2.459), with the uncheatable eval improving by 3.1% from cf=1.0 to cf=4.0.

To size the upcoming 1e23 MoE run, @ClassicLarry completed a v16 isoflop sweep under the new learning-rate recipe #4447, running GQA 4:1 MoE models with 64 experts and k=4 across budgets from 1e18 to 1e21. The 1e21 prediction of 2.598 matched the actual d2560 result of 2.599, validating the near-term fit. Because the isoflop curves are flat and the model-size optimum ambiguous, a middle-ground sizing was chosen for 1e23: 48 layers, d5120 (40 heads), ~131B total / 16B active parameters, and ~1.02T tokens. Leave-future-out cross-validation of the scaling law showed that fixing the irreducible loss at L∞ = 1.6 yields stable predictions for 1e23 — estimated at roughly 2.25 Paloma macro loss — as long as empirical data exists at 1e18–3e19. A companion beta2 sweep #4567 confirmed that the existing clip(base^(B/32), 0.95, 0.9999) formula undershoots the optimal beta2 at larger batch sizes: at d=1024, the best beta2 shifted from 0.995 at bs=64–128 down to 0.97 at bs=256 and 0.95 at bs=512. The recipe will be updated to use a base of 0.998 instead of 0.999 — a near-free win at small batch sizes and directionally correct at large ones, where all large-scale runs will hit the 0.95 floor regardless.

A critical batch size sweep #4432 mapped how eval BPB degrades as batch size grows at isoflop-optimal budgets from d512 through d1280. The result is that there is no sharp cliff: smaller batches are consistently a bit better, but degradation stays mild until roughly bs=64–128 at d512, bs=128–256 at d768 and d1024, and approximately bs=512 at d1280. A power-law fit (CBS = 1.87e-05 · C^0.388) projects the critical batch size to ~6,361 at 1e22 and ~15,536 at 1e23, indicating that the 4M-token batch used in the d3200 run was comfortably below the harmful regime. The practical upshot is that future optimizer tuning of beta1/beta2 scaling is the higher-leverage follow-up rather than refining the CBS fit. Separately, #4569 opened to investigate the embedding norm growth seen in the 1e22 run — the embed (trained with AdamW rather than AdamH) grew from norm 176 at initialization to 3700 over the course of the run — and is exploring approaches including weight decay, cautious weight decay, per-token normalization, and token-level LR modulation as a function of recent gradient magnitude.

On the architecture front, latent-only MoE loss quality was assessed as part of the Great 10T gate in #4032. A clean 1e19 A/B across d768, d1024, and d1536 showed that the answer is width-dependent: latent MoE is a clear no at d768 (2.9% BPB regression with no throughput benefit), a plausible tradeoff at d1024 (1.7% regression, throughput gain), and the strongest current candidate at d1536 (approximately 0.8% regression, much better throughput). @pc0618 pushed in #moe for moving to larger scales given the shrinking gap, arguing the BPB gap should further close at the 100s:1 token-to-parameter ratios used in full hero runs rather than compute-optimal conditions; @ClassicLarry pushed back that the d1536 point is heavily undertrained and a 2% BPB hit at d1024 could cost ~15% of training time to make up, while still allowing that "most good ideas look bad until we get the execution right." The call is to keep latent MoE on the bench rather than promote it yet.

Isoflop infrastructure landed on main in #4636, which ports MoeAdamHHeuristic from the moe_isoflop_apr_2026 branch into experiments/grug/moe, drops the initial-dense-layer path from the MoE model to match the isoflop architecture, defaults EP capacity factor back to 1.0, and rewires launch.py to derive (model, optimizer, batch, steps) from a compute budget + hidden_dim. @ClassicLarry announced the PR as a "metrics to climb" leaderboard for anyone — or anyone's agents — wanting to improve the MoE recipe, with @dlwh noting that the combined week's progress — Iris fully operational alongside the MoE work — made for an unusually strong landing. The beta1=0.9062 literal in the new heuristic raised an eyebrow; @ClassicLarry explained it came from @Helw150's Vizier sweep and the decimals don't matter, prompting @Helw150's joke in a spin-off thread that his "Vizier" is a local wizard he consults rather than an actual sweep.

Forward-looking coordination started firming up in adjacent channels. On the MFU side of the April goal #4283, @chloew7 offered to run grug vs Megatron on an 8×H100 node for #4311, derisking the H100 leg of the 1e23 plan before committing. @rjpower's GrugMoE fixes on the research/grug-moe-jax-regression branch (see #4455) benchmarked JAX 0.9.2 against 0.8.0 and are being landed into main; @pc0618 signed off that maintaining MFU on grug MoE under the new JAX is the bar training folks care about. And in #midtraining, @Helw150 sketched that midtraining the 1e21 and 1e22 isoflop checkpoints is "very doable over a weekend when preemptible capacity is high," teeing up the natural next step once the isoflop sweep is complete. Infrastructure improvements for the Vizier reference sweep landed in #4563 from @eric-czech, making sweep jobs preemptible and preventing divergent (NaN loss) trials from crashing the entire sweep by marking them infeasible in Vizier instead.

0 PRs this week, 21 new comments, and 3 new issues (8 total)

Sort:

Issues
#3469 MoE Sweep: Hyperparameters, Routing, Architecture, and Optimizer
#4013 [moe] Good 10T gate for #3469
#4014 [moe] Great 10T gate for #3469
#4432 Experiment: Map out Critical Batch Size for Current MoE Recipe 💬7
#4447 🆕 Experiment: MoE Isoflop on new lr recipe. 💬9
#4567 🆕 Experiment: Investigate MoE Beta2 Scaling Recipe 💬4
#4225 Experiment: Map out LR schedule and tuned value interactions with AdamH
#4569 🆕 Experiment: Investigate Approaches to handle Embed Norm Growth in MoE Recipe 💬1

4 autocategorized

#3314 main: default train/eval/lora/viz to array-first Grug datasets 💬5 +235 −99 @dlwh
#4563 Handle infeasible trials in Vizier reference sweep +34 −15 @eric-czech
#4591 exp1337: add seed sweep for Delphi runs +28 −15 @Helw150
#3800 Test MoE Arch at 1e21 and 1e22 Flop Scales 💬6

#4273 Improve Usability & Observability

Summary: Lower priority / slack-time workstream covering workqueue, dev-tpu replacement, and observability.

The week saw targeted improvements to Iris's operator experience, closing several gaps that surface in day-to-day use. @rjpower added a guard in #4585 that rejects --tpu, --gpu, and oversized memory/disk requests on the entrypoint job unless --enable-extra-resources is also passed — a common mistake where users assume the coordinator needs accelerators when it's only dispatching to worker tasks. The error message now explains the coordinator pattern explicitly. Separately, @rjpower landed #4597, a script that detects cross-region GCS reads for Iris jobs, giving operators a concrete diagnostic tool for latency and cost anomalies.

@ravwojdyla-agent added an iris job summary <job_id> command in #4592, surfacing per-task state, exit code, duration, peak memory, and current memory — data the controller already collected but had no CLI exposure. The same peak-memory column was added to the job detail dashboard page. The feature was designed with two concrete use cases in mind: agents babysitting long runs can fetch a structured summary at completion rather than scraping logs, and OOM postmortems can immediately see which shards hit cgroup limits. The --json flag makes the output machine-readable for scripted workflows.

On the dashboard side, @Helw150 opened #4660 adding dark mode, a clearer autoscaling overview, and revised colors and fonts — a follow-up to #4647, which called out that slice-state badges, colored dots, and progress-bar segments had no in-UI legend and no visual distinction between idle-ready and occupied-ready slices. The dashboard polish landed against a backdrop of genuine user enthusiasm: rohithck called the Iris dashboard “truly a breath of fresh air coming from ray”, and @dlwh's broader push to move workloads off Ray announced in #general drove first-time Iris users into the cluster all week. The load also surfaced the dashboard's current scaling limits — Michael Ryan noted that his large-resource job was “slowing down the iris dashboard and testing the limits of the auto-scaler,” and yurusankyo reported the controller being unresponsive on April 9th, prompting @rjpower to restart it.

Several smaller UX papercuts were identified and in some cases fixed in-week. Eric Czech hit a job stuck in a silent crash loop with max_task_failures > 1 left over from Ray defaults; @rjpower acknowledged that “it's confusing for users if the job takes 10x longer to report failure,” shipped the fix in #4615, and noted the job page itself needs better crash-loop surfacing. Ahmed M Ahmed raised a co-scheduling gap in #infra: RL jobs frequently have the trainer spin up while the worker sits unscheduled (or vice versa), holding compute that could go to someone else — @rjpower suggested combined reservations as a workaround and invited an issue for proper priority-boost support. The week also produced a set of new scoping issues for the longer-term documentation gap: #4463 audits stale Ray-centric docs, #4464 calls for a proper Iris getting-started guide, #4465 tracks removing Ray references from docs/ entirely, #4466 targets making Iris docs agent-parseable with structured CLI and SDK references, and #4467, now closed, proposed simplifying the --cluster default so users in the marin repo can omit the flag entirely.

7 autocategorized

#4585 [iris] Gate --tpu/--gpu and large resource requests behind --enable-extra-resources +114 −3 @rjpower
#4592 [iris] Expose per-task summary (incl. peak memory) via CLI and dashboard 💬1 +251 −4 @ravwojdyla-agent
#4597 Add a script to detect cross-region reads for Iris jobs. +435 −0 @rjpower
#4660 Dark Mode, Clearer Overview for Autoscaling, Preferred Colors & Fonts 💬1 +1484 −265 @Helw150
#1790 Rename clusters to be unambigious and fully qualified
#2297 Make artifact staging configurable; reduce FUSE/GCSFuse coupling
#3219 Iris: better scheduling feedback 💬1

#4272 Canonical pipeline (download → norm → dedup/quality → tokenize)

Summary: Define canonical data pipelines for all data ingestion: download -> normalize -> dedup/quality -> tokenize.

The canonical data pipeline took a major step toward operational maturity this week with the introduction of a daily smoke ferry by @ravwojdyla that runs the full datakit stack — download → normalize → fuzzy dedup → consolidate → tokenize — end-to-end on a fineweb-edu 10BT sample every morning via GitHub Actions. The ferry uses a temporary GCS bucket with a 1-day TTL so per-run outputs auto-expire, provides per-run isolation, and is wired to Slack alerting on scheduled failure; a first live run completed in 1h10m on the production cluster and was used to tune worker resource provisioning. Alongside this, a normalize step was merged, establishing a general building block that converts raw downloads into standard Parquet with deterministic xxh3_128 IDs, exact content dedup within a shard, and configurable target partition sizes — the missing formal stage between download and tokenize described in the epic's sub-issues #4483 and #4484.

A focused vortex-to-parquet migration landed across two PRs. #4596 switched fuzzy document dedup and connected-components intermediates from vortex to parquet, and #4610 completed the migration by also converting the final outputs of exact-paragraph, exact-document, and fuzzy-document dedup. As @ravwojdyla noted, early experience with vortex in this pipeline was not smooth, and parquet is the simpler path forward. An open PR from @rjpower, #4658, further tightens the tokenization step by replacing per-file stat RPCs with a single bulk glob(detail=True) call, eliminating the dedicated tokenize-filescan Zephyr job that previously launched 32 distributed workers just to retrieve file sizes — on a 2,755-file nemotron dataset the new approach takes ~2 seconds.

The pipeline's rough edges surfaced loudly on Thursday when @Helw150's common_corpus_english tokenization job stalled at 4999/5000 shards with 3268 workers still alive; @rjpower traced it to a single bad parquet shard OOM-looping on pathologically malformed content, and noted that Zephyr "only discover[s] the succeeded shards during the work phase," so a restart of a mostly-done job still fans out thousands of workers that finish almost immediately. Those workers then pin the CPU resources on v5p slices and prevent Iris from freeing the machine for training jobs asking for smaller slices, which blocked other users for hours. The incident generated three issues in quick succession — #4575, #4577, #4578 — and became the motivating example for #4600, which ports the native llama whitespace protection and complements #4603's max_whitespace_run_chars guard on the normalize side. #4588 documents the underlying class of failure — multi-megabyte whitespace runs from broken HTML extraction that trigger OOMs and can leave latent bad data in training corpora — and #4503 continues to track the _consolidate_metadata OOM on large datasets where the coordinator materializes all shard offset arrays at once.

Beneath the operational pain is a design conversation that @ravwojdyla opened with a new contributor, Neville, about the tokenize producer itself. In a long pointers thread, rav framed the core issue: "the producer code/pipeline is overly complicated and slow, especially the levanter store consolidation. I'm not even entirely sure whether that consolidation is needed." The ensuing discussion with Neville and @dlwh/@Helw150 worked through TensorStore chunk semantics, why producer-side document ordering can't be aligned with consumer chunk boundaries (sequence lengths expand from 4K up to 128K during training), and the pretraining-vs-SFT split in read patterns — groundwork for a potential rewrite of the producer once Iris-native batch execution is fully online. A separate @rohithck-filed issue in code-talk about actor-name clashes on tokenize retries points at the same family of problems from the other side.

New datasets continued to flow into the pipeline. @Helw150 added the PleIAs/common_corpus English open-access subset — filtering to Open Science, Open Government, and Open Culture documents — and NSF grant abstracts (~170M tokens of public-domain scientific text). @ravwojdyla landed full download → normalize → tokenize support for all six starcoder2data-extras subsets (IR for C++, Python, Rust, low-resource languages, documentation, and Kaggle), also exposing levanter_batch_size through the tokenization write pipeline to control memory pressure for large-document datasets. Integration test coverage for the pipeline is expanding incrementally: PRs #4492 and #4493 add document-level dedup and consolidate edge-case tests respectively, working toward the sub-issues tracking end-to-end test coverage under the epic.

18 autocategorized

#4188 datakit: add normalize step (download → standard Parquet) 💬1 +561 −1 @ravwojdyla-agent
#4374 Add fineweb-edu 10BT exact paragraph dedup experiment +49 −0 @rjpower
#4492 Add document-level dedup to integration test +38 −7 @claude
#4493 Add consolidate edge-case tests +75 −0 @claude
#4516 NSF Grant Abstracts 💬3 +186 −0 @Helw150
#4596 [marin] fuzzy dedup: use parquet instead of vortex 💬1 +10 −8 @ravwojdyla-agent
#4598 datakit: daily smoke ferry for download → normalize → dedup → consolidate → tokenize +372 −30 @ravwojdyla-agent
#4599 Add starcoder2data-extras download and tokenization 💬1 +129 −0 @Helw150
#4603 [marin] datakit/normalize: compact pathological whitespace runs 💬7 +83 −16 @ravwojdyla-agent
#4606 [experiments] Add PleIAs/common_corpus dataset (English open-access subset) +134 −0 @Helw150
#4610 [marin] dedup output: use parquet instead of vortex +21 −23 @ravwojdyla-agent
#4626 Add normalization and tokenization for starcoder2-extras +140 −9 @ravwojdyla-agent
#4658 [zephyr/tokenize] Use bulk list-objects for file sizes, delete filescan job 💬1 +155 −120 @rjpower
#2149 [dedup] Milestone 4: Estimate Dedup impact via isoflop curves
#4476 Download Nemo v2 datasets (low priority)
#4482 [Epic] Data Pipeline: Normalization, Integration Tests, and Embeddings 💬1
#4503 OOM in _consolidate_metadata during tokenization for large datasets
#4588 datakit/tokenization - cleanup or skip bad records 💬8

#4270 Canary pass rate to 90%+

Summary: Measurable: canary ferry pass rate consistently above 90%.

This week’s work on canary reliability focused on hunting down the concrete failure modes degrading ferry pass rates. @rjpower fixed a cluster of CI failures in PR #4654: the Claude triage agent was hitting a 50-turn budget wall and getting killed mid-run, so the limit was raised to 500 turns; the CoreWeave canary was missing the controller extra needed for CloudK8sService; the dev-restart cron was colliding with the TPU canary ferry at 06:00 UTC and was staggered to 05:00; and SSH tunnel establishment was made resilient to transient connection resets and refusals via retry_with_backoff.

On the datakit smoke ferry, @ravwojdyla-agent fixed two separate failure paths. PR #4616 addressed an OOM in the consolidate step — the worker resource allocation was hardcoded at defaults sized for vortex rather than parquet, and was bumped to 8 GB via a new worker_resources field on ConsolidateConfig; fuzzy dedup iterations were also capped at 3 (down from 10) to keep runtime bounded. The same PR replaced the ad-hoc validation script with a data-driven checker that enforces exact file counts, schema, and row-count invariants across the full pipeline chain (download → normalize → dedup → consolidate → tokens), verified against a real end-to-end ferry run.

A subtler validation failure was fixed in PR #4627: validate_ferry_outputs.py was resolving the temp bucket using the GHA runner’s environment (which has no GCP metadata and falls back to file:///tmp/), while the Iris worker had already written outputs to gs://marin-tmp-{region}/ttl=1d. The fix has the ferry write a ferry_run_status.json containing the resolved marin_prefix to a path set by FERRY_STATUS_PATH, which the GHA workflow reads back and passes explicitly to the validation step — eliminating both the fallback and any hardcoded region assumptions. Finally, PR #4624 reduced the canary ferry coordinator’s memory request from 16 GB to 2 GB; at 16 GB it was tripping the --enable-extra-resources guard and breaking both TPU and CoreWeave ferries.

Discord made clear that the underlying cluster the TPU canary depends on was itself unreliable this week: central2 Ray had to be manually restarted at least three times between Apr 6 and Apr 8 — Eric Czech reported a SIGABORT in the GCS server logs with the head container up but no Ray services running, rohithck restarted a borked scheduler on Apr 7, and Tony restarted again on Apr 8 when nothing was running. Any canary ferry that fired during those windows would have failed for reasons orthogonal to the CI fixes above, which makes interpreting week-over-week pass-rate numbers noisier than the PR stream alone suggests and reinforces the case for the ongoing Iris migration as the real path to sustained 90%+.

4 autocategorized

#4616 [datakit] Fix smoke ferry: parquet dedup, consolidate OOM, validation +245 −67 @ravwojdyla-agent
#4624 [ci] Reduce canary ferry entrypoint memory from 16G to 2G +2 −2 @rjpower
#4627 ci: pass ferry output prefix via status file to validation step +44 −13 @ravwojdyla-agent
#4654 [ci] Fix canary ferry CI failures: turn budget, CW deps, schedule overlap, tunnel retry +100 −52 @rjpower

#4269 Single way of running jobs — off Ray completely

Summary: All jobs run through Fray+Iris.

The Ray migration reached its decisive cutover this week, moving from a gradual two-system coexistence into an actively managed wind-down. @yonromai announced in #general that Ray will be fully deprecated in favor of Iris by end of month, and #4453 catalogued the remaining Ray-dependent workloads across 12 library files, spawning a coordinated migration with reach-outs to Rohith for logprob jobs #4640, Ahmed and Kevin for RL helpers #4639, and David for Levanter cache and vLLM infra #4641. @yonromai capped all remaining Ray TPU pools with tiered max_workers limits in #4604 and #4623, dropped min_workers to zero across all migrated clusters, deleted Ray auth/secret targets from the Makefile in #4562, and deleted Marin's classification inference and FastText training stack — the largest remaining direct import ray users — in #4642. Working with @Helw150, @yonromai also removed Levanter's Ray-only cache_dataset entrypoint and the vLLM Ray TPU fallback in #4648. @rjpower consolidated integration tests onto Iris in #4601 and removed Ray backend support from test fixtures in #4595. The single-host RL pipeline preregistered in #3959 completed its migration: @ahmeda14960 landed #3960, replacing the Ray-era client-side launcher with an in-cluster Iris coordinator topology with preemption-safe resume semantics.

The cutover was not friction-free. Ray went down in central2 on Apr 6 — Eric Czech found the container running but GCS server SIGABORT and no Ray services alive, and Tony was unable to connect at the same time; Eric restarted the cluster after confirming his dna branch was too stale to move onto Iris yet. Rohith restarted central2 again the next morning after jobs sat waiting on resources for hours, and Tony restarted it again that night. After #4604 reduced Ray capacity, Tony — the last remaining Ray user — found his running jobs killed by the cluster restart and pushed back hard on the timeline in #infra, noting he had been told the end of the month was the Iris migration deadline. @yonromai apologized for the disruption and explained that the pre-#4604 state effectively allowed one user to saturate TPUs meant to be shared, while @Helw150 pointed out that preemption pressure from Google paying customers was consuming ~90% of TRC capacity in east5-a and that Ray's only lever was max_workers, which itself appeared not to be enforced reliably for multi-host slices. The episode was a concrete reminder that Ray's autoscaler has no fair-share notion — hence the push to Iris, which does.

Iris absorbed substantial reliability and performance work. @rjpower replaced all kubectl subprocess calls in CloudK8sService with the Python kubernetes client in #4532, dropping per-call latency from ~1 s to connection-pooled milliseconds, and addressed 2,900+ orphaned pods and 17,800+ stale ConfigMaps accumulating in etcd via periodic GC in #4508. Controller responsiveness was fixed in #4531 by removing an unnecessary DuckDB log fetch on every GetProcessStatus poll, and the ListJobs call that was taking ~3 s due to eager descendant fetching was replaced with a cheap DISTINCT parent_job_id query in #4533. The SQLite controller store had proto BLOB columns replaced with native SQL across migrations 0024–0028 in #4644, and profiling moved to a dedicated profiles.sqlite3 in #4496. @rjpower added historical task resource usage tracking in #4629, split the autoscaler into a structured package in #4572, replaced the simple one-at-a-time worker restart with adaptive batch sizing in #4635, and changed the autoscaler from a hard min_slices floor to an additive buffer_slices warm-pool model in #4544. These landed against live load: after Michael Ryan's large v5e/v5-litepod sweep started flapping the cluster on Apr 8, @rjpower performed a controller restart midday to clear the autoscaler and deployed a scheduling-glitch fix the next morning after users with many tasks reported unresponsiveness. Fray's actor RPC was split into a direct remote() path for short-lived calls and a long-poll submit() path for multi-minute operations in #4500, halving RPC overhead, and cluster.proto was split into three focused files in #4452.

The Iris dashboard received a concentrated round of usability improvements and became newcomer-friendly enough that rohithck called it "a breath of fresh air coming from ray". @rjpower overhauled the scheduler tab, added paginated task lists, and surfaced scheduling failure reasons in #4651. Failed task attempt callouts were added to the job status page in #4614, filter and sort state is now encoded in URL query params in #4633, and the endpoint panel was fixed to show the job name rather than the user in #4671. Job request details (command, env, pip packages, named ports) appear on the job detail page via #4668. The dashboard became accessible without SSH tunnels via a Cloud Run IAP proxy deployed in #4630 by @ravwojdyla-agent. @AlienKevin extended flexible TPU scheduling to the CLI by adding comma-separated variant lists to iris job run --tpu in #4619 and expanded the v4-reserved pool to cover the full size range in #4662. New --priority and --fresh flags were added to iris job run in #4646 and #4655, with @Helw150 and @rjpower coordinating to make BATCH priority the ingredient that lets small CPU jobs stop pinning large TPU slices for other users. A separate in-progress infra status dashboard aggregating CI, build health, Iris reachability, and job state was opened in #4649. @dlwh summarized the week in #moe: "Iris is fully operational… I should take more weeks off." Operational papercuts still showed up: after #4609 renamed the lib packages to a uniform marin-* prefix, @rjpower warned in #infra that updated branches would trip uv with a spurious missing-iris error and supplied the uv sync --all-packages --reinstall-package marin-iris workaround; users hitting google.protobuf.json_format.ParseError on changed proto schemas were pointed to uv run python lib/iris/scripts/generate_protos.py as the manual regen step; and a vestigial Ray-era max_task_failures default that caused Iris jobs to crash-loop 10x before reporting failure was fixed in #4615. Docs lag the blessed path — #4443 called out that ~20 doc files still reference Ray and the Iris README, OPS.md, and 22 design docs need an accuracy audit before agents and new users can self-serve.

Zephyr received stability and isolation work driven by CoreWeave production issues. @ravwojdyla-agent switched to running each shard task in a fresh Python subprocess in #4522, eliminating the Arrow memory pool, page cache, and leaked file descriptor accumulation that was OOM-killing long-lived workers. OOM-killed subprocesses now surface as MemoryError rather than a generic returncode -9 crash in #4580, and subprocess workers exit via os._exit in a try/finally to dodge PyArrow's GCS shutdown abort in #4576 and #4582. Per-shard failure tracking with a configurable MAX_SHARD_FAILURES abort threshold replaced the previous behavior in #4579, and idle workers on the last pipeline stage now exit immediately in #4583. The value of this hardening was visible in real time: on Apr 9 Will's tokenization job sat at 4999/5000 shards with one shard death-looping on OOM while Zephyr kept discovering succeeded shards during the work phase and spawning thousands of short-lived workers against them, filing follow-ups #4575, #4577, and #4578. On multi-cloud storage, @ravwojdyla fixed the Malformed StorageGeneration TensorStore bug on R2/CoreWeave in #4441, @ahmeda14960 fixed single-shard cache consolidation on R2/S3 in #4436, and the S3 distributed lock race causing SignatureDoesNotMatch errors was fixed in #4440.

A parallel thread ran on GCS cost. @rjpower flagged in #infra that Marin's GCS bill is closer to $60k/month than previously thought, most of it in the STANDARD class where Autoclass can't help; a single code-resili-14b-* sweep accounts for ~315 TiB and ~$4,500/mo across 16 hyperparameter variants, with math-14b-resili-* and medical-14b-resili-* adding another ~$2,700/mo. The plan that landed in discussion: enable soft-delete for 3 days, then delete anything not on @dlwh's protect list or the project spreadsheet; @rjpower ran the one-time purge on Apr 8 afternoon. @Helw150 changed the default training path to save only the final permanent checkpoint — intermediate checkpoints now require explicit opt-in — noting that the previous behavior was writing 50 permanent model copies per 50k-step job. @ahmeda14960 proposed a longer-term split into shared storage (HF models, common datasets) and per-user storage tagged at executor-step write time now that Iris makes user attribution tractable. @rjpower also noted that class-B operation costs were being driven by users running full random shuffles during parameter sweeps and asked everyone to move to the new block-shuffle.

169 autocategorized

#3960 [RL] Migrate single-host RL pipeline to Fray v2/Iris +4934 −1041 @ahmeda14960
#4246 [iris] Offload large workdir files to blob store in launch_job 💬1 +393 −12 @rjpower
#4344 [zephyr] Fix load_parquet memory: use ParquetFile, drop dataset API +696 −75 @ravwojdyla-agent
#4372 [iris] Add PodDisruptionBudget for controller and coordinator pods on K8s +200 −18 @ravwojdyla-agent
#4375 [zephyr] On-demand single-task workers with per-stage resources 💬1 +447 −183 @ravwojdyla-agent
#4380 [fray] Remove 24h default timeout on job wait +4 −2 @ravwojdyla-agent
#4393 [iris] Replace gcloud CLI with REST API client in CloudGcpService +927 −459 @rjpower
#4421 Rewrite storage purge: soft-delete instead of STS backup + lifecycle +9710 −0 @rjpower
#4436 [levanter] Fix single-shard cache consolidation on R2/S3 +52 −2 @ahmeda14960
#4438 [zephyr] Fix logging crash in coordinator daemon thread during shutdown +5 −0 @rjpower
#4439 [iris] Optimize LogPusher: release lock before RPC, increase flush interval +107 −25 @rjpower
#4440 [rigging] Fix S3Lease thread-safety: per-lock-path botocore client +12 −4 @ravwojdyla
#4441 [R2/CW] Fix Tensorstore Malformed StorageGeneration +7 −55 @ravwojdyla
#4442 [zephyr] Parallelize result flattening with thread pool +11 −3 @ravwojdyla
#4452 [iris] Split cluster.proto and check in generated pb2 files 💬3 +11045 −3806 @rjpower
#4459 [iris] Fix LogCollector starvation under slow kubectl +3 −3 @rjpower
#4490 Bump default RAM for checkpoint export step 💬1 +11 −3 @claude
#4491 [iris] Retry GCS log offload with bounded exponential backoff 💬3 +38 −6 @claude
#4496 [iris] Move profile data to separate SQLite DB +197 −31 @rjpower
#4497 [iris] Add setup_iam.py check subcommand, rename from init_gcp_service_accounts +617 −335 @rjpower
#4498 [levanter] Disable xdist in TPU pytest entrypoints +5 −3 @yonromai
#4499 Extract retry_with_backoff to rigging for reuse outside RPC +212 −136 @rjpower
#4500 [fray] Split actor RPC into remote() (direct) and submit() (long-poll) +57 −5 @rjpower
#4501 [iris] Fix K8s dashboard performance: unified ClusterState, paginate pod table +448 −203 @rjpower
#4505 [vllm] Move Marin to forked TPU wheels +3829 −3600 @ahmeda14960
#4508 [iris] K8s provider perf: periodic GC, hot-loop scan removal, collector backoff +480 −43 @rjpower
#4515 [codex] Grant bucket metadata access to Iris workers 💬1 +67 −2 @yonromai
#4518 [iris] Fix log store GCS accumulation: use stable segment filenames +69 −61 @rjpower
#4520 [iris] Preserve reservation holders during preemption retry cascade +144 −12 @rjpower
#4522 zephyr: subprocess per shard task +392 −76 @ravwojdyla-agent
#4523 [zephyr] Remove unused CLI entry point 💬1 +8 −542 @claude
#4525 [iris] Switch CPU VM machine type from e2-highmem-2 to n2-highmem-2 +3 −3 @rjpower
#4526 Add lib/rigging/src to Ray PYTHONPATH and pyrefly search paths +5 −0 @rjpower
#4527 [iris] Replace track/untrack collector API with set_pods for consistency +178 −145 @rjpower
#4528 [iris] Add v4-reserved TPU pool for queued resource debugging +120 −54 @rjpower
#4529 [iris] Fix py-spy EPERM on TPU containers by always adding SYS_PTRACE +3 −1 @rjpower
#4531 [iris] Fix controller responsiveness under autoscaler pressure +129 −66 @rjpower
#4532 [iris] Migrate CloudK8sService from kubectl subprocess to kubernetes Python client 💬1 +898 −544 @rjpower
#4533 [iris] Replace eager descendant fetching with lazy has_children in ListJobs +383 −255 @rjpower
#4538 [iris] Fix OS Login key purge to remove legacy no-TTL keys +73 −17 @rjpower
#4542 [iris] Make kubernetes an optional controller dep +25 −11 @rjpower
#4543 [iris] Cap log fetch size to prevent unbounded reads on controller restart +25 −3 @rjpower
#4544 [iris] Replace min_slices with buffer_slices autoscaler feature +240 −289 @rjpower
#4545 Fix markdown docs in zephyr, iris, fray per agent-first principles 💬1 +604 −873 @yonromai
#4554 Wire Through Docker Images 💬3 +47 −12 @Helw150
#4557 [Iris] Canonicalize CPU metrics and add K8s API timeouts 💬1 +206 −164 @rjpower
#4558 [iris] Add JobQuery-backed job tree loading +647 −375 @rjpower
#4560 [iris] Reject child submissions when parent is absent from the DB +86 −27 @claude
#4562 Remove Ray auth/secret targets from Makefile and docs +2 −80 @ravwojdyla-agent
#4568 [iris] Bump k8s controller memory to 16Gi, filter scheduling events server-side +7 −2 @rjpower
#4572 [iris] Split autoscaler into a package with explicit planning +2270 −1902 @rjpower
#4574 [iris] Remove legacy ListJobsRequest fields after JobQuery transition +640 −329 @claude
#4576 zephyr: os._exit subprocess worker to dodge PyArrow GCS shutdown abort +11 −0 @ravwojdyla-agent
#4579 [zephyr] Track per-shard failures and abort after MAX_SHARD_FAILURES 💬3 +145 −17 @claude
#4580 zephyr: report subprocess SIGKILL as MemoryError +14 −1 @ravwojdyla-agent
#4582 zephyr: always os._exit subprocess worker via try/finally +20 −5 @ravwojdyla-agent
#4583 [zephyr] Exit idle workers on last stage without waiting for in-flight tasks 💬6 +48 −17 @claude
#4584 [zephyr] Skip worker allocation for already-complete shards on resume 💬1 +176 −3 @claude
#4595 Remove Ray backend support from test fixtures +7 −44 @rjpower
#4600 [levanter] Cap homogeneous-run length when calling Rust BPE tokenizers 💬4 +266 −6 @Helw150
#4601 Remove Ray from integration tests; consolidate on Iris +211 −1479 @rjpower
#4602 [zephyr] Avoid duplicate coordinator shutdown log 💬1 +0 −13 @ravwojdyla-agent
#4604 Cap Ray max_workers, zero min_workers across migrated clusters 💬1 +28 −28 @yonromai
#4614 [iris] Surface task attempt failures on dashboard and job detail page 💬3 +108 −3 @claude
#4615 [training] Set max_retries_failure=0 for training job submissions +1 −1 @rjpower
#4619 [iris] iris job run --tpu accepts comma-separated list for flexible scheduling 💬1 +100 −6 @AlienKevin
#4623 [infra] Cap remaining Ray TPU pools with tiered max_workers rule +99 −99 @yonromai
#4625 [iris] Remove 8 GB Docker memory cap on build containers 💬1 +3 −2 @rjpower
#4629 [iris] Track historical task resource usage in the DB 💬2 +715 −247 @rjpower
#4630 iris: cloud run iap proxy for controller dashboard 💬1 +353 −0 @ravwojdyla-agent
#4633 [iris] Encode jobs tab filter/sort/page state in URL query params +45 −8 @claude
#4635 [iris] Adaptive rolling worker restart with observation window 💬1 +146 −60 @rjpower
#4638 [iris] Replace monolithic heartbeat with focused Ping/StartTasks/StopTasks/PollTasks RPCs 💬1 +1331 −1762 @rjpower
#4642 [byebye-ray] Delete inference and fasttext tooling 💬2 +26 −1965 @yonromai
#4643 [zephyr] Return counters alongside results from ctx.execute() 💬1 +351 −274 @ravwojdyla-agent
#4644 [iris] Normalize controller DB: replace proto BLOBs with native SQL columns 💬2 +1911 −488 @rjpower
#4646 [iris] Add --priority flag to iris job run +70 −1 @claude
#4648 [ray] Delete cache_dataset and vLLM Ray fallback +6 −115 @yonromai
#4649 infra-dashboard: cloud run status page for marin 💬1 +8924 −0 @ravwojdyla-agent
#4650 [zephyr] Count partitions skipped on existing output +2 −0 @ravwojdyla-agent
#4651 [iris] Overhaul scheduler tab, paginate task lists, improve task detail 💬2 +394 −313 @rjpower
#4652 [iris] Fix TPU JAX bootstrap retries 💬1 +85 −4 @Calvin-Xu
#4655 [iris] Add --fresh flag to skip checkpoint restore on controller start 💬1 +120 −21 @rjpower
#4657 [rigging] Skip GCS mirror fallback on non-GCS MARIN_PREFIX +62 −2 @claude
#4662 [iris] Expand v4-reserved pool to match v4-preemptible sizes (32-4096) +16 −2 @AlienKevin
#4663 [deps] Pin torch to pytorch-cpu index in the vllm extra +582 −993 @AlienKevin
#4668 [iris] Show command line and job request details on dashboard +79 −0 @claude
#4670 [iris] Classify TPU capacity errors as QuotaExhaustedError and retry tpu_delete on quota 💬1 +165 −4 @claude
#4671 [iris] Fix endpoint dashboard showing user instead of job name 💬3 +61 −52 @claude
#4672 [iris] Default --tail to true for iris job logs +1 −1 @claude
#4674 [iris] Decode constraints to native Constraint at load time 💬2 +356 −454 @rjpower
#4675 [inference] Pre-start cleanup of stale TPU lockfiles in vLLM native server 💬1 +26 −0 @AlienKevin
#4680 Rename FS tests for mirroring to the proper location. +0 −0 @rjpower
#4681 [iris] Reject jobs with unsatisfiable routing constraints at submit time +312 −313 @claude
#4682 iris: move log service to a separate subprocess +247 −56 @claude
#2279 Ray Authentication Fails Due to Environment Variable Caching at Import Time
#3140 Make reservations top-level objects with independent lifecycle tracking 💬1
#3487 iris: reservations are confusing and inefficient 💬1
#3761 [ray] Jobs fail when iris rpc generated files are gitignored
#3768 Experiment: Fast vLLM weight loading on TPU via Levanter fsspec 💬1
#3959 Migrate RL pipeline from Fray v1 (Ray) to Fray v2 (Iris)
#4004 [zephyr] Coordinator thread has no lifecycle logging — hangs are undiagnosable
#4088 iris: migrate classification/inference to Fray & Iris 💬1
#4200 Zephyr coordinator stalls when last-shard worker dies or hangs
#4325 [zephyr] load_parquet uses 8.7 GB RSS for a 2 GB file due to pyarrow.dataset API
#4356 [vllm] Repair stale Meta Llama GCS caches flattened into Mistral layout 💬1
#4357 move marin to fork of vllm and tpu-inference 💬2
#4367 [iris] Add PodDisruptionBudget for controller and coordinator pods on K8s
#4368 Zephyr Improvements 💬1
#4373 [CW/R2] Tensorstore `Malformed StorageGeneration` 💬2
#4397 Evaluating logprobs is broken
#4414 [iris] K8s LogCollector silently skips log collection for nested zephyr pipeline pods 💬1
#4417 [zephyr] Coordinator result flattening is single-threaded, slow for large shard counts
#4419 [CW/R2] Transient SignatureDoesNotMatch on S3 distributed lock write
#4433 TensorStore Malformed StorageGeneration when consolidating single-shard cache on R2/S3 💬1
#4437 Fix lm-eval-harness on multihost TPUs (v2)
#4445 Improve levanter store/cache 💬15
#4460 Iris: support reserved node topology changes without restarting the cluster 💬5
#4461 Delete 7 Ray orphan files (dead code / superseded CLI tools) 💬1
#4474 [Epic] Levanter Store, K8s Logging, and Infrastructure Improvements
#4475 Bump RAM for checkpoint export last step
#4477 [iris] Fix dashboard and logging with many pods 💬2
#4478 [iris] Transaction log agent skill
#4479 [iris] Move profile data out of Iris SQLite DB
#4480 [iris] Ensure reliable parquet-to-GCS log sync
#4494 [zephyr] Coordinator actor name collision on retry causes ActorAlreadyExistsError
#4495 [levanter] TPU Docker tests regress after global pytest -n auto 💬1
#4502 Bump Jax version to 0.9.2
#4504 Investigate issues in `/tim/iris-run-job-20260407-181022` 💬2
#4506 [vllm-fork move] jax 0.8.0 → 0.9.2 upgrade breaks levanter source, chex transit, datasets audio, and playwright 💬2
#4509 [iris] Move workdir files from ConfigMaps to bundle store blobs
#4512 Iris: k8s perf improvements 💬4
#4517 [iris] Investigate noisy memray 'Failed to compress input file' logs
#4519 Iris: pro-actively fetch extra buffer slices for popular scaling groups
#4521 [zephyr] Remove unused Zephyr CLI entry point
#4530 Iris controller restart protocol
#4537 Dashboard log entries duplicated after #4518 deployment (old-format parquet overlap) 💬2
#4540 [iris] Add preemptible flag for `job run`
#4552 Iris: dashboard doesn't properly handle nested jobs 💬1
#4553 Iris: controller should track historical task resource utilization in the DB
#4559 [iris] Child job submission silently accepted when parent is absent from DB
#4564 iris: controller stops responding when taking a cpu profile
#4566 iris: capture last memory dump for OOMing tasks
#4570 iris: db cleanup
#4571 Iris Performance - Modal like startup times 💬3
#4573 [iris] Remove legacy ListJobsRequest fields after JobQuery transition
#4575 zephyr: identify bad shards and don't rerun
#4577 [zephyr] Skip worker allocation for already-complete shards on resume 💬1
#4578 [zephyr] Workers don't exit after completing tasks, hold resources idle
#4587 zephr: split large parquet shards 💬2
#4589 [iris] Expose per-task summary (incl. peak memory) via CLI/RPC 💬4
#4593 Observability: Log the CDF of Loss as a function of position in sequence
#4594 Minor / sharp edge: Iris job list prefix doesn't work with username only
#4613 Iris -- add hints about job/task failures to the main dashboard and job status page
#4618 [iris] iris job run --tpu should accept multiple TPU variants for flexible scheduling
#4631 iris: get rid of reservations
#4632 iris dashboard: encode filters in the URL so back button and link sharing are more friendly
#4639 Do "RL helpers" Need to be ported to Iris? 💬1
#4640 Does "Evaluation: logprob jobs" need to be ported to Iris? 💬1
#4641 Do Levanter cache dataset and vllm server need to be ported to Iris? 💬1
#4645 iris: add --priority to allow submitting a job at a given priority
#4647 Iris: Add a legend for dashboard symbols and an in-use slice state 💬1
#4656 Zephyr: noisy logs in coreweave CI pipeline
#4664 iris: logging - detect and skip exceptions for common error types
#4665 iris: show useful job info on the dashboard
#4667 iris: endpoints associated with user, not job?
#4669 iris job log should --tail by default
#4673 iris: move logger service into a separate process, boot at controller startup time 💬1
#4679 iris: reject unsatisfiable constraints

#3192 Synthetic data (research + critical path for post-training)

0/5 sub-issues closed

The SWE-ZERO data generation pipeline had its most active week yet. #4561, the execution-free agentic rollout MVP, closed this week after @AlienKevin completed the full six-step plan: the final 32K-context run on a v6e-8 worker with TP=4 produced 972/1000 unique rollouts, 486 clean submissions, and 4.52M completion tokens in 38 minutes — 3.16x more clean submissions and 2.46x more usable tokens than the 8K baseline. The dataset is published at AlienKevin/SWE-ZERO-1k-trajectories-32k, with an LLM-judged resolve rate of ~13% pass@1 and ~20% pass@10 on a 10-PR sample. Building on the MVP, #4653 (multi-language scaling, preregistered) also closed: 300 rollouts across all 20 SWE-rebench V2 languages (5 PRs x 3 rollouts per language) reached 31.7% aggregate submission rate and a 3.7% LLM-judged pass@1, with trajectories at AlienKevin/SWE-ZERO-multilang-300-trajectories. The 1B-token full-corpus scale-out #4666 (preregistered) is now 55% complete — 53,826 of 96,237 rollouts (~671M of ~1.2B tokens) across 19 shards. A separate experiment, #4683 (preregistered), validated execution-based evaluation via ConTree: the SDK integration is working end-to-end and will replace the LLM-as-judge with actual test-suite pass/fail verdicts. PR #4611 from @rjpower added a gVisor-sandboxed Python execution tracer for SWE-rebench instances, capturing function calls, returns, and line-level events via sys.monitoring (Python 3.12+) with a sys.settrace fallback. Kevin announced the milestone in #data-curation — a crisp three-point progress report covering the 1000-trajectory MVP, 87.4% unique-bash-command diversity verification, and the mini-coder-1.7b teacher scoring >50% pass@100 on SWE-B — with next-step pointer to the 100B-token target.

Discord also sharpened how the team is thinking about synthetic data generators more broadly. In #data-curation, @cs2716 flagged the BeyondWeb paper's finding that rephraser size shows diminishing returns past 3B parameters — "effective synthetic data generation doesn't necessarily require massive computational resources" — and noted Percy's own earlier paper used Llama 3.1 8B Instruct as a generator with good results. Percy replied that Marin will likely adopt some of these ideas but wants to adapt them away from the "infinite compute" framing of the source paper ("we don't have infinite compute in Marin 😂"). @elie separately surfaced a long-context / high-quality-document-reuse paper Percy co-authored as a candidate direction. The Kevin-initiated framing from earlier in the week explicitly positioned SWE-ZERO as "a cheap way to scale agentic traces for pre/mid-training" that could complement Code World Model-style rollouts — a bet the team is now visibly making.

On the agentic SFT front, both the 8B and 32B NemotronTerminal reproduction runs are in flight. The 32B run #4307 is at 30.8% of training (step 1761/5,721) on v5p-256, and an intermediate eval at step 1500 showed 17.6% on Terminal-Bench 2 (versus the released model's 27.4%), suggesting the run is tracking well at roughly 65% of target performance at 26% through training. A second trial on v4-512 with tensor parallelism was submitted this week via Ray on the big-run cluster to compare hardware paths. For Marin-8B, #4420 confirmed a baseline of 0% on TB2 before SFT (as expected), and SFT training on the full 366K-example Nemotron-Terminal-Corpus is at 87% completion (step 4962/5,721) on v5p-64; intermediate checkpoints show 1% TBLite at step 1500 and 5% at step 3000. A notable debugging episode uncovered that Levanter's HF checkpoint export writes the wrong eos_token_id (128001 instead of 128009), causing generation to hang — fixing this unblocked the evals. A parallel v4-128 trial for Marin-8B was also launched this week. #4510 added projections for reproducing OpenSWE on Marin 32B: ~10,500 chip-hours on v5p-512 (~1.7 days), roughly 80% of the NemotronTerminal 32B SFT cost. In #sft, @willheld declared "online distillation is actually my #1 desired RL feature for Marin," triggering a threaded exchange with @natolambert disambiguating two open problems — distilling from an open-weights model in a fully open fashion (e.g. a Flash-sized Kimi) versus distilling from a fully open source model (e.g. a Flash-sized OLMo 3 32B) — with willheld noting Tinker currently supports neither since it only targets open-weight finetuning. Separately, in the #sft-agents benchmark-hacking thread, Kevin surfaced his recent SWE-B test-poisoning fix and argued that continuous trace monitoring and benchmark hardening are unavoidable — a position that dovetails with the ConTree execution-based evaluation work in #4683.

Research into soft proxies and predictive scaling continued. #4389 saw negative results on logprob-based proxies: @RohithKuditipudi evaluated top-25 Qwen3-8B SWE-bench fine-tunes using trajectory cross-logprob scoring (both full-sequence loss and success/failure gap), and found the relationship broke down even within a controlled same-base-model family — suggesting fine-tuning itself disrupts logprob comparability. Masking to only bash command tokens inside tool calls did not substantially improve signal. @Helw150 proposed two follow-on directions: adapting Charlie Snell's emergence-prediction method, or shifting to a post-training-controlled experiment design. A cluster of new planning issues — #4547, #4548, #4549, #4550, #4551 — was opened to structure the mid/post-training prediction program: identifying recipe candidates, predicting outcomes from intermediate checkpoints, designing smooth pass@k proxies, and training isoflop/Delphi ladders across mixes. PR #4539 from @RohithKuditipudi added soft proxies for the OT-Agent leaderboard as part of this effort. willheld sharpened the prediction conversation in #midtraining, drawing a firm distinction between predicting the effects of LR annealing (which he believes Marin already has evidence for) and predicting the effects of midtraining with a different data mix (which he does not think is shown); his concrete recommendation was to run midtraining on all isoflop models first, noting that "doing midtraining for like the 1e21 and 1e22 is very doable over a weekend when preemptible capacity is high" — a cheap, well-posed experiment that would directly feed the prediction program.

The RL infrastructure gained new building blocks this week. @taivu1998 opened PR #4661 with a first-pass OpenReward integration — manifest prep, typed Qwen/vLLM tool-calling, a single-turn OpenRewardEnv, and a smoke launcher — alongside PR #4628 adding a telemetry event shards and provenance substrate for durable RL logging. PR #4620 made KL configuration explicit by replacing bare kl_coef plumbing with a dedicated KLConfig on RLOOLoss, and added k2 loss support alongside the existing k3 path. PR #4524 made the Iris RL config resource-aware for GPU rollout and trainer workers. On the DPO side, @ahmeda14960 opened PR #4637 unifying DPO and LoRA-DPO under a single train_dpo.py entrypoint with config-driven adapter types, fixing an HF export axis-order bug, and achieving a 2x eval speedup via durable reference caching — a cleaned-up successor to PR #4634. @eric-czech opened PR #4677 fixing tokenizer loading for lm-eval on HF checkpoints when a custom Marin tokenizer is specified alongside an HF checkpoint path. PR #4398 from @RohithKuditipudi migrated logprob evals to Fray v2 and added a tracker callback to save eval results to file in addition to W&B.

0 PRs this week, and 0 new issues (5 total)

Sort:

Issues
#2956 [Agentic SFT] SFT Qwen3-8B on 5K SWE-smith trajectories and show improvement on SWE-bench
#2905 [Agentic SFT] Generate 30K Coding Trajectories across 6 Languages
#3093 [Agentic SFT] Tracking SFT datasets
#2262 Experiment: OpenThoughts4 Teacher Model Comparison - Qwen3-32B vs. Qwen3-235B-A22B
#3956 Pilot distillation tests for optimal teacher selection

35 autocategorized

#4398 Migrate logprob evals to Fray v2 + some flybys +339 −7 @RohithKuditipudi
#4524 [RL] Add GPU smoke probes and resource-aware config +1433 −247 @taivu1998
#4539 Soft proxies for ot-agent leaderboard +720 −40 @RohithKuditipudi
#4611 Add SWE-rebench tracing experiment with gVisor sandbox and Python tracer +2428 −184 @rjpower
#4620 [rl] Make KL configuration explicit and add k2 support +314 −62 @taivu1998
#4628 [RL] Add telemetry event shards and provenance substrate +772 −4 @taivu1998
#4634 [dpo] Revive LoRA-DPO on canonical train_dpo +19580 −372 @ahmeda14960
#4637 [dpo] Add LoRA-DPO support to Levanter +2974 −364 @ahmeda14960
#4661 [RL] Add first-pass OpenReward integration 💬1 +2674 −49 @taivu1998
#4677 Fix tokenizer loading for lm-eval on HF checkpoints +15 −2 @eric-czech
#1633 RL: Support separate inference server for RL training
#1707 RL: Max token management for RL training is awkward and error prone
#1738 [epic] RL improvements/productionization
#2198 SFT several models on the best existing datasets
#2387 [RL] vLLM EngineCore initialization fails due to TPU HBM fragmentation
#3490 [Agentic SFT] Reproduce NemotronTerminal-8B 💬1
#4307 [Agentic SFT] Reproduce NemotronTerminal-32B 💬4
#4389 Identify a soft proxy for agentic benchmarks to support data-mixture studies 💬7
#4420 [Agentic SFT] Post-train Marin-8B Instruct on the TerminalCorpus 💬11
#4435 Generate diverse agentic traces for pre-training in SWE-ZERO style 💬2
#4507 Predict midtraining + SFT effects across the Delphi scaling ladder
#4510 [Agentic SFT] Estimate Marin 8B and 32B post-training compute needs 💬1
#4511 Predictions for Post-training
#4513 Pass@k as a parallel measurement modality on midtrained models
#4514 Predict midtraining + SFT effects using only annealed parent checkpoints
#4547 Identifying some mid/post-training recipe candidates
#4548 Predicting mid/post-training outcomes from intermediate checkpoints
#4549 Predicting pass@k
#4550 Reliable scaling for downstream evals/post-training 💬1
#4551 Train Isoflops/Delphi ladders on various mid/post-training mixes
#4556 [levanter] Revive LoRA-DPO on canonical train_dpo 💬5
#4561 Experiment: SWE-ZERO MVP — Execution-free agentic rollout generation 💬22
#4653 Experiment: SWE-ZERO multi-language scaling (20 languages × 5 PRs × 3 runs) 💬4
#4666 Experiment: SWE-ZERO scaling to 1B tokens (32k PRs × 3 rollouts) 💬1
#4683 Experiment: Execution-based SWE-ZERO evaluation via ConTree 💬1

#3100 Data sources for pre-training / mid-training

Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.

0/5 sub-issues closed

The Nemotron data tokenization pipeline saw a focused round of reliability and observability fixes this week, all from @ravwojdyla. PR #4446 gave each Nemotron split its own Fray job, making individual stages easier to inspect in the dashboard. Memory allocation was addressed in two places: PR #4450 raised worker memory for CommonCrawl download workers to 4 GB (each decompresses ~350 MB zstd files to 1.5–2 GB in memory, causing OOMKills at the previous 1 GB default), and PR #4448 raised the levanter-cache-copy stage to 10 GB. PR #4449 distributed the cache shard-size probing step into a Zephyr pipeline so thousands of S3 connections no longer pile up in the coordinator process, and PR #4454 exposed cache_copy_max_workers as a tunable parameter on tokenize_nemotron.

On the data-mixing research front, issue #2345 received new results from @Calvin-Xu on the multi-domain swarm run. Updated plots show that the functional form fitting can improve over the search frontier in as few as 20 runs when soft regularization constrains optima to a convex combination of the top observed runs. A complementary signal-to-noise analysis at 60M and 1.2B token scales quantifies how much useful mixing signal is recoverable at each budget, using the 240-run swarm for signal and matched-seed repeats for noise. Follow-up notes in #data-mixing report that the functional form is now mostly locked in, with predicted optima becoming sparser than earlier fits (often dropping low-quality splits entirely); the raw optima are good enough now that the earlier convex-combination regularizer is no longer necessary, and noise has been pinned at std ~0.0014 BPB from matched-seed repeats.

A question came up around whole-document packing for non-natural-language domains. @Helw150 noted in issue #4535 that protein and DNA documents suffer acutely from mid-sequence splits (the latter half of a document is meaningless without the start), and asked for the partial-packing behavior available in chat formats to be surfaced for plain text formats. It turned out that DatasetComponent already supports this via its pack field; PR #4536 further exposes a partial_pack option directly on TextLmDatasetFormat so callers can enable whole-document packing inline without reaching into the surrounding component. Alongside this, @eric-czech landed PR #4622 migrating DNABatchTokenizer and DNALmDatasetFormat to the new MarinTokenizer protocol, using as_hf_tokenizer() internally where HF-specific APIs are still required. On the GitHub code dataset front, issue #3332—which proposes cloning permissively-licensed repos from CommonPile and extracting full commit histories as training documents—received a question from @percyliang about expected token yield and crawl throughput.

Discord discussion kept circling the synthetic and agentic branches of the data strategy. In #data-curation, @AlienKevin flagged issue #4435 (SWE-ZERO) as a cheap path to scale agentic traces for pre/mid-training alongside the Code World Model rollouts in issue #4383, and by the end of the week reported an MVP with 1000 trajectories across 10 SWE-rebench repos (87.4% unique bash commands at Jaccard < 0.5), a teacher scoring >50% pass@100 on SWE-B via issue #4561, and an explicit next step of scaling toward the 100B-token target called out on this epic. A separate thread on rephrasing-style synthetic data surfaced external input from @elie pointing at a recent Liang-coauthored paper on reusing high-quality long-context documents; @cs2716 noted the BeyondWeb finding that rephrasers plateau beyond ~3B parameters, and @percyliang cautioned that the cited work assumes an “infinite compute” regime Marin does not have, so any adoption will need adaptation. In #midtraining, @Helw150 and @ahmeda14960 clarified that “midtraining” in Marin’s usage specifically means swapping the data mix during annealing (not just LR annealing on the pretraining mix as in the Mantis Stack v2 edu tail), framing how the upcoming isoflop midtraining sweeps will be set up.

0 PRs this week, and 0 new issues (5 total)

Sort:

Issues
#3049 Test Luxical as a General Tool for Data Integration Pipelines
#3101 Ensure we have 20T deduped tokens of data
#3183 Software Heritage Foundation license
#3194 Gather code environments
#4148 Experiment: synthetic reasoning bootstrap corpus

11 autocategorized

#4446 Each tokenize nemotron split gets fray job 💬1 +17 −2 @ravwojdyla
#4448 Use 10g in levanter-cache-copy +1 −1 @ravwojdyla
#4449 [levanter] Distribute cache shard size probing as a zephyr pipeline +19 −8 @ravwojdyla
#4450 nemotron: set 4GB worker memory for CC download 💬2 +4 −1 @ravwojdyla
#4454 Expose `cache_copy_max_workers` in the `tokenize_nemotron` +2 −0 @ravwojdyla
#4536 [levanter] Expose pack field on TextLmDatasetFormat 💬1 +121 −2 @claude
#4622 Migrate DNABatchTokenizer and DNALmDatasetFormat to MarinTokenizer 💬1 +24 −23 @eric-czech
#2345 Data Mixing Many Domains Swarm Run 💬2
#3332 Expand on CommonPile Github Archive -> Full Commit History for Permissively Licensed Repos 💬1
#4456 Make Default Checkpointing Behavior Extremely Conservative
#4535 Make it Easy to Use Partial Sequence Packing for Non-Chat Formats 💬2

Other Changes

@Helw150 landed two training infrastructure fixes: PR #4458 simplifies the QB (queue-based) trainer by consolidating work that was split across the main thread and post-step synchronization — a cleanup motivated by suspected checkpoint resumption issues — and PR #4457 makes default checkpoint retention conservative across default_train, default_sft, and default_dpo, so only the final checkpoint (plus rolling temporaries for resumption) is kept by default rather than accumulating all intermediate ones. On the agentic tooling side, @rjpower added a TDD bug-fix skill that walks agents through four phases — root cause analysis, writing a minimal failing test, applying the smallest sufficient fix, then lint and commit — keeping them from jumping straight to a patch without a reproducing test. @yonromai added missing YAML frontmatter to the fix-docs skill so Codex stops flagging it as invalid in PR #4659, and @ravwojdyla re-synced a drifted uv.lock in PR #4617.

@eric-czech rebased the long-running dna branch onto current main in PR #4247, replaying 11 of 15 branch-only commits after it had fallen 381 commits behind; four commits touching PermutationDataset/EraShufflingDataset were skipped because main now depends on finite dataset semantics that conflict with the branch’s infinite-length approach. Separately, @redagavin merged a speedrun entry PR #2185 validating the Muon optimizer on a 50M-parameter Llama at 1× Chinchilla scale (~1B tokens on a single H200), achieving a Paloma BPB of 1.3989 as a baseline for optimizer comparisons.

On packaging, @rjpower renamed the six marin-owned lib packages to the marin-* prefix (marin-fray, marin-rigging, marin-iris, marin-zephyr, marin-haliax, marin-levanter) in PR #4609 so wheels publish under a marin-owned namespace; import names are unchanged, but the rename tripped up old uv clients and required a uv sync --all-packages --reinstall-package marin-iris to recover, per a heads-up Russell posted in #infra.

7 PRs this week, 4 new comments, and 80 issues closed (80 total)

Sort:

#4659 Add frontmatter to fix-docs skill +5 −0 @yonromai
#4617 Update uv.lock +30 −12 @ravwojdyla
#4458 Simplify QB +19 −20 @Helw150
#4457 Make default checkpoint retention conservative 💬1 +23 −7 @Helw150
#4444 Add TDD bug fix skill +97 −0 @rjpower
#4247 Rebase dna branch onto main +156552 −288520 @eric-czech
#2185 speedrun submission: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale 💬2 +221 −0 @redagavin

Community Pulse

Beyond the external contributions already woven into the epics above, two threads stand on their own. @RohithKuditipudi, alongside the Fray v2 logprob-eval migration in #4398, filed a five-issue post-training predictability roadmap — #4547, #4548, #4549, #4550, #4551 — covering intermediate-checkpoint forecasting, pass@k prediction, and isoflop/Delphi ladders on mid- and post-training mixes, a scaling-law program parallel to the pretraining isoflops. @nevillelyh wrote a detailed design review on #4445 (levanter store/cache) with write-throughput benchmarks and a Parquet-mapped layout sketch, and weighed in on Iris startup profiling in #4571. @redagavin added a Muon-at-1×-Chinchilla speedrun in #2185.

Four new members posted introductions: Vivien Cheng, a Stanford MS / incoming PhD taking CS336 who works on ML systems, kernel optimization, and linear attention at Hazy Research and wants to contribute on infra/systems; Ty Feng, an ML engineer with an RL-infra background on a TRC grant who found Marin via the JAX docs and wants to contribute to data and engineering while learning scaling on JAX/TPU; Chris (cs2716), an Imperial maths grad coming from robotics "to learn how the sausage gets made" as the field moves toward foundation models; and Sri, with theoretical background in scaling and distributed systems plus JAX tinkering on a local Mac M1, planning to file small PRs against under-documented edges while ramping up. Vivien's kernel background intersects with this week's ragged_dot Triton landing and the communication-bound gap it exposed; Ty brings context on the JAX/TPU RL substrate @taivu1998 is building in #4524, #4620, #4628, #4661.

Kevin Xiang Li's SWE-ZERO status post in #data-curation summarized 1000 rolled-out trajectories across 10 SWE-rebench repos, 87.4% unique bash commands, and mini-coder-1.7b above 50% on SWE-B pass@100, pointing at the 100B-token target in #3100 next. In #moe, Larry's announcement of #4636 invited humans and agents to climb the published metrics ladder in experiments/grug/moe/README.md. And Percy Liang's reply to a question about a long-context data-reuse paper he co-authored — "we will probably want to take some of these ideas, but might want to adapt it since that paper is still in the 'infinite compute' setting, but we don't have infinite compute in Marin" — captured the project's compute-constrained framing on the isoflop and data-mixing work.

Reading in #news tracked the project's own active arcs — synthetic data, agentic-benchmark integrity, and MoE routing — via Meta's Muse Spark / MSL announcement, Tristan Thrush's Dataset Policy Gradient paper, a report of widespread cheating on agent benchmarks including Terminal-Bench 2's top three, and a PathMoE writeup on sharing router weights across consecutive layers.

News & research shared

x.com/adamlsteinl/status/2042655187613995026?s=20 — cc discussion (5 comments)
ai.meta.com/blog/introducing-muse-spark-msl/ — Introducing Muse Spark: Scaling Towards Personal Superintelligence discussion (4 comments)
x.com/mogiciantony/status/2042300245242233216?s=46 — seems worth a look! cc: discussion (2 comments)
arxiv.org/abs/2603.18534v1 — Synthetic data augmentation has emerged as a promising solution when pre-training is constrained by data rather than compute. We study how to design synthetic… discussion (1 comment)
huggingface.co/datasets/AlienKevin/SWE-ZERO-1k-trajectories/vi... — We’re on a journey to advance and democratize artificial intelligence through open source and open science. discussion
arxiv.org/abs/2604.08423 — What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient… discussion
x.com/cmpatino_/status/2043343429753782664?s=46 — very cool distillation work that seems relevant discussion
ai.meta.com/blog/introducing- — Introducing Muse Spark: Scaling Towards Personal Superintelligence discussion
www.youtube.com/watch?v=EKu7TYWNxqA — I felt a great disturbance in the Force... discussion
x.com/mogiciantony/status/ discussion
x.com/adamlsteinl/status/2042655202268914056?s=20 — More charitably: discussion
x.com/askalphaxiv/status/2039400773105090619 — Path-constrained mixture of experts (PathMoE) They found that by re-using router weights across small blocks of consecutive layers, it… discussion

GitHub activity from 7 external contributors

Rohith Kuditipudi · cs phd @ stanford 2 PRs, 4 comments

#4398 Migrate logprob evals to Fray v2 + some flybys +339 −7
#4539 Soft proxies for ot-agent leaderboard +720 −40

4 comments on 2 threads

#4389 Identify a soft proxy for agentic benchmarks to support data-mixture studies ×3
#4640 Does "Evaluation: logprob jobs" need to be ported to Iris?

Eric Czech 4 PRs, 1 comment

#4622 Migrate DNABatchTokenizer and DNALmDatasetFormat to MarinTokenizer 💬1 +24 −23
#4563 Handle infeasible trials in Vizier reference sweep +34 −15
#4247 Rebase dna branch onto main +156552 −288520
#4677 Fix tokenizer loading for lm-eval on HF checkpoints +15 −2

1 comment on 1 thread

#4622 Migrate DNABatchTokenizer and DNALmDatasetFormat to MarinTokenizer

Tai Vu · San Francisco, CA, US 4 PRs, 1 comment

#4661 [RL] Add first-pass OpenReward integration 💬1 +2674 −49
#4628 [RL] Add telemetry event shards and provenance substrate +772 −4
#4620 [rl] Make KL configuration explicit and add k2 support +314 −62
#4524 [RL] Add GPU smoke probes and resource-aware config +1433 −247

1 comment on 1 thread

#4661 [RL] Add first-pass OpenReward integration

Neville Li · NY · Recovering "AI" engineer 0 PRs, 4 comments

4 comments on 2 threads

#4445 Improve levanter store/cache ×3
#4571 Iris Performance - Modal like startup times

Gavin Yang · NY, USA · First-year CS PhD student at Northeastern University. Graduated with a Bachelor's degree in CS & DS from NYU. 1 PR, 1 comment

#2185 speedrun submission: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale 💬2 +221 −0

1 comment on 1 thread

#2185 speedrun submission: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale

Chloe Chia · San Francisco 1 PR

#4297 Add GPU Triton kernel for ragged_dot MoE grouped matmul 💬1 +174 −7

Percy Liang · Stanford University · Stanford, CA 0 PRs, 1 comment

1 comment on 1 thread

#3332 Expand on CommonPile Github Archive -> Full Commit History for Permissively Licensed Repos

Top 15 runs (by FLOPs) this week (completed, running, crashed)

The week's defining event was the launch of the preregistered 1e23 MoE run — the project milestone is literally titled "Kick-off pre-trained 100B-A13B 1.2T token MoE (pregistered)" — backed by a completed scaling law sweep that produced a sharp prediction before the run started. The preparation work closed three interlocking experiments and the big run itself kicked off Friday on v4-512. Larry tied a bow on it Thursday with #4636, catching the grug/moe recipe up to main and publishing a metrics ladder in experiments/grug/moe/README.md for "anyone, or anyone's agents" to try to climb — announced in #moe and met with dlwh's "Iris is fully operational, moe looks 🔥. i should take more weeks off.".

MoE v7 1e22 closes out. Issue #3800 (Test MoE Arch at 1e21 and 1e22 Flop Scales) reached its conclusion this week with three reruns of the 1e22 d3200 run by @ClassicLarry (v2, v3, v4), totaling 1.08e22 model FLOPs each on v4-256. The key finding was the effect of capacity factor: the v4 run with cf=4.0 reached 38% MFU and Paloma c4_en/bpb 0.742, macro loss 2.290 on Paloma and 1.968 on uncheatable eval — all notably better than cf=1.0 (v2 at 19% MFU, macro loss 2.459). The v2 resume ran at just 19% MFU due to a checkpoint-loading path that was bisected to a crash in broadcast_one_to_all when trainable weights appear inside tuple[Block, ...]; a workaround was found that allowed all seven affected jobs to restart cleanly. The final actual macro loss of 2.432 came in above the preregistered prediction of 2.389 — @ClassicLarry attributed this to a too-optimistic irreducible asymptote in the three-point fit (asymptote of -0.1 vs a realistic 1.7) and to using a non-isoflop datapoint for the 1e20 anchor.

Isoflop v16 sweep produces the 1e23 preregistered prediction. Issue #4447 ran MoE v16 isoflop sweeps from 1e18 to 3e20 FLOPs across model widths d512 through d2560 on v4-256, filling in the scaling law needed to size the big run. The curves were smooth and the near-term fit was validated: the law predicted Paloma c4_en/bpb 2.598 at 1e21, and the actual d2560 run came in at 2.599. Optimal sizing for 1e23 was ambiguous due to flat isoflop curves, so the team split the difference between two fits: 131B total / ~16B active, 48 layers, d5120, ~1.02T tokens. For the preregistered Paloma macro loss prediction at 1e23, free-asymptote fits appeared too optimistic due to systematic underestimation at smaller scales; fixing L∞ = 1.6 gave stable cross-validated predictions of ~2.25 across all training-set sizes from 1e18-1e19 onward (scaling law: macro(C) = 1.6 + 95.18 · C^-0.094). A parallel critical batch size sweep in #4432 fit CBS ~ 1.87e-5 · C^0.388, projecting a critical batch size of ~15,500 (63M tokens/batch) at 1e23; the planned batch size of ~8M tokens sits well below that threshold. The preregistration discipline is itself part of the house style — when Meta announced Muse Spark in #news, willheld's first reply was "not preregistered... doesn't count 🤣".

The 1e23 run launches, with an early crash and a live restart. Two attempts at the full-scale run appeared this week on v4-512. moe_1e23_apr10_bs2048_ep8_ragged crashed after 7B tokens with loss diverging to 8.89 at 17% MFU; the ragged all-to-all capacity clipping bug fixed in #4359 was one contributing factor. The successor, moe_1e23_d5120_bs2048_ep4_ring, launched Friday on v4-512 with ring-style expert parallelism and is live as of this summary, having processed ~20B of the target ~1.2T tokens at 14% MFU (still in early warmup). Both runs use the d5120 geometry (131B total / ~16B active) from the isoflop sizing recommendation and train on the Nemotron mix. MFU at this scale remains under scrutiny — #4283 is the April tracking issue for the v4-1024/2048 target, and on Sunday @chloe offered to run grug MoE vs. Megatron on H100 to pin down the cross-stack comparison.

NemotronTerminal SFT reproduction in progress. Two SFT runs are underway aiming to reproduce NemotronTerminal-32B (target: 27.4% on TerminalBench 2). Issue #4307 tracks exp4307, a Qwen3-32B SFT on the full NemotronTerminal corpus running on v5p-256 at 34% MFU; it appears as crashed in W&B at step ~1500 (train loss 0.361) and was resubmitted this weekend on v4-512 from scratch with tensor parallelism. At step 1500 (26% through training), @AlienKevin reported 17.6% on TB2 (13/74 tasks solved), putting it at ~65% of the released model's performance. The 8B companion, #4420 / exp4420, is running on v5p-32 at 42% MFU; an early EOS token bug (Levanter exported eos_token_id: 128001 instead of the Marin tokenizer's 128009) caused all step-1500 TB2 evals to time out — once fixed, re-eval returned 1% on TBLite and step 3000 reached 5%. TerminalBench validity itself was live context this week: Kevin Xiang Li reflected in #sft-agents on the Steinl et al. finding that the top three TB2 submissions are all cheating, noting "we have to constantly monitor agent traces to detect reward hacks and continuously harden these benchmarks."

Run	User	Hardware(?)	Hours(?)	FLOP Budget(?)	BPB(?)
moe-v7-1e22-d3200-resume-2	Larry Dial	TPU v4 (256 chips)	0.2h	1.08e22 model 5.59e22 HW (19%)	BPB: 0.750
#3800 moe-v7-1e22-d3200-v3	Larry Dial	TPU v4 (256 chips)	0.3h	1.08e22 model 2.88e22 HW (37%)	BPB: 0.738
#3800 moe-v7-1e22-d3200-v4	Larry Dial	TPU v4 (256 chips)	0.3h	1.08e22 model 2.84e22 HW (38%)	BPB: 0.733
#4447 moe_1e23_d5120_bs2048_ep4_ring	Larry Dial	TPU v4 (512 chips)	1.2d	2.06e21 model 1.45e22 HW (14%)	BPB: 1.079
#4447 isoflop-moe-v16-1e+21-d2560	Larry Dial	TPU v4 (256 chips)	1.0d	1.13e21 model 7.52e21 HW (15%)	BPB: 0.805
exp4307_nemotron_terminal_qwen3_32b_32768tok_v5p256-ec2c31	Kevin Li	TPU v5 (128 chips)	2.6d	2.30e21 model 6.84e21 HW (34%)	—
#4447 isoflop-moe-v16-1e+21-d2048	Larry Dial	TPU v4 (256 chips)	1.0d	1.19e21 model 5.35e21 HW (22%)	BPB: 0.817
exp4420_sft_marin_8b_instruct_terminal_corpus_full_32768tokens_v-b20d73	Kevin Li	TPU v5 (32 chips)	4.3d	2.20e21 model 5.28e21 HW (42%)	—
#4447 isoflop-moe-v16-1e+21-d2048-v2	Larry Dial	TPU v4 (256 chips)	0.2h	1.19e21 model 5.03e21 HW (24%)	BPB: 0.793
#4447 #4359 moe_1e23_apr10_bs2048_ep8_ragged	Larry Dial	TPU v4 (512 chips)	8.9h	7.66e20 model 4.49e21 HW (17%)	BPB: 4.032
#4447 isoflop-moe-v16-1e+21-d2560-v2	Larry Dial	TPU v4 (256 chips)	0.3h	1.13e21 model 4.21e21 HW (27%)	BPB: 0.792
#4447 isoflop-moe-v16-3e+20-d1280-v2	Larry Dial	TPU v4 (64 chips)	1.3d	4.29e20 model 4.14e21 HW (10%)	BPB: 0.852
#4447 isoflop-moe-v16-3e+20-d1536	Larry Dial	TPU v4 (64 chips)	1.5d	3.93e20 model 1.94e21 HW (20%)	BPB: 0.837
#4447 isoflop-moe-v16-3e+20-d2304-v2	Larry Dial	TPU v4 (64 chips)	2.9h	3.49e20 model 1.85e21 HW (19%)	BPB: 0.843
#4447 isoflop-moe-v16-3e+20-d1792	Larry Dial	TPU v4 (64 chips)	1.4d	3.75e20 model 1.78e21 HW (21%)	BPB: 0.835

#4271 Marin-as-a-library (Bolinas can import marin)

#4283 MoE MFU at scale

#4282 Agentify experimentation

#4281 MoE Scaling up to April goal

#4273 Improve Usability & Observability

#4272 Canonical pipeline (download → norm → dedup/quality → tokenize)

#4270 Canary pass rate to 90%+

#4269 Single way of running jobs — off Ray completely

#3192 Synthetic data (research + critical path for post-training)

#3100 Data sources for pre-training / mid-training

Other Changes

Community Pulse

News & research shared

Rohith Kuditipudi · cs phd @ stanford 2 PRs, 4 comments

Eric Czech 4 PRs, 1 comment

Tai Vu · San Francisco, CA, US 4 PRs, 1 comment

Neville Li · NY · Recovering "AI" engineer 0 PRs, 4 comments

Gavin Yang · NY, USA · First-year CS PhD student at Northeastern University. Graduated with a Bachelor's degree in CS & DS from NYU. 1 PR, 1 comment

Chloe Chia · San Francisco 1 PR

Percy Liang · Stanford University · Stanford, CA 0 PRs, 1 comment

Top 15 runs (by FLOPs) this week (completed, running, crashed)

Keyboard shortcuts