The preregistered 1e23 MoE run launched Friday on v4-512 — the arc the project has been building toward since the 1e18–1e19 isoflop sweeps in early March and the 1e21 and 1e22 runs in the weeks since (the milestone is literally named for it). The v16 isoflop sweep that sized the run predicted Paloma c4_en/bpb 2.598 at 1e21 and the actual came in at 2.599; the macro-loss prediction at 1e23 is ~2.25, with the run training a 131B-total / ~16B-active model on ~1.2T tokens of Nemotron mix. The 1e22 v7 run that preceded it closed out with a clear capacity-factor finding (cf=4.0 beats cf=1.0 by 0.027 macro loss), and the actual 2.432 came in above the earlier 2.389 prediction — @ClassicLarry traced this to a too-optimistic irreducible-loss asymptote in the prior three-point fit, a calibration that fed directly into the 1e23 prediction.
The Ray-to-Iris migration reached its endgame: classification inference, FastText training, and Levanter’s Ray cache_dataset entrypoint were all deleted, capping the multi-month Iris architecture push that ran through SQLite and JWT (week of March 9), DuckDB and multi-host GPU (March 16), and the controller-performance work that cut lock hold time from 80ms to under 5ms (March 23). Genuinely new this week: the marin-* library packages began publishing nightly to PyPI, the canonical datakit pipeline gained a daily end-to-end smoke ferry (now visible in the cluster-stability strip above), and SWE-ZERO synthetic data crossed 55% of its preregistered 1B-token target after the execution-free MVP and multi-language scaling experiments both closed. Nightshift’s automated scout pipeline began reliably delivering findings after a git worktree race fix that had been silently killing most jobs.
The migration’s endgame was not a clean cutover. The central2 Ray cluster took three manual restarts between Apr 6 and Apr 8 (SIGABORT in GCS server, resource starvation), and @yonromai’s mid-week cap on remaining Ray workers in #4604 ran headlong into a paper-deadline SFT branch that depended on the old pools — a conversation between @Helw150, @yonromai, and Tony that sharpened why Iris’s fair-share scheduling is worth the cost of the forced deprecation. Alongside the migration, @rjpower surfaced a ~$60k/month GCS bill driven mostly by checkpoint storage: a one-time purge against a 3-day soft-delete window ran Apr 8, default training paths now keep only final checkpoints permanently, and a block-shuffle recommendation went out to cut class-B operation costs on hyperparameter sweeps. Outside the core team, four new members introduced themselves — framed in the Community Pulse below — alongside substantive GitHub contributions from @eric-czech, @RohithKuditipudi, @chloechiaw, @taivu1998, @redagavin, and @nevillelyh.
Summary: Measurable: Bolinas can `import marin` and use it as a library.
This week saw a concentrated push to make Marin's library packages genuinely usable as standalone dependencies. @rjpower introduced a MarinTokenizer Protocol in #4405 that decouples all tokenizer usage from HuggingFace's transformers package — the new abstraction uses the Rust-backed tokenizers library directly, which eliminates the torch-at-import problem and enables process-isolated encode_batch calls to contain memory leaks. Every runtime tokenizer path (data processing, RL training, eval, inference, visualization) was migrated in that PR. Follow-on work in #4451 added as_hf_tokenizer() to the Protocol so that padding, save_pretrained, and LoRA export can bridge back to HF APIs when needed, and fixed kitoken resolution in levanter's pyproject.toml. A residual tokenizer consistency issue tracked in #1753 was closed as resolved by the unified pipeline, though #4678 documents a remaining gap in eval_harness.py where HF and Marin tokenizer APIs are still mixed.
Cleaning up the library/experiment boundary was a parallel thread. #4541 removed all lib/ → experiments/ import paths and deleted the obsolete speedrun system, completing the work tracked by #4469. A subtle Python 3.11+ regression was fixed by @yonromai in #4534: get_caller_path() was returning <frozen runpy> as the experiment name when scripts were launched via python -m, causing executor metadata to be written to paths like gs://…/<frozen runpy>-ee7bce.json. The fix walks the stack past frozen frames to find the actual caller file.
Config discovery and packaging received significant hardening. #4546 added a --cluster flag to the Iris CLI that resolves cluster configs by name (searching infra/, installed package resources, and ~/.config/marin/clusters/) so external repos no longer need the monorepo checkout on disk. #4607 went further and bundled cluster YAML configs inside the marin-iris wheel itself, so --cluster=marin works for any downstream consumer that installs the wheel. The tokenizer mirror path scheme was also cleaned up in #4555, switching from a flattened org--model key to versioned slash-separated paths (mirror://tokenizers/{org}/{model}/hf-hub-{version}/) so library upgrades force a fresh fetch rather than silently reusing stale cached files. A missing chex dependency that broke wheel-only installs of levanter was fixed in #4608.
The packaging layer was put on a formal footing by #4609, which renamed the six internal library packages from their bare names to a marin-* prefix on PyPI (marin-fray, marin-rigging, marin-iris, marin-zephyr, marin-haliax, marin-levanter), leaving import names unchanged. Building on that, #4612 wired up CI to publish all seven marin-* wheels nightly, on tagged releases, and to a local vendor directory for fast local iteration. Nightly builds are tagged marin-<pkg>-YYYYMMDD with a rolling marin-<pkg>-latest alias; a coherent semver release can be cut by pushing a marin-libs-vX.Y.Z tag. The rename shipped with a small amount of user-facing friction: Russell flagged in #infra on Apr 10 that older uv clients could report iris as missing after pulling main, with uv sync --all-packages --reinstall-package marin-iris as the workaround. The long-standing tracking issue #2442 for Marin-as-a-library was closed, and a demo repo #4472 remains open as the final proof-of-concept step.
Summary: Tracking issue for April MFU work. Tasks/goals:
The week's most concrete GPU MoE throughput advance was the landing of #4297 by @chloechiaw, which adds a Triton kernel for the ragged_dot grouped matmul at the heart of Grug MoE GPU compute. The kernel, adapted from tokamax, runs the forward pass in Triton while falling back to XLA ragged_dot_general for the backward pass via custom_vjp (a limitation of JAX 0.8's lack of autodiff through pallas_call). Kernel-level forward benchmarks on a single H100 showed 5.2× speedup on uniform traffic (5.78 ms vs 29.98 ms) and 2.8× on skewed loads; at the 256M-parameter model level with 8 experts over 100 steps the improvement was a more modest ~20% in steps/sec. #4427, run by @yonromai and sealed the same week, then quantified the residual gap: with PR #4297 as the new baseline the H100×8 forward throughput sits at 26.12M tok/s versus Megatron's historical 33.09M tok/s anchor — a remaining 21.07% gap — and profiling shifts the blame from the old w13_ragged_dot compute bottleneck toward communication, synchronization, and overlap (roughly 29% compute vs 35% communication vs 36% host overhead in a fresh exact-cap Triton trace). The Megatron anchor itself is about to be re-measured: on April 12 @chloechiaw volunteered in #moe to run the Grug-vs-Megatron head-to-head called for by sub-issue #4311, confirming that grug_moe is the intended comparison point rather than one of the other Marin MoE variants.
A companion experiment, #4406 by @yonromai, tested whether the expert-padded w13 lowering from PR #3821 could stack on top of the new Triton path. The answer was a clear negative: across three seeds the combined approach was 3.28% slower than plain Triton, even though the same padded lowering remains a meaningful win on the XLA-only path. This rules out that direction as a follow-up for the current EP=8 configuration. Separately, #4359 by @dlwh fixed a correctness issue in ragged expert parallelism: receive buffers were previously sized for worst-case traffic rather than the configured capacity factor, and this PR clips receiver group sizes before the ragged all-to-all and preserves kept-token ordering on the return path.
On the TPU side, #4455 by @yonromai investigated JAX 0.8.0 vs 0.9.2 compatibility for Grug MoE on TPU. The initial smoke run on JAX 0.9.2 failed immediately due to stricter shard_map sharding validation — the expert weight arrays in grug_moe.py and lm_head in loss.py had mismatched in-specs. After explicit jax.sharding.reshard fixes, training runs cleanly on both stacks, and the final steady-state benchmark (7 steps, 2 warmup) shows JAX 0.9.2 slightly ahead of 0.8.0 on both v5p-8 and v4-8, clearing the path for a future JAX upgrade. The result landed in a broader context: @yonromai opened a coordination thread in #infra with @ahmeda14960 to scope a full codebase migration to JAX 0.9.2 — noting that the tpu-dep-hell branch had so far been exercised mainly for vLLM inference, not training — and posted the research/grug-moe-jax-regression branch with the fixes needed to reproduce the benchmark, which @ahmeda14960's agent then cross-filed as #4506 tracking the broader Levanter, chex, flax, and datasets ripple effects. @pranshu28 framed the acceptance criterion from the training side: “if we maintain similar MFU on grug MoE on the new jax that should be fine for training folks” — a reminder that an earlier Mixtral 8x7b regression in Levanter is the reason this gate exists.
Meanwhile, #4636 by @ClassicLarry (open at week's end) ports the MoeAdamHHeuristic from the moe_isoflop_apr_2026 branch, removes the initial-dense-layer path from Grug MoE to match the isoflop architecture, and fixes a jnp.repeat crash in align_kv_heads under abstract-mesh training — part of the preparation for the April isoflop scaling runs. Larry pitched the PR in #moe as bringing main back in sync and pointed anyone (“or anyone's agents”) to the recipe README as a “set of metrics to climb”, followed by a “moe looks 🔥” from @dlwh on return from a week off — informal validation that the combined picture (Triton kernel + isoflop heuristic + clean JAX 0.9.2 path + Iris fully operational) has the epic moving on all three fronts at once.
Summary: Split from #4266.
The Nightshift automated experimentation system saw key infrastructure fixes this week, enabling its scout agents to reliably produce results for the first time. PR #4581 from @rjpower diagnosed two bugs that had been silently killing most scouts each night: git worktree add calls were racing on repo metadata when dispatched in parallel (causing exit-255 failures for two of four scouts), and the surviving scouts produced no output because the runner was passing a nonexistent --cwd flag to the Claude CLI instead of the correct cwd= argument to subprocess.run. With worktree creation now sequentialized before parallel dispatch, scouts began delivering findings consistently.
The repaired pipeline immediately bore fruit. PR #4586 (April 9) combined four scout findings across subprojects: iris’s _job_state_name() and _task_state_name() helpers were deduplicated into canonical functions in rpc/proto_utils.py; levanter’s near-identical parquet row-group seeking blocks (one even carrying a TODO: fix this duplication comment) were unified into a shared _iter_parquet_from_row helper for a net −25 lines; three dead utility files in marin were removed; and a latent bug in zephyr’s load_parquet filter-column detection was fixed, replacing fragile substring matching on PyArrow expression strings with a proper AST walker that extracts field names directly.
PR #4621 (April 10) continued the streak with another round of automated housekeeping: three dead functions and their unused psutil/subprocess imports were removed from marin’s evaluation/utils.py; iris’s ControllerTransitions pruning logic was refactored into shared helpers (_prune_per_worker_history and _batch_delete) across two previously copy-pasted methods; four dead exports and legacy-compat methods were cleared from zephyr’s execution.py; and levanter shed 165 lines across six files, including deletion of a backward-compatibility shim (models/rotary.py) with zero callers.
Discord made clear that agent-driven work has quietly become the default mode of operating on the codebase, not just the nightly scout loop. Russell routinely delegated triage to Claude mid-conversation — pasting its analysis of memray’s spurious “Failed to compress input file” errors to show they weren’t causing task failures, offloading the BATCH-priority request recipe because he couldn’t remember it but “Claude will”, and responding to an OOM-looping tokenization job by telling willheld “ill file an issue and see if our friend Claude does an okay job” — which became #4575 minutes later, followed by #4577 and #4578. Romain took a similar path on the JAX 0.9.2 upgrade, planning to “just dipatch a codex” from Pranshu’s tpu-dep-hell branch to check for training regressions rather than doing the benchmark sweep by hand.
The pattern extended beyond the core infra team. rohithck filed #4494 and noted it was his “first time using an agent to file an issue; I am feeling the agi”; Ahmed had Codex file #4495 for a TPU-in-use error in the RL migration, and later credited “claude / codex” with making a stale-GCS-cache vllm bug palatable to hunt down. Willheld flagged that “Claude has its own CronCreate tool now”, retiring the sleep 570 loops that had been scaffolding scheduled agent runs. Eric raised the harder open question — how people are organizing agent-executed jobs rather than just the babysit/recover skills — suggesting the next frontier for this epic is moving agents from diagnosis and cleanup into actually launching and steering experiments, a direction Larry echoed when he opened the MoE recipe leaderboard to “anyone, or anyone’s agents”.
Summary: Split from #4266.
The preregistered 1e22 MoE run tracked in #3800 completed this week, training a 34.6B-total / 4.7B-active parameter model on 326B tokens of Nemotron mix at 1e22 non-embedding FLOPs. The run reached a paloma/macro_loss of 2.432 at capacity factor 4.0, substantially outperforming the 1e22 dense baseline, though slightly above the earlier 2.3887 prediction — @ClassicLarry attributed the gap primarily to an unrealistically low irreducible-loss asymptote in the three-point fit used to make that prediction, and notes that the 1e20 anchor point was not from a proper isoflop curve. A checkpoint-resume crash tied to trainable weights inside tuple[Block, ...] structures temporarily blocked the run; a workaround was found by loading from the latest checkpoint into a new GCS folder and W&B log, and a fix via #4458 resolved the immediate blocker. Capacity factor comparisons at step 77724 showed that higher capacity factors consistently improve macro loss (cf=4.0 giving 2.432 vs cf=1.0 giving 2.459), with the uncheatable eval improving by 3.1% from cf=1.0 to cf=4.0.
To size the upcoming 1e23 MoE run, @ClassicLarry completed a v16 isoflop sweep under the new learning-rate recipe #4447, running GQA 4:1 MoE models with 64 experts and k=4 across budgets from 1e18 to 1e21. The 1e21 prediction of 2.598 matched the actual d2560 result of 2.599, validating the near-term fit. Because the isoflop curves are flat and the model-size optimum ambiguous, a middle-ground sizing was chosen for 1e23: 48 layers, d5120 (40 heads), ~131B total / 16B active parameters, and ~1.02T tokens. Leave-future-out cross-validation of the scaling law showed that fixing the irreducible loss at L∞ = 1.6 yields stable predictions for 1e23 — estimated at roughly 2.25 Paloma macro loss — as long as empirical data exists at 1e18–3e19. A companion beta2 sweep #4567 confirmed that the existing clip(base^(B/32), 0.95, 0.9999) formula undershoots the optimal beta2 at larger batch sizes: at d=1024, the best beta2 shifted from 0.995 at bs=64–128 down to 0.97 at bs=256 and 0.95 at bs=512. The recipe will be updated to use a base of 0.998 instead of 0.999 — a near-free win at small batch sizes and directionally correct at large ones, where all large-scale runs will hit the 0.95 floor regardless.
A critical batch size sweep #4432 mapped how eval BPB degrades as batch size grows at isoflop-optimal budgets from d512 through d1280. The result is that there is no sharp cliff: smaller batches are consistently a bit better, but degradation stays mild until roughly bs=64–128 at d512, bs=128–256 at d768 and d1024, and approximately bs=512 at d1280. A power-law fit (CBS = 1.87e-05 · C^0.388) projects the critical batch size to ~6,361 at 1e22 and ~15,536 at 1e23, indicating that the 4M-token batch used in the d3200 run was comfortably below the harmful regime. The practical upshot is that future optimizer tuning of beta1/beta2 scaling is the higher-leverage follow-up rather than refining the CBS fit. Separately, #4569 opened to investigate the embedding norm growth seen in the 1e22 run — the embed (trained with AdamW rather than AdamH) grew from norm 176 at initialization to 3700 over the course of the run — and is exploring approaches including weight decay, cautious weight decay, per-token normalization, and token-level LR modulation as a function of recent gradient magnitude.
On the architecture front, latent-only MoE loss quality was assessed as part of the Great 10T gate in #4032. A clean 1e19 A/B across d768, d1024, and d1536 showed that the answer is width-dependent: latent MoE is a clear no at d768 (2.9% BPB regression with no throughput benefit), a plausible tradeoff at d1024 (1.7% regression, throughput gain), and the strongest current candidate at d1536 (approximately 0.8% regression, much better throughput). @pc0618 pushed in #moe for moving to larger scales given the shrinking gap, arguing the BPB gap should further close at the 100s:1 token-to-parameter ratios used in full hero runs rather than compute-optimal conditions; @ClassicLarry pushed back that the d1536 point is heavily undertrained and a 2% BPB hit at d1024 could cost ~15% of training time to make up, while still allowing that "most good ideas look bad until we get the execution right." The call is to keep latent MoE on the bench rather than promote it yet.
Isoflop infrastructure landed on main in #4636, which ports MoeAdamHHeuristic from the moe_isoflop_apr_2026 branch into experiments/grug/moe, drops the initial-dense-layer path from the MoE model to match the isoflop architecture, defaults EP capacity factor back to 1.0, and rewires launch.py to derive (model, optimizer, batch, steps) from a compute budget + hidden_dim. @ClassicLarry announced the PR as a "metrics to climb" leaderboard for anyone — or anyone's agents — wanting to improve the MoE recipe, with @dlwh noting that the combined week's progress — Iris fully operational alongside the MoE work — made for an unusually strong landing. The beta1=0.9062 literal in the new heuristic raised an eyebrow; @ClassicLarry explained it came from @Helw150's Vizier sweep and the decimals don't matter, prompting @Helw150's joke in a spin-off thread that his "Vizier" is a local wizard he consults rather than an actual sweep.
Forward-looking coordination started firming up in adjacent channels. On the MFU side of the April goal #4283, @chloew7 offered to run grug vs Megatron on an 8×H100 node for #4311, derisking the H100 leg of the 1e23 plan before committing. @rjpower's GrugMoE fixes on the research/grug-moe-jax-regression branch (see #4455) benchmarked JAX 0.9.2 against 0.8.0 and are being landed into main; @pc0618 signed off that maintaining MFU on grug MoE under the new JAX is the bar training folks care about. And in #midtraining, @Helw150 sketched that midtraining the 1e21 and 1e22 isoflop checkpoints is "very doable over a weekend when preemptible capacity is high," teeing up the natural next step once the isoflop sweep is complete. Infrastructure improvements for the Vizier reference sweep landed in #4563 from @eric-czech, making sweep jobs preemptible and preventing divergent (NaN loss) trials from crashing the entire sweep by marking them infeasible in Vizier instead.
Summary: Lower priority / slack-time workstream covering workqueue, dev-tpu replacement, and observability.
The week saw targeted improvements to Iris's operator experience, closing several gaps that surface in day-to-day use. @rjpower added a guard in #4585 that rejects --tpu, --gpu, and oversized memory/disk requests on the entrypoint job unless --enable-extra-resources is also passed — a common mistake where users assume the coordinator needs accelerators when it's only dispatching to worker tasks. The error message now explains the coordinator pattern explicitly. Separately, @rjpower landed #4597, a script that detects cross-region GCS reads for Iris jobs, giving operators a concrete diagnostic tool for latency and cost anomalies.
@ravwojdyla-agent added an iris job summary <job_id> command in #4592, surfacing per-task state, exit code, duration, peak memory, and current memory — data the controller already collected but had no CLI exposure. The same peak-memory column was added to the job detail dashboard page. The feature was designed with two concrete use cases in mind: agents babysitting long runs can fetch a structured summary at completion rather than scraping logs, and OOM postmortems can immediately see which shards hit cgroup limits. The --json flag makes the output machine-readable for scripted workflows.
On the dashboard side, @Helw150 opened #4660 adding dark mode, a clearer autoscaling overview, and revised colors and fonts — a follow-up to #4647, which called out that slice-state badges, colored dots, and progress-bar segments had no in-UI legend and no visual distinction between idle-ready and occupied-ready slices. The dashboard polish landed against a backdrop of genuine user enthusiasm: rohithck called the Iris dashboard “truly a breath of fresh air coming from ray”, and @dlwh's broader push to move workloads off Ray announced in #general drove first-time Iris users into the cluster all week. The load also surfaced the dashboard's current scaling limits — Michael Ryan noted that his large-resource job was “slowing down the iris dashboard and testing the limits of the auto-scaler,” and yurusankyo reported the controller being unresponsive on April 9th, prompting @rjpower to restart it.
Several smaller UX papercuts were identified and in some cases fixed in-week. Eric Czech hit a job stuck in a silent crash loop with max_task_failures > 1 left over from Ray defaults; @rjpower acknowledged that “it's confusing for users if the job takes 10x longer to report failure,” shipped the fix in #4615, and noted the job page itself needs better crash-loop surfacing. Ahmed M Ahmed raised a co-scheduling gap in #infra: RL jobs frequently have the trainer spin up while the worker sits unscheduled (or vice versa), holding compute that could go to someone else — @rjpower suggested combined reservations as a workaround and invited an issue for proper priority-boost support. The week also produced a set of new scoping issues for the longer-term documentation gap: #4463 audits stale Ray-centric docs, #4464 calls for a proper Iris getting-started guide, #4465 tracks removing Ray references from docs/ entirely, #4466 targets making Iris docs agent-parseable with structured CLI and SDK references, and #4467, now closed, proposed simplifying the --cluster default so users in the marin repo can omit the flag entirely.
Summary: Define canonical data pipelines for all data ingestion: download -> normalize -> dedup/quality -> tokenize.
The canonical data pipeline took a major step toward operational maturity this week with the introduction of a daily smoke ferry by @ravwojdyla that runs the full datakit stack — download → normalize → fuzzy dedup → consolidate → tokenize — end-to-end on a fineweb-edu 10BT sample every morning via GitHub Actions. The ferry uses a temporary GCS bucket with a 1-day TTL so per-run outputs auto-expire, provides per-run isolation, and is wired to Slack alerting on scheduled failure; a first live run completed in 1h10m on the production cluster and was used to tune worker resource provisioning. Alongside this, a normalize step was merged, establishing a general building block that converts raw downloads into standard Parquet with deterministic xxh3_128 IDs, exact content dedup within a shard, and configurable target partition sizes — the missing formal stage between download and tokenize described in the epic's sub-issues #4483 and #4484.
A focused vortex-to-parquet migration landed across two PRs. #4596 switched fuzzy document dedup and connected-components intermediates from vortex to parquet, and #4610 completed the migration by also converting the final outputs of exact-paragraph, exact-document, and fuzzy-document dedup. As @ravwojdyla noted, early experience with vortex in this pipeline was not smooth, and parquet is the simpler path forward. An open PR from @rjpower, #4658, further tightens the tokenization step by replacing per-file stat RPCs with a single bulk glob(detail=True) call, eliminating the dedicated tokenize-filescan Zephyr job that previously launched 32 distributed workers just to retrieve file sizes — on a 2,755-file nemotron dataset the new approach takes ~2 seconds.
The pipeline's rough edges surfaced loudly on Thursday when @Helw150's common_corpus_english tokenization job stalled at 4999/5000 shards with 3268 workers still alive; @rjpower traced it to a single bad parquet shard OOM-looping on pathologically malformed content, and noted that Zephyr "only discover[s] the succeeded shards during the work phase," so a restart of a mostly-done job still fans out thousands of workers that finish almost immediately. Those workers then pin the CPU resources on v5p slices and prevent Iris from freeing the machine for training jobs asking for smaller slices, which blocked other users for hours. The incident generated three issues in quick succession — #4575, #4577, #4578 — and became the motivating example for #4600, which ports the native llama whitespace protection and complements #4603's max_whitespace_run_chars guard on the normalize side. #4588 documents the underlying class of failure — multi-megabyte whitespace runs from broken HTML extraction that trigger OOMs and can leave latent bad data in training corpora — and #4503 continues to track the _consolidate_metadata OOM on large datasets where the coordinator materializes all shard offset arrays at once.
Beneath the operational pain is a design conversation that @ravwojdyla opened with a new contributor, Neville, about the tokenize producer itself. In a long pointers thread, rav framed the core issue: "the producer code/pipeline is overly complicated and slow, especially the levanter store consolidation. I'm not even entirely sure whether that consolidation is needed." The ensuing discussion with Neville and @dlwh/@Helw150 worked through TensorStore chunk semantics, why producer-side document ordering can't be aligned with consumer chunk boundaries (sequence lengths expand from 4K up to 128K during training), and the pretraining-vs-SFT split in read patterns — groundwork for a potential rewrite of the producer once Iris-native batch execution is fully online. A separate @rohithck-filed issue in code-talk about actor-name clashes on tokenize retries points at the same family of problems from the other side.
New datasets continued to flow into the pipeline. @Helw150 added the PleIAs/common_corpus English open-access subset — filtering to Open Science, Open Government, and Open Culture documents — and NSF grant abstracts (~170M tokens of public-domain scientific text). @ravwojdyla landed full download → normalize → tokenize support for all six starcoder2data-extras subsets (IR for C++, Python, Rust, low-resource languages, documentation, and Kaggle), also exposing levanter_batch_size through the tokenization write pipeline to control memory pressure for large-document datasets. Integration test coverage for the pipeline is expanding incrementally: PRs #4492 and #4493 add document-level dedup and consolidate edge-case tests respectively, working toward the sub-issues tracking end-to-end test coverage under the epic.
Summary: Measurable: canary ferry pass rate consistently above 90%.
This week’s work on canary reliability focused on hunting down the concrete failure modes degrading ferry pass rates. @rjpower fixed a cluster of CI failures in PR #4654: the Claude triage agent was hitting a 50-turn budget wall and getting killed mid-run, so the limit was raised to 500 turns; the CoreWeave canary was missing the controller extra needed for CloudK8sService; the dev-restart cron was colliding with the TPU canary ferry at 06:00 UTC and was staggered to 05:00; and SSH tunnel establishment was made resilient to transient connection resets and refusals via retry_with_backoff.
On the datakit smoke ferry, @ravwojdyla-agent fixed two separate failure paths. PR #4616 addressed an OOM in the consolidate step — the worker resource allocation was hardcoded at defaults sized for vortex rather than parquet, and was bumped to 8 GB via a new worker_resources field on ConsolidateConfig; fuzzy dedup iterations were also capped at 3 (down from 10) to keep runtime bounded. The same PR replaced the ad-hoc validation script with a data-driven checker that enforces exact file counts, schema, and row-count invariants across the full pipeline chain (download → normalize → dedup → consolidate → tokens), verified against a real end-to-end ferry run.
A subtler validation failure was fixed in PR #4627: validate_ferry_outputs.py was resolving the temp bucket using the GHA runner’s environment (which has no GCP metadata and falls back to file:///tmp/), while the Iris worker had already written outputs to gs://marin-tmp-{region}/ttl=1d. The fix has the ferry write a ferry_run_status.json containing the resolved marin_prefix to a path set by FERRY_STATUS_PATH, which the GHA workflow reads back and passes explicitly to the validation step — eliminating both the fallback and any hardcoded region assumptions. Finally, PR #4624 reduced the canary ferry coordinator’s memory request from 16 GB to 2 GB; at 16 GB it was tripping the --enable-extra-resources guard and breaking both TPU and CoreWeave ferries.
Discord made clear that the underlying cluster the TPU canary depends on was itself unreliable this week: central2 Ray had to be manually restarted at least three times between Apr 6 and Apr 8 — Eric Czech reported a SIGABORT in the GCS server logs with the head container up but no Ray services running, rohithck restarted a borked scheduler on Apr 7, and Tony restarted again on Apr 8 when nothing was running. Any canary ferry that fired during those windows would have failed for reasons orthogonal to the CI fixes above, which makes interpreting week-over-week pass-rate numbers noisier than the PR stream alone suggests and reinforces the case for the ongoing Iris migration as the real path to sustained 90%+.
Summary: All jobs run through Fray+Iris.
The Ray migration reached its decisive cutover this week, moving from a gradual two-system coexistence into an actively managed wind-down. @yonromai announced in #general that Ray will be fully deprecated in favor of Iris by end of month, and #4453 catalogued the remaining Ray-dependent workloads across 12 library files, spawning a coordinated migration with reach-outs to Rohith for logprob jobs #4640, Ahmed and Kevin for RL helpers #4639, and David for Levanter cache and vLLM infra #4641. @yonromai capped all remaining Ray TPU pools with tiered max_workers limits in #4604 and #4623, dropped min_workers to zero across all migrated clusters, deleted Ray auth/secret targets from the Makefile in #4562, and deleted Marin's classification inference and FastText training stack — the largest remaining direct import ray users — in #4642. Working with @Helw150, @yonromai also removed Levanter's Ray-only cache_dataset entrypoint and the vLLM Ray TPU fallback in #4648. @rjpower consolidated integration tests onto Iris in #4601 and removed Ray backend support from test fixtures in #4595. The single-host RL pipeline preregistered in #3959 completed its migration: @ahmeda14960 landed #3960, replacing the Ray-era client-side launcher with an in-cluster Iris coordinator topology with preemption-safe resume semantics.
The cutover was not friction-free. Ray went down in central2 on Apr 6 — Eric Czech found the container running but GCS server SIGABORT and no Ray services alive, and Tony was unable to connect at the same time; Eric restarted the cluster after confirming his dna branch was too stale to move onto Iris yet. Rohith restarted central2 again the next morning after jobs sat waiting on resources for hours, and Tony restarted it again that night. After #4604 reduced Ray capacity, Tony — the last remaining Ray user — found his running jobs killed by the cluster restart and pushed back hard on the timeline in #infra, noting he had been told the end of the month was the Iris migration deadline. @yonromai apologized for the disruption and explained that the pre-#4604 state effectively allowed one user to saturate TPUs meant to be shared, while @Helw150 pointed out that preemption pressure from Google paying customers was consuming ~90% of TRC capacity in east5-a and that Ray's only lever was max_workers, which itself appeared not to be enforced reliably for multi-host slices. The episode was a concrete reminder that Ray's autoscaler has no fair-share notion — hence the push to Iris, which does.
Iris absorbed substantial reliability and performance work. @rjpower replaced all kubectl subprocess calls in CloudK8sService with the Python kubernetes client in #4532, dropping per-call latency from ~1 s to connection-pooled milliseconds, and addressed 2,900+ orphaned pods and 17,800+ stale ConfigMaps accumulating in etcd via periodic GC in #4508. Controller responsiveness was fixed in #4531 by removing an unnecessary DuckDB log fetch on every GetProcessStatus poll, and the ListJobs call that was taking ~3 s due to eager descendant fetching was replaced with a cheap DISTINCT parent_job_id query in #4533. The SQLite controller store had proto BLOB columns replaced with native SQL across migrations 0024–0028 in #4644, and profiling moved to a dedicated profiles.sqlite3 in #4496. @rjpower added historical task resource usage tracking in #4629, split the autoscaler into a structured package in #4572, replaced the simple one-at-a-time worker restart with adaptive batch sizing in #4635, and changed the autoscaler from a hard min_slices floor to an additive buffer_slices warm-pool model in #4544. These landed against live load: after Michael Ryan's large v5e/v5-litepod sweep started flapping the cluster on Apr 8, @rjpower performed a controller restart midday to clear the autoscaler and deployed a scheduling-glitch fix the next morning after users with many tasks reported unresponsiveness. Fray's actor RPC was split into a direct remote() path for short-lived calls and a long-poll submit() path for multi-minute operations in #4500, halving RPC overhead, and cluster.proto was split into three focused files in #4452.
The Iris dashboard received a concentrated round of usability improvements and became newcomer-friendly enough that rohithck called it "a breath of fresh air coming from ray". @rjpower overhauled the scheduler tab, added paginated task lists, and surfaced scheduling failure reasons in #4651. Failed task attempt callouts were added to the job status page in #4614, filter and sort state is now encoded in URL query params in #4633, and the endpoint panel was fixed to show the job name rather than the user in #4671. Job request details (command, env, pip packages, named ports) appear on the job detail page via #4668. The dashboard became accessible without SSH tunnels via a Cloud Run IAP proxy deployed in #4630 by @ravwojdyla-agent. @AlienKevin extended flexible TPU scheduling to the CLI by adding comma-separated variant lists to iris job run --tpu in #4619 and expanded the v4-reserved pool to cover the full size range in #4662. New --priority and --fresh flags were added to iris job run in #4646 and #4655, with @Helw150 and @rjpower coordinating to make BATCH priority the ingredient that lets small CPU jobs stop pinning large TPU slices for other users. A separate in-progress infra status dashboard aggregating CI, build health, Iris reachability, and job state was opened in #4649. @dlwh summarized the week in #moe: "Iris is fully operational… I should take more weeks off." Operational papercuts still showed up: after #4609 renamed the lib packages to a uniform marin-* prefix, @rjpower warned in #infra that updated branches would trip uv with a spurious missing-iris error and supplied the uv sync --all-packages --reinstall-package marin-iris workaround; users hitting google.protobuf.json_format.ParseError on changed proto schemas were pointed to uv run python lib/iris/scripts/generate_protos.py as the manual regen step; and a vestigial Ray-era max_task_failures default that caused Iris jobs to crash-loop 10x before reporting failure was fixed in #4615. Docs lag the blessed path — #4443 called out that ~20 doc files still reference Ray and the Iris README, OPS.md, and 22 design docs need an accuracy audit before agents and new users can self-serve.
Zephyr received stability and isolation work driven by CoreWeave production issues. @ravwojdyla-agent switched to running each shard task in a fresh Python subprocess in #4522, eliminating the Arrow memory pool, page cache, and leaked file descriptor accumulation that was OOM-killing long-lived workers. OOM-killed subprocesses now surface as MemoryError rather than a generic returncode -9 crash in #4580, and subprocess workers exit via os._exit in a try/finally to dodge PyArrow's GCS shutdown abort in #4576 and #4582. Per-shard failure tracking with a configurable MAX_SHARD_FAILURES abort threshold replaced the previous behavior in #4579, and idle workers on the last pipeline stage now exit immediately in #4583. The value of this hardening was visible in real time: on Apr 9 Will's tokenization job sat at 4999/5000 shards with one shard death-looping on OOM while Zephyr kept discovering succeeded shards during the work phase and spawning thousands of short-lived workers against them, filing follow-ups #4575, #4577, and #4578. On multi-cloud storage, @ravwojdyla fixed the Malformed StorageGeneration TensorStore bug on R2/CoreWeave in #4441, @ahmeda14960 fixed single-shard cache consolidation on R2/S3 in #4436, and the S3 distributed lock race causing SignatureDoesNotMatch errors was fixed in #4440.
A parallel thread ran on GCS cost. @rjpower flagged in #infra that Marin's GCS bill is closer to $60k/month than previously thought, most of it in the STANDARD class where Autoclass can't help; a single code-resili-14b-* sweep accounts for ~315 TiB and ~$4,500/mo across 16 hyperparameter variants, with math-14b-resili-* and medical-14b-resili-* adding another ~$2,700/mo. The plan that landed in discussion: enable soft-delete for 3 days, then delete anything not on @dlwh's protect list or the project spreadsheet; @rjpower ran the one-time purge on Apr 8 afternoon. @Helw150 changed the default training path to save only the final permanent checkpoint — intermediate checkpoints now require explicit opt-in — noting that the previous behavior was writing 50 permanent model copies per 50k-step job. @ahmeda14960 proposed a longer-term split into shared storage (HF models, common datasets) and per-user storage tagged at executor-step write time now that Iris makes user attribution tractable. @rjpower also noted that class-B operation costs were being driven by users running full random shuffles during parameter sweeps and asked everyone to move to the new block-shuffle.
The SWE-ZERO data generation pipeline had its most active week yet. #4561, the execution-free agentic rollout MVP, closed this week after @AlienKevin completed the full six-step plan: the final 32K-context run on a v6e-8 worker with TP=4 produced 972/1000 unique rollouts, 486 clean submissions, and 4.52M completion tokens in 38 minutes — 3.16x more clean submissions and 2.46x more usable tokens than the 8K baseline. The dataset is published at AlienKevin/SWE-ZERO-1k-trajectories-32k, with an LLM-judged resolve rate of ~13% pass@1 and ~20% pass@10 on a 10-PR sample. Building on the MVP, #4653 (multi-language scaling, preregistered) also closed: 300 rollouts across all 20 SWE-rebench V2 languages (5 PRs x 3 rollouts per language) reached 31.7% aggregate submission rate and a 3.7% LLM-judged pass@1, with trajectories at AlienKevin/SWE-ZERO-multilang-300-trajectories. The 1B-token full-corpus scale-out #4666 (preregistered) is now 55% complete — 53,826 of 96,237 rollouts (~671M of ~1.2B tokens) across 19 shards. A separate experiment, #4683 (preregistered), validated execution-based evaluation via ConTree: the SDK integration is working end-to-end and will replace the LLM-as-judge with actual test-suite pass/fail verdicts. PR #4611 from @rjpower added a gVisor-sandboxed Python execution tracer for SWE-rebench instances, capturing function calls, returns, and line-level events via sys.monitoring (Python 3.12+) with a sys.settrace fallback. Kevin announced the milestone in #data-curation — a crisp three-point progress report covering the 1000-trajectory MVP, 87.4% unique-bash-command diversity verification, and the mini-coder-1.7b teacher scoring >50% pass@100 on SWE-B — with next-step pointer to the 100B-token target.
Discord also sharpened how the team is thinking about synthetic data generators more broadly. In #data-curation, @cs2716 flagged the BeyondWeb paper's finding that rephraser size shows diminishing returns past 3B parameters — "effective synthetic data generation doesn't necessarily require massive computational resources" — and noted Percy's own earlier paper used Llama 3.1 8B Instruct as a generator with good results. Percy replied that Marin will likely adopt some of these ideas but wants to adapt them away from the "infinite compute" framing of the source paper ("we don't have infinite compute in Marin 😂"). @elie separately surfaced a long-context / high-quality-document-reuse paper Percy co-authored as a candidate direction. The Kevin-initiated framing from earlier in the week explicitly positioned SWE-ZERO as "a cheap way to scale agentic traces for pre/mid-training" that could complement Code World Model-style rollouts — a bet the team is now visibly making.
On the agentic SFT front, both the 8B and 32B NemotronTerminal reproduction runs are in flight. The 32B run #4307 is at 30.8% of training (step 1761/5,721) on v5p-256, and an intermediate eval at step 1500 showed 17.6% on Terminal-Bench 2 (versus the released model's 27.4%), suggesting the run is tracking well at roughly 65% of target performance at 26% through training. A second trial on v4-512 with tensor parallelism was submitted this week via Ray on the big-run cluster to compare hardware paths. For Marin-8B, #4420 confirmed a baseline of 0% on TB2 before SFT (as expected), and SFT training on the full 366K-example Nemotron-Terminal-Corpus is at 87% completion (step 4962/5,721) on v5p-64; intermediate checkpoints show 1% TBLite at step 1500 and 5% at step 3000. A notable debugging episode uncovered that Levanter's HF checkpoint export writes the wrong eos_token_id (128001 instead of 128009), causing generation to hang — fixing this unblocked the evals. A parallel v4-128 trial for Marin-8B was also launched this week. #4510 added projections for reproducing OpenSWE on Marin 32B: ~10,500 chip-hours on v5p-512 (~1.7 days), roughly 80% of the NemotronTerminal 32B SFT cost. In #sft, @willheld declared "online distillation is actually my #1 desired RL feature for Marin," triggering a threaded exchange with @natolambert disambiguating two open problems — distilling from an open-weights model in a fully open fashion (e.g. a Flash-sized Kimi) versus distilling from a fully open source model (e.g. a Flash-sized OLMo 3 32B) — with willheld noting Tinker currently supports neither since it only targets open-weight finetuning. Separately, in the #sft-agents benchmark-hacking thread, Kevin surfaced his recent SWE-B test-poisoning fix and argued that continuous trace monitoring and benchmark hardening are unavoidable — a position that dovetails with the ConTree execution-based evaluation work in #4683.
Research into soft proxies and predictive scaling continued. #4389 saw negative results on logprob-based proxies: @RohithKuditipudi evaluated top-25 Qwen3-8B SWE-bench fine-tunes using trajectory cross-logprob scoring (both full-sequence loss and success/failure gap), and found the relationship broke down even within a controlled same-base-model family — suggesting fine-tuning itself disrupts logprob comparability. Masking to only bash command tokens inside tool calls did not substantially improve signal. @Helw150 proposed two follow-on directions: adapting Charlie Snell's emergence-prediction method, or shifting to a post-training-controlled experiment design. A cluster of new planning issues — #4547, #4548, #4549, #4550, #4551 — was opened to structure the mid/post-training prediction program: identifying recipe candidates, predicting outcomes from intermediate checkpoints, designing smooth pass@k proxies, and training isoflop/Delphi ladders across mixes. PR #4539 from @RohithKuditipudi added soft proxies for the OT-Agent leaderboard as part of this effort. willheld sharpened the prediction conversation in #midtraining, drawing a firm distinction between predicting the effects of LR annealing (which he believes Marin already has evidence for) and predicting the effects of midtraining with a different data mix (which he does not think is shown); his concrete recommendation was to run midtraining on all isoflop models first, noting that "doing midtraining for like the 1e21 and 1e22 is very doable over a weekend when preemptible capacity is high" — a cheap, well-posed experiment that would directly feed the prediction program.
The RL infrastructure gained new building blocks this week. @taivu1998 opened PR #4661 with a first-pass OpenReward integration — manifest prep, typed Qwen/vLLM tool-calling, a single-turn OpenRewardEnv, and a smoke launcher — alongside PR #4628 adding a telemetry event shards and provenance substrate for durable RL logging. PR #4620 made KL configuration explicit by replacing bare kl_coef plumbing with a dedicated KLConfig on RLOOLoss, and added k2 loss support alongside the existing k3 path. PR #4524 made the Iris RL config resource-aware for GPU rollout and trainer workers. On the DPO side, @ahmeda14960 opened PR #4637 unifying DPO and LoRA-DPO under a single train_dpo.py entrypoint with config-driven adapter types, fixing an HF export axis-order bug, and achieving a 2x eval speedup via durable reference caching — a cleaned-up successor to PR #4634. @eric-czech opened PR #4677 fixing tokenizer loading for lm-eval on HF checkpoints when a custom Marin tokenizer is specified alongside an HF checkpoint path. PR #4398 from @RohithKuditipudi migrated logprob evals to Fray v2 and added a tracker callback to save eval results to file in addition to W&B.
Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.
The Nemotron data tokenization pipeline saw a focused round of reliability and observability fixes this week, all from @ravwojdyla. PR #4446 gave each Nemotron split its own Fray job, making individual stages easier to inspect in the dashboard. Memory allocation was addressed in two places: PR #4450 raised worker memory for CommonCrawl download workers to 4 GB (each decompresses ~350 MB zstd files to 1.5–2 GB in memory, causing OOMKills at the previous 1 GB default), and PR #4448 raised the levanter-cache-copy stage to 10 GB. PR #4449 distributed the cache shard-size probing step into a Zephyr pipeline so thousands of S3 connections no longer pile up in the coordinator process, and PR #4454 exposed cache_copy_max_workers as a tunable parameter on tokenize_nemotron.
On the data-mixing research front, issue #2345 received new results from @Calvin-Xu on the multi-domain swarm run. Updated plots show that the functional form fitting can improve over the search frontier in as few as 20 runs when soft regularization constrains optima to a convex combination of the top observed runs. A complementary signal-to-noise analysis at 60M and 1.2B token scales quantifies how much useful mixing signal is recoverable at each budget, using the 240-run swarm for signal and matched-seed repeats for noise. Follow-up notes in #data-mixing report that the functional form is now mostly locked in, with predicted optima becoming sparser than earlier fits (often dropping low-quality splits entirely); the raw optima are good enough now that the earlier convex-combination regularizer is no longer necessary, and noise has been pinned at std ~0.0014 BPB from matched-seed repeats.
A question came up around whole-document packing for non-natural-language domains. @Helw150 noted in issue #4535 that protein and DNA documents suffer acutely from mid-sequence splits (the latter half of a document is meaningless without the start), and asked for the partial-packing behavior available in chat formats to be surfaced for plain text formats. It turned out that DatasetComponent already supports this via its pack field; PR #4536 further exposes a partial_pack option directly on TextLmDatasetFormat so callers can enable whole-document packing inline without reaching into the surrounding component. Alongside this, @eric-czech landed PR #4622 migrating DNABatchTokenizer and DNALmDatasetFormat to the new MarinTokenizer protocol, using as_hf_tokenizer() internally where HF-specific APIs are still required. On the GitHub code dataset front, issue #3332—which proposes cloning permissively-licensed repos from CommonPile and extracting full commit histories as training documents—received a question from @percyliang about expected token yield and crawl throughput.
Discord discussion kept circling the synthetic and agentic branches of the data strategy. In #data-curation, @AlienKevin flagged issue #4435 (SWE-ZERO) as a cheap path to scale agentic traces for pre/mid-training alongside the Code World Model rollouts in issue #4383, and by the end of the week reported an MVP with 1000 trajectories across 10 SWE-rebench repos (87.4% unique bash commands at Jaccard < 0.5), a teacher scoring >50% pass@100 on SWE-B via issue #4561, and an explicit next step of scaling toward the 100B-token target called out on this epic. A separate thread on rephrasing-style synthetic data surfaced external input from @elie pointing at a recent Liang-coauthored paper on reusing high-quality long-context documents; @cs2716 noted the BeyondWeb finding that rephrasers plateau beyond ~3B parameters, and @percyliang cautioned that the cited work assumes an “infinite compute” regime Marin does not have, so any adoption will need adaptation. In #midtraining, @Helw150 and @ahmeda14960 clarified that “midtraining” in Marin’s usage specifically means swapping the data mix during annealing (not just LR annealing on the pretraining mix as in the Mantis Stack v2 edu tail), framing how the upcoming isoflop midtraining sweeps will be set up.
@Helw150 landed two training infrastructure fixes: PR #4458 simplifies the QB (queue-based) trainer by consolidating work that was split across the main thread and post-step synchronization — a cleanup motivated by suspected checkpoint resumption issues — and PR #4457 makes default checkpoint retention conservative across default_train, default_sft, and default_dpo, so only the final checkpoint (plus rolling temporaries for resumption) is kept by default rather than accumulating all intermediate ones. On the agentic tooling side, @rjpower added a TDD bug-fix skill that walks agents through four phases — root cause analysis, writing a minimal failing test, applying the smallest sufficient fix, then lint and commit — keeping them from jumping straight to a patch without a reproducing test. @yonromai added missing YAML frontmatter to the fix-docs skill so Codex stops flagging it as invalid in PR #4659, and @ravwojdyla re-synced a drifted uv.lock in PR #4617.
@eric-czech rebased the long-running dna branch onto current main in PR #4247, replaying 11 of 15 branch-only commits after it had fallen 381 commits behind; four commits touching PermutationDataset/EraShufflingDataset were skipped because main now depends on finite dataset semantics that conflict with the branch’s infinite-length approach. Separately, @redagavin merged a speedrun entry PR #2185 validating the Muon optimizer on a 50M-parameter Llama at 1× Chinchilla scale (~1B tokens on a single H200), achieving a Paloma BPB of 1.3989 as a baseline for optimizer comparisons.
On packaging, @rjpower renamed the six marin-owned lib packages to the marin-* prefix (marin-fray, marin-rigging, marin-iris, marin-zephyr, marin-haliax, marin-levanter) in PR #4609 so wheels publish under a marin-owned namespace; import names are unchanged, but the rename tripped up old uv clients and required a uv sync --all-packages --reinstall-package marin-iris to recover, per a heads-up Russell posted in #infra.
Beyond the external contributions already woven into the epics above, two threads stand on their own. @RohithKuditipudi, alongside the Fray v2 logprob-eval migration in #4398, filed a five-issue post-training predictability roadmap — #4547, #4548, #4549, #4550, #4551 — covering intermediate-checkpoint forecasting, pass@k prediction, and isoflop/Delphi ladders on mid- and post-training mixes, a scaling-law program parallel to the pretraining isoflops. @nevillelyh wrote a detailed design review on #4445 (levanter store/cache) with write-throughput benchmarks and a Parquet-mapped layout sketch, and weighed in on Iris startup profiling in #4571. @redagavin added a Muon-at-1×-Chinchilla speedrun in #2185.
Four new members posted introductions: Vivien Cheng, a Stanford MS / incoming PhD taking CS336 who works on ML systems, kernel optimization, and linear attention at Hazy Research and wants to contribute on infra/systems; Ty Feng, an ML engineer with an RL-infra background on a TRC grant who found Marin via the JAX docs and wants to contribute to data and engineering while learning scaling on JAX/TPU; Chris (cs2716), an Imperial maths grad coming from robotics "to learn how the sausage gets made" as the field moves toward foundation models; and Sri, with theoretical background in scaling and distributed systems plus JAX tinkering on a local Mac M1, planning to file small PRs against under-documented edges while ramping up. Vivien's kernel background intersects with this week's ragged_dot Triton landing and the communication-bound gap it exposed; Ty brings context on the JAX/TPU RL substrate @taivu1998 is building in #4524, #4620, #4628, #4661.
Kevin Xiang Li's SWE-ZERO status post in #data-curation summarized 1000 rolled-out trajectories across 10 SWE-rebench repos, 87.4% unique bash commands, and mini-coder-1.7b above 50% on SWE-B pass@100, pointing at the 100B-token target in #3100 next. In #moe, Larry's announcement of #4636 invited humans and agents to climb the published metrics ladder in experiments/grug/moe/README.md. And Percy Liang's reply to a question about a long-context data-reuse paper he co-authored — "we will probably want to take some of these ideas, but might want to adapt it since that paper is still in the 'infinite compute' setting, but we don't have infinite compute in Marin" — captured the project's compute-constrained framing on the isoflop and data-mixing work.
Reading in #news tracked the project's own active arcs — synthetic data, agentic-benchmark integrity, and MoE routing — via Meta's Muse Spark / MSL announcement, Tristan Thrush's Dataset Policy Gradient paper, a report of widespread cheating on agent benchmarks including Terminal-Bench 2's top three, and a PathMoE writeup on sharing router weights across consecutive layers.
The week's defining event was the launch of the preregistered 1e23 MoE run — the project milestone is literally titled "Kick-off pre-trained 100B-A13B 1.2T token MoE (pregistered)" — backed by a completed scaling law sweep that produced a sharp prediction before the run started. The preparation work closed three interlocking experiments and the big run itself kicked off Friday on v4-512. Larry tied a bow on it Thursday with #4636, catching the grug/moe recipe up to main and publishing a metrics ladder in experiments/grug/moe/README.md for "anyone, or anyone's agents" to try to climb — announced in #moe and met with dlwh's "Iris is fully operational, moe looks 🔥. i should take more weeks off.".
MoE v7 1e22 closes out. Issue #3800 (Test MoE Arch at 1e21 and 1e22 Flop Scales) reached its conclusion this week with three reruns of the 1e22 d3200 run by @ClassicLarry (v2, v3, v4), totaling 1.08e22 model FLOPs each on v4-256. The key finding was the effect of capacity factor: the v4 run with cf=4.0 reached 38% MFU and Paloma c4_en/bpb 0.742, macro loss 2.290 on Paloma and 1.968 on uncheatable eval — all notably better than cf=1.0 (v2 at 19% MFU, macro loss 2.459). The v2 resume ran at just 19% MFU due to a checkpoint-loading path that was bisected to a crash in broadcast_one_to_all when trainable weights appear inside tuple[Block, ...]; a workaround was found that allowed all seven affected jobs to restart cleanly. The final actual macro loss of 2.432 came in above the preregistered prediction of 2.389 — @ClassicLarry attributed this to a too-optimistic irreducible asymptote in the three-point fit (asymptote of -0.1 vs a realistic 1.7) and to using a non-isoflop datapoint for the 1e20 anchor.
Isoflop v16 sweep produces the 1e23 preregistered prediction. Issue #4447 ran MoE v16 isoflop sweeps from 1e18 to 3e20 FLOPs across model widths d512 through d2560 on v4-256, filling in the scaling law needed to size the big run. The curves were smooth and the near-term fit was validated: the law predicted Paloma c4_en/bpb 2.598 at 1e21, and the actual d2560 run came in at 2.599. Optimal sizing for 1e23 was ambiguous due to flat isoflop curves, so the team split the difference between two fits: 131B total / ~16B active, 48 layers, d5120, ~1.02T tokens. For the preregistered Paloma macro loss prediction at 1e23, free-asymptote fits appeared too optimistic due to systematic underestimation at smaller scales; fixing L∞ = 1.6 gave stable cross-validated predictions of ~2.25 across all training-set sizes from 1e18-1e19 onward (scaling law: macro(C) = 1.6 + 95.18 · C-0.094). A parallel critical batch size sweep in #4432 fit CBS ~ 1.87e-5 · C0.388, projecting a critical batch size of ~15,500 (63M tokens/batch) at 1e23; the planned batch size of ~8M tokens sits well below that threshold. The preregistration discipline is itself part of the house style — when Meta announced Muse Spark in #news, willheld's first reply was "not preregistered... doesn't count 🤣".
The 1e23 run launches, with an early crash and a live restart. Two attempts at the full-scale run appeared this week on v4-512. moe_1e23_apr10_bs2048_ep8_ragged crashed after 7B tokens with loss diverging to 8.89 at 17% MFU; the ragged all-to-all capacity clipping bug fixed in #4359 was one contributing factor. The successor, moe_1e23_d5120_bs2048_ep4_ring, launched Friday on v4-512 with ring-style expert parallelism and is live as of this summary, having processed ~20B of the target ~1.2T tokens at 14% MFU (still in early warmup). Both runs use the d5120 geometry (131B total / ~16B active) from the isoflop sizing recommendation and train on the Nemotron mix. MFU at this scale remains under scrutiny — #4283 is the April tracking issue for the v4-1024/2048 target, and on Sunday @chloe offered to run grug MoE vs. Megatron on H100 to pin down the cross-stack comparison.
NemotronTerminal SFT reproduction in progress. Two SFT runs are underway aiming to reproduce NemotronTerminal-32B (target: 27.4% on TerminalBench 2). Issue #4307 tracks exp4307, a Qwen3-32B SFT on the full NemotronTerminal corpus running on v5p-256 at 34% MFU; it appears as crashed in W&B at step ~1500 (train loss 0.361) and was resubmitted this weekend on v4-512 from scratch with tensor parallelism. At step 1500 (26% through training), @AlienKevin reported 17.6% on TB2 (13/74 tasks solved), putting it at ~65% of the released model's performance. The 8B companion, #4420 / exp4420, is running on v5p-32 at 42% MFU; an early EOS token bug (Levanter exported eos_token_id: 128001 instead of the Marin tokenizer's 128009) caused all step-1500 TB2 evals to time out — once fixed, re-eval returned 1% on TBLite and step 3000 reached 5%. TerminalBench validity itself was live context this week: Kevin Xiang Li reflected in #sft-agents on the Steinl et al. finding that the top three TB2 submissions are all cheating, noting "we have to constantly monitor agent traces to detect reward hacks and continuously harden these benchmarks."
| Run | User | Hardware(?) | Hours(?) | FLOP Budget(?) | Loss | BPB(?) |
|---|---|---|---|---|---|---|
| moe-v7-1e22-d3200-resume-2 | Larry Dial |
TPU v4 (256 chips) |
0.2h |
1.08e22 model
5.59e22 HW (19%) |
BPB: 0.750 | |
| #3800 moe-v7-1e22-d3200-v3 | Larry Dial |
TPU v4 (256 chips) |
0.3h |
1.08e22 model
2.88e22 HW (37%) |
BPB: 0.738 | |
| #3800 moe-v7-1e22-d3200-v4 | Larry Dial |
TPU v4 (256 chips) |
0.3h |
1.08e22 model
2.84e22 HW (38%) |
BPB: 0.733 | |
| #4447 moe_1e23_d5120_bs2048_ep4_ring | Larry Dial |
TPU v4 (512 chips) |
1.2d |
2.06e21 model
1.45e22 HW (14%) |
BPB: 1.079 | |
| #4447 isoflop-moe-v16-1e+21-d2560 | Larry Dial |
TPU v4 (256 chips) |
1.0d |
1.13e21 model
7.52e21 HW (15%) |
BPB: 0.805 | |
| exp4307_nemotron_terminal_qwen3_32b_32768tok_v5p256-ec2c31 | Kevin Li |
TPU v5 (128 chips) |
2.6d |
2.30e21 model
6.84e21 HW (34%) |
— | |
| #4447 isoflop-moe-v16-1e+21-d2048 | Larry Dial |
TPU v4 (256 chips) |
1.0d |
1.19e21 model
5.35e21 HW (22%) |
BPB: 0.817 | |
| exp4420_sft_marin_8b_instruct_terminal_corpus_full_32768tokens_v-b20d73 | Kevin Li |
TPU v5 (32 chips) |
4.3d |
2.20e21 model
5.28e21 HW (42%) |
— | |
| #4447 isoflop-moe-v16-1e+21-d2048-v2 | Larry Dial |
TPU v4 (256 chips) |
0.2h |
1.19e21 model
5.03e21 HW (24%) |
BPB: 0.793 | |
| #4447 #4359 moe_1e23_apr10_bs2048_ep8_ragged | Larry Dial |
TPU v4 (512 chips) |
8.9h |
7.66e20 model
4.49e21 HW (17%) |
BPB: 4.032 | |
| #4447 isoflop-moe-v16-1e+21-d2560-v2 | Larry Dial |
TPU v4 (256 chips) |
0.3h |
1.13e21 model
4.21e21 HW (27%) |
BPB: 0.792 | |
| #4447 isoflop-moe-v16-3e+20-d1280-v2 | Larry Dial |
TPU v4 (64 chips) |
1.3d |
4.29e20 model
4.14e21 HW (10%) |
BPB: 0.852 | |
| #4447 isoflop-moe-v16-3e+20-d1536 | Larry Dial |
TPU v4 (64 chips) |
1.5d |
3.93e20 model
1.94e21 HW (20%) |
BPB: 0.837 | |
| #4447 isoflop-moe-v16-3e+20-d2304-v2 | Larry Dial |
TPU v4 (64 chips) |
2.9h |
3.49e20 model
1.85e21 HW (19%) |
BPB: 0.843 | |
| #4447 isoflop-moe-v16-3e+20-d1792 | Larry Dial |
TPU v4 (64 chips) |
1.4d |
3.75e20 model
1.78e21 HW (21%) |
BPB: 0.835 |
4 comments on 2 threads