The preregistered 1e23 MoE run that launched last Friday moved mid-flight from v4-1024 with ring dispatch to v4-2048 with ragged expert parallelism at ep=8, running at roughly 20% better MFU on twice the hardware while tracking train loss step-for-step against the ring baseline. Getting there meant chasing a step-0 gradient explosion under ragged ep≥8 — a real bug the ring path had been quietly masking, fixed in #4867 — plus a small raft of Iris fixes that stopped reserved slices from being reaped out from under the run. As of this writing the run has consumed ~105B of ~1T budgeted tokens at Paloma macro_loss 2.73, roughly 10% through the preregistered #4697 budget against a target of 2.25. The one unresolved wrinkle @dlwh flagged: perplexity evals are meaningfully worse at matched train loss on the ragged path, particularly on github, which he suspects is an eval-time mismatch or overflow under imbalanced routing rather than training corruption.
The Iris log pipeline collapsed on the 16th: per-segment (min_key, max_key) aggregates pruned nothing because every segment spans every user, and the _MAX_PARQUETS_PER_READ=25 cap started silently hiding rows. Three consecutive PRs rebuilt it — lifted the cap with newest-first early stop, then dropped it entirely in favor of DuckDB row-group stats — and a Cloud Run status page aggregating ferry CI, Iris reachability, and job state finally lit the cluster's health up from outside. The Ray-to-Iris migration quietly reached "Tony is the only user left" status — @willheld told him to patch and reboot the old cluster himself — while Levanter's RayConfig was deleted and Evalchemy migrated over. SWE-ZERO pivoted from the completed 1B-token preregistered MVP to a 140B scale-out after prefix caching landed a 3.4× per-worker speedup; ~2M of 12.3M target rollouts done by Sunday, with a go/no-go midtraining validation on Marin-8B pending Monday. Outside the critical path, Nightshift landed four cleanup PRs unattended after a GitHub App token fix restored its CI checks, the canonical datakit pipeline's normalize stage was reshaped end-to-end around the Nemotron v1/v2 corpora, and the TPU and CW GPU canary lanes moved off their dedicated clusters onto warm production infrastructure, producing the first green TPU canary run in over a week.
Summary: Split from #4266.
The preregistered 1e23 MoE run #4697 that kicked off last Friday is now on the board. The initial v4-1024 ring-dispatch configuration — d5120, 129B total / 15.9B active, 120k steps on the Nemotron mix, predicting final Paloma macro loss of 2.25 per the #4447 isoflop extrapolation — was posted as a full job summary and moe_1e23_apr branch on the 13th. @dlwh spent the week migrating it to the intended v4-2048 / ragged-all-to-all path. As of Saturday, the reconstituted run is training on twice the hardware at roughly 20% better MFU, with train loss curves essentially identical to the ring-dispatch control. Perplexity evals come out measurably worse at matched train loss, which @dlwh suspects is an eval-time mismatch or overflow in the imbalanced-batch path rather than a training corruption — worth tracking, probably not a reset.
The migration surfaced a real bug: ragged expert parallelism at ep>=8 was exploding grad/norm/output_proj on step 0/1 while the ring path happily masked the failure #4746. Root cause turned out to be incorrect sender-side output offsets in _shard_a2a_params, fixed in #4867 with a ring-vs-ragged EP parity regression. A second finding: MoeAdamHHeuristic._compute_num_layers() was throwing away the computed depth and returning a hardcoded 48, so the smoke config wasn't actually the intended d5120/l51 shape — restoring the formula yields 49 layers for d=5120, matching the job-summary figure.
The 1e23 run also stressed Iris hard enough to produce three infra fixes in its wake: #4823 keeps one v4-reserved/2048 slice warm (the autoscaler had been force-deleting reserved slices ten minutes after worker registration via the 600s idle path), #4793 raises worker-heartbeat RPC timeouts to cover the registered-but-never-advanced failure mode, and #4792 scales TPU bootstrap waits by pod size to handle the 255/256-healthy provisioning path. A follow-up issue #4822 tracks the deeper fix of teaching Iris to treat assigned/building work as slice activity so reserved slices aren't reaped out from under live jobs.
In parallel, the agent-driven architecture sweep kept closing out: GatedNorm scale factor, 2x expert granularity, slim-layer, router-combine-activation, x0 skip connections, pseudogram, backout, and router-bias-nonzero-init all landed their write-ups this week, with a handful more — paired-head attention #4907, partial RoPE #4946, and the AdamH-preserved GatedNorm init test #4904 — still open. Separately, #3902 finally merged, teaching Grug Muon to batch Newton-Schulz over stacked MoE expert weights and restore update sharding exactly once at the optimizer boundary. @Ahmed M Ahmed also flagged the Nemotron math split as a likely midtraining add once the main run is underway.
Summary: Improve Levanter's data store, fix K8s logging at scale, and address infrastructure gaps in the Iris dashboard and profiling.
The week opened with #4682 lifting Iris's log service out of the controller process and into a dedicated subprocess on port+1, closing #4673. That single structural change unleashed a cascade. #4708 restored a PushLogs proxy on the controller after older workers were caught silently failing to push (their cached /system/log-server resolution still pointed at controller:10000). #4795 moved worker log-handler attachment ahead of _register() so bootstrap failures — the kind that produced the 255/256 workers healthy dead end @dlwh filed as #4794 — actually ship searchable remote logs. #4711 taught DuckDBLogStore to reconcile its in-memory segment index against disk, so a vanished Parquet file no longer renders an entire key range permanently unreadable.
Then on the 16th the log store collapsed. #4833 diagnosed the root cause: _LocalSegment tracked only scalar (min_key, max_key) aggregates, but every segment on the cluster spans every user (/alice/... through /zoe/...), so file-level pruning narrowed nothing and the _MAX_PARQUETS_PER_READ = 25 cap was load-bearing — keys whose rows happened to sit outside the newest twenty-five segments became invisible. Users felt it immediately. Tony asked in #infra whether the dashboard was hanging for others, Larry reported jobs failing at step 1,615 / 12,649 with empty logs, and @ravwojdyla opened #4860 on a restarted job with no logs at all. The fix landed in two pieces: #4834 raised the per-read cap to twenty-five segments / 2.5 GB and added a newest-first early stop (p95 held under 500 ms on the 4.5 GB, 46-segment production corpus), then #4881 dropped the cap entirely by turning on DuckDB's enable_object_cache and letting per-row-group stats do the pruning. #4889 cleaned up two durability regressions flagged in review — a close() path that never uploaded when exactly one tmp was present, and a crash window between rename and unlink that could double-count rows — and #4939 gave compaction its own DuckDB connection after a 67-second compaction on Sunday stalled the dashboard with 504s.
Around the log-store core, @rjpower rewrote LogPusher in #4866 with a dedicated drain thread, failure-driven re-resolution, and a 10k head-of-queue buffer that only drops oldest-first on overflow; #4864 unifies push and fetch behind a single IrisLogClient. #4717 pushed dashboard log filtering and level to the server so rare-string searches at small maxLines actually return matches. #4863 dropped the dead SQLite logs table that had been silently returning zero rows to every iris query. #4935 shipped an RpcStatsCollector and a StatsService endpoint — per-method counters, a fixed-bucket latency histogram, and ring-buffered slow-tail plus discovery samples — closing out the last of the observability holes that made the week's outages hard to pin down from outside. By Sunday night Russell posted in #infra that logs were back up: "it turns out that you guys can write >10m logs a minute which was... not quite accounted for."
On the dashboards side, @ravwojdyla-agent stood up #4649, a new IAP-gated Cloud Run status page aggregating ferry CI, main-branch build health, Iris reachability, worker counts, and job state — roughly 1.3k lines of Node/Hono plus a React/Jotai frontend. #4699 fixed a /api/workers 504 traced to inline AbortSignal.timeout() objects being GC'd before firing under Node 20 undici. Follow-ups added a CW ferry card, auto-deploy, amber flags for ferry runs slower than mean + 1σ of the preceding seven successes, and a distinct stripe for cancelled runs so they stop reading as "unknown." @Helw150 landed dark mode and a clearer autoscaler overview in #4660, then fixed a band-percentage bug in #4688 where slices running multiple bands were double-counted so shares could sum past 100%. The Levanter store/cache tracking issue #4445 remains open — no code changes this week — but the logging foundation the epic shares with it is visibly sturdier than it was seven days ago.
Summary: Tracking issue for April MFU work. Tasks/goals:
The week's MFU story converged on a single number. On Sunday @dlwh reported that the reconstituted 1e23 run, moved from v4-1024 ring to twice as much hardware on v4-2048 with the ragged_all_to_all dispatch at ep=8, is now pulling roughly 20% better MFU than the ring baseline, with train loss nearly identical across the two trajectories. The caveat, worth flagging: perplexity evals are meaningfully worse under the ragged path even at matched loss, with GitHub perplexity particularly degraded; @dlwh suspects an eval-condition mismatch, an overflow, or a subtle bug under imbalanced routing rather than anything that warrants a run reset.
Getting the ragged path to that point took the week's code work. #4746 caught the 1e23 smoke exploding on step 0/1 whenever ragged expert parallelism was exercised at ep>=8 — grad/norm/output_proj jumping from ~0.6 to the 8k-330k range while ring stayed stable, which meant the branch could silently mask the failure instead of smoking it out. @dlwh's follow-up #4752 against moe_1e23_apr restores the heuristic layer formula (d=5120 gives 49 layers, not the hardcoded 48), adds an explicit MoeImplementation knob so the smoke actually defaults to ragged on v4-2048 in us-central2, and plumbs Iris production priority through grug dispatch and Fray. Upstream of that, #4636 landed earlier in the week: it ported MoeAdamHHeuristic from the isoflop branch so launch.py can derive (model, optimizer, batch, steps) from a compute budget plus hidden_dim, dropped the initial-dense-layer path to match the isoflop architecture, and fixed a sharding crash in align_kv_heads by replacing jnp.repeat with a reshape+broadcast under abstract-mesh contexts.
The H100 side of the epic got its first real numbers too. @chloechiaw posted a head-to-head of Grug ring EP against Megatron DeepEP on 8xH100 across four MoE shapes: the Qwen3-30B anchor (128 experts, topk=8, h=2048, ffn=768) lands at 13.1 ms vs Megatron's 12.41 ms, a 6% gap; the 32- and 64-expert shapes at the same hidden/ffn actually run faster on Grug (0.23x and 0.57x of Megatron's per-step time); and the Qwen3-235B anchor is the outlier at 3.48x slower (40.9 ms vs 11.75 ms), which is the shape that now needs a closer look before the H100 track can claim parity at the geometries that matter for the 1e23 milestone.
Summary: Split from #4266.
With last week's git worktree race behind it, the Nightshift scout pipeline settled into a daily cadence: cleanup PRs landed on the 13th, 14th, 15th, and 17th, each one a cross-subproject harvest of dead code, duplicated helpers, and convention violations found by parallel scout worktrees. #4691 pruned four unused utilities from levanter and unified the TPU slice-selection logic in scaling_laws/tpu_utils.py behind a new TpuSpec dataclass. #4729 fixed a quietly broken _is_local_leader() where atexit.register calls sat after return statements and so never cleaned up temp lock files, plus an os.system error branch that could never fire. #4777 deleted roughly 700 lines of RL math-grading code that had already been re-homed under tinker_environments, and #4861 tidied up after the shuffle-format rewrite by removing the ScatterShard alias and its thin wrapper. Four days, four merges, no reverts.
Two infrastructure PRs tightened the bot's operating envelope. @rjpower discovered that Nightshift PRs were merging with no checks at all — the default GITHUB_TOKEN does not trigger workflow runs — and in #4781 swapped both Nightshift workflows over to a minted claude-nightshift GitHub App token so pull_request workflows fire normally. The effect was immediately visible: #4861 is the first multi-cleanup authored by claude-nightshift rather than github-actions. Separately, @dlwh's #4763 taught the pull-request skill to prefer pushing branches onto the main repo when permissions allow, falling back to a fork only when necessary — one fewer source of friction when a subagent has something to say.
Agents showed up elsewhere in the week's chatter, though under the Codex and Claude Code banners rather than Nightshift's. Eric Czech reported that a Claude-supervised babysitter reading only iris job state kept an eight-run sweep alive for roughly twelve hours unattended; letting the same agent debug around iris outages did not go nearly as well, and sporadic iris latency still trips up agents that poll frequently. On the levanter side, rohithck noted that Codex one-shotted a vLLM fix, and Ahmed M Ahmed spent a multi-day session narrowing a v5p/v6e LoRA divergence with Codex as the scribe — it turned out that step-0 LoRA B gradients differ elementwise between matched bf16 runs. The supporting cast of workflow irritations was also visible: Claude Code now blocks sleep loops, and dlwh watched Codex reboot a host without rebuilding the image, twice.
Summary: Lower priority / slack-time workstream covering workqueue, dev-tpu replacement, and observability.
The observability surface took its lumps this week, which generated the usual crop of wishlist items. iris.oa.dev is now the shareable production dashboard — @rav reminded folks they no longer need to port-forward, and links to jobs and logs can simply be pasted into Discord (url). But the dashboard also spent stretches of the week hanging, returning 504s on FetchLogs, or dropping log tails entirely while the controller was under load; @Russell Power traced Saturday's missing-logs episode to users writing >10MB/minute of log output, which was "not quite accounted for" (url). That operational pain is what motivated most of the new issues this week.
On the persistence and auditability axis, @ihodes filed #4839 to keep historical cluster state — jobs, runtimes, retries, accelerators, disk, CPU — rather than only the live snapshot the dashboards render, and @rjpower filed #4895 to capture a full audit log of controller actions (task assignments, job ingestion, scaling changes), on the premise that offline log replay would make the intermittent overload incidents far easier to load-test against. @rjpower also opened #4840, proposing that the CreateJob API reject Iris clients more than two weeks stale so backwards-compatible shims can be retired on a predictable cadence. None of the three have discussion yet.
Two smaller usability items rounded out the week. @Helw150's #4747 — documenting the intended semantics of the PRODUCTION, BATCH, and INTERACTIVE priority queues, including a CLI warning when users reach for PRODUCTION — was filed and closed in short order. And @hammer's #4820 asks whether Marin should track Batch Heterogeneity (the max-microbatch-loss vs. mean-loss gap reported in the Arcee Trinity tech report) as a training-instability canary, motivated by the Delphi loss spikes earlier in the cycle; it is open, without comments.
Summary: Define canonical data pipelines for all data ingestion: download -> normalize -> dedup/quality -> tokenize.
The normalize stage of the canonical datakit pipeline was reshaped end-to-end this week. #4886 collapsed the per-subdirectory fanout into a single zephyr pipeline that walks everything under input_path and writes a flat outputs/main/ layout, and #4876 split main and duplicate records into separate parquet streams so downstream steps read unique records without post-filtering. #4761 made in-normalize exact dedup optional via a DedupMode enum (leaving room for minhash later), and #4893 decomposed fuzzy dedup into two datakit-shaped jobs — compute_minhash_attrs emitting per-source minhash buckets and compute_fuzzy_dups_attrs producing per-source cluster-annotation attribute trees. Smaller refinements rounded out the shape: #4884 stamped a schema version on NormalizeResult, #4890 exposed max_workers on normalize and consolidate, #4698 silently filtered empty and whitespace-only text (including \xa0) instead of crashing, and #4603 began compacting pathological whitespace runs at 128 chars to survive the multi-MB space runs that HTML-to-text extraction occasionally produces. @ravwojdyla-agent drove essentially all of this.
Several of these changes were forced by what @ravwojdyla-agent found running a 1.3B-record nemotron-v1 normalize at 6307-way shuffle. #4818 traced ~11% empty output shards to a double-hash in scatter routing, switched deterministic_hash from adler32 to xxh3_64, and added an [0, N) int passthrough; #4819 stopped pre-hashing hex ids to ints inside normalize entirely. #4853 deleted the coordinator-side scatter-manifest consolidation that had been a single point of failure (reducers now read per-mapper sidecars through a 32-thread pool), and #4887 pulled per-shard slicing into the sidecar worker thread so reducer RSS fell from roughly 16 GB to 200 MB. Underneath, #4782 replaced the parquet-based shuffle with flat zstd frames plus byte-range sidecars (Arrow is now out of the shuffle data plane) and @rjpower's #4695 swapped pickle+zstd spill files for parquet with a background I/O thread. #4579 gave the shuffle finite patience — tasks now abort after three shard-level failures instead of re-queueing forever on a deterministically OOMing shard — and #4784 added a 10 GB shuffle integration lane on marin-dev with uniform-vs-skewed matrix coverage.
On the tokenize side, @rjpower's #4658 deleted the tokenize-filescan zephyr job entirely: fsspec.glob(detail=True) returns file sizes from the same list-objects call, so the 32 distributed workers that used to stat files individually are gone and a 2,755-file, 1 TB nemotron shard now resolves in about two seconds. @nevillelyh's in-flight #4814 is an early sketch toward removing the consolidate step from tokenize in favor of a ShardedTreeCache virtual tree for downstream readers. #4758 rewrote consolidate itself as a chain of sorted_merge_join ops — map-side, no shuffle, leaning on the datakit invariant that attribute files share input partitioning 1:1 and are id-sorted. @rjpower's #4721 fixed a 3-8x row inflation in distributed_scan where staging directories were never truncated between runs, which had been making storage reports look wildly more expensive than the billing dashboard.
The daily smoke ferry kept finding the rough edges introduced by the reshape. #4909 rewired the validator after #4886 and #4893 moved the output layout, #4910 de-flaked the multi-source fuzzy-dup test (two records with identical text collapse into one main row, so CI now accepts either survivor), and #4701 added per-step wall-clock timing to StepRunner after smoke-ferry profiling showed dedup eating 39% of the 80-minute run. @wmoss's #4731 stopped the fineweb exact-dedup experiment from downloading the full corpus when only the 10BT sample was processed. Two short operator runbooks landed — #4771 for the datakit smoke ferry and #4762 for ad-hoc ferry runs on the marin Iris cluster — capturing the SSH-tunnel and coordinator-preemption gotchas from recent manual runs. With the reshape nearly complete, #4892 is wiring nsf_awards and nemotron v1/v2 through download-normalize-tokenize; in #data-curation, @willheld and @ahmeda14960 were already discussing whether nemotron v2's math split is enough on its own for mid-training. And in #code-talk, the mirror-FS-vs-executor interaction surfaced as a real operational wrinkle: a cache copied east1 to east5 gets redone from scratch in eu-west because the executor checks step markers in its local region only, not whether the output exists somewhere.
Summary: Measurable: Bolinas can `import marin` and use it as a library.
Nothing landed in this epic this week; last week's nightly PyPI publish of the marin-* packages is still the current state of the world. The only stir was on the long-open prototype #2477, where @rjpower pinged @ryan-williams after a design-doc update to say he has unpushed wiring changes and that the initial attempt looks promising but needs cleanup and trial usage. The demo-repo tracker #4472 remains closed with no new work, and no import marin consumer surfaced in Discord — the Bolinas-labelled jobs on the Iris cluster this week are unrelated DNA scaling runs, not the downstream project this epic is named for.
Summary: Measurable: canary ferry pass rate consistently above 90%.
The canary fleet ended the week decisively greener than it started, thanks to two structural fixes that pulled the smoke runs out of the flaky marin-dev / dedicated-canary-cluster lanes and onto the warm production infrastructure. @rjpower's #4739 moved the TPU canary ferry to the production marin cluster with --priority production, after controller-SQLite forensics on run 24332480667 showed the previous day's "hung" canary had actually been sitting in a 6h 5m reservation queue on marin-dev before GHA's 240-minute timeout killed it mid-wait. The companion #4744 retired the separate CoreWeave canary cluster entirely and reused the warm iris-ci controller + H100 nodepool, unblocking a run of 15 consecutive failures since March 31 caused by a single US-WEST-04A H100 quota being pinned by the permanent CI nodepool.
The numbers moved accordingly. The TPU canary went 0-for-5 (all cancelled on GHA timeout) from Apr 13–17 and then landed two clean greens on Apr 18 and Apr 19, the first scheduled successes on this lane in over a week. CoreWeave GPU canary flipped from 0/7 the prior week to 5/7 this week (success on every run from Apr 15 onward). The datakit smoke lane was the outlier: @ravwojdyla's #4787 pointed it at the prod cluster at production priority on Apr 15, but it then regressed Apr 18–19 when #4876's normalize output-layout change (normalize/outputs/main/ instead of normalize/*.parquet) silently broke validate_ferry_outputs.py — root-caused and filed as #4908, closed within hours on Apr 19.
Supporting plumbing filled in around the edges. #4831 added cancelled() alerting across all three canary workflows and split Slack-notify into its own job with a fresh timeout budget, so the next GHA-timeout cancel will actually page instead of vanishing; #4854 set MARIN_PREFIX=gs://marin-us-central1 on the TPU canary's off-GCP validate step to stop tracker_metrics.jsonl 404s; and #4856 cleaned up SSH auth for the prod marin workflows. On the gating side, #3506 (conservative metric thresholds on both MoE canaries) closed. Two cluster-stability items remain live — #4776 (@dlwh's RCA traced an Iris cloud-smoke restart_worker flake to a bookkeeping gap where workers register before the TPU LRO settles, not to the reserved-TPU timeout change as first suspected) is still open, and Russell's manual iris controller restart cadence continued in #infra with restarts on Apr 13, 14, 16, 17, and 19. The prod canary surface is green; the dev cluster under it still is not.
Summary: All jobs run through Fray+Iris.
The Ray endgame kept grinding forward. #4815 (@yonromai) deleted RayConfig from Levanter's TrainerConfig entirely — every call site already passed auto_start_cluster=False, so the field was inert — and ported marin/evaluation/visualize.py off fray v1 onto current_client().submit(JobRequest(...)), dropping the subprocess/ExceptionInfo wrapper the v1 flow needed. That single PR closed #4639 ("do RL helpers need porting?") and #4640 ("does evaluation: logprob jobs?"), both of which had been open questions about whether the leftover code was worth migrating or safe to delete. #4742 followed up on the documentation side, scrubbing Ray callouts from the babysit-job, dev-tpu, and agent-research skills and deleting the legacy dev-tpu-ray skill outright, closing #4740. Off the epic proper but in the same spirit, @teetone landed #4894, migrating Evalchemy to Iris.
Iris itself continued to absorb the load with visible bruising. #4725 (@rjpower) fixed a fray v2 bug where create_actor_group was calling self._iris.submit() without an environment argument, so device defaults like JAX_PLATFORMS="" never reached actor jobs; combined with Iris's parent→child env inheritance, a CPU coordinator spawning a TPU actor would produce a process with JAX_PLATFORMS=cpu and JAX would refuse to initialize the TPU backend. The fix closed #4714 (@RohithKuditipudi), who also filed #4728 after noticing that the paths-filter on fray-unit-tests.yaml only triggers on lib/fray/** changes — an iris commit on 2026-04-12 had silently broken five fray assertions for two days before surfacing on #4725. The controller needed multiple reboots over the week (url, url, url), and Russell Power noted that worker/job storms still overwhelm it — load-balancing and client-side retry tweaks are slated for next week.
On the Discord side, the "single user left" moment arrived. Tony asked if the Ray cluster could stay up through the end of the month to finish an SFT research project; willheld replied bluntly that the team no longer has the bandwidth to debug Ray's autoscaler ("Ray stopped supporting the autoscaler themselves in preference of their Kubernetes offering") and that Tony is the only user left, so he's free to patch and reboot the cluster himself without worrying about disrupting anyone. Tony traced the issue to max_workers semantics and unblocked himself. Ray is not yet literally deleted from the repo — #4453 still tracks six files with direct import ray (classification actors, executor, Levanter distributed, vLLM TPU detection) plus roughly two dozen fray-v1 consumers in RL, evaluation, and tests — but no human workflows depend on it anymore. The week's other new work, #4827 and #4828 (both @yonromai), stakes out the next phase: route all evaluators through an OpenAI-compatible HTTP contract so Levanter and vLLM become deployment details, which unblocks deleting levanter_lm_evaluation_harness and the last Ray-coupled evaluator paths.
The SWE-ZERO arc pivoted from preregistered MVP to pretraining-scale generation this week. The 1B-token run #4666 closed on Monday at 96,237 rollouts across 32,079 PRs and 20 languages, for roughly 1.1B content tokens and a 38.6% submission rate. @AlienKevin computed within-PR Jaccard similarity at 0.058 on a 19K-rollout sample, confirming the model explores diverse strategies rather than memorising one path per instance. The HuggingFace release was renamed AlienKevin/SWE-ZERO-96K-trajectories to match the actual count, and the 1K-PR pilot on the expanded SWE-rebench V2-PRs schema #4710 closed the same day after validating ingestion of the 122,910-PR corpus at a 25.4% submission rate.
With the MVP retired, the team launched the 140B-token scale-out #4719: 122,910 PRs times 100 rollouts each, targeting 12.3M trajectories. Progress moved from roughly 5% on Tuesday to 16.6% by Sunday morning - 2.04M rollouts and an estimated 23.3B tokens committed. Most of the movement came from two throughput fixes @AlienKevin landed mid-week, documented in c5a806cc9. Profiling 6,667 real rollouts showed vLLM prefill consuming 84.6% of per-rollout time because prefix caching was off and each of ~26 turns re-prefilled a growing ~194K-token context. Enabling --enable-prefix-caching yielded a measured 3.4x per-worker speedup (600 to 4,600 rollouts per batch-hour), and dropping MAX_TURNS from 30 to 15 doubled throughput again by finishing rollouts before preemption. Supporting work included a watchdog that aborts tasks when vLLM exits (earlier runs had leaked ~29% "Connection error" rollouts), a shard-lease disable that unblocked relaunches, a memory bump to 32GB, and a persistent 5-minute relaunch loop with interactive-priority anchor batches. The SWE-ZERO-12M-trajectories dataset now sits at 1.45M clean trajectories across 14,554 unique PRs after generation-time error filtering dropped the error rate from 15.5% to 6.0% between checkpoints.
Quality signal arrived on two fronts. The ConTree execution-based evaluator #4683 completed on all 151 Python instances available as swerebench images, yielding pass@1=6.0% and pass@3=11.3% - stricter than the earlier 13% LLM-judged estimate, as expected when non-submitted rollouts count as failures. @AlienKevin also derived a submission-based pass@k curve: at k=100 rollouts per PR, 90.6% of PRs produce at least one submitted patch, with diminishing returns past k=20. And a go/no-go validation opened on Friday #4898 to midtrain Marin-8B base on a 100K-trajectory subset and measure SWE-bench Verified before and after, with results due Monday April 21. The config (exp4898_sft_marin_8b_swe_zero.py) picks a 1e-4 learning rate following the weight-decay interaction analysis from #4420 and sets context to 32K to match generation. By Sunday morning the 50K run was 39% through training with loss down from 2.86 to 0.286, and the baseline eval of Marin-8B on SWE-bench Verified landed at 0/16 resolved - a clean zero-floor from which any non-zero post-training result is informative. The baseline rollout also surfaced a plumbing fix: the InstalledAgent version of mini-swe-agent ran the CLI inside the Daytona sandbox and could not reach host-side vLLM, so @AlienKevin rewrote MiniSweAgentV1 as a BaseAgent that wraps the official DefaultAgent with litellm and Harbor adapters.
Around the SWE-ZERO core, the adjacent synthetic-data surfaces kept moving. The execution-trace pipeline from SWE-rebench-V2 #4383 was rewritten by @Helw150 onto ConTree's SDK, eliminating the Docker daemon and Iris-privilege path; outputs now land as per-test parquet rows. The soft-proxy investigation for agentic benchmarks #4389 updated with a mixed read: @RohithKuditipudi reported that positive-trace loss and success-failure gap gave weak or non-monotonic signal on a 15-model MATH study, and @dlwh pushed back with the "dumb" baseline of predicting CORRECT/INCORRECT on complete traces, which Rohith flagged as worth a filtered re-examination. @taivu1998 opened #4858 to add Hermes trace support to the SFT pipeline, introducing message_postprocess_fn and row_id_fn hooks on TransformAdapter and registering the glm-5.1 and kimi splits with a trace-focused pilot experiment. The umbrella issue for diverse agentic traces #4435 got the consolidated status update: MVP, multilang, 1B and V2-PRs pilot all closed; 140B in flight; ConTree eval live.
Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.
With the 1e23 MoE run now leaning on the Nemotron mix, this week's data work split into three concurrent threads. On the midtraining side, @ahmeda14960 opened #4927 to add experiments/midtraining_data_buckets.py, a math-focused registry grouped by LLM-provenance so cooldown experiments can pick a teacher-model comfort level and pull a pre-grouped set of ExecutorSteps. BUCKET_1 is three filter-only corpora (ProofPile 2, FineMath-3-plus, MegaMath web); BUCKET_2 adds four non-Qwen rewrites (the three nemotron_cc_math_v1 Phi-4 splits plus MegaMath web-pro's Llama-3.3-70B refinement); and BUCKET_3 opens up the full 32-dataset Qwen/QwQ surface, including every Nemotron v2 family (CC v2/v2.1/Code, Pretraining-Code v1+v2, Specialized-v1 minus stem_sft) and the MegaMath synthetic QA/translated-code/text-code splits. The registry is a pure re-export over the already-tokenized steps in gs://marin-us-central2/; the open downstream is wiring BUCKET_1_PLUS_2 into a Mantis-style 8B cooldown. Alongside it, #4892 from @ravwojdyla-agent adds datakit normalize steps for the Nemotron v1 (seven splits) and v2 families so that pipeline can feed those buckets end-to-end.
In parallel, @dlwh pushed forward the long-context data audit for exp2062. #4735 asks how much usable long context actually lives in Longmino and FinePDFs/FinePDFs-edu by length bucket, and what filtering or capping should apply before those corpora are allowed to dominate the 8B pilot's long-context slice; the current exp2062 mix still depends on hand-entered token counts in longmino.py and finepdfs.py. The scaffolding landed in #4738, a reusable CLI that samples raw docs for repetition/OCR/formatting heuristics and reads exact totals from tokenized-cache .stats.json when available, emitting summary.json, summary.md, and a stratified review_sample.jsonl; the in-region corpus run and the keep/cap/defer recommendation are still open. #4736 adds a companion track for a small Marin-owned OSS long-doc QA manifest over existing Ar5iv, Wikipedia, and Stack Exchange corpora, since NVIDIA's long-context post-training data is not cleanly ungated. On the eval gating side, #4737 wires RULER/NIAH task configs through the vLLM lm-eval path for 4k/8k/16k/32k/64k, fixes the 4k-default truncation that likely sank the prior #2064 attempt, and adds an exp2062 runner for the warmstart and phase checkpoints; full RULER suites were launched on April 14 as Iris jobs for Qwen3 8B and Marin 8B base, with results still pending at end of week.
On the upstream web-data side, @XenonMolecule posted a substantial update on #2351, the small-model raw-WARCs-to-training-tokens effort. Running seven pipelines against the same 3000 DCLM WARCs, the LLM-extraction path (Qwen3-8B) yields roughly 58.9B tokens versus 2.7B for Nemotron-CC, 2.7B for DCLM, 1.9B for Nemotron-CC without rephrase, and 817M for FineWeb-Edu; raw HTML sits at 3.63T and Resiliparse at 142.7B for scale. Projected to the full 7.9M-WARC crawl that is a 2,642x scale-up and an inference bill on the order of 6.60e22 FLOPs, with a closed-form accounting worked out from Levanter's lm_flops_per_token. Preliminary 20T-target simulated-epoch sweeps on Delphi-style isoflop curves show DCLM yielding better Paloma loss than plain Nemotron at small scale, but the sampling factor caps Experiment B at roughly 7.6B tokens and cuts off the largest compute candidates — the working fix is bumping to 10k WARCs (the minimum DCLM comparison size) rather than pushing the target to a "sci-fi" 200T. On Discord, @Ahmed M Ahmed and @willheld debated whether Marin needs more math sources beyond Nemotron-distilled or Qwen-rephrased data; Will's read was that Nemotron 2's math data is sufficient for a strong math reasoning model, and Ahmed flagged the already-tokenized Nemotron-Math subsets on central2 as untouched in Mantis so far. Finally, @hammer opened #4915 to audit number tokenization — right-to-left parsing, place-aligned chunking, digit splitting — citing the HuggingFace number-tokenization blog and Arcee Trinity Large's note on pathological backtracking in the standard number regex.
The week's uncategorized traffic is dominated by a long tail of Iris hardening from @rjpower: a refactored heartbeat protocol split into focused Ping/StartTasks/StopTasks/PollTasks RPCs #4638 with the monolithic path retired in #4843, async heartbeat dispatch at concurrency 128 #4842, a periodic checkpoint moved onto its own thread #4847, terminal-task history eviction chunked across transactions #4845 #4850 #4851, autoscaler pending-hint caching #4848, and a worker health score with a reaper thread for failure-aware termination #4883. Correctness fixes trickled in alongside: unsatisfiable routing constraints are now rejected at submit time #4681, TPU capacity errors classified as quota-exhausted and retried #4670, a preemption race in heartbeat task kill #4690, split-heartbeat loops no longer crash on undefined state #4880, and bad-node stderr now promotes the worker to WORKER_FAILED #4798. Smaller touches — cascading worker_task_history deletes #4838, pinning the TPU network to default #4857, caching ListJobs endpoints #4703, documenting priority bands #4749, keeping a v4-reserved/2048 slice warm #4832 — round out the Iris picture, as does a parallel push from @ravwojdyla-agent to consolidate the controller store layer around process-scoped stores #4836 and add a slice lifecycle state machine for autoscaler transitions #4816.
Outside Iris, @taivu1998 landed a sweep of RL plumbing: a neutral objective runtime and data plane for the trainer #4766, a native Reasoning Gym adapter #4684, an OpenAI-compatible Prime verifier client #4825, RL LoRA training with merged-rollout serving and final adapter export #4826, and a first marin.test_time_scaling slice for math reasoning #4774. @eric-czech cherry-picked two Levanter fixes for the eval harness — a missing resource mapping in broadcast_shard #4911 and an unsupported implementation kwarg #4912 — and @yonromai rewired the harness through the MarinTokenizer protocol #4944. @rjpower also bumped pyrefly from 0.42 to 0.61 and regenerated the baseline #4801, fixed the real bugs that surfaced (169 → 129 diagnostics) #4804, and worked through four further cleanup clusters #4808 #4809 #4810 #4811. Six archived-Qwen3 speedrun submissions from @WhenWen arrived as a set — AdamC/AdamH, MuonC/MuonH, MuonRemez, and PRISM-Berkeley #4928 #4929 #4930 #4931 #4932 #4933. Rounding out the week, @dlwh flipped train/eval/LoRA/viz defaults over to array-first Grug datasets as part of gruggification #3314, @teetone migrated Evalchemy onto Iris and added an OlympiadBench Physics reasoning eval #4894, integration tests were deselected by default and a gated tokenizer test now skips gracefully #4757, an orphan-draft-release cleanup script unblocks the nightly wheel uploads #4780, and a faulthandler now ships in every Iris/Marin/Zephyr/Fray process so SIGSEGV/SIGABRT/SIGBUS produce tracebacks instead of silent exits #4743.
External GitHub contributors converged on the RL, speedrun, and eval-harness surfaces. @taivu1998 opened a run of draft PRs reworking Marin's RL stack — a trainer refactor onto an objective runtime and neutral data plane #4766, LoRA training and export support #4826, a hardening pass on the Prime verifier integration #4825, a native Reasoning Gym environment adapter #4684, and plumbing for Hermes-trace SFT #4858 and sample-only TTS math #4774. @WhenWen archived six Qwen3 speedrun submissions in one sitting — AdamC, AdamH, MuonC, MuonH, MuonRemez, and PRISM-Berkeley (#4928–#4933). @eric-czech landed two levanter fixes for the eval harness — resource partitioning #4911 and a stray implementation arg in the lm_eval call #4912 — and drove most of the discussion on #4678 about mixed Marin/HF tokenizer paths. @teetone migrated Evalchemy to Iris and added OlympiadBench Physics coverage in #4894; @nevillelyh opened a WIP to drop consolidate from tokenize #4814; @wmoss trimmed the fineweb download to the 10BT sample that's actually processed #4731.
The week's most substantive Discord threads clustered around scheduler semantics and the MoE run. Eric Czech's self-preemption mishap turned into a design conversation in #infra, where Will Held proposed that a user's INTERACTIVE jobs should preempt their own BATCH jobs — directly tied to the eval-harness region plumbing he was also debugging on #4911. Russell Power and Eric then mapped out multi-region sweep behavior and the Executor/MirrorFS interaction in a long infra thread, concluding the Executor abstraction needs a rethink in the Iris+MirrorFS world. On the MoE side, Will Held's writeup of quantile balancing on the Open Athena blog summarized the ongoing experiments in #moe; Larry posted the first 5% of the 130B/A29B loss curve, and dlwh noted at week's end that the reconstituted 1e23 run is now on twice the hardware at ~20% better MFU on the ragged_all_to_all path.
Seven new members arrived with self-introductions, spanning industry infra and student researchers: an ML research engineer from programmatic-advertising DLRM systems; a Scale AI engineer on agent systems (previously Two Sigma) focused on reasoning and interpretability; a Senior MLE at Bill with use-case GRPO and multimodal post-training experience; an MLE at a Tokyo fintech LLM startup; a graduating senior at Fayetteville State with NASA JPL and UNC Chapel Hill research on aerospace LLMs and radiotherapy VLMs; a fractional CTO now running Kagi's index team; and a second-year undergraduate focused on ML systems and performance. Jason Grey (greyleader77) brings context on large-scale crawl and index infrastructure from Common Crawl's 10PB pipeline, which intersects with the data-pipeline and retrieval work in the datasets epic. Kartik_S (Senior MLE at Bill) intersects with the active RL and post-training threads through his use-case-specific GRPO background. Tri Dao also joined the server on April 18 without a self-intro; his work on FlashAttention and Mamba intersects directly with this week's ragged-all-to-all expert-parallelism debugging and the broader MoE kernel surface.
News and research links in #news and #reinforcement-learning skewed toward tokenization and numerical handling — Julie Kallini's tokenizer thread, HuggingFace's number-tokenization blog and finephrase space, a digits-to-decisions paper that Jeff H turned into issue #4915 — alongside an arXiv trickle on depth-adaptive architectures, experience replay for RL, and a Nature piece on subliminal trait transmission during distillation.
The preregistered 1e23 MoE arc — a 129B-total / ~16B-active ragged mixture of experts predicted to land at paloma macro_loss 2.25 after ~1T Nemotron tokens (#4697, isoflop prediction from #4447) — dominated the week. The original moe_1e23_d5120_bs2048_ep4_ring launch on v4-1024 with ring expert dispatch clocked ~49B tokens at 14.2% MFU before being torn down mid-week so the training could migrate to a v4-2048 slice with ragged all-to-all. That migration required chasing through a step-0 gradient explosion with ep≥8 ragged dispatch #4746, a sender-side offset bug in _shard_a2a_params fixed in #4867, and a raft of Iris fixes (#4821 to keep a reserved v4-2048 slice warm, #4793/#4792 to raise heartbeat and bootstrap timeouts on the 2048-node pod).
The reconstituted moe_1e23_d5120_bs2048_ep8_ragged_48l_rayuvtpu_20260417_124933 came online Apr 17 on v4-2048 at ep=8 and, as of this writing, is still running — it has consumed 105B tokens at MFU 16.4% (roughly 20% faster, relative, than the ring baseline on half the hardware), with train loss tracking the ring run essentially step-for-step. The paloma macro_loss sits at 2.7283 and overall macro_loss at 2.6186 — nowhere near the 2.25 preregistered target yet, but only ~10% of the way through the 1T-token budget. dlwh flagged a curious wrinkle on Discord: even at identical train loss the perplexity evals are a good bit worse than the ring baseline, "much, much worse on github," which he suspects is an eval-condition mismatch or overflow behavior under imbalanced routing rather than a run-resetting bug.
On the agentic SFT side, #4420 delivered the week's most sobering result: two epochs of SFT on the full 366K-example Nemotron-Terminal-Corpus brought exp4420-8b-v4-128-20260413 (Marin-8B Instruct, v4-128) to final loss 0.442 — a stubborn +0.082 gap to the Qwen3-8B reference that never closed — and only 1/89 = 1.1% on Terminal-Bench 2.0 versus 15.9% for the Qwen3-8B baseline at matched recipe. The v5p-64 twin crashed, the v4-128 run finished; the takeaway is that Marin-8B Instruct absorbs terminal-skill SFT roughly an order of magnitude worse than Qwen3-8B on identical data, plausibly tokenizer- or chat-template-bound. The 32B analog exp4307-32b-v4-512-tp2-20260413 (#4307, Qwen3-32B on v4-512 TP=2) crashed at step 892 / 15.6% after burning 28.5h, and the Marin-32B-base comparator exp4760_sft_marin_32b_base_terminal_corpus_15pct_32768tok_v5p64 #4760 is still running at step 210/859 on preemptible v5p-64 after its 128-chip sibling crashed off the OOM cliff.
A sweep of Michael Ryan's FM-natural curation ablations landed at the ~3e20 FLOP scale on v5p/v4-32 slices, putting four text-extraction and filtering pipelines head-to-head at the same compute. At the ~1B d1536 shape, curation-resiliparse-expFM_natural-3e+20-d1536-L16-B256 finished best with paloma macro_bpb 1.1234 and uncheatable macro_bpb 0.9494 — the github_cpp and github_python bpb drop to 0.83 and 0.92 is the clean tell that resiliparse is pulling code through where the other pipelines mangle it (FineWeb-Edu sits at cpp=5.56, python=5.29 at the small d512 shape). curation-dclm-expFM_natural-3e+20-d1536-L16-B256 finished at macro_bpb 1.1797, the Nemotron-full variant at macro_bpb 1.3325, and the tiny FineWeb-Edu d512 smoke run at 2.2024 — a reminder that programming-language coverage, not just English prose quality, is driving the curation gap at 1e23-scale.
| Run | User | Hardware(?) | Hours(?) | FLOP Budget(?) | Loss | BPB(?) |
|---|---|---|---|---|---|---|
| #4697 #4867 pre-reg moe_1e23_d5120_bs2048_ep8_ragged_48l_rayuvtpu_20260417_124933 | David Leo Wright Hall |
TPU v4 (1024 chips) |
2.8d |
1.08e22 model
6.61e22 HW (16%) |
BPB: 0.848 | |
| #4697 pre-reg moe_1e23_d5120_bs2048_ep4_ring | Larry Dial |
TPU v4 (512 chips) |
3.0d |
5.08e21 model
3.57e22 HW (14%) |
BPB: 0.884 | |
| #4307 exp4307-32b-v4-512-tp2-20260413 | Kevin Li |
TPU v4 (256 chips) |
1.2d |
1.15e21 model
7.00e21 HW (16%) |
BPB: 0.808 | |
| moe_1e23_d5120_bs2048_ep8_ragged_48l_rayuvtpu_20260417_011404 | David Leo Wright Hall |
TPU v4 (1024 chips) |
6.8h |
1.09e21 model
6.55e21 HW (17%) |
— | |
| #4420 exp4420-8b-v4-128-20260413 | Kevin Li |
TPU v4 (64 chips) |
1.6d |
9.30e20 model
2.33e21 HW (40%) |
BPB: 0.170 | |
| #4760 exp4760_sft_marin_32b_base_terminal_corpus_15pct_32768tok_v5p64-775003 | Kevin Li |
TPU v5 (32 chips) |
1.9d |
7.37e20 model
1.58e21 HW (47%) |
— | |
| curation-fineweb_edu-expFM_natural-3e+20-d512-L6-B2048 | Michael Ryan |
TPU v5 (32 chips) |
1.4d |
3.00e20 model
1.49e21 HW (20%) |
BPB: 2.029 | |
| curation-dclm-expFM_natural-3e+20-d512-L6-B2048 | Michael Ryan |
TPU v5 (32 chips) |
1.2d |
2.67e20 model
1.32e21 HW (20%) |
BPB: 1.283 | |
| curation-nemotron_full-expFM_natural-2e+20-d512-L6-B2048 | Michael Ryan |
TPU v5 (32 chips) |
18.4h |
1.96e20 model
1.24e21 HW (16%) |
BPB: 1.510 | |
| curation-nemotron_full-expFM_natural-3e+20-d512-L6-B2048 | Michael Ryan |
TPU v4 (32 chips) |
1.1d |
2.39e20 model
1.18e21 HW (20%) |
BPB: 1.496 | |
| curation-nemotron_full-expFM_natural-3e+20-d1536-L16-B256 | Michael Ryan |
TPU v4 (32 chips) |
1.2d |
3.00e20 model
1.12e21 HW (27%) |
BPB: 1.205 | |
| curation-dclm-expFM_natural-2e+20-d512-L6-B2048 | Michael Ryan |
TPU v5 (32 chips) |
20.4h |
2.00e20 model
9.94e20 HW (20%) |
BPB: 1.265 | |
| curation-dclm-expFM_natural-3e+20-d1536-L16-B256 | Michael Ryan |
TPU v5 (32 chips) |
20.1h |
3.00e20 model
9.55e20 HW (31%) |
BPB: 1.039 | |
| curation-resiliparse-expFM_natural-3e+20-d1536-L16-B256 | Michael Ryan |
TPU v5 (32 chips) |
20.1h |
3.00e20 model
9.31e20 HW (32%) |
BPB: 0.998 | |
| exp4760_sft_marin_32b_base_terminal_corpus_15pct_32768tok_v5p256-d7b7a8 | Kevin Li |
TPU v5 (128 chips) |
7.3h |
4.13e20 model
8.67e20 HW (48%) |
— |
6 comments on 6 threads