Marin: Week of April 13th summary

#4281 MoE scaling: 1e23 run live, ragged EP debugged mid-flight

Summary: Split from #4266.

23/31 sub-issues closed

The preregistered 1e23 MoE run #4697 that kicked off last Friday is now on the board. The initial v4-1024 ring-dispatch configuration — d5120, 129B total / 15.9B active, 120k steps on the Nemotron mix, predicting final Paloma macro loss of 2.25 per the #4447 isoflop extrapolation — was posted as a full job summary and moe_1e23_apr branch on the 13th. @dlwh spent the week migrating it to the intended v4-2048 / ragged-all-to-all path. As of Saturday, the reconstituted run is training on twice the hardware at roughly 20% better MFU, with train loss curves essentially identical to the ring-dispatch control. Perplexity evals come out measurably worse at matched train loss, which @dlwh suspects is an eval-time mismatch or overflow in the imbalanced-batch path rather than a training corruption — worth tracking, probably not a reset.

The migration surfaced a real bug: ragged expert parallelism at ep>=8 was exploding grad/norm/output_proj on step 0/1 while the ring path happily masked the failure #4746. Root cause turned out to be incorrect sender-side output offsets in _shard_a2a_params, fixed in #4867 with a ring-vs-ragged EP parity regression. A second finding: MoeAdamHHeuristic._compute_num_layers() was throwing away the computed depth and returning a hardcoded 48, so the smoke config wasn't actually the intended d5120/l51 shape — restoring the formula yields 49 layers for d=5120, matching the job-summary figure.

The 1e23 run also stressed Iris hard enough to produce three infra fixes in its wake: #4823 keeps one v4-reserved/2048 slice warm (the autoscaler had been force-deleting reserved slices ten minutes after worker registration via the 600s idle path), #4793 raises worker-heartbeat RPC timeouts to cover the registered-but-never-advanced failure mode, and #4792 scales TPU bootstrap waits by pod size to handle the 255/256-healthy provisioning path. A follow-up issue #4822 tracks the deeper fix of teaching Iris to treat assigned/building work as slice activity so reserved slices aren't reaped out from under live jobs.

In parallel, the agent-driven architecture sweep kept closing out: GatedNorm scale factor, 2x expert granularity, slim-layer, router-combine-activation, x0 skip connections, pseudogram, backout, and router-bias-nonzero-init all landed their write-ups this week, with a handful more — paired-head attention #4907, partial RoPE #4946, and the AdamH-preserved GatedNorm init test #4904 — still open. Separately, #3902 finally merged, teaching Grug Muon to batch Newton-Schulz over stacked MoE expert weights and restore update sharding exactly once at the optimizer boundary. @Ahmed M Ahmed also flagged the Nemotron math split as a likely midtraining add once the main run is underway.

5 PRs this week, 90 new comments, and 23 new issues (31 total)

Sort:

#4823 [iris] Set buffer_slices: 1 for v4-reserved/2048 to keep one slice warm +1 −1 @claude
#4793 [iris] Raise worker heartbeat RPC timeout +167 −161 @dlwh
#4792 [iris] Extend TPU bootstrap timeouts 💬1 +111 −11 @dlwh
#3902 [levanter] Improve Grug Muon MoE orthogonalization +280 −21 @dlwh
#4867 [Levanter] Fix ragged all-to-all sender offsets 💬1 +66 −3 @dlwh
Issues
#3469 MoE Sweep: Hyperparameters, Routing, Architecture, and Optimizer
#4013 [moe] Good 10T gate for #3469
#4014 [moe] Great 10T gate for #3469
#4432 Experiment: Map out Critical Batch Size for Current MoE Recipe
#4447 Experiment: MoE Isoflop on new lr recipe. 💬1
#4567 Experiment: Investigate MoE Beta2 Scaling Recipe
#4225 Experiment: Map out LR schedule and tuned value interactions with AdamH
#4569 Experiment: Investigate Approaches to handle Embed Norm Growth in MoE Recipe
#4697 🆕 Experiment: 1e23 MoE Run 💬8
#4716 🆕 Agent MoE Experiment: Attention Gate Sizing 💬8
#4767 🆕 Agent MoE Experiment: E=128 experts (up from E=64) 💬6
#4768 🆕 Agent MoE Experiment: Shared Expert Sizing (half and none) 💬4
#4769 🆕 Agent MoE Experiment: Remove sliding window attention 💬4
#4770 🆕 Agent MoE Experiment: Remove GatedNorm (keep RMSNorm only) 💬4
#4772 🆕 Agent MoE Experiment: Remove XSA (Exclusive Self Attention) 💬4
#4800 🆕 Agent MoE Experiment: Routed expert output weighting factor 💬6
#4802 🆕 Agent MoE Experiment: Partial key offset (partial RoPE + key shift) 💬4
#4803 🆕 Agent MoE Experiment: Remove router z-loss 💬5
#4805 🆕 Agent MoE Experiment: Softmax routing instead of sigmoid 💬4
#4806 🆕 Agent MoE Experiment: x0 skip connections (residual to initial embedding) 💬6
#4807 🆕 Agent MoE Experiment: Layer-grain prediction (weighted sum of per-layer diffs) 💬2
#4849 🆕 Agent MoE Experiment: Router bias nonzero init 💬3
#4899 🆕 Agent MoE Experiment: 2x expert granularity (E=128, K=8, half expert dim) 💬1
#4900 🆕 Agent MoE Experiment: Slim layer (per-layer MoE necessity) 💬2
#4901 🆕 Agent MoE Experiment: Router combine activation (identity vs pre-sigmoid) 💬3
#4902 🆕 Agent MoE Experiment: GatedNorm scale factor (allow amplification) 💬3
#4904 🆕 Agent MoE Experiment: GatedNorm init scale (AdamH-preserved norm) 💬1
#4905 🆕 Agent MoE Experiment: Pseudogram (per-position sigmoid residual) 💬4
#4906 🆕 Agent MoE Experiment: Backout (subtract cached midpoint activation) 💬4
#4907 🆕 Agent MoE Experiment: Paired head attention (cross-head querying on even layers) 💬1
#4946 🆕 Agent MoE Experiment: Partial RoPE (half-dim rotation)

1 autocategorized

#2851 Follow-up: Adaptive MoE quality/efficiency with take-until-null routing 💬2

#4474 Log pipeline rebuild, status page goes live

Summary: Improve Levanter's data store, fix K8s logging at scale, and address infrastructure gaps in the Iris dashboard and profiling.

The week opened with #4682 lifting Iris's log service out of the controller process and into a dedicated subprocess on port+1, closing #4673. That single structural change unleashed a cascade. #4708 restored a PushLogs proxy on the controller after older workers were caught silently failing to push (their cached /system/log-server resolution still pointed at controller:10000). #4795 moved worker log-handler attachment ahead of _register() so bootstrap failures — the kind that produced the 255/256 workers healthy dead end @dlwh filed as #4794 — actually ship searchable remote logs. #4711 taught DuckDBLogStore to reconcile its in-memory segment index against disk, so a vanished Parquet file no longer renders an entire key range permanently unreadable.

Then on the 16th the log store collapsed. #4833 diagnosed the root cause: _LocalSegment tracked only scalar (min_key, max_key) aggregates, but every segment on the cluster spans every user (/alice/... through /zoe/...), so file-level pruning narrowed nothing and the _MAX_PARQUETS_PER_READ = 25 cap was load-bearing — keys whose rows happened to sit outside the newest twenty-five segments became invisible. Users felt it immediately. Tony asked in #infra whether the dashboard was hanging for others, Larry reported jobs failing at step 1,615 / 12,649 with empty logs, and @ravwojdyla opened #4860 on a restarted job with no logs at all. The fix landed in two pieces: #4834 raised the per-read cap to twenty-five segments / 2.5 GB and added a newest-first early stop (p95 held under 500 ms on the 4.5 GB, 46-segment production corpus), then #4881 dropped the cap entirely by turning on DuckDB's enable_object_cache and letting per-row-group stats do the pruning. #4889 cleaned up two durability regressions flagged in review — a close() path that never uploaded when exactly one tmp was present, and a crash window between rename and unlink that could double-count rows — and #4939 gave compaction its own DuckDB connection after a 67-second compaction on Sunday stalled the dashboard with 504s.

Around the log-store core, @rjpower rewrote LogPusher in #4866 with a dedicated drain thread, failure-driven re-resolution, and a 10k head-of-queue buffer that only drops oldest-first on overflow; #4864 unifies push and fetch behind a single IrisLogClient. #4717 pushed dashboard log filtering and level to the server so rare-string searches at small maxLines actually return matches. #4863 dropped the dead SQLite logs table that had been silently returning zero rows to every iris query. #4935 shipped an RpcStatsCollector and a StatsService endpoint — per-method counters, a fixed-bucket latency histogram, and ring-buffered slow-tail plus discovery samples — closing out the last of the observability holes that made the week's outages hard to pin down from outside. By Sunday night Russell posted in #infra that logs were back up: "it turns out that you guys can write >10m logs a minute which was... not quite accounted for."

On the dashboards side, @ravwojdyla-agent stood up #4649, a new IAP-gated Cloud Run status page aggregating ferry CI, main-branch build health, Iris reachability, worker counts, and job state — roughly 1.3k lines of Node/Hono plus a React/Jotai frontend. #4699 fixed a /api/workers 504 traced to inline AbortSignal.timeout() objects being GC'd before firing under Node 20 undici. Follow-ups added a CW ferry card, auto-deploy, amber flags for ferry runs slower than mean + 1σ of the preceding seven successes, and a distinct stripe for cancelled runs so they stop reading as "unknown." @Helw150 landed dark mode and a clearer autoscaler overview in #4660, then fixed a band-percentage bug in #4688 where slices running multiple bands were double-counted so shares could sum past 100%. The Levanter store/cache tracking issue #4445 remains open — no code changes this week — but the logging foundation the epic shares with it is visibly sturdier than it was seven days ago.

47 autocategorized

#4939 [iris] Isolate log-store compaction from read traffic +29 −8 @rjpower
#4889 [iris] Log-store: compact single tmp on close, drop crash-leftover tmps on startup +109 −8 @rjpower
#4881 [iris] Log-store: let DuckDB do file-level pruning; drop load-bearing file cap +543 −641 @rjpower
#4866 [iris] LogPusher: background drain thread, self-heal via resolver, never drop on send failures 💬2 +614 −273 @rjpower
#4865 [iris] Fix iris process profile default target for controller +2 −1 @rjpower
#4863 [iris] Remove dead SQLite logs table from controller schema +26 −69 @claude
#4864 [iris] Unify log client: resolver, failure-driven re-resolution, buffer-to-recover 💬1 +863 −482 @rjpower
#4834 [iris] Raise log-store per-read cap and add newest-first early stop +46 −5 @rjpower
#4717 [iris/dashboard] Push log filter and level to server +12 −23 @rjpower
#4711 [iris] Self-heal log store when local Parquet files vanish; add INFO logs 💬1 +128 −4 @rjpower
#4708 [iris] Restore PushLogs proxy on controller; fix legacy parent_job_id filter +22 −7 @rjpower
#4704 iris: restore per-task mem/cpu/peak mem in job view 💬4 +391 −1 @ravwojdyla-agent
#4702 [iris/dashboard] Task page thread dump: use dedicated page +11 −2 @ravwojdyla-agent
#4699 infra-dashboard: fix /api/workers 504 timeout on Cloud Run +42 −20 @ravwojdyla-agent
#4694 [iris] Remove heartbeat log forwarding dead code 💬1 +24 −218 @claude
#4709 [iris] Warn on log push RPC failures instead of silently swallowing 💬2 +249 −5 @claude
#4688 Fix band percentage math in fleet overview +18 −8 @Helw150
#4687 Will/fix pr 4660 followups +0 −141 @Helw150
#4682 iris: move log service to a separate subprocess 💬1 +490 −145 @claude
#4660 Dark Mode, Clearer Overview for Autoscaling, Preferred Colors & Fonts +1266 −122 @Helw150
#4649 infra-dashboard: cloud run status page for marin +8938 −0 @ravwojdyla-agent
#4788 [status-page] Flag ferry runs slower than prior 7 successful runs +73 −4 @ravwojdyla-agent
#4785 [status-page] Stripe cancelled ferry runs as gray/red +32 −17 @ravwojdyla-agent
#4745 status-page: add cw ferry, iris.oa.dev links, auto-deploy +48 −4 @ravwojdyla-agent
#4748 status-page: install gcloud beta component in deploy job +1 −0 @ravwojdyla-agent
#4751 iris-iap-proxy: auto-deploy to cloud run on merge to main +78 −25 @ravwojdyla-agent
#4935 iris: add in-process RPC statistics and a StatsService endpoint 💬1 +1187 −23 @rjpower
#4796 [iris] Add heartbeat and slice-lifecycle debug logging +31 −3 @rjpower
#4795 [iris] Attach worker log handler before register so bootstrap logs ship 💬1 +164 −15 @claude
#4882 [rigging] Silence fsspec BlockCache debug log spam +2 −0 @rjpower
#4733 Try to reduce logging load. 💬3 +839 −209 @rjpower
#4718 Log shard retires as debug 💬4 +3 −2 @ravwojdyla
#4812 [zephyr] Include stage name in worker task-execution log +7 −1 @ravwojdyla-agent
#4647 Iris: Add a legend for dashboard symbols and an in-use slice state
#4664 iris: logging - detect and skip exceptions for common error types
#4673 iris: move logger service into a separate process, boot at controller startup time
#4692 [iris] Remove heartbeat log forwarding now that workers push directly to log service
#4693 [iris] Update iris client to fetch logs from /system/log-server directly
#4706 iris: RPC status dashboard for server & client RPCs
#4707 iris: log pusher should warn on RPC failures/slow-down
#4833 Log-store: narrow per-segment key pruning so reads don't need the full cap 💬1
#4860 Iris - missing logs 💬2
#4862 [iris] Remove dead SQLite `logs` table from controller schema
#4871 iris: log-server should use a managed thread for automatic cleanup 💬1
#4873 iris: noisy LogPusher logs during testing
#4445 Improve levanter store/cache 💬7
#4794 [iris] Let workers send bootstrap logs before registration 💬1

4 potentially related in Other Changes

#4894 Migrate Evalchemy to Iris + OlympiadBench Physics +185 −32 @teetone
#4798 Detect TPU bad-node stderr and promote to WORKER_FAILED 💬1 +137 −11 @rjpower
#4743 rigging: install faulthandler in configure_logging 💬1 +66 −9 @rjpower
#4753 Default vLLM mode to native instead of docker sidecar 💬3 +36 −2 @claude

#4283 MoE MFU at scale

Summary: Tracking issue for April MFU work. Tasks/goals:

0/6 sub-issues closed

The week's MFU story converged on a single number. On Sunday @dlwh reported that the reconstituted 1e23 run, moved from v4-1024 ring to twice as much hardware on v4-2048 with the ragged_all_to_all dispatch at ep=8, is now pulling roughly 20% better MFU than the ring baseline, with train loss nearly identical across the two trajectories. The caveat, worth flagging: perplexity evals are meaningfully worse under the ragged path even at matched loss, with GitHub perplexity particularly degraded; @dlwh suspects an eval-condition mismatch, an overflow, or a subtle bug under imbalanced routing rather than anything that warrants a run reset.

Getting the ragged path to that point took the week's code work. #4746 caught the 1e23 smoke exploding on step 0/1 whenever ragged expert parallelism was exercised at ep>=8 — grad/norm/output_proj jumping from ~0.6 to the 8k-330k range while ring stayed stable, which meant the branch could silently mask the failure instead of smoking it out. @dlwh's follow-up #4752 against moe_1e23_apr restores the heuristic layer formula (d=5120 gives 49 layers, not the hardcoded 48), adds an explicit MoeImplementation knob so the smoke actually defaults to ragged on v4-2048 in us-central2, and plumbs Iris production priority through grug dispatch and Fray. Upstream of that, #4636 landed earlier in the week: it ported MoeAdamHHeuristic from the isoflop branch so launch.py can derive (model, optimizer, batch, steps) from a compute budget plus hidden_dim, dropped the initial-dense-layer path to match the isoflop architecture, and fixed a sharding crash in align_kv_heads by replacing jnp.repeat with a reshape+broadcast under abstract-mesh contexts.

The H100 side of the epic got its first real numbers too. @chloechiaw posted a head-to-head of Grug ring EP against Megatron DeepEP on 8xH100 across four MoE shapes: the Qwen3-30B anchor (128 experts, topk=8, h=2048, ffn=768) lands at 13.1 ms vs Megatron's 12.41 ms, a 6% gap; the 32- and 64-expert shapes at the same hidden/ffn actually run faster on Grug (0.23x and 0.57x of Megatron's per-step time); and the Qwen3-235B anchor is the outlier at 3.48x slower (40.9 ms vs 11.75 ms), which is the shape that now needs a closer look before the H100 track can claim parity at the geometries that matter for the 1e23 milestone.

0 PRs this week, 1 new comment, and 0 new issues (6 total)

Sort:

Issues
#4300 TPU v4: 25%-30% MFU sustained for 100B-A13B
#4301 Experiment: Grug MoE ~116B-A16B bring-up on v4-1024 (d4864/l47/h38)
#4302 H100 x 8 MOE MFU perf
#4311 Measure throughput of megatron on 8x H100 node on relevant geometries 💬1
#4312 Improve end-to-end MFU of MOE on 8xH100 node
#4313 Measure/improve MFU on 2 x 8xH100 node

2 autocategorized

#4636 Add MoeAdamHHeuristic, drop dense layers, fix align_kv_heads sharding 💬1 +674 −83 @ClassicLarry
#4752 [grug-moe] Smoke the 1e23 ragged EP path +34475 −7770 @dlwh

1 potentially related in Other Changes

#4837 Bump our non-preemptible CPU counts. +1 −1 @rjpower

#4282 Agentify experimentation

Summary: Split from #4266.

With last week's git worktree race behind it, the Nightshift scout pipeline settled into a daily cadence: cleanup PRs landed on the 13th, 14th, 15th, and 17th, each one a cross-subproject harvest of dead code, duplicated helpers, and convention violations found by parallel scout worktrees. #4691 pruned four unused utilities from levanter and unified the TPU slice-selection logic in scaling_laws/tpu_utils.py behind a new TpuSpec dataclass. #4729 fixed a quietly broken _is_local_leader() where atexit.register calls sat after return statements and so never cleaned up temp lock files, plus an os.system error branch that could never fire. #4777 deleted roughly 700 lines of RL math-grading code that had already been re-homed under tinker_environments, and #4861 tidied up after the shuffle-format rewrite by removing the ScatterShard alias and its thin wrapper. Four days, four merges, no reverts.

Two infrastructure PRs tightened the bot's operating envelope. @rjpower discovered that Nightshift PRs were merging with no checks at all — the default GITHUB_TOKEN does not trigger workflow runs — and in #4781 swapped both Nightshift workflows over to a minted claude-nightshift GitHub App token so pull_request workflows fire normally. The effect was immediately visible: #4861 is the first multi-cleanup authored by claude-nightshift rather than github-actions. Separately, @dlwh's #4763 taught the pull-request skill to prefer pushing branches onto the main repo when permissions allow, falling back to a fork only when necessary — one fewer source of friction when a subagent has something to say.

Agents showed up elsewhere in the week's chatter, though under the Codex and Claude Code banners rather than Nightshift's. Eric Czech reported that a Claude-supervised babysitter reading only iris job state kept an eight-run sweep alive for roughly twelve hours unattended; letting the same agent debug around iris outages did not go nearly as well, and sporadic iris latency still trips up agents that poll frequently. On the levanter side, rohithck noted that Codex one-shotted a vLLM fix, and Ahmed M Ahmed spent a multi-day session narrowing a v5p/v6e LoRA divergence with Codex as the scribe — it turned out that step-0 LoRA B gradients differ elementwise between matched bf16 runs. The supporting cast of workflow irritations was also visible: Claude Code now blocks sleep loops, and dlwh watched Codex reboot a host without rebuilding the image, twice.

6 autocategorized

#4763 [Agents] Prefer main repo branches for PRs +5 −0 @dlwh
#4861 [nightshift] 20260417 multi-cleanup 💬1 +26 −66 @claude-nightshift
#4777 [nightshift] 20260415 multi-cleanup +29 −823 @github-actions
#4729 [nightshift] 20260414 multi-cleanup +74 −173 @github-actions
#4691 [nightshift] 20260413 multi-cleanup +111 −222 @github-actions
#4781 [ci] Open nightshift PRs with a GitHub App token so CI runs +18 −2 @rjpower

#4273 Improve Usability & Observability

Summary: Lower priority / slack-time workstream covering workqueue, dev-tpu replacement, and observability.

The observability surface took its lumps this week, which generated the usual crop of wishlist items. iris.oa.dev is now the shareable production dashboard — @rav reminded folks they no longer need to port-forward, and links to jobs and logs can simply be pasted into Discord (url). But the dashboard also spent stretches of the week hanging, returning 504s on FetchLogs, or dropping log tails entirely while the controller was under load; @Russell Power traced Saturday's missing-logs episode to users writing >10MB/minute of log output, which was "not quite accounted for" (url). That operational pain is what motivated most of the new issues this week.

On the persistence and auditability axis, @ihodes filed #4839 to keep historical cluster state — jobs, runtimes, retries, accelerators, disk, CPU — rather than only the live snapshot the dashboards render, and @rjpower filed #4895 to capture a full audit log of controller actions (task assignments, job ingestion, scaling changes), on the premise that offline log replay would make the intermittent overload incidents far easier to load-test against. @rjpower also opened #4840, proposing that the CreateJob API reject Iris clients more than two weeks stale so backwards-compatible shims can be retired on a predictable cadence. None of the three have discussion yet.

Two smaller usability items rounded out the week. @Helw150's #4747 — documenting the intended semantics of the PRODUCTION, BATCH, and INTERACTIVE priority queues, including a CLI warning when users reach for PRODUCTION — was filed and closed in short order. And @hammer's #4820 asks whether Marin should track Batch Heterogeneity (the max-microbatch-loss vs. mean-loss gap reported in the Arcee Trinity tech report) as a training-instability canary, motivated by the Delphi loss spikes earlier in the cycle; it is open, without comments.

5 autocategorized

#4747 Document Intended Usage of the Priority Queues
#4839 Persist cluster details
#4840 iris: reject clients that are older than N days
#4895 iris - controller audit logging
#4820 Monitor Batch Heterogeneity during training?

#4272 Canonical pipeline (download to tokenize)

Summary: Define canonical data pipelines for all data ingestion: download -> normalize -> dedup/quality -> tokenize.

The normalize stage of the canonical datakit pipeline was reshaped end-to-end this week. #4886 collapsed the per-subdirectory fanout into a single zephyr pipeline that walks everything under input_path and writes a flat outputs/main/ layout, and #4876 split main and duplicate records into separate parquet streams so downstream steps read unique records without post-filtering. #4761 made in-normalize exact dedup optional via a DedupMode enum (leaving room for minhash later), and #4893 decomposed fuzzy dedup into two datakit-shaped jobs — compute_minhash_attrs emitting per-source minhash buckets and compute_fuzzy_dups_attrs producing per-source cluster-annotation attribute trees. Smaller refinements rounded out the shape: #4884 stamped a schema version on NormalizeResult, #4890 exposed max_workers on normalize and consolidate, #4698 silently filtered empty and whitespace-only text (including \xa0) instead of crashing, and #4603 began compacting pathological whitespace runs at 128 chars to survive the multi-MB space runs that HTML-to-text extraction occasionally produces. @ravwojdyla-agent drove essentially all of this.

Several of these changes were forced by what @ravwojdyla-agent found running a 1.3B-record nemotron-v1 normalize at 6307-way shuffle. #4818 traced ~11% empty output shards to a double-hash in scatter routing, switched deterministic_hash from adler32 to xxh3_64, and added an [0, N) int passthrough; #4819 stopped pre-hashing hex ids to ints inside normalize entirely. #4853 deleted the coordinator-side scatter-manifest consolidation that had been a single point of failure (reducers now read per-mapper sidecars through a 32-thread pool), and #4887 pulled per-shard slicing into the sidecar worker thread so reducer RSS fell from roughly 16 GB to 200 MB. Underneath, #4782 replaced the parquet-based shuffle with flat zstd frames plus byte-range sidecars (Arrow is now out of the shuffle data plane) and @rjpower's #4695 swapped pickle+zstd spill files for parquet with a background I/O thread. #4579 gave the shuffle finite patience — tasks now abort after three shard-level failures instead of re-queueing forever on a deterministically OOMing shard — and #4784 added a 10 GB shuffle integration lane on marin-dev with uniform-vs-skewed matrix coverage.

On the tokenize side, @rjpower's #4658 deleted the tokenize-filescan zephyr job entirely: fsspec.glob(detail=True) returns file sizes from the same list-objects call, so the 32 distributed workers that used to stat files individually are gone and a 2,755-file, 1 TB nemotron shard now resolves in about two seconds. @nevillelyh's in-flight #4814 is an early sketch toward removing the consolidate step from tokenize in favor of a ShardedTreeCache virtual tree for downstream readers. #4758 rewrote consolidate itself as a chain of sorted_merge_join ops — map-side, no shuffle, leaning on the datakit invariant that attribute files share input partitioning 1:1 and are id-sorted. @rjpower's #4721 fixed a 3-8x row inflation in distributed_scan where staging directories were never truncated between runs, which had been making storage reports look wildly more expensive than the billing dashboard.

The daily smoke ferry kept finding the rough edges introduced by the reshape. #4909 rewired the validator after #4886 and #4893 moved the output layout, #4910 de-flaked the multi-source fuzzy-dup test (two records with identical text collapse into one main row, so CI now accepts either survivor), and #4701 added per-step wall-clock timing to StepRunner after smoke-ferry profiling showed dedup eating 39% of the 80-minute run. @wmoss's #4731 stopped the fineweb exact-dedup experiment from downloading the full corpus when only the 10BT sample was processed. Two short operator runbooks landed — #4771 for the datakit smoke ferry and #4762 for ad-hoc ferry runs on the marin Iris cluster — capturing the SSH-tunnel and coordinator-preemption gotchas from recent manual runs. With the reshape nearly complete, #4892 is wiring nsf_awards and nemotron v1/v2 through download-normalize-tokenize; in #data-curation, @willheld and @ahmeda14960 were already discussing whether nemotron v2's math split is enough on its own for mid-training. And in #code-talk, the mirror-FS-vs-executor interaction surfaced as a real operational wrinkle: a cache copied east1 to east5 gets redone from scratch in eu-west because the executor checks step markers in its local region only, not whether the output exists somewhere.

27 autocategorized

#4910 tests: de-flake fuzzy multi-source test for train_red dedup winner +33 −46 @ravwojdyla-agent
#4909 datakit: fix smoke ferry validator for new normalize/fuzzy_dups layout +84 −39 @ravwojdyla-agent
#4893 marin: split fuzzy dedup into MinHash and FuzzyDups jobs +934 −221 @ravwojdyla-agent
#4890 [marin] Expose max_workers on normalize and consolidate +19 −5 @ravwojdyla-agent
#4886 datakit: normalize merges all discovered data into a single output +92 −147 @ravwojdyla-agent
#4884 marin: add version field to NormalizeResult +1 −0 @ravwojdyla-agent
#4876 marin: datakit normalize writes split main/dup parquet outputs 💬1 +148 −73 @ravwojdyla-agent
#4819 marin: use natural string key in normalize group_by +3 −3 @ravwojdyla-agent
#4818 zephyr: xxh3 hash + pre-sharded int passthrough for scatter routing 💬3 +11 −8 @ravwojdyla-agent
#4701 Optimize datakit smoke ferry: add step timing + reduce dedup overhead +10 −3 @ravwojdyla-agent
#4698 [datakit] Filter empty text in normalize pipeline +49 −9 @ravwojdyla-agent
#4761 datakit: optional exact dedup in normalize 💬1 +35 −6 @ravwojdyla-agent
#4758 consolidate: map-side merge join + flat kwargs api +128 −467 @ravwojdyla-agent
#4603 [marin] datakit/normalize: compact pathological whitespace runs 💬1 +83 −1 @ravwojdyla-agent
#4892 Add datakit normalize steps for nsf_awards and nemotron v1/v2 +105 −14 @ravwojdyla-agent
#4658 [zephyr/tokenize] Use bulk list-objects for file sizes, delete filescan job +204 −137 @rjpower
#4814 [WIP] Remove consolidate from tokenize +524 −30 @nevillelyh
#4771 ferries: add OPS.md for datakit smoke ferry +51 −0 @ravwojdyla-agent
#4762 experiments: add ferries/OPS.md for ad-hoc datakit runs 💬1 +110 −0 @ravwojdyla-agent
#4731 Only download the 10BT sample, since that's all that that will be processed 💬1 +5 −5 @wmoss
#4721 [storage] Fix duplicate rows in scan output and harden report generator +347 −120 @rjpower
#4887 zephyr: bound reducer memory when reading scatter sidecars 💬3 +71 −37 @ravwojdyla-agent
#4853 [zephyr] Eliminate coordinator-side scatter manifest consolidation 💬2 +68 −91 @ravwojdyla-agent
#4782 [zephyr] Replace Parquet shuffle with zstd-chunk format 💬2 +520 −642 @ravwojdyla-agent
#4784 zephyr: shuffle integration tests 💬3 +443 −387 @ravwojdyla-agent
#4695 [zephyr] External sort spill: Parquet instead of pickle+zstd 💬5 +256 −54 @rjpower
#4579 [zephyr] Track per-shard failures and abort after MAX_SHARD_FAILURES 💬1 +145 −17 @claude

#4271 Marin-as-a-library (Bolinas can import marin)

Summary: Measurable: Bolinas can `import marin` and use it as a library.

Nothing landed in this epic this week; last week's nightly PyPI publish of the marin-* packages is still the current state of the world. The only stir was on the long-open prototype #2477, where @rjpower pinged @ryan-williams after a design-doc update to say he has unpushed wiring changes and that the initial attempt looks promising but needs cleanup and trial usage. The demo-repo tracker #4472 remains closed with no new work, and no import marin consumer surfaced in Discord — the Bolinas-labelled jobs on the Iris cluster this week are unrelated DNA scaling runs, not the downstream project this epic is named for.

1 potentially related in Other Changes

#2477 Prototype changes for Marin as a library 💬1 +1215 −435 @rjpower

#4270 Canary pass rate to 90%+

Summary: Measurable: canary ferry pass rate consistently above 90%.

The canary fleet ended the week decisively greener than it started, thanks to two structural fixes that pulled the smoke runs out of the flaky marin-dev / dedicated-canary-cluster lanes and onto the warm production infrastructure. @rjpower's #4739 moved the TPU canary ferry to the production marin cluster with --priority production, after controller-SQLite forensics on run 24332480667 showed the previous day's "hung" canary had actually been sitting in a 6h 5m reservation queue on marin-dev before GHA's 240-minute timeout killed it mid-wait. The companion #4744 retired the separate CoreWeave canary cluster entirely and reused the warm iris-ci controller + H100 nodepool, unblocking a run of 15 consecutive failures since March 31 caused by a single US-WEST-04A H100 quota being pinned by the permanent CI nodepool.

The numbers moved accordingly. The TPU canary went 0-for-5 (all cancelled on GHA timeout) from Apr 13–17 and then landed two clean greens on Apr 18 and Apr 19, the first scheduled successes on this lane in over a week. CoreWeave GPU canary flipped from 0/7 the prior week to 5/7 this week (success on every run from Apr 15 onward). The datakit smoke lane was the outlier: @ravwojdyla's #4787 pointed it at the prod cluster at production priority on Apr 15, but it then regressed Apr 18–19 when #4876's normalize output-layout change (normalize/outputs/main/ instead of normalize/*.parquet) silently broke validate_ferry_outputs.py — root-caused and filed as #4908, closed within hours on Apr 19.

Supporting plumbing filled in around the edges. #4831 added cancelled() alerting across all three canary workflows and split Slack-notify into its own job with a fresh timeout budget, so the next GHA-timeout cancel will actually page instead of vanishing; #4854 set MARIN_PREFIX=gs://marin-us-central1 on the TPU canary's off-GCP validate step to stop tracker_metrics.jsonl 404s; and #4856 cleaned up SSH auth for the prod marin workflows. On the gating side, #3506 (conservative metric thresholds on both MoE canaries) closed. Two cluster-stability items remain live — #4776 (@dlwh's RCA traced an Iris cloud-smoke restart_worker flake to a bookkeeping gap where workers register before the TPU LRO settles, not to the reserved-TPU timeout change as first suspected) is still open, and Russell's manual iris controller restart cadence continued in #infra with restarts on Apr 13, 14, 16, 17, and 19. The prod canary surface is green; the dev cluster under it still is not.

9 autocategorized

#4744 canary-ferry-cw: share iris-ci controller + H100 nodepool +136 −141 @rjpower
#4739 canary-ferry: move to prod marin cluster, fix diagnostics on timeout +65 −8 @rjpower
#4831 ci: alert on cancellation for canary workflows, run Slack notify in a separate job +106 −23 @ravwojdyla-agent
#4787 ferries: run datakit smoke on marin cluster at production priority +2 −1 @ravwojdyla-agent
#4854 [ci] Set MARIN_PREFIX on TPU canary validate step +10 −0 @rjpower
#4856 ci: drop explicit os login ssh key management from ferry workflows +14 −26 @ravwojdyla-agent
#3506 All canaries should be gated using conservative metrics
#4776 [Iris] Fix restart_worker slice loss in cloud smoke 💬1
#4908 [canary-datakit-smoke] validate_ferry_outputs.py globs wrong normalize path

1 potentially related in Other Changes

#4798 Detect TPU bad-node stderr and promote to WORKER_FAILED 💬1 +137 −11 @rjpower

#4269 Single way of running jobs — off Ray completely

Summary: All jobs run through Fray+Iris.

The Ray endgame kept grinding forward. #4815 (@yonromai) deleted RayConfig from Levanter's TrainerConfig entirely — every call site already passed auto_start_cluster=False, so the field was inert — and ported marin/evaluation/visualize.py off fray v1 onto current_client().submit(JobRequest(...)), dropping the subprocess/ExceptionInfo wrapper the v1 flow needed. That single PR closed #4639 ("do RL helpers need porting?") and #4640 ("does evaluation: logprob jobs?"), both of which had been open questions about whether the leftover code was worth migrating or safe to delete. #4742 followed up on the documentation side, scrubbing Ray callouts from the babysit-job, dev-tpu, and agent-research skills and deleting the legacy dev-tpu-ray skill outright, closing #4740. Off the epic proper but in the same spirit, @teetone landed #4894, migrating Evalchemy to Iris.

Iris itself continued to absorb the load with visible bruising. #4725 (@rjpower) fixed a fray v2 bug where create_actor_group was calling self._iris.submit() without an environment argument, so device defaults like JAX_PLATFORMS="" never reached actor jobs; combined with Iris's parent→child env inheritance, a CPU coordinator spawning a TPU actor would produce a process with JAX_PLATFORMS=cpu and JAX would refuse to initialize the TPU backend. The fix closed #4714 (@RohithKuditipudi), who also filed #4728 after noticing that the paths-filter on fray-unit-tests.yaml only triggers on lib/fray/** changes — an iris commit on 2026-04-12 had silently broken five fray assertions for two days before surfacing on #4725. The controller needed multiple reboots over the week (url, url, url), and Russell Power noted that worker/job storms still overwhelm it — load-balancing and client-side retry tweaks are slated for next week.

On the Discord side, the "single user left" moment arrived. Tony asked if the Ray cluster could stay up through the end of the month to finish an SFT research project; willheld replied bluntly that the team no longer has the bandwidth to debug Ray's autoscaler ("Ray stopped supporting the autoscaler themselves in preference of their Kubernetes offering") and that Tony is the only user left, so he's free to patch and reboot the cluster himself without worrying about disrupting anyone. Tony traced the issue to max_workers semantics and unblocked himself. Ray is not yet literally deleted from the repo — #4453 still tracks six files with direct import ray (classification actors, executor, Levanter distributed, vLLM TPU detection) plus roughly two dozen fray-v1 consumers in RL, evaluation, and tests — but no human workflows depend on it anymore. The week's other new work, #4827 and #4828 (both @yonromai), stakes out the next phase: route all evaluators through an OpenAI-compatible HTTP contract so Levanter and vLLM become deployment details, which unblocks deleting levanter_lm_evaluation_harness and the last Ray-coupled evaluator paths.

8 autocategorized

#4815 Remove RayConfig and port visualize to fray v2 +18 −367 @yonromai
#4742 Remove Ray references from babysit and dev TPU skills/docs +3 −220 @claude
#4725 fix(fray): pass device default env vars in create_actor_group 💬1 +63 −5 @rjpower
#4639 Do "RL helpers" Need to be ported to Iris? 💬1
#4640 Does "Evaluation: logprob jobs" need to be ported to Iris? 💬3
#4714 [fray] create_actor_group drops device default env vars, actors inherit wrong JAX_PLATFORMS 💬1
#4728 [fray] CI skips fray tests when only iris changes, hiding cross-layer breakage
#4740 remove references to ray for babysitting, dev tpu skills/docs

#3192 Synthetic data (research + critical path for post-training)

0/4 sub-issues closed

The SWE-ZERO arc pivoted from preregistered MVP to pretraining-scale generation this week. The 1B-token run #4666 closed on Monday at 96,237 rollouts across 32,079 PRs and 20 languages, for roughly 1.1B content tokens and a 38.6% submission rate. @AlienKevin computed within-PR Jaccard similarity at 0.058 on a 19K-rollout sample, confirming the model explores diverse strategies rather than memorising one path per instance. The HuggingFace release was renamed AlienKevin/SWE-ZERO-96K-trajectories to match the actual count, and the 1K-PR pilot on the expanded SWE-rebench V2-PRs schema #4710 closed the same day after validating ingestion of the 122,910-PR corpus at a 25.4% submission rate.

With the MVP retired, the team launched the 140B-token scale-out #4719: 122,910 PRs times 100 rollouts each, targeting 12.3M trajectories. Progress moved from roughly 5% on Tuesday to 16.6% by Sunday morning - 2.04M rollouts and an estimated 23.3B tokens committed. Most of the movement came from two throughput fixes @AlienKevin landed mid-week, documented in c5a806cc9. Profiling 6,667 real rollouts showed vLLM prefill consuming 84.6% of per-rollout time because prefix caching was off and each of ~26 turns re-prefilled a growing ~194K-token context. Enabling --enable-prefix-caching yielded a measured 3.4x per-worker speedup (600 to 4,600 rollouts per batch-hour), and dropping MAX_TURNS from 30 to 15 doubled throughput again by finishing rollouts before preemption. Supporting work included a watchdog that aborts tasks when vLLM exits (earlier runs had leaked ~29% "Connection error" rollouts), a shard-lease disable that unblocked relaunches, a memory bump to 32GB, and a persistent 5-minute relaunch loop with interactive-priority anchor batches. The SWE-ZERO-12M-trajectories dataset now sits at 1.45M clean trajectories across 14,554 unique PRs after generation-time error filtering dropped the error rate from 15.5% to 6.0% between checkpoints.

Quality signal arrived on two fronts. The ConTree execution-based evaluator #4683 completed on all 151 Python instances available as swerebench images, yielding pass@1=6.0% and pass@3=11.3% - stricter than the earlier 13% LLM-judged estimate, as expected when non-submitted rollouts count as failures. @AlienKevin also derived a submission-based pass@k curve: at k=100 rollouts per PR, 90.6% of PRs produce at least one submitted patch, with diminishing returns past k=20. And a go/no-go validation opened on Friday #4898 to midtrain Marin-8B base on a 100K-trajectory subset and measure SWE-bench Verified before and after, with results due Monday April 21. The config (exp4898_sft_marin_8b_swe_zero.py) picks a 1e-4 learning rate following the weight-decay interaction analysis from #4420 and sets context to 32K to match generation. By Sunday morning the 50K run was 39% through training with loss down from 2.86 to 0.286, and the baseline eval of Marin-8B on SWE-bench Verified landed at 0/16 resolved - a clean zero-floor from which any non-zero post-training result is informative. The baseline rollout also surfaced a plumbing fix: the InstalledAgent version of mini-swe-agent ran the CLI inside the Daytona sandbox and could not reach host-side vLLM, so @AlienKevin rewrote MiniSweAgentV1 as a BaseAgent that wraps the official DefaultAgent with litellm and Harbor adapters.

Around the SWE-ZERO core, the adjacent synthetic-data surfaces kept moving. The execution-trace pipeline from SWE-rebench-V2 #4383 was rewritten by @Helw150 onto ConTree's SDK, eliminating the Docker daemon and Iris-privilege path; outputs now land as per-test parquet rows. The soft-proxy investigation for agentic benchmarks #4389 updated with a mixed read: @RohithKuditipudi reported that positive-trace loss and success-failure gap gave weak or non-monotonic signal on a 15-model MATH study, and @dlwh pushed back with the "dumb" baseline of predicting CORRECT/INCORRECT on complete traces, which Rohith flagged as worth a filtered re-examination. @taivu1998 opened #4858 to add Hermes trace support to the SFT pipeline, introducing message_postprocess_fn and row_id_fn hooks on TransformAdapter and registering the glm-5.1 and kimi splits with a trace-focused pilot experiment. The umbrella issue for diverse agentic traces #4435 got the consolidated status update: MVP, multilang, 1B and V2-PRs pilot all closed; 140B in flight; ConTree eval live.

0 PRs this week, 7 new comments, and 1 new issue (4 total)

Sort:

Issues
#3093 [Agentic SFT] Tracking SFT datasets
#2262 Experiment: OpenThoughts4 Teacher Model Comparison - Qwen3-32B vs. Qwen3-235B-A22B
#3956 Pilot distillation tests for optimal teacher selection
#4719 🆕 [SWE-ZERO] Scale to 140B tokens: 122K PRs x 100 rollouts from SWE-rebench V2-PRs 💬7

7 autocategorized

#4858 [datasets] Add Hermes trace support to the SFT pipeline 💬1 +789 −19 @taivu1998
#4666 Experiment: SWE-ZERO scaling to 1B tokens (32k PRs × 3 rollouts) 💬6
#4710 [SWE-ZERO] Pilot: 1K rollouts from SWE-rebench V2-PRs (126K expanded dataset) 💬1
#4683 Experiment: Execution-based SWE-ZERO evaluation via ConTree 💬2
#4383 Generate Python execution traces from SWE-rebench-V2 for pretraining 💬1
#4435 Generate diverse agentic traces for pre-training in SWE-ZERO style 💬2
#4389 Identify a soft proxy for agentic benchmarks to support data-mixture studies 💬2

4 potentially related in Other Changes

#4826 [marin] Add RL LoRA training and export support 💬1 +1249 −12 @taivu1998
#4825 [RL] Improve Prime verifier integration 💬1 +1392 −553 @taivu1998
#4766 [RL] Refactor trainer onto objective runtime and neutral data plane 💬4 +3124 −738 @taivu1998
#4684 [rl] Add native Reasoning Gym environment support 💬1 +940 −1 @taivu1998

#3100 Data sources for pre-training / mid-training

Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.

0/5 sub-issues closed

With the 1e23 MoE run now leaning on the Nemotron mix, this week's data work split into three concurrent threads. On the midtraining side, @ahmeda14960 opened #4927 to add experiments/midtraining_data_buckets.py, a math-focused registry grouped by LLM-provenance so cooldown experiments can pick a teacher-model comfort level and pull a pre-grouped set of ExecutorSteps. BUCKET_1 is three filter-only corpora (ProofPile 2, FineMath-3-plus, MegaMath web); BUCKET_2 adds four non-Qwen rewrites (the three nemotron_cc_math_v1 Phi-4 splits plus MegaMath web-pro's Llama-3.3-70B refinement); and BUCKET_3 opens up the full 32-dataset Qwen/QwQ surface, including every Nemotron v2 family (CC v2/v2.1/Code, Pretraining-Code v1+v2, Specialized-v1 minus stem_sft) and the MegaMath synthetic QA/translated-code/text-code splits. The registry is a pure re-export over the already-tokenized steps in gs://marin-us-central2/; the open downstream is wiring BUCKET_1_PLUS_2 into a Mantis-style 8B cooldown. Alongside it, #4892 from @ravwojdyla-agent adds datakit normalize steps for the Nemotron v1 (seven splits) and v2 families so that pipeline can feed those buckets end-to-end.

In parallel, @dlwh pushed forward the long-context data audit for exp2062. #4735 asks how much usable long context actually lives in Longmino and FinePDFs/FinePDFs-edu by length bucket, and what filtering or capping should apply before those corpora are allowed to dominate the 8B pilot's long-context slice; the current exp2062 mix still depends on hand-entered token counts in longmino.py and finepdfs.py. The scaffolding landed in #4738, a reusable CLI that samples raw docs for repetition/OCR/formatting heuristics and reads exact totals from tokenized-cache .stats.json when available, emitting summary.json, summary.md, and a stratified review_sample.jsonl; the in-region corpus run and the keep/cap/defer recommendation are still open. #4736 adds a companion track for a small Marin-owned OSS long-doc QA manifest over existing Ar5iv, Wikipedia, and Stack Exchange corpora, since NVIDIA's long-context post-training data is not cleanly ungated. On the eval gating side, #4737 wires RULER/NIAH task configs through the vLLM lm-eval path for 4k/8k/16k/32k/64k, fixes the 4k-default truncation that likely sank the prior #2064 attempt, and adds an exp2062 runner for the warmstart and phase checkpoints; full RULER suites were launched on April 14 as Iris jobs for Qwen3 8B and Marin 8B base, with results still pending at end of week.

On the upstream web-data side, @XenonMolecule posted a substantial update on #2351, the small-model raw-WARCs-to-training-tokens effort. Running seven pipelines against the same 3000 DCLM WARCs, the LLM-extraction path (Qwen3-8B) yields roughly 58.9B tokens versus 2.7B for Nemotron-CC, 2.7B for DCLM, 1.9B for Nemotron-CC without rephrase, and 817M for FineWeb-Edu; raw HTML sits at 3.63T and Resiliparse at 142.7B for scale. Projected to the full 7.9M-WARC crawl that is a 2,642x scale-up and an inference bill on the order of 6.60e22 FLOPs, with a closed-form accounting worked out from Levanter's lm_flops_per_token. Preliminary 20T-target simulated-epoch sweeps on Delphi-style isoflop curves show DCLM yielding better Paloma loss than plain Nemotron at small scale, but the sampling factor caps Experiment B at roughly 7.6B tokens and cuts off the largest compute candidates — the working fix is bumping to 10k WARCs (the minimum DCLM comparison size) rather than pushing the target to a "sci-fi" 200T. On Discord, @Ahmed M Ahmed and @willheld debated whether Marin needs more math sources beyond Nemotron-distilled or Qwen-rephrased data; Will's read was that Nemotron 2's math data is sufficient for a strong math reasoning model, and Ahmed flagged the already-tokenized Nemotron-Math subsets on central2 as untouched in Mantis so far. Finally, @hammer opened #4915 to audit number tokenization — right-to-left parsing, place-aligned chunking, digit splitting — citing the HuggingFace number-tokenization blog and Arcee Trinity Large's note on pathological backtracking in the standard number regex.

0 PRs this week, and 0 new issues (5 total)

Sort:

Issues
#3049 Test Luxical as a General Tool for Data Integration Pipelines
#3101 Ensure we have 20T deduped tokens of data
#3183 Software Heritage Foundation license
#3194 Gather code environments
#4148 Experiment: synthetic reasoning bootstrap corpus

6 autocategorized

#4927 Add midtraining data registry by provenance bucket +91 −0 @ahmeda14960
#4738 [long-context] Add cache-aware audit tooling for Longmino and FinePDFs 💬1 +833 −0 @dlwh
#4735 [long-context] Audit Longmino and FinePDF quality for exp2062 💬2
#4736 [long-context] Build a small OSS long-doc QA manifest for exp2062
#2351 Small Model for Raw Web Data to Training Tokens 💬1
#4915 Number tokenization

1 potentially related in Other Changes

#4737 [Evals] Wire up RULER for exp2062 +441 −48 @dlwh

Other Changes

The week's uncategorized traffic is dominated by a long tail of Iris hardening from @rjpower: a refactored heartbeat protocol split into focused Ping/StartTasks/StopTasks/PollTasks RPCs #4638 with the monolithic path retired in #4843, async heartbeat dispatch at concurrency 128 #4842, a periodic checkpoint moved onto its own thread #4847, terminal-task history eviction chunked across transactions #4845 #4850 #4851, autoscaler pending-hint caching #4848, and a worker health score with a reaper thread for failure-aware termination #4883. Correctness fixes trickled in alongside: unsatisfiable routing constraints are now rejected at submit time #4681, TPU capacity errors classified as quota-exhausted and retried #4670, a preemption race in heartbeat task kill #4690, split-heartbeat loops no longer crash on undefined state #4880, and bad-node stderr now promotes the worker to WORKER_FAILED #4798. Smaller touches — cascading worker_task_history deletes #4838, pinning the TPU network to default #4857, caching ListJobs endpoints #4703, documenting priority bands #4749, keeping a v4-reserved/2048 slice warm #4832 — round out the Iris picture, as does a parallel push from @ravwojdyla-agent to consolidate the controller store layer around process-scoped stores #4836 and add a slice lifecycle state machine for autoscaler transitions #4816.

Outside Iris, @taivu1998 landed a sweep of RL plumbing: a neutral objective runtime and data plane for the trainer #4766, a native Reasoning Gym adapter #4684, an OpenAI-compatible Prime verifier client #4825, RL LoRA training with merged-rollout serving and final adapter export #4826, and a first marin.test_time_scaling slice for math reasoning #4774. @eric-czech cherry-picked two Levanter fixes for the eval harness — a missing resource mapping in broadcast_shard #4911 and an unsupported implementation kwarg #4912 — and @yonromai rewired the harness through the MarinTokenizer protocol #4944. @rjpower also bumped pyrefly from 0.42 to 0.61 and regenerated the baseline #4801, fixed the real bugs that surfaced (169 → 129 diagnostics) #4804, and worked through four further cleanup clusters #4808 #4809 #4810 #4811. Six archived-Qwen3 speedrun submissions from @WhenWen arrived as a set — AdamC/AdamH, MuonC/MuonH, MuonRemez, and PRISM-Berkeley #4928 #4929 #4930 #4931 #4932 #4933. Rounding out the week, @dlwh flipped train/eval/LoRA/viz defaults over to array-first Grug datasets as part of gruggification #3314, @teetone migrated Evalchemy onto Iris and added an OlympiadBench Physics reasoning eval #4894, integration tests were deselected by default and a gated tokenizer test now skips gracefully #4757, an orphan-draft-release cleanup script unblocks the nightly wheel uploads #4780, and a faulthandler now ships in every Iris/Marin/Zephyr/Fray process so SIGSEGV/SIGABRT/SIGBUS produce tracebacks instead of silent exits #4743.

89 PRs this week, 81 new comments, and 58 issues closed (58 total)

Sort:

#4941 [inference] Detect v5+/v6e TPUs via /dev/vfio/* nodes +7 −0 @yonromai
#4938 [evaluation] Remove dead self.cleanup(model) call in LevanterLmEvalEvaluator +0 −4 @yonromai
#4920 Remove dead alpaca evaluator references +1 −26 @yonromai
#4914 [iris] Extend default RPC retry budget to ~30 minutes +18 −5 @claude
#4894 Migrate Evalchemy to Iris + OlympiadBench Physics Log pipeline rebuild, status page goes live +185 −32 @teetone
#4888 [iris] Push None handling into _write_worker_snapshots +27 −39 @rjpower
#4885 [iris] Consolidate ping-loop DB writes; fix dropped resource_snapshot_proto column +111 −75 @rjpower
#4880 [iris] Fix split-heartbeat loops crashing on undefined _checkpoint_paused +16 −18 @rjpower
#4879 [iris] Fix post-restart job hang: init _checkpoint_paused, fail task on StartTasks RPC error +26 −0 @rjpower
#4878 [iris] Skip decommit for reservation-holder tasks in _kill_non_terminal_tasks 💬1 +75 −10 @rjpower
#4877 iris smoke: remove flaky nested job-expand test +1 −61 @rjpower
#4874 iris: auto-cleanup managed threads in tests via thread_container_scope +75 −129 @rjpower
#4870 [iris] Merge test_budget_api.py into test_budget.py; remove slop +307 −644 @claude
#4868 [iris] Persist cross-region ops outputs and handle partial parquet 💬1 +491 −105 @rjpower
#4859 [iris] Include per-job autoscaler hint in GetJobStatus 💬2 +29 −33 @rjpower
#4857 [iris] Pin networkConfig.network=default on TPU create 💬2 +2 −0 @rjpower
#4855 [iris] Remove broken test_worker_restart_preserves_task smoke test +0 −64 @rjpower
#4851 [iris] Chunk terminal-task history eviction across transactions +47 −22 @rjpower
#4850 [iris] Evict terminal-task resource history past 1h TTL +382 −11 @rjpower
#4848 [iris] Cache autoscaler pending hints per evaluate() cycle 💬1 +95 −37 @claude
#4847 [iris] Move periodic checkpoint to its own thread, delay first run by one interval +20 −6 @rjpower
#4846 [iris] Drop per-RPC autoscaler hint in GetJobStatus; cache worker roster; cover task-summary index +72 −8 @rjpower
#4845 [iris] Chunk fail_heartbeats_batch to release SQLite writer between groups +659 −29 @rjpower
#4842 iris: async heartbeat dispatch, concurrency 128 💬1 +63 −49 @ravwojdyla-agent
#4838 [iris] Cascade worker_task_history on task delete +328 −1 @rjpower
#4837 Bump our non-preemptible CPU counts. MoE MFU at scale +1 −1 @rjpower
#4835 iris: remove unused _heartbeat_lock 💬1 +13 −21 @ravwojdyla-agent
#4830 [iris] Add --state-dir, default dry-run dir to /tmp/dry-run/{today} +71 −73 @rjpower
#4829 [iris] Make checkpoint fully concurrent; trim task_resource_history retention +85 −42 @rjpower
#4798 Detect TPU bad-node stderr and promote to WORKER_FAILED Log pipeline rebuild, status page goes live Canary pass rate to 90%+ 💬1 +137 −11 @rjpower
#4797 Ignore worktrees. +2 −0 @rjpower
#4791 iris, fray: reject TPU requests whose chip count doesn't match VM shape 💬1 +178 −6 @rjpower
#4780 scripts: clean up orphan draft releases and verify release state +60 −0 @rjpower
#4779 [Iris] Stamp finished_at_ms on retried task attempts +151 −2 @rjpower
#4765 tests: restore integration-test pytest wrapper +82 −17 @ravwojdyla-agent
#4764 [Iris] Handle reserved TPU queue timeouts explicitly 💬1 +70 −19 @dlwh
#4757 Skip e2e/integration tests by default and gracefully skip gated tokenizer test +12 −7 @rjpower
#4749 [iris] Document priority bands and warn on PRODUCTION 💬1 +54 −1 @claude
#4743 rigging: install faulthandler in configure_logging Log pipeline rebuild, status page goes live 💬1 +66 −9 @rjpower
#4734 iris: probe worker cache-dir writability once at startup 💬1 +34 −29 @claude
#4726 Fix zone migration & heartbeats conflicting with in-flight preemptions. +145 −1 @rjpower
#4723 [iris] Add missing zones to KNOWN_GCP_ZONES and fix CPU group name 💬1 +18 −0 @claude
#4720 [iris] Propagate derived region/zone to worker configs +28 −2 @Helw150
#4715 iris: capture submitting CLI argv on jobs +486 −176 @ravwojdyla-agent
#4703 [iris] Cache endpoints in-memory and paginate ListJobs 💬2 +961 −256 @rjpower
#4700 Bump concurrency +2 −0 @ravwojdyla
#4696 [ci] Fetch host-level logs when iris cloud tests fail +76 −2 @rjpower
#4690 [iris] Fix preemption race in heartbeat task kill +15 −17 @AlienKevin
#4681 [iris] Reject jobs with unsatisfiable routing constraints at submit time +312 −313 @claude
#4670 [iris] Classify TPU capacity errors as QuotaExhaustedError and retry tpu_delete on quota +165 −4 @claude
#4638 [iris] Replace monolithic heartbeat with focused Ping/StartTasks/StopTasks/PollTasks RPCs 💬1 +1321 −147 @rjpower
#4574 [iris] Remove legacy ListJobsRequest fields after JobQuery transition +206 −227 @claude
#3314 main: default train/eval/lora/viz to array-first Grug datasets +235 −99 @dlwh
#4944 [levanter] Use levanter.tokenizers.load_tokenizer in eval harness +13 −19 @yonromai
#4933 [speedrun] Add PRISM-Berkeley archived Qwen3 submission +2869 −0 @WhenWen
#4932 [speedrun] Add MuonRemez archived Qwen3 submission +2598 −0 @WhenWen
#4931 [speedrun] Add MuonH archived Qwen3 submission +1989 −0 @WhenWen
#4930 [speedrun] Add MuonC archived Qwen3 submission +2612 −0 @WhenWen
#4929 [speedrun] Add AdamH archived Qwen3 submission +2023 −0 @WhenWen
#4928 [speedrun] Add AdamC archived Qwen3 submission +1973 −0 @WhenWen
#4912 [levanter] Fix lm_eval use of unsupported implementation arg +6 −0 @eric-czech
#4911 [levanter] Fix eval harness resource partitioning +11 −2 @eric-czech
#4883 [iris] Worker health score + reaper thread for failure-aware termination +523 −158 @rjpower
#4843 [iris] Remove old monolithic heartbeat path (Phase 4) +1336 −1756 @rjpower
#4841 [evals] Decouple evals from model backends via OpenAI-compatible HTTP (#4827) +2358 −2940 @yonromai
#4836 [iris] Consolidate controller store layer around process-scoped stores +5025 −3362 @rjpower
#4826 [marin] Add RL LoRA training and export support Synthetic data (research + critical path for post-training) 💬1 +1249 −12 @taivu1998
#4825 [RL] Improve Prime verifier integration Synthetic data (research + critical path for post-training) 💬1 +1392 −553 @taivu1998
#4824 [iris] Use raw pending demand as scaledown floor 💬2 +122 −4 @claude
#4816 [iris] Add slice lifecycle state machine for autoscaler transitions +710 −414 @rjpower
#4813 [iris] Fix GCP bootstrap gcloud race 💬2 +76 −18 @dlwh
#4811 pyrefly: cluster B cleanup (arg / overload noise) +16084 −5712 @rjpower
#4810 pyrefly: cluster A cleanup (imports + attrs) +9399 −23 @rjpower
#4809 pyrefly: cluster D cleanup (bad-specialization + override) +164 −1286 @rjpower
#4808 pyrefly: cluster C cleanup (control-flow annotations) +109 −401 @rjpower
#4804 pyrefly: fix real bugs surfaced by 0.61 bump 💬1 +134 −578 @rjpower
#4801 pyrefly: bump 0.42.0 → 0.61.0 and regenerate baseline +1115 −1163 @rjpower
#4786 [iris] Return visible TPUs before create LRO settles 💬1 +92 −18 @dlwh
#4774 [tts] Add sample-only reasoning TTS math slice 💬1 +1556 −42 @taivu1998
#4766 [RL] Refactor trainer onto objective runtime and neutral data plane Synthetic data (research + critical path for post-training) 💬4 +3124 −738 @taivu1998
#4753 Default vLLM mode to native instead of docker sidecar Log pipeline rebuild, status page goes live 💬3 +36 −2 @claude
#4737 [Evals] Wire up RULER for exp2062 Data sources for pre-training / mid-training +441 −48 @dlwh
#4684 [rl] Add native Reasoning Gym environment support Synthetic data (research + critical path for post-training) 💬1 +940 −1 @taivu1998
#4713 [iris] Enforce heartbeat timeout for dead worker detection 💬2 +20 −16 @AlienKevin
#2477 Prototype changes for Marin as a library Marin-as-a-library (Bolinas can import marin) 💬1 +1215 −435 @rjpower
#4759 [Iris] Increase reserved TPU cloud startup timeout 💬1 +70 −19 @dlwh
#4832 [iris] Keep one v4-reserved/2048 slice warm 💬1 +0 −0 @dlwh
#4560 [iris] Reject child submissions when parent is absent from the DB 💬1 +86 −27 @claude
#4917 [iris] Cap wait_for_job state-poll backoff at 30s, not 2s 💬1 +38 −9 @claude

Community Pulse

External GitHub contributors converged on the RL, speedrun, and eval-harness surfaces. @taivu1998 opened a run of draft PRs reworking Marin's RL stack — a trainer refactor onto an objective runtime and neutral data plane #4766, LoRA training and export support #4826, a hardening pass on the Prime verifier integration #4825, a native Reasoning Gym environment adapter #4684, and plumbing for Hermes-trace SFT #4858 and sample-only TTS math #4774. @WhenWen archived six Qwen3 speedrun submissions in one sitting — AdamC, AdamH, MuonC, MuonH, MuonRemez, and PRISM-Berkeley (#4928–#4933). @eric-czech landed two levanter fixes for the eval harness — resource partitioning #4911 and a stray implementation arg in the lm_eval call #4912 — and drove most of the discussion on #4678 about mixed Marin/HF tokenizer paths. @teetone migrated Evalchemy to Iris and added OlympiadBench Physics coverage in #4894; @nevillelyh opened a WIP to drop consolidate from tokenize #4814; @wmoss trimmed the fineweb download to the 10BT sample that's actually processed #4731.

The week's most substantive Discord threads clustered around scheduler semantics and the MoE run. Eric Czech's self-preemption mishap turned into a design conversation in #infra, where Will Held proposed that a user's INTERACTIVE jobs should preempt their own BATCH jobs — directly tied to the eval-harness region plumbing he was also debugging on #4911. Russell Power and Eric then mapped out multi-region sweep behavior and the Executor/MirrorFS interaction in a long infra thread, concluding the Executor abstraction needs a rethink in the Iris+MirrorFS world. On the MoE side, Will Held's writeup of quantile balancing on the Open Athena blog summarized the ongoing experiments in #moe; Larry posted the first 5% of the 130B/A29B loss curve, and dlwh noted at week's end that the reconstituted 1e23 run is now on twice the hardware at ~20% better MFU on the ragged_all_to_all path.

Seven new members arrived with self-introductions, spanning industry infra and student researchers: an ML research engineer from programmatic-advertising DLRM systems; a Scale AI engineer on agent systems (previously Two Sigma) focused on reasoning and interpretability; a Senior MLE at Bill with use-case GRPO and multimodal post-training experience; an MLE at a Tokyo fintech LLM startup; a graduating senior at Fayetteville State with NASA JPL and UNC Chapel Hill research on aerospace LLMs and radiotherapy VLMs; a fractional CTO now running Kagi's index team; and a second-year undergraduate focused on ML systems and performance. Jason Grey (greyleader77) brings context on large-scale crawl and index infrastructure from Common Crawl's 10PB pipeline, which intersects with the data-pipeline and retrieval work in the datasets epic. Kartik_S (Senior MLE at Bill) intersects with the active RL and post-training threads through his use-case-specific GRPO background. Tri Dao also joined the server on April 18 without a self-intro; his work on FlashAttention and Mamba intersects directly with this week's ragged-all-to-all expert-parallelism debugging and the broader MoE kernel surface.

News and research links in #news and #reinforcement-learning skewed toward tokenization and numerical handling — Julie Kallini's tokenizer thread, HuggingFace's number-tokenization blog and finephrase space, a digits-to-decisions paper that Jeff H turned into issue #4915 — alongside an arXiv trickle on depth-adaptive architectures, experience replay for RL, and a Nature piece on subliminal trait transmission during distillation.

News & research shared

x.com/JulieKallini/status/2044890881141228029 — Great tweet thread about how Opus 4.7's tokeniser may be more inefficient - a few papers linked discussion (3 comments)
x.com/rosinality/status/2044664370064408597/photo/1 — (cc ) discussion (2 comments)
www.youtube.com/watch?v=wcyzjrqpJ70 — how should mirror FS interact w executor? > haha in complicated ways? mirror fs reports a file "exists" if it's available in any region. i… discussion (2 comments)
openathena.ai/blog/quantile-balancing/ — Open Athena is a nonprofit that accelerates academia with capabilities from the AI frontier discussion (1 comment)
arxiv.org/abs/2604.09258 — Pretraining is the cornerstone of Large Language Models (LLMs), dominating the vast majority of computational budget and data to serve as the primary engine for… discussion
huggingface.co/spaces/huggingface/finephrase — This web app loads experiment data about language‑model training and displays each run as a book on an interactive bookshelf. You don’t need to provide anything… discussion
arxiv.org/pdf/2604.08706 — another paper along the lines of ‘obvious stuff from pre-LLM RL we should try’, this time it’s adding experience replay discussion
arxiv.org/abs/2604.12946 — Traditional fixed-depth architectures scale quality by increasing training FLOPs, typically through increased parameterization, at the expense of a higher… discussion
huggingface.co/spaces/huggingface/number-tokenization-blog — This blog post explores how different tokenization methods impact a model's ability to perform arithmetic, focusing on why some numbers are tokenized… discussion
arxiv.org/pdf/2508.14444 — yeah taking a look now again at nemotron nano paper, didn't catch this dataset was a part of nemotron v2: discussion
arxiv.org/abs/2411.05735 — Language model performance depends on identifying the optimal mixture of data groups to train on (e.g., law, code, math). Prior work has proposed a diverse set… discussion
x.com/rosinality/status/ discussion
x.com/matternjustus/status/2044876224896565679?s=46 — new eval dropped: discussion
arxiv.org/html/2602.17004v1#S2 — Arcee Trinity Large Technical Report discussion
openreview.net/pdf?id=rhPnkTKfMy — oh i totally missed this: discussion
www.nature.com/articles/s41586-026-10319-8 — During model distillation, large language models can subtly transmit traits unrelated to the training data. discussion
github.com/marin-community/marin/issues/4915 — Make sure our tokenizer handles numbers well, e.g. right-to-left parsing, place-aligned chunking, digit splitting, etc. From Digits to Decisions: How… discussion

GitHub activity from 9 external contributors

Tai Vu · San Francisco, CA, US 6 PRs, 6 comments

#4858 [datasets] Add Hermes trace support to the SFT pipeline 💬1 +789 −19
#4826 [marin] Add RL LoRA training and export support 💬1 +1249 −12
#4825 [RL] Improve Prime verifier integration 💬1 +1392 −553
#4774 [tts] Add sample-only reasoning TTS math slice 💬1 +1556 −42
#4766 [RL] Refactor trainer onto objective runtime and neutral data plane 💬4 +3124 −738
#4684 [rl] Add native Reasoning Gym environment support 💬1 +940 −1

6 comments on 6 threads

#4858 [datasets] Add Hermes trace support to the SFT pipeline
#4826 [marin] Add RL LoRA training and export support
#4825 [RL] Improve Prime verifier integration
#4774 [tts] Add sample-only reasoning TTS math slice
#4766 [RL] Refactor trainer onto objective runtime and neutral data plane
#4684 [rl] Add native Reasoning Gym environment support

whenwen · A learner, interested in machine learning, language models, and mathematics. 6 PRs

#4933 [speedrun] Add PRISM-Berkeley archived Qwen3 submission +2869 −0
#4932 [speedrun] Add MuonRemez archived Qwen3 submission +2598 −0
#4931 [speedrun] Add MuonH archived Qwen3 submission +1989 −0
#4930 [speedrun] Add MuonC archived Qwen3 submission +2612 −0
#4929 [speedrun] Add AdamH archived Qwen3 submission +2023 −0
#4928 [speedrun] Add AdamC archived Qwen3 submission +1973 −0

Eric Czech 2 PRs, 3 comments

#4912 [levanter] Fix lm_eval use of unsupported implementation arg +6 −0
#4911 [levanter] Fix eval harness resource partitioning +11 −2

3 comments on 1 thread

#4678 [levanter] eval_harness.py mixes use of Marin and HF tokenizers ×3

Will Moss · @airbnb · San Francisco, CA 1 PR, 2 comments

#4731 Only download the 10BT sample, since that's all that that will be processed 💬1 +5 −5

2 comments on 1 thread

#4741 3 tests fail locally ×2

Rohith Kuditipudi · cs phd @ stanford 0 PRs, 3 comments

3 comments on 3 threads

#4725 fix(fray): pass device default env vars in create_actor_group
#4640 Does "Evaluation: logprob jobs" need to be ported to Iris?
#4389 Identify a soft proxy for agentic benchmarks to support data-mixture studies

Neville Li · NY · Recovering "AI" engineer 1 PR, 1 comment

#4814 [WIP] Remove consolidate from tokenize +524 −30

1 comment on 1 thread

#4445 Improve levanter store/cache

Chloe Chia · San Francisco 0 PRs, 1 comment

1 comment on 1 thread

#4311 Measure throughput of megatron on 8x H100 node on relevant geometries

Tony Lee · Stanford University · Stanford, California · CS PhD at Stanford 1 PR

#4894 Migrate Evalchemy to Iris + OlympiadBench Physics +185 −32

claude-nightshift 1 PR

#4861 [nightshift] 20260417 multi-cleanup 💬1 +26 −66

Top 15 runs (by FLOPs) this week (completed, running, crashed)

The preregistered 1e23 MoE arc — a 129B-total / ~16B-active ragged mixture of experts predicted to land at paloma macro_loss 2.25 after ~1T Nemotron tokens (#4697, isoflop prediction from #4447) — dominated the week. The original moe_1e23_d5120_bs2048_ep4_ring launch on v4-1024 with ring expert dispatch clocked ~49B tokens at 14.2% MFU before being torn down mid-week so the training could migrate to a v4-2048 slice with ragged all-to-all. That migration required chasing through a step-0 gradient explosion with ep≥8 ragged dispatch #4746, a sender-side offset bug in _shard_a2a_params fixed in #4867, and a raft of Iris fixes (#4821 to keep a reserved v4-2048 slice warm, #4793/#4792 to raise heartbeat and bootstrap timeouts on the 2048-node pod).

The reconstituted moe_1e23_d5120_bs2048_ep8_ragged_48l_rayuvtpu_20260417_124933 came online Apr 17 on v4-2048 at ep=8 and, as of this writing, is still running — it has consumed 105B tokens at MFU 16.4% (roughly 20% faster, relative, than the ring baseline on half the hardware), with train loss tracking the ring run essentially step-for-step. The paloma macro_loss sits at 2.7283 and overall macro_loss at 2.6186 — nowhere near the 2.25 preregistered target yet, but only ~10% of the way through the 1T-token budget. dlwh flagged a curious wrinkle on Discord: even at identical train loss the perplexity evals are a good bit worse than the ring baseline, "much, much worse on github," which he suspects is an eval-condition mismatch or overflow behavior under imbalanced routing rather than a run-resetting bug.

On the agentic SFT side, #4420 delivered the week's most sobering result: two epochs of SFT on the full 366K-example Nemotron-Terminal-Corpus brought exp4420-8b-v4-128-20260413 (Marin-8B Instruct, v4-128) to final loss 0.442 — a stubborn +0.082 gap to the Qwen3-8B reference that never closed — and only 1/89 = 1.1% on Terminal-Bench 2.0 versus 15.9% for the Qwen3-8B baseline at matched recipe. The v5p-64 twin crashed, the v4-128 run finished; the takeaway is that Marin-8B Instruct absorbs terminal-skill SFT roughly an order of magnitude worse than Qwen3-8B on identical data, plausibly tokenizer- or chat-template-bound. The 32B analog exp4307-32b-v4-512-tp2-20260413 (#4307, Qwen3-32B on v4-512 TP=2) crashed at step 892 / 15.6% after burning 28.5h, and the Marin-32B-base comparator exp4760_sft_marin_32b_base_terminal_corpus_15pct_32768tok_v5p64 #4760 is still running at step 210/859 on preemptible v5p-64 after its 128-chip sibling crashed off the OOM cliff.

A sweep of Michael Ryan's FM-natural curation ablations landed at the ~3e20 FLOP scale on v5p/v4-32 slices, putting four text-extraction and filtering pipelines head-to-head at the same compute. At the ~1B d1536 shape, curation-resiliparse-expFM_natural-3e+20-d1536-L16-B256 finished best with paloma macro_bpb 1.1234 and uncheatable macro_bpb 0.9494 — the github_cpp and github_python bpb drop to 0.83 and 0.92 is the clean tell that resiliparse is pulling code through where the other pipelines mangle it (FineWeb-Edu sits at cpp=5.56, python=5.29 at the small d512 shape). curation-dclm-expFM_natural-3e+20-d1536-L16-B256 finished at macro_bpb 1.1797, the Nemotron-full variant at macro_bpb 1.3325, and the tiny FineWeb-Edu d512 smoke run at 2.2024 — a reminder that programming-language coverage, not just English prose quality, is driving the curation gap at 1e23-scale.

Run	User	Hardware(?)	Hours(?)	FLOP Budget(?)	BPB(?)
#4697 #4867 pre-reg moe_1e23_d5120_bs2048_ep8_ragged_48l_rayuvtpu_20260417_124933	David Leo Wright Hall	TPU v4 (1024 chips)	2.8d	1.08e22 model 6.61e22 HW (16%)	BPB: 0.848
#4697 pre-reg moe_1e23_d5120_bs2048_ep4_ring	Larry Dial	TPU v4 (512 chips)	3.0d	5.08e21 model 3.57e22 HW (14%)	BPB: 0.884
#4307 exp4307-32b-v4-512-tp2-20260413	Kevin Li	TPU v4 (256 chips)	1.2d	1.15e21 model 7.00e21 HW (16%)	BPB: 0.808
moe_1e23_d5120_bs2048_ep8_ragged_48l_rayuvtpu_20260417_011404	David Leo Wright Hall	TPU v4 (1024 chips)	6.8h	1.09e21 model 6.55e21 HW (17%)	—
#4420 exp4420-8b-v4-128-20260413	Kevin Li	TPU v4 (64 chips)	1.6d	9.30e20 model 2.33e21 HW (40%)	BPB: 0.170
#4760 exp4760_sft_marin_32b_base_terminal_corpus_15pct_32768tok_v5p64-775003	Kevin Li	TPU v5 (32 chips)	1.9d	7.37e20 model 1.58e21 HW (47%)	—
curation-fineweb_edu-expFM_natural-3e+20-d512-L6-B2048	Michael Ryan	TPU v5 (32 chips)	1.4d	3.00e20 model 1.49e21 HW (20%)	BPB: 2.029
curation-dclm-expFM_natural-3e+20-d512-L6-B2048	Michael Ryan	TPU v5 (32 chips)	1.2d	2.67e20 model 1.32e21 HW (20%)	BPB: 1.283
curation-nemotron_full-expFM_natural-2e+20-d512-L6-B2048	Michael Ryan	TPU v5 (32 chips)	18.4h	1.96e20 model 1.24e21 HW (16%)	BPB: 1.510
curation-nemotron_full-expFM_natural-3e+20-d512-L6-B2048	Michael Ryan	TPU v4 (32 chips)	1.1d	2.39e20 model 1.18e21 HW (20%)	BPB: 1.496
curation-nemotron_full-expFM_natural-3e+20-d1536-L16-B256	Michael Ryan	TPU v4 (32 chips)	1.2d	3.00e20 model 1.12e21 HW (27%)	BPB: 1.205
curation-dclm-expFM_natural-2e+20-d512-L6-B2048	Michael Ryan	TPU v5 (32 chips)	20.4h	2.00e20 model 9.94e20 HW (20%)	BPB: 1.265
curation-dclm-expFM_natural-3e+20-d1536-L16-B256	Michael Ryan	TPU v5 (32 chips)	20.1h	3.00e20 model 9.55e20 HW (31%)	BPB: 1.039
curation-resiliparse-expFM_natural-3e+20-d1536-L16-B256	Michael Ryan	TPU v5 (32 chips)	20.1h	3.00e20 model 9.31e20 HW (32%)	BPB: 0.998
exp4760_sft_marin_32b_base_terminal_corpus_15pct_32768tok_v5p256-d7b7a8	Kevin Li	TPU v5 (128 chips)	7.3h	4.13e20 model 8.67e20 HW (48%)	—

#4281 MoE scaling: 1e23 run live, ragged EP debugged mid-flight

#4474 Log pipeline rebuild, status page goes live

#4283 MoE MFU at scale

#4282 Agentify experimentation

#4273 Improve Usability & Observability

#4272 Canonical pipeline (download to tokenize)

#4271 Marin-as-a-library (Bolinas can import marin)

#4270 Canary pass rate to 90%+

#4269 Single way of running jobs — off Ray completely

#3192 Synthetic data (research + critical path for post-training)

#3100 Data sources for pre-training / mid-training

Other Changes

Community Pulse

News & research shared

Tai Vu · San Francisco, CA, US 6 PRs, 6 comments

whenwen · A learner, interested in machine learning, language models, and mathematics. 6 PRs

Eric Czech 2 PRs, 3 comments

Will Moss · @airbnb · San Francisco, CA 1 PR, 2 comments

Rohith Kuditipudi · cs phd @ stanford 0 PRs, 3 comments

Neville Li · NY · Recovering "AI" engineer 1 PR, 1 comment

Chloe Chia · San Francisco 0 PRs, 1 comment

Tony Lee · Stanford University · Stanford, California · CS PhD at Stanford 1 PR

claude-nightshift 1 PR

Top 15 runs (by FLOPs) this week (completed, running, crashed)

Keyboard shortcuts