Marin: Week of April 20th summary

#4281 MoE Scaling up to April goal

Summary: Split from #4266.

40/52 sub-issues closed

The 1e23 MoE run #4697 moved from launch into ongoing operation this week. @dlwh posted a tracking update on Apr 20 cataloging three concurrent W&B runs for the d5120 / 48-layer / 129B-total / 16B-active model: an earlier divergent EP8 ragged relaunch, the current EP8 ragged relaunch (moe_1e23_d5120_bs2048_ep8_ragged_48l_rayuvtpu_20260417_124933), and the known-good EP4 ring baseline. The run-specific launch artifacts were kept on the long-lived codex/grug-moe-debug-artifacts branch with the freeform debug log split off to codex/grug-moe-debug-log; the reusable config plumbing landed cleanly via #4964, while #4959 was closed as experiment-only. Mid-week, @Helw150 regenerated the 1e23 forecast for Percy's ICLR presentation using @eric-czech's VPNLS approach, getting a slightly more conservative paloma/macro_loss projection of 2.295 against the original 2.25 prediction (asymptote pinned at 1.6, following Larry's setup). @eric-czech warned that VPNLS came in too conservative on the Delphi forecasts unless fitted without E; @ClassicLarry noted the 1e22 MoE point in the comparison plot was from the prior WSD less-tuned LR recipe and not in the fit, leaving the curve only mildly off-trend. By the end of the week @ClassicLarry was examining grad norm on the active 1e23 against the 1e21 reference, observing that nearly all the gradient comes from output_proj which sits in AdamH at constant step size, so the existing >1.0 clip mostly does an awkward downscale of the embed and other non-AdamH params; @Kaiyue-Wen argued clipping is essentially never triggered now and even if it eventually triggers it should be benign for Adam/Muon, raising module-wise gradient normalization as a way to ignore grad norm entirely the way parameter norm is now ignored.

The dominant story of the week was the agent-driven sweep of architecture and recipe ablations against the post-isoflop compute-optimal baseline. The shape that emerged: a handful of small individually-passing changes turned out to be cleanly additive. Partial RoPE on every layer #4946 dropped macro_loss by 0.020 at d512 and d768 with essentially no throughput cost, giving 1.09–1.12× effective speedup; partial key offset (PKO) on the every-fourth long-window layers #4802 matched or exceeded that, with the every-4th variant capturing the full benefit at 1/4 the layers, mirroring the nanogpt approach. Combining PKO every-4th + partial rope every-layer #4951 hit 1.226× at d768. #4976 then forced the last layer to be long+PKO so the model always ends on an induction-capable layer, lifting effective speedup to 1.19–1.23× across all four scales with a projected 1e23 macro_loss delta of −0.012. Doubling routed experts to E=128 / K=4 #4767 contributed another 1.08–1.20× (improving with scale), and reusing the third-to-last layer's attention input for the last two layers #4987 added 1.02–1.10× essentially for free. Stacked together as the "combined best" run #4999, the configuration delivered 1.21–1.52× effective speedup across d512–d1280 with macro_loss drops of −0.054 to −0.067, and the fitted scaling law projects −0.035 at 1e21 and −0.023 at 1e23 (pinned exponent), or roughly 1.4–2.1× speedup against baseline at hero scales depending on whether the exponent is held or fit. @ClassicLarry flagged in #moe that the 52% gain at 3e19 was largely driven by partial key offset, an idea originally invented in nanogpt — @Kaiyue-Wen called it "crazily good." Late in the week #5154 opened a barebones-transformer probe (no XSA, no GatedNorm, no MoE, MHA, dense MLP) specifically to chase down why PKO shows ~20% benefit on the Marin MoE recipe but only 0–2% on the speedrun nanogpt/nanochat recipe, hunting for any tokenization artifact.

The MHA thread that opened on Apr 24 stands as the most consequential late-breaking result. Removing GQA entirely #5151 dropped macro_loss by 0.024 at d512 and 0.034 at d768 for ~2–4% throughput cost, giving 1.10–1.18× effective speedup — confirming the prior #4371 GQA estimate at the new compute budgets. Stacking MHA with PKO every-4th + last #5152 compounded almost additively: 1.28× at d512 and 1.37× at d768 from a single experiment, suggesting MHA is providing the KV capacity that PKO's induction heads then put to work. This sets up an obvious next combined-best refresh that swaps GQA out of the recipe entirely.

The negative-result column is equally informative for narrowing the recipe. Min LR ratio sweeps #4972 and the constant-final-LR variant #4981 both confirmed that fully decaying LR to zero is optimal at these run lengths — any non-zero floor strictly hurt. Doubling expert granularity to E=128 / K=8 with half expert dim #4899 improved quality slightly but cost ~30% throughput, failing decisively. Sandwich GatedNorm (post-MLP) #4993 hurt both quality and throughput. Shifting the AdamH:Adam LR ratio off 13/3 #5000 failed at gate 2 — the existing ratio is well-calibrated. Router/shared norm splits #4973, layer-grain prediction #4807, full attention residuals #5113 (good quality, throughput too expensive — block-residual variants #5110 won as bs=4 with 1.06–1.12× across all scales instead), GatedNorm init scale sweeps #4904, value embeddings #4986, and RoPE-before-QK-norm #5114 all failed or were neutral. The GatedNorm position ablation #4952 isolated the MLP GatedNorm as load-bearing while embed/final/attn positions are ~neutral, and the slim-layer probe #4900 identified that the long-window layer is the one MoE position that cannot be removed — pointing to a future experiment of beefing up the MLP at every-4th layers via a wider skip path. Depth-width shifts #5002 showed −1 layer passing gate 1 at 1.07–1.12× by trading negligible quality for ~10% throughput, suggesting the L = round(d / (64 + 4·log₂(d) − 9)) heuristic is slightly too deep; @ClassicLarry deferred a base=80 retune behind the other recipe changes since it would invalidate the isoflop sweeps.

The optimizer side picked up new contributors. @pc0618 launched a Muon AOL-coefficient sweep #5115 at gate 1 and a MuonH swap with 2× batch ablation #5134 after @Kaiyue-Wen argued in #moe that MuonH should show a clearer step-wise advantage if the AdamH LR is held fixed, and that Muon should benefit disproportionately from a 2× batch increase since orthogonalization overhead amortizes better at larger batches. The first MuonH launches hit a JAX ShardingTypeError on the ('data', None, None) × ('expert', None, None) broadcast and were patched with a reshard before the scale-invariant multiply path. By Apr 25 the early read was that AdamH baseline, Muon AOL, MuonH base-batch, and MuonH 2× batch all looked indistinguishable, with @Kaiyue-Wen noting the y-range made the comparison hard to read at a final-difference scale of ~0.02. @WhenWen opened parallel work on AdamH gradient normalization: per-module gradient RMS-1 normalization #5180 via #5181 regressed d512 enough to fail gate 1, prompting a global-tree variant #5182 via #5183 that preserves relative module scales — gate 1 was still in flight at week's end, with one relaunch needed after a parent preemption moved the executor prefix between regions and broke checkpoint resumption. @dlwh also landed sampled backward-flow probes for Grug #5036 with a residual-stream DAG renderer logged to W&B every 50 steps, giving the modeling group a new lens on activation and cotangent flow during these recipe sweeps. Looking past the hero run, @Larry noted in #moe that the 1e22 needed capacity factor pushed to 4 post-hoc to get an honest eval — c4_en BPB 0.4882 at cf=1 vs 0.4259 at cf=4 — because OOD eval distributions flood single experts and cause drops; @Kaiyue-Wen suggested a small DSv3-style sequence-level auxiliary loss to discourage the imbalance, while open questions remain about how to handle this at inference without giving up the EP throughput. The DeepSeek-V4 paper that dropped on Apr 24 prompted Larry to register intent to test their gating change in Marin — sigmoid is "a little odd" given gate values are always sampled from the right tail with no real 0–1 bound.

2 PRs this week, 80 new comments, and 22 new issues (52 total)

Sort:

#4964 [grug] Add configurable MoE implementation 💬1 +39 −1 @dlwh
#4959 [grug] Add 1e23 MoE relaunch artifacts 💬2 +315 −17 @dlwh
Issues
#3469 MoE Sweep: Hyperparameters, Routing, Architecture, and Optimizer
#4013 [moe] Good 10T gate for #3469
#4014 [moe] Great 10T gate for #3469
#4432 Experiment: Map out Critical Batch Size for Current MoE Recipe
#4447 Experiment: MoE Isoflop on new lr recipe.
#4567 Experiment: Investigate MoE Beta2 Scaling Recipe
#4225 Experiment: Map out LR schedule and tuned value interactions with AdamH
#4569 Experiment: Investigate Approaches to handle Embed Norm Growth in MoE Recipe
#4697 Experiment: 1e23 MoE Run 💬7
#4716 Agent MoE Experiment: Attention Gate Sizing
#4767 Agent MoE Experiment: E=128 experts (up from E=64)
#4768 Agent MoE Experiment: Shared Expert Sizing (half and none)
#4769 Agent MoE Experiment: Remove sliding window attention
#4770 Agent MoE Experiment: Remove GatedNorm (keep RMSNorm only)
#4772 Agent MoE Experiment: Remove XSA (Exclusive Self Attention)
#4800 Agent MoE Experiment: Routed expert output weighting factor
#4802 Agent MoE Experiment: Partial key offset (partial RoPE + key shift) 💬3
#4803 Agent MoE Experiment: Remove router z-loss 💬2
#4805 Agent MoE Experiment: Softmax routing instead of sigmoid
#4806 Agent MoE Experiment: x0 skip connections (residual to initial embedding)
#4807 Agent MoE Experiment: Layer-grain prediction (weighted sum of per-layer diffs) 💬2
#4849 Agent MoE Experiment: Router bias nonzero init
#4899 Agent MoE Experiment: 2x expert granularity (E=128, K=8, half expert dim) 💬2
#4900 Agent MoE Experiment: Slim layer (per-layer MoE necessity) 💬1
#4901 Agent MoE Experiment: Router combine activation (identity vs pre-sigmoid)
#4902 Agent MoE Experiment: GatedNorm scale factor (allow amplification)
#4904 Agent MoE Experiment: GatedNorm init scale (AdamH-preserved norm) 💬5
#4905 Agent MoE Experiment: Pseudogram (per-position sigmoid residual)
#4906 Agent MoE Experiment: Backout (subtract cached midpoint activation) 💬1
#4907 Agent MoE Experiment: Paired head attention (cross-head querying on even layers)
#4946 🆕 Agent MoE Experiment: Partial RoPE (half-dim rotation) 💬3
#4951 🆕 Agent MoE Experiment: PKO (every 4th) + partial rope (every layer) 💬4
#4952 🆕 Agent MoE Experiment: GatedNorm position ablation (embed/final/attn/mlp) 💬3
#4972 🆕 Agent MoE Experiment: Min LR ratio (0.05/0.1/0.15 vs 0.0) 💬2
#4973 🆕 Agent MoE Experiment: Router/shared norm split (separate RMSNorm+GatedNorm) 💬2
#4976 🆕 Agent MoE Experiment: PKO+prope with last layer always long+PKO 💬3
#4981 🆕 Agent MoE Experiment: Constant final LR (hold at X after linear decay) 💬2
#4986 🆕 Agent MoE Experiment: Value embeddings (token-indexed v mixing) 💬4
#4987 🆕 Agent MoE Experiment: Cached attention (reuse attn input for last 3 layers) 💬3
#4993 🆕 Agent MoE Experiment: Sandwich GatedNorm (post-MLP norm) 💬2
#4999 🆕 Agent MoE Experiment: Combined best (E128 + PKO+prope+lastPKO + cached attn) 💬2
#5000 🆕 Agent MoE Experiment: Adam LR shift (0.7x / 1.3x) 💬3
#5002 🆕 Agent MoE Experiment: Depth-width shift (±1 layer from heuristic) 💬4
#5047 🆕 Agent MoE Experiment: Wide attention (1.5x heads, head_dim=128) 💬1
#5110 🆕 Agent MoE Experiment: Block attention residuals (inter-block softmax weighting) 💬4
#5113 🆕 Agent MoE Experiment: Full attention residuals (every output is its own entry) 💬2
#5114 🆕 Agent MoE Experiment: RoPE before QK norm 💬3
#5115 🆕 Agent MoE Experiment: Muon AOL coefficients vs AdamH at gate 1 💬1
#5134 🆕 Agent MoE Experiment: MuonH swap with 2x batch ablation 💬3
#5151 🆕 Agent MoE Experiment: MHA (no GQA) 💬1
#5152 🆕 Agent MoE Experiment: MHA + PKO (no GQA, partial key offset every 4th + last) 💬1
#5154 🆕 Agent MoE Experiment: Barebones transformer (no XSA/GatedNorm/MoE, MHA, dense MLP) 💬1

12 autocategorized

#5183 [grug] Add MoE AdamH global gradient normalization +712 −33 @WhenWen
#5181 [grug] Add MoE AdamH gradient normalization +432 −5 @WhenWen
#5179 Add MoE depth MuP LR sweep +1474 −6 @WhenWen
#5036 Add Grug backward-flow logging 💬2 +2288 −13 @dlwh
#4930 [speedrun] Add MuonC archived Qwen3 submission 💬1 +2612 −0 @WhenWen
#4591 exp1337: add seed sweep for Delphi runs 💬1 +28 −15 @Helw150
#3466 Scale up Nano MoE 💬1
#5180 Agent MoE Experiment: AdamH gradient normalization 💬3
#5182 Agent MoE Experiment: AdamH global gradient normalization 💬8
#2434 MuonHT: MuonH with tangent constraints
#2315 Experiment: Grug vs hackable 125m
#2390 Speedrun Starter Template: Switch from Hackable Transformer to Grugformer

#4269 Single way of running jobs — off Ray completely

Summary: All jobs run through Fray+Iris.

1/1 sub-issues closed

The Ray sunset arrived. @yonromai drove the long-tail migration from #4453 through to its terminal state in a sequence of stacked deletes: trivial v1→v2 import swaps in #4970, ports of export/levanter_checkpoint.py and inference/vllm_smoke_test.py to fray v2 in #4980 and #4983, removal of dead RL code (evaluate_environment.py in #4975, JAX_TRANSFER_SERVER weight transfer in #4979 — confirmed unused by @rjpower in #reinforcement-learning), the eval-tree v1 scaffolding in #4953, the Ray+Iris hybrid integration test in #5068, and then the heavy excisions: cluster/ray.py and the executor's lazy import ray branches in #5028, Levanter's ray_tpu.py/launch_on_ray.py in #5031, the Marin Ray glue in marin.cluster in #5131, the 16 Ray cluster YAML templates and operator docs in #5132, the Ray operator tooling and its CI in #5089, ray_run.py itself in #5087, the docs sweep retiring ray up/ray down/launch_on_ray from tutorials and runbooks in #5076, the fray.v1 package in #5137, fray.v2.ray_backend in #5138, and finally the fray.v2.* → fray.* rename in #5140 once nothing else needed the version suffix. @Helw150 tore the Ray references out of the internal guidelines in #4985. @yonromai's opening salvo on Apr 23 in #infra — "Is it okay if I delete all the Ray resources on GCP, except for big run?" — was the operational counterpart: 9 of 10 head VMs and 34 firewall rules went down on the 23rd, leaving only marin-big-run alive for the in-flight v4-2048 production job. Net deletion across the epic is on the order of 22k LOC; fray is now Iris-only with a LocalClient fallback for tests. Docs lag is tracked in #5029 (still open).

Iris absorbed the consequences of being the sole scheduler. @Helw150's #5081 turned on the budget system: migration 0037 set budget_limit=75000 with max_band=PRODUCTION for seven admins, the same limit at INTERACTIVE for thirteen named researchers, and forced new users to budget_limit=0/max_band=BATCH so unlisted submitters get opportunistic scheduling only. @rjpower followed with #5083, exposing --preemptible/--no-preemptible on iris job run so callers can override the small-CPU heuristic explicitly (closing #4540), and #5108 added a client_revision_date stamped at wheel build that lets the controller reject root submissions from clients more than a configurable interval old, with a 2-week grace for in-flight installs (closing #4840). The legacy monolithic Heartbeat RPC and its 2k-line surface — HeartbeatRequest/Response, use_split_heartbeat flag, HeartbeatAction enum, DispatchBatch, all begin/complete/fail_heartbeat paths — were removed in #5092 once #4984 flipped use_split_heartbeat=True on prod marin and the split Ping+StartTasks+StopTasks+PollTasks path had burned in elsewhere. The audit story moved from a SQLite txn_log/txn_actions pair to structured log lines (event=… entity=… trigger=…) in #5082, picking up API-key create/revoke instrumentation along the way. The deprecated GetTaskLogs RPC was removed in #4231 in favor of FetchLogs with regex source patterns. Underneath all of this, @rjpower began a major refactor of the controller's storage layer: #5147 introduced a typed ControllerStore bundling per-entity stores with a temporary self._db escape hatch, and #5164 then migrated ~30 inline SQL sites in ControllerTransitions behind TaskStore/WorkerStore/JobStore methods, lifted transaction scope to the entrypoint, added a TaskScope ADT, on-commit hooks for the attribute cache, and closed the submit-with-replace TOCTOU. Behavioral equivalence is proved by the 13-scenario replay-golden harness in #5165, which produces byte-identical DB state on the refactor branch versus main.

The scheduler had a bad Wednesday and a productive recovery. On Apr 22 around 01:15 UTC, Michael Ryan flagged "I think something might be up with the iris scheduler" in #infra; @rjpower traced it to a crashed scheduler loop, narrowed the cause to "reservation hold job preemptions had special casing which broke the scheduler", and noted that "likely 50% of the complexity of the iris scheduler is due to how I hacked those in… which I need to delete with great passion (hopefully by next week)". The fix landed as #5032 within the hour. A more pernicious class of bug surfaced over the rest of the week — a StartTasks→PollTasks race in which a poll's DB snapshot, taken before a concurrent StartTasks assignment commit, would omit the new task from expected_tasks, the worker would kill it as unexpected, the controller would promote that to JOB_STATE_KILLED, and the kill would cascade across the entire pool, surfacing in zephyr as the misleading "Worker job terminated permanently… Workers likely crashed" abort. @ravwojdyla-agent attacked the worker side first in #5043 with a 30s submission grace window in _reconcile_expected_tasks; @rjpower followed with #5046 (a wire-level TASK_STATE_MISSING signal that maps to WORKER_FAILED in the controller so the task retries via its preemption budget), #5054 (the controller-side companion to #5043), and ultimately #5090, which moved per-worker PollTasks out of its dedicated thread and runs it inline at the end of each scheduling iteration so the same thread that commits assignments owns the running-tasks snapshot. The autoscaler got its own structural rewrite in #4816, which lifted the slice lifecycle into an explicit (from_state, type(event)) → to_state transition table with a sum-type outcome consumed via a single match; failed slices are now atomically removed from _slices and surfaced through BecameFailed.handle rather than left in place. @rjpower's #5078 added a manual-slice CLI (iris cluster create-slice/delete-slice) that the autoscaler ignores, useful for pinning capacity outside the demand model.

Worker reliability moved on several fronts. #4883 replaced the time-decay health score with two independent ping-based termination conditions — 10 consecutive ping failures, or 10 monotonic BUILDING→FAILED transitions — and removed the heartbeat_failure_threshold config knob; FAILED from RUNNING deliberately doesn't bump either counter to keep poison-pill jobs from reaping every worker that ran them. #4940 wired Connect deadline checks into the RPC interceptor on entry and after semaphore release, so FetchLogs and heartbeats whose clients have already given up shed their work as DEADLINE_EXCEEDED instead of compounding the pile-up. #5021 cut the FrayActorJob polling load by switching wait_ready/is_done from the heavy GetJobStatus RPC to lightweight GetJobState with backoff out to 5s; #5024 restored the ConnectError(NOT_FOUND) public-API contract that #5021 inadvertently regressed; #4917 capped wait_for_job's state-poll backoff at 30s and moved wait_for_job_with_streaming off its fixed time.sleep. SIGSEGV exits in container processes — almost always libtpu/glibc/C-extension faults rather than user logic — now route to preemption_count (default budget 100) rather than the terminal failure_count=0 path in #5013, and #5038 reaps TPU hosts that keep failing launches (e.g. iommu/vfio group already held), excludes virtual reservation-holder tasks from expected_tasks, and fires on_stop on natural return. #5017 disabled the periodic memray loop after it began segfaulting the processes it was profiling; on-demand memory profiles still work, and #5033 stopped passing --subprocesses to py-spy when the controller and worker profile themselves, since they spawn non-Python children that py-spy can't fingerprint. @dhidary's in-flight #5150 adds enable_external_ip=False to the GCP configs and opens an IAP tunnel for projects (TRC grants, security-locked enterprise orgs) where constraints/compute.vmExternalIpAccess bans public IPs.

The week's operational narrative widened to compute economics and agent ergonomics. Eric Czech opened a long thread in #infra noting that practical cross-region sweeps are getting compressed into shorter and shorter windows as v4/v5/v6 availability bounces, and asking whether OA could offload checkpoint storage to R2 or a separate GCP project to escape hai-gcp-models egress. @rjpower pushed back on the storage-split path — egress prevents external projects from helping much for checkpoints — and argued that "in a sense, we may need a rewrite of Executor for the modern era", since the executor still resolves output paths in the root launcher's region and reuses them for child jobs in other regions (filed as #4969, still open). The recommended path remains mirror:// for checkpoints; the durable fix is fair-share priority bumping inside Iris, which works only if everyone submits to the same queue rather than letting agents place jobs themselves. The agent-babysitting pattern crystallized: by Apr 25 Eric reported a sweep of 8 training jobs running for "a little over a week with Claude babysitting it to keep things going", with anecdotal job-completion rates close to what manual checks would have produced, and @rjpower noted that optional retries on executor launches would close the remaining transient-bug gap. @Helw150 announced the deprecation of stanford-crfm/marin-tokenizer and the migration to marin-community/marin-tokenizer (#4977, accidentally deleted upstream), and an unrelated kitokenizer regression — add_special_tokens=False dropping per-document BOS — landed in production between Apr 8 and Apr 22 15:50 UTC and corrupted ~63 TB of tokenized/* data per the GCS audit cross-referenced in #5149 (@rjpower's gist of recent writes is the source of truth; @Helw150 and @ahmeda14960 are doing the affected-cache triage). The wandb>0.24.0 floor went in via #5011 to dodge the silent-upload bug from #5010. A handful of papercuts closed alongside: #4988 — --reserve ignoring --region when claiming reservation workers — was fixed by #4989; iris --cluster=<typo> now fails fast with the cluster list rather than a generic missing-controller error in #5016; iris --cluster=NAME with auto-tunnel became the documented primary CoreWeave connection pattern in #5001; step_runner's unmet-deps message no longer crashes with unhashable type: list per #4991. Larry's "iris giving us a lot of compute right now wow" on Apr 21 captured the underlying reality: with Ray gone and the budget system live, the cluster is finally doing the job it was redesigned to do, and by Apr 25 @ahmeda14960 could report 4,888 v5p chips up — enough headroom to consider a Qwen3 235B/A22B 3T-token training run in a week.

1 PR this week, and 1 new issue (1 total)

Sort:

#5078 [Iris] Add manual slice CLI (create-slice / delete-slice) +250 −39 @rjpower
Issues
#5069 🆕 Iris: Manual slice startup

83 autocategorized

#5140 fray: rename fray.v2.* to fray.* (#4453) +196 −301 @yonromai
#5138 No more Ray in Marin 💬1 +54 −3184 @yonromai
#5137 fray: delete legacy v1 package (#4453) +5 −6938 @yonromai
#5132 infra: delete Ray cluster templates and retire Ray-cluster docs +0 −4086 @yonromai
#5131 Remove dead Ray helpers from marin.cluster +0 −566 @yonromai
#5089 Delete Ray operator tooling +68 −3772 @yonromai
#5087 Delete ray_run.py +0 −419 @yonromai
#5076 docs: retire Ray launcher references; route to Iris 💬1 +131 −840 @yonromai
#5068 Delete Ray+Iris integration test +0 −113 @yonromai
#5031 Delete Levanter Ray TPU infra +0 −367 @yonromai
#5028 Delete dead Marin Ray glue (cluster/ray.py + executor branches) +2 −971 @yonromai
#4985 Remove Ray from Internal Guidelines Docs +43 −53 @Helw150
#4983 marin: port inference/vllm_smoke_test.py to fray v2 +6 −5 @yonromai
#4980 marin: port export/levanter_checkpoint.py to fray v2 💬1 +14 −8 @yonromai
#4979 marin: remove dead WeightTransferMode.JAX_TRANSFER_SERVER +3 −740 @yonromai
#4975 marin: delete dead rl/scripts/evaluate_environment.py +0 −326 @yonromai
#4970 fray: trivial v1 → v2 swaps (device_flops + test fixtures) +5 −5 @yonromai
#4953 evals: drop Fray v1 launch_evaluate_with_ray scaffolding (#4453) +54 −403 @yonromai
#5108 [iris] Reject stale clients on root LaunchJob submissions +435 −302 @rjpower
#5081 [iris] Start using Iris' budgets now that we are at full usage 💬3 +542 −158 @Helw150
#5083 [iris] Add --preemptible flag to job run +108 −15 @claude
#5092 [Iris] Remove legacy monolithic Heartbeat RPC +302 −1998 @rjpower
#5090 [Iris] Run PollTasks inline in scheduling loop to fix StartTasks race 💬2 +18 −20 @rjpower
#5043 iris/worker: protect freshly-submitted tasks from StartTasks→PollTasks race 💬2 +126 −14 @ravwojdyla-agent
#5046 iris: add TASK_STATE_MISSING; stop worker-reconciled KILLED from cascading to jobs 💬4 +112 −55 @rjpower
#5054 [iris] Controller-side grace window for StartTasks/PollTasks race 💬1 +193 −3 @rjpower
#5038 [iris] Reap TPU hosts that keep failing launches; fire on_stop on natural return +199 −2 @rjpower
#5033 [iris] Skip py-spy --subprocesses when profiling controller/worker self +14 −8 @rjpower
#5021 [iris] Use lightweight GetJobState for Fray actor polling 💬3 +47 −54 @rjpower
#5024 [iris] Raise ConnectError(NOT_FOUND) from IrisClient.job_state +1 −1 @rjpower
#5017 [iris] Disable periodic memray memory profiling +6 −7 @rjpower
#5016 iris: list available clusters when --cluster is unknown or missing +6 −2 @yonromai
#5001 [iris] docs: lead §Connecting with --cluster auto-tunnel +26 −7 @yonromai
#5082 [Iris] Replace txn_log audit table with structured log lines +265 −639 @rjpower
#4984 Enable split heartbeat for prod Marin cluster +1 −1 @rjpower
#4940 [iris] Shed RPCs whose client deadline already expired +252 −84 @rjpower
#4917 [iris] Cap wait_for_job state-poll backoff at 30s, not 2s +38 −9 @claude
#4883 [iris] Worker health score + reaper thread for failure-aware termination 💬2 +448 −325 @rjpower
#4560 [iris] Reject child submissions when parent is absent from the DB 💬1 +86 −27 @claude
#4816 [iris] Add slice lifecycle state machine for autoscaler transitions 💬1 +846 −469 @rjpower
#5147 [iris] Introduce stores layer between transitions.py and db.py +1055 −654 @rjpower
#5164 [iris] Migrate transitions SQL behind typed store layer +3858 −2657 @rjpower
#5150 [iris] Support internal-IP-only deployments via IAP tunneling 💬3 +92 −74 @dhidary
#4948 [iris] Drop fragile upper-bound timing assertions in test_utils +2 −2 @rjpower
#5027 Log lock-release lifecycle in executor_step_status and distributed_lock +7 −1 @yonromai
#4989 [iris] Propagate --region/--zone into --reserve entries 💬2 +11 −0 @claude
#4965 Move scripts/iris/cross_region_ops and scripts/storage into scripts/ops/ +30 −30 @rjpower
#5032 Fix scheduler bug, adjust v4 quotas. +169 −44 @rjpower
#5013 iris: promote SIGSEGV exits to WORKER_FAILED 💬2 +11 −1 @ravwojdyla-agent
#4231 Remove deprecated GetTaskLogs RPC 💬1 +63 −269 @claude
#4410 Bump dependencies to fix 47 Dependabot security alerts 💬1 +618 −611 @rjpower
#5011 Require wandb>0.24.0 to avoid silent upload failures +2 −2 @claude
#4120 [iris] Auto-quarantine and delete TPU nodes that crash with SIGSEGV
#4167 [iris] get_job_info crashes on constraint JSON with mode field
#4284 [iris] Autoscaler worker registry is disconnected from controller worker DB
#4540 [iris] Add preemptible flag for `job run` 💬1
#4559 [iris] Child job submission silently accepted when parent is absent from DB
#4692 [iris] Remove heartbeat log forwarding now that workers push directly to log service 💬1
#4712 [iris] Controller doesn't reliably detect dead preemptible workers (heartbeat timeout not enforced)
#4817 [iris] Terminate workers w/ "No accelerator found" errors 💬1
#4833 Log-store: narrow per-segment key pruning so reads don't need the full cap
#4840 iris: reject clients that are older than N days
#4891 False Positive on Cross Region Activity On Pre-empted Task
#4916 iris - backoff GetJobStatus more aggressively
#3473 Iris: run controller & workers under explicit service account, not default compute SA 💬1
#3491 Iris: add job priorities and preemption
#3636 iris: split up protos
#3794 iris: OOM kills not surfaced to users — task restarts or shows 'Terminated by user' 💬1
#3795 iris: memory profiler is not reliable 💬1
#4988 iris: --reserve ignores --region when claiming the reservation worker 💬3
#4991 step_runner: unmet-deps error crashes with 'unhashable type: list' 💬2
#5010 bump wandb
#3109 iris: move K8s pod creation into build() using init containers
#2762 Iris: fair scheduler 💬1
#2939 Iris: install jax/k8s for pod configuration, or inject gpu config as needed
#3291 Iris: add users, auth etc
#3558 Iris reserve+Fray parent launcher is routed to preemptible TPU groups (v5p-8)
#3943 kubernetes_provider: port allocations not injected into task pod env vars
#4414 [iris] K8s LogCollector silently skips log collection for nested zephyr pipeline pods 💬1
#4512 Iris: k8s perf improvements
#4969 [executor] Prevent root-region prefixes from leaking into child jobs
#5029 Docs sweep: replace ray_run.py / launch_on_ray / ray_tpu references after deletion (#5028, #5031) 💬1
#2417 Ray/TPU failure modes + fixes (incl. #2399)

#4474 Levanter Store, K8s Logging, and Infrastructure Improvements

Summary: Improve Levanter's data store, fix K8s logging at scale, and address infrastructure gaps in the Iris dashboard and profiling.

This week introduces a new infrastructure epic, #4474, that bundles three streams of plumbing work that had been running in parallel: rethinking the Levanter token store/cache, hardening Iris's K8s logging path so it can survive cluster-scale ingest, and closing a set of correctness and ergonomics gaps in the broader Iris/Levanter stack. The epic carries forward the design conversation that @ravwojdyla opened with new contributor Neville two weeks ago in #4445 — whether the producer-side consolidate_shard_caches dance the tokenize pipeline does to keep TensorStore happy can be replaced by something simpler (and ideally Parquet-shaped, "due to gravity"), without giving up the random-access-from-thousands-of-readers properties training depends on.

The store-side work this week was Neville's. #5070 landed a focused optimization to JaggedArrayStore's write path — a PreparedBatch._from_sequences() fast path that pre-allocates a single flat array for Python-list inputs and skips per-item np.asarray calls — taking a 1B-token write benchmark from 44.4M to 57.5M tokens/sec on local disk, and removing the unused TreeBatchPreparer while there. The larger swing, #4814, remains in draft: it removes consolidate from tokenize entirely and replaces it with an in-memory virtual ShardedTreeCache for downstream readers. Local tests pass and a tutorial CPU train works; @ravwojdyla granted Neville production Iris-cluster access so the next validation step is a 10BT tokenize run from the datakit-smoke workflow, with Iris dashboard and ops docs as the entry point. Together with #5023, which moves token-boundary rendering for perplexity-gap literal examples out of the document-scan hot path and behind a lazy contract, these three PRs are the first concrete steps from the rav/Neville design discussion into landing code.

The week's most visible incident was a tokenizer-correctness bug rather than a store-design issue, and it pulled the same surface area into focus. #5034, opened by @ahmeda14960 after he and Michael Ryan noticed mismatched bpb/perplexity between caches built three weeks apart, traced to Levanter's BatchTokenizer: after the early-April migration to the kitoken backend, the init probe used add_special_tokens=True but the hot path did not, so kitoken's explicit BOS gate stayed off and every text cache built between roughly 2026-04-08 and the fix landing on 2026-04-22 was missing the leading <|begin_of_text|> token (id 128000) on every document. The shared marin-community/marin-tokenizer was unchanged; the bug was purely on Levanter's side. #5040 from Ahmed dropped the probe, prepends BOS manually after encode_batch, and threads append_bos/enforce_bos through cache metadata so post-migration caches will now rebuild on next use. "yikes, no not intentional," Russell replied; "i hate HF tokenizers so so much." The blast-radius accounting moved to #5149, where @rjpower's three-week GCS audit gist identified ~63 TB of tokenized/* data inside the bug window; an empirical scan of the actual TensorStore arrays then confirmed the major training datasets were safe — primarily the validation/eval caches Ahmed and Michael had built were affected, and Russell posted the all-clear later in the week.

On the K8s/Iris-logging side, @rjpower is in the middle of decoupling the log server from the controller process so it can ship as its own image. #4947 relocates iris.cluster.log_store to iris.log_server.store, lifts JwtTokenManager out of the controller, deletes the subprocess-based log-server auto-spawn path, and adds an opt-in enable_log_server_sidecar mode that runs iris-log-server as a second container on the controller VM. The supporting plumbing for this — a JwtVerifier stateless enough to live in non-controller processes, plus a ServiceURL resolver that lets gcp://finelog-server just work — landed as the +2696/-126 #5161. Independently, two operational fixes: #5020 raises the controller's local log-store retention from 50 segments / 5 GB (less than a day at marin's ~6–7 GB/day ingest) to 1000 / 100 GB (~2 weeks), since the read path has no GCS fallback once a parquet ages out, and #5019 parallelizes cross-region log downloads with a 16-thread ThreadPoolExecutor that had previously been single-threaded. #5174, from @ravwojdyla, makes the log UI itself more useful: each line of a job-aggregated log view now starts with a T<N> link to the originating task's detail page, parsed from the existing LogEntry.key. The longstanding #4509 — moving K8s workdir files from ConfigMaps (which had accumulated to 17,800+ orphans on the CW cluster, degrading the API server) to bundle-store blobs — closed this week.

The remaining surface of the epic was a set of Levanter checkpointing and harness fixes. #4386 closed via #4387, adding a temporary_base_path to CheckpointerConfig so time-policy checkpoints route to region-local Marin temp buckets with TTL while step-policy/permanent saves stay durable; the still-open #5066 wires the new temp-checkpoint roots into the Grug and Marin Levanter launch paths, deriving them from the configured output path via marin_temp_bucket(..., source_prefix=...) and keeping forced checkpointer saves permanent. #5022 from Calvin-Xu fixes a more insidious bug — train_lm.py never waited for the final async checkpoint write before exiting and train_dpo.py waited on a freshly-constructed Checkpointer instance with nothing pending — by storing the checkpointer on the trainer and calling wait_until_finished() after training. @eric-czech cherry-picked two upstream Levanter eval-harness fixes (#4912 for the unsupported implementation kwarg closing #4852, #4911 for missing resource_mapping in broadcast_shard); #4977 finished the rename to marin-community/marin-tokenizer across configs, docs, and tests; and #5171 closed #5139 by dropping Levanter's protobuf<7 upper bound now that Ray (the original reason for it) is gone and TB/XProf compatibility was reverified. Two open items round out the picture: #5026, a step_lock self-race where successful executor steps report LeaseLostError on the terminal write_status because the lock is released before the lease's finalizer runs, and #5030, dlwh's request for backward-flow probe support on real ArrayStacked/scanned Grug models so logging doesn't fall back to stack.unstacked() and defeat the point — a baseline non-stacked run is up as the reference behavior.

29 autocategorized

#5070 Optimize JaggedArrayStore write path and remove dead code 💬3 +64 −56 @nevillelyh
#5066 [levanter] Wire temporary checkpoint roots into launch paths +200 −24 @dlwh
#4387 [levanter] Separate temporary checkpoint base path and use Marin temp buckets +451 −163 @claude
#5022 [levanter] Wait for final checkpoint async save before process exit +15 −6 @Calvin-Xu
#5023 [levanter] Defer token boundary rendering +193 −29 @dlwh
#4814 Remove consolidate from tokenize 💬6 +526 −30 @nevillelyh
#5040 [levanter] Fix BatchTokenizer dropping BOS after kitoken migration +155 −23 @ahmeda14960
#4944 [levanter] Use levanter.tokenizers.load_tokenizer in eval harness 💬1 +13 −19 @yonromai
#5015 Add paged-decode interface to Qwen, matching llama/apertus +119 −0 @RohithKuditipudi
#5014 Add teacher-forced scoring engine over paged KV cache +624 −0 @RohithKuditipudi
#5174 [iris]: link aggregated job log lines to originating task 💬1 +50 −8 @ravwojdyla-agent
#5020 [iris] Raise log-store local retention to 1000 segments / 100 GB 💬1 +211 −8 @rjpower
#5019 [ops] Parallelize cross_region log downloads with ThreadPoolExecutor +40 −25 @rjpower
#4947 iris: decouple log server so it can ship as its own image +676 −441 @rjpower
#5161 [rigging] Add URL resolver and JWT verifier +2696 −126 @rjpower
#4912 [levanter] Fix lm_eval use of unsupported implementation arg +6 −0 @eric-czech
#4911 [levanter] Fix eval harness resource partitioning +11 −2 @eric-czech
#4977 Rename Marin tokenizer repository and fix chat template expectation +30 −28 @dlwh
#5168 Delphi Downstream Evals +391 −13 @Helw150
#5171 [codex] Drop Levanter protobuf upper bound +19 −19 @yonromai
#4445 Improve levanter store/cache 💬1
#4386 [levanter] Separate temporary checkpoint base path and use Marin temp buckets
#5030 [grug] Support backward flow logging for ArrayStacked layers 💬1
#5034 BOS token missing from text caches built after kitoken tokenizer-backend migration (April 2026)
#5149 Bugged Tokenized Datasets 💬1
#4852 [levanter] lm_eval uses unsupported `implementation` arg
#5139 Re-evaluate protobuf<7 pin in lib/levanter now that ray is gone
#4509 [iris] Move workdir files from ConfigMaps to bundle store blobs
#5026 `step_lock` self-race: successful steps report `LeaseLostError` on terminal `write_status`

#4283 MoE MFU at scale

Summary: Tracking issue for April MFU work. Tasks/goals:

0/6 sub-issues closed

A quiet week for the MFU epic itself. With the Triton ragged_dot kernel from #4297 and the JAX 0.9.2 compatibility result from #4455 both landed the prior week, no new throughput PRs or comments arrived against the open sub-issues — #4300 (TPU v4 25%-30% MFU sustained), #4302 (H100×8 MoE MFU), #4311 (Megatron throughput on 8×H100), #4312 (close end-to-end MFU gap), #4313 (2×8×H100) — and none gained activity in the date range. The Grug-vs-Megatron head-to-head that @chloechiaw volunteered to run on April 12 has not yet produced a follow-up in the visible record. With the 1e23 v4-512 MoE run now mid-flight and absorbing most of the modeling team's attention, the epic appears to be in a holding pattern between the kernel work that closed in mid-April and a likely renewal of the H100 measurement push once the big run finishes or stabilizes.

0 PRs this week, and 0 new issues (6 total)

Sort:

Issues
#4300 TPU v4: 25%-30% MFU sustained for 100B-A13B
#4301 Experiment: Grug MoE ~116B-A16B bring-up on v4-1024 (d4864/l47/h38)
#4302 H100 x 8 MOE MFU perf
#4311 Measure throughput of megatron on 8x H100 node on relevant geometries
#4312 Improve end-to-end MFU of MOE on 8xH100 node
#4313 Measure/improve MFU on 2 x 8xH100 node

#4282 Agentify experimentation

Summary: Split from #4266.

The headline this week was teaching agents to babysit jobs without leaning on a developer’s laptop. #5018 framed the problem precisely: Codex heartbeat sessions babysitting long Iris jobs were failing because they shelled out to uv run iris, depended on local GCP/W&B token caches, and broke whenever oauth2.googleapis.com auth refresh stalled — the recent perplexity-gap-post4962-rerun watch died exactly that way. #5042 from @dlwh landed the answer: a marin-mcp-babysitter entry point with resident Iris controller and log clients, structured tools for jobs, tasks, logs, workers, processes, and profiles, plus Zephyr progress parsing and a diagnosis layer that classifies stuck-assigned, OOM/exit 137, TPU/XLA bad-node, quota-backoff, dead-worker, and zombie-coordinator states. A follow-up the same day #5071 switched worker metadata to protobuf json_format and added Zephyr coordinator thread-liveness so zombie coordinators surface in the structured output. dlwh flagged in #code-review that he’d “sworn never to learn MCP” but it was the only path to giving Codex automations the network access babysitting needs — an admission Russell Power immediately tried to parlay into handing him k8s next. By Saturday Eric Czech reported in #infra that he’d had Claude babysitting an 8-job sweep for “a little over a week,” and used the same MCP surface to ask the agent for parent-job lifetime stats — about a day on average.

The Nightshift loop ran four nights in a row this week, each producing a small multi-scout cleanup PR with a haiku for a commit poem and a real verification trail attached. #4949 opened the week with three mechanical removals (an unused BuildResult dataclass in iris, twelve lines of commented-out slicing code dead since Sep 2024 in levanter’s JaggedArrayStore, two orphan helpers in marin), and the cadence held through #4998, #5044, #5122, and #5153. The wins are starting to extend beyond dead code: #5122 caught a real correctness bug in levanter.schedule.value_at_step — the forward iteration returned the first segment whose start <= step, so any multi-segment IntSchedule (including train_batch_size in the trainer) silently used the earliest value forever after the first boundary. The scout reversed the iteration, added regression tests covering scalar values, multi-segment boundaries, mid-segment lookups, and the pre-first-segment error case, and shipped it. The pattern of scouts filing no_change findings when the sandbox blocked their writes is also now load-bearing: #5044 — the same PR that landed two scouts’ cleanups — recorded the value_at_step bug as a finding-only note one day before #5122 picked it up and fixed it.

Two issues this week sharpened the meta-loop around how agents do research and maintain themselves. #5117 split out of this epic to codify a forage skill: a dedicated phase before the first hypothesis lands in an agent-research logbook, with explicit Marin-specific search surfaces (long-lived research/<topic> branches, the experiment issue template, .agents/logbooks/<topic>.md, snapshot tags, prior runs in experiments/ and W&B). dlwh’s thread quote in the issue body sets the tone — agents already know how to do lit search, so the skill’s value is “Marin’s spin,” not generic search heuristics — and the issue is explicit that no new MCP servers, paper-ingestion tools, or curated paper library should fall out of it. #5155 opened a parallel question for documentation: build a skill (or skills) that either generate docs/ on demand or flag staleness, so the canonical reference doesn’t silently rot while the agent-authored code around it churns.

Outside the formal epic items, the “I asked Claude” pattern kept widening its surface area. Ahmed used Claude to diagnose a tokenization regression where caches generated yesterday and a month ago produced different bpb on Paloma — “Claude is claiming our caches no longer have BOS by default”, filed as #5034, with an offer to “have claude make a quick PR” as the obvious next step — and later did “a very scientific analysis with claude” to produce #5149, the cleanup list of who needs to delete or re-tokenize which BOS-affected caches, even pinging dlwh that “Claude thinks u own” proteindocs. willheld posted an “Agent MoE Experiment” combining the best gates the agent had found into the leaderboard, and Larry’s “the agent.md file has formulas to convert a difference in macro_loss to a percent speedup” made clear the MoE recipe leaderboard now treats agent-readable instructions as a primary interface. rjpower landed #5039 to fix a silent ops-log loss (the global logs/ gitignore had been swallowing every postmortem written under .agents/ops/logs/) and to write the first ops log entry under the new path; yonromai’s #5073 mopped up the singular-vs-plural .agents/project/ path mistake from #4545 and updated the three AGENTS.md references that pointed at the old path. Both are small, but they are the kind of plumbing that decides whether the agent substrate compounds or quietly leaks.

11 autocategorized

#5153 [nightshift] 20260424 multi-cleanup +10 −67 @claude-nightshift
#5122 [nightshift] 20260423 multi-cleanup 💬1 +40 −89 @claude-nightshift
#5044 [nightshift] 20260422 multi-cleanup +22 −28 @claude-nightshift
#4998 [nightshift] 20260421 multi-cleanup 💬1 +5 −127 @claude-nightshift
#4949 [nightshift] 20260420 multi-cleanup +2 −66 @claude-nightshift
#5042 [Iris] Add babysitting MCP server 💬2 +1074 −37 @dlwh
#5071 [Iris] Improve MCP babysitting diagnostics +81 −27 @dlwh
#5039 Add ops-log skill and move logs out of gitignored path +481 −0 @rjpower
#5073 Move archived design docs from .agents/project/ to .agents/projects/ +3 −3 @yonromai
#5018 [iris] Add a job babysitting MCP 💬4
#5155 Find a better way to keep `docs` are up-to-date 💬2

#4273 Improve Usability & Observability

Summary: Lower priority / slack-time workstream covering workqueue, dev-tpu replacement, and observability.

The dashboard kept shipping polish and gained a real RPC observability story this week. @rjpower's #4950 tightened the RPC stats histogram from the coarse 1/2/5/10 schedule to three buckets per octave from 1ms to ~60s, redacted request previews server-side, and redesigned RpcStatsPanel.vue with inline per-method sparkline histograms, p50/p95/p99 ticks, expandable per-method sample scopes, and a Recent / Slow-and-errors split — closing #4706's ask for a real RPC status surface and giving operators something concrete to point at when "the controller feels slow." The same week resolved #4707 (silent log-pusher RPC failures), #4895 (controller audit logging), and #4564 (CPU profile hanging the controller) — longstanding "we can't reconstruct what happened" pain points. @dlwh's #3297 — Iris reporting a job RUNNING while W&B marked the underlying training run crashed — was also closed. #5072 remains open as the next-step proposal: structured logging on the log service so worker/task resource history can be offloaded from the controller's sqlite and queried analytically.

The smaller dashboard papercuts piled up too. @wmoss gave Iris a tab icon in #5077 ("felt like Iris needed more personality"); @ravwojdyla-agent made the jobs view mobile-friendly in #5170 with a clean sm-breakpoint switch between mobile and desktop layouts, added a star-and-filter affordance for top-level jobs in #5007 (URL-synced via ?starred=1, persisted in localStorage), and scoped the controller's state filter to top-level jobs only in #4994 so expanding a parent stops hiding its non-matching children. @yonromai fixed the infra status page history strips to render oldest-to-newest in #5172, and a Claude-authored fix in #5186 made the dashboard sparkline charts fill their container, closing @wmoss's #5185. #4942 swapped the opaque (peer, user_agent) tuple in stats extraction for a small dataclass, and #4896 repaired last week's cross-region ops detector so checkpoint writes from resubmitted jobs no longer get false-flagged. On the CLI side, #5025 caught that IrisClient.list_jobs was ignoring its states/prefix arguments on the wire and page-walking the entire jobs table on every iris job list — filters now push into JobQuery server-side. #5037 rewired bug_report.py off the deprecated GetTaskLogs RPC, and #5045 stopped a RemoteLogHandler leak across the controller test session that was producing flaky CI.

The other big strand was a Zephyr-to-Iris status pipeline. @wmoss's #5063 added stage-level throughput stats (items and bytes flowing through scatter ops) and a stack of follow-ups — #5141, #5143, #5144, and finally #5187 — ferries those stats to the controller, persists them in sqlite, and surfaces a markdown task-status field on the Iris dashboard that Zephyr populates with current-stage and progress info. @ravwojdyla's #5175 caught that map-only stages were emitting items=0 (0.0/s), bytes_processed=0.0MiB for the entire shard runtime — "noisy and misleading" — and #5176 suppresses those when nothing has been counted. @ravwojdyla-agent's #5136 made Zephyr re-raise parquet/vortex schema-mismatch failures with both the expected and inferred schemas, so the diverging field is visible without spelunking. On the testing side, @rjpower landed #5165 — a pytest-driven event-replay framework with 13 curated scenarios and committed golden DB dumps — explicitly as a behavioral fingerprint of main ahead of the in-flight controller SQL-store refactor.

Discord made clear the dashboard work is landing on real users. Larry on Apr 21: "iris giving us a lot of compute right now wow." Two hours after Russell's scheduler-restart for new optimizations, Michael Ryan flagged "I think something might be up with the iris scheduler?" — the scheduler thread had crashed on a reservation-preemption corner case; Russell noted it "need[s] some better monitoring around that or make the sql guards more robust" and shipped a fix the same hour. After Friday's restart he posted an unprompted change-log of the included Iris fixes — the kind of operator-facing comms the dashboard work is meant to support. Eric Czech, who has had Claude babysitting an 8-job sweep for over a week, used iris job summary to pull stats showing top-level parent jobs typically last about a day before something kills them — "anecdotally I think this is at least fairly close to what I would have guessed… a big improvement over the week or two prior." On the friction side, Eric and Tim O'Donnell are re-running gcloud auth login constantly; rav uses a service account, Russell suspects the openathena.ai org's session policy. Eric also caught another stuck Iris worker; Russell pulled the worker log and pointed at #5038 as the fix. The week's ugliest discovery was Ahmed's BOS-token regression in tokenized caches — same model on a month-old cache and a fresh one produced different bpb on Paloma — which became #5149, a multi-day blast-radius scan, and Russell's wishlist note that "I hate HF tokenizers so so much" plus a Versioned("tokenizer=20260423") proposal for eval tokenization steps. None of that is observability per se, but it underscores why #5072's structured logs and #4895's controller audit log matter: when something silently changes, the only durable defense is being able to query history.

31 autocategorized

#5186 [iris] Make dashboard sparkline charts fill their container 💬1 +19 −14 @claude
#5172 Fix infra dashboard history order 💬1 +10 −4 @yonromai
#5170 [iris] Make dashboard jobs view mobile-friendly 💬2 +386 −132 @ravwojdyla-agent
#5077 Iris needs an icon! +55 −23 @wmoss
#5007 iris/dashboard: star top-level jobs and filter to starred 💬4 +242 −19 @ravwojdyla-agent
#5187 [iris] Add markdown status text to tasks UI and using it for Zephyr tasks +587 −110 @wmoss
#5144 [Iris] Add sections to the UI for task stats 💬1 +1126 −109 @wmoss
#5143 [Iris] Store the task stats in sqlite +752 −17 @wmoss
#5141 [Zephyr] Send task stats from Zephyr to Iris 💬1 +369 −17 @wmoss
#5063 Add stage-level throughput stats +161 −12 @wmoss
#5165 [iris] Add event-replay testing system for transitions equivalence +3231 −0 @rjpower
#5163 [iris] Add event-replay testing system for transitions equivalence 💬2 +11659 −0 @rjpower
#4950 [iris] Tighten RPC stats histogram, redact previews, redesign dashboard panel +475 −108 @rjpower
#5025 [iris] Push IrisClient.list_jobs filters into server-side JobQuery +31 −17 @rjpower
#4994 iris: apply state filter only to top-level jobs +0 −1 @ravwojdyla-agent
#4942 iris/stats: return _CallMetadata dataclass instead of tuple[str, str] +16 −7 @rjpower
#5037 [iris] Fix _list_descendant_jobs to use ListJobs instead of deprecated GetTaskLogs +7 −7 @rjpower
#4896 [iris] Fix cross-region ops false positives from attempt_id reuse +96 −29 @claude
#5176 [zephyr] skip status log when no counters recorded 💬2 +133 −15 @wmoss
#5136 [zephyr] Surface expected vs. actual schema on parquet/vortex write failures 💬2 +32 −1 @ravwojdyla-agent
#5045 [iris] Stop controller-test RemoteLogHandler leak across test session 💬1 +185 −191 @rjpower
#4706 iris: RPC status dashboard for server & client RPCs
#4707 iris: log pusher should warn on RPC failures/slow-down
#4895 iris - controller audit logging
#5072 iris - structured logging 💬1
#4552 Iris: dashboard doesn't properly handle nested jobs
#4564 iris: controller stops responding when taking a cpu profile
#4517 [iris] Investigate noisy memray 'Failed to compress input file' logs 💬1
#5185 [Iris] The CPU and Memory spark charts don't fill the container in the dashboard 💬1
#5175 Optional counter/perf stats logs 💬1
#3297 [Iris] Job/log state can go stale: Iris stays RUNNING while training run crashes

#4272 Canonical pipeline (download → norm → dedup/quality → tokenize)

Summary: Define canonical data pipelines for all data ingestion: download -> normalize -> dedup/quality -> tokenize.

The week's headline for the canonical pipeline was @ravwojdyla's landing of the datakit testbed baseline, an end-to-end ferry → tokenize → train arm wired off a single CLI entrypoint: per-source sample → tokenize → Grug-MoE training on a v5p-8, with a STAGING_PREFIX and TARGET_TOTAL_TOKENS_B at the top of the file and proportional byte-fair sampling that whole-shard-copies when the budget allows and row-samples the tail when it doesn't. Underneath sits the canonical source registry, a frozen DatakitSource dataclass per entry mirroring the 102 datasets in the marin-community/token-counts HuggingFace repo (99 active, 97 pinned with both revision and repo set — the subset the ferry can materialize); every staged path was cross-checked against gs://marin-us-central1/raw/. A companion datakit-smoke staging validator now runs alongside the daily ferry and exits non-zero with a per-path <status>: <url> report when any of the 70 unique staged_path prefixes is not SUCCESS — turning the registry into an executable contract rather than a lookup table.

The pipeline expanded source coverage in the same motion. #4892 added normalize_*_step factories for nsf_awards, nemotron_v1 (one step per quality/kind split, seven in NEMOTRON_V1_SPLITS), and nemotron_v2 (one per family/subset), validating nemotron_v1 normalize end-to-end on the quality=medium-low/kind=actual slice (1.24B records, 6,299 shards, 14.47 GB peak on 16 GB workers). The weekly Nemotron ferry wired the full normalize → minhash → fuzzy_dups → consolidate → tokenize chain on quality=medium against a verify-only download step that asserts the pre-staged gs://marin-eu-west4/raw/nemotro-cc-eeb783 dump exists and has .jsonl.* files, reusing the canonical override_output_path so StepRunner's cache check short-circuits on the existing STATUS_SUCCESS marker. #5158 tightened the tokenize boundary further by sizing the zephyr window and Levanter cache batch_size from parquet row-group metadata so each unit of work aligns with roughly half a row group end-to-end.

Running the full registry surfaced two correctness bugs that are easy to read as the price of finally exercising the long tail. #5162 reported that 5 of ~100 sources were producing mostly-empty parquet shards — coderforge was the worst case, with 643 of 672 shards carrying only the 176-byte footer. Root cause: scatter routes records into exactly num_output_shards buckets via hash(key) % num_output_shards, but the reduce stage was spawning max(input_shards, num_output_shards) tasks, so when the input had more shards than the output the surplus reduce tasks ran on indices no record hashes to and emitted empty files. #5166 fixed it with a regression test. Separately, #5142 hardened the parquet writer's schema inference: _accumulate_tables was inferring the schema from the first 8 records, so an optional field that happened to be None in the prologue got pinned to pa.null() and later real values crashed with ArrowInvalid — common-pile/stackv2's nested metadata.gha_language (959 null / 1041 str across ~2,000 records) was deterministically failing. Worse, pa.Table.from_pylist was silently dropping top-level keys missing from the pinned schema, a latent data-loss bug. The fix routes both cases through pa.unify_schemas with permissive promotion and reconciles prior chunks on yield.

The most consequential pipeline failure of the week, though, was tokenization-side and pre-existing: #5034 caught that Levanter's BatchTokenizer had silently stopped prepending the BOS token to each document when the tokenizer backend migrated from HF's PreTrainedTokenizerFast to the new KitokenMarinTokenizer in early April. Ahmed Ahmed flagged it after seeing different bpb/pplx on the same model evaluated against caches built a day apart versus a month apart — "Claude is claiming our caches no longer have BOS by default? was this intentional" — and Russell's reply, "i hate HF tokenizers so so much", captured the mood. The fix landed mid-week; #5149 tracks the cleanup blast radius across every cache built between roughly 2026-04-08 and 2026-04-22 15:50 UTC. Russell scanned for affected dirs with changes in the last three weeks and noted that the big training caches were spared — most of the damage landed on eval sets and on Rafal's nemotron experiments — but proposed a Versioned("tokenizer=20260423") notation on the eval tokenization steps as the safest forward path. In a quieter parallel thread, romain noticed the stanford-crfm/marin-tokenizer HF repo had been deleted out from under CI #4974; #4977 renamed everything to marin-community/marin-tokenizer, and #4971 opened the longer-horizon roadmap for derived 32K/64K tokenizers now and a trained v2 family later.

Beneath the data work, Zephyr itself absorbed a sustained polish pass. @hsuhanooi opened a sequence of focused PRs on the scatter/shuffle hot path: a byte-budgeted scatter write buffer that bounds write-side RSS regardless of item size or output-shard count (replacing the fixed 100K-row-per-shard threshold that could accumulate unbounded memory before close()), Arrow IPC and msgspec msgpack codecs as opt-in alternatives to cloudpickle for JSON-shaped chunks (each prefixed with a one-byte format tag so both formats coexist in the same scatter file), a /dev/shm-backed sidecar bytes cache that turns ~5ms GCS reads into microseconds on the second access, and a worker idle-poll backoff cap raised from 1.0s to 5.0s because subprocess-per-shard isolation now makes single-second polling pure busy-waiting. @ravwojdyla in parallel landed preemption-requeue exemption from MAX_SHARD_FAILURES — three clean preemptions had been aborting pipelines because the 3-attempt task budget was being consumed by infra requeues — split the attempt counter into TASK and INFRA kinds, and made INFRA requeues unbounded. #4968 (still open) folds M mapper sidecars into R per-reducer manifests in a new CombineMeta stage so each reducer does one manifest read instead of M; #5135 teaches connected-components to resume from the last complete iteration on opt-in. #5146, closed this week, made StepRunner.run() walk transitive deps internally so the per-script visit() boilerplate in scripts/datakit/ can go away. The cumulative direction is unmistakable: the pipeline is being shaken down by being run, end to end, every day.

31 autocategorized

#5159 [datakit]: Testbed Baseline 💬2 +2289 −620 @ravwojdyla
#5158 tokenize: size window + levanter batch from parquet row groups +42 −4 @ravwojdyla
#5166 [zephyr] Honor num_output_shards when input has more shards +43 −5 @ravwojdyla-agent
#5142 zephyr: widen inferred parquet schema via pa.unify_schemas +110 −27 @ravwojdyla-agent
#5135 fuzzy_dups: add opt-in resume for CC iterations +64 −21 @ravwojdyla-agent
#4982 zephyr: use msgpack for scatter_meta sidecars 💬4 +18 −7 @ravwojdyla-agent
#4892 datakit: normalize steps for nsf_awards and nemotron v1/v2 +104 −14 @ravwojdyla-agent
#4966 ferries: weekly datakit nemotron ferry 💬1 +349 −0 @ravwojdyla-agent
#4968 zephyr: per-reducer shuffle manifests 💬2 +534 −137 @ravwojdyla-agent
#4958 zephyr: row-level filtering in iter_parquet_row_groups +46 −4 @ravwojdyla-agent
#4957 zephyr: tighten needs_external_sort heuristic 💬1 +49 −5 @ravwojdyla-agent
#4956 zephyr: non-preemptible coordinator defaults, 2g RAM 💬1 +29 −29 @ravwojdyla-agent
#4955 [zephyr] ScatterReader constructor/logging cleanup +15 −9 @ravwojdyla-agent
#4954 [Zephyr] Fix bug where counter flushing thread was given dummy context 💬3 +62 −11 @wmoss
#5116 [zephyr] Add ArrowIpcCodec as opt-in scatter codec for JSON-shaped pipelines 💬2 +216 −59 @hsuhanooi
#5088 [zephyr] Use Arrow IPC for scatter chunks, fall back to pickle 💬1 +88 −33 @hsuhanooi
#5091 [zephyr] Use msgspec msgpack for scatter chunks, fall back to pickle 💬6 +107 −37 @hsuhanooi
#5107 [zephyr] Cache scatter sidecar bytes and fix subprocess flusher context 💬5 +107 −18 @hsuhanooi
#5055 [zephyr] OOM-proof scatter write buffer with byte-based flush budget 💬7 +234 −12 @hsuhanooi
#5048 [zephyr] Fix dead threading.Event in _wait_for_stage 💬4 +29 −8 @hsuhanooi
#5051 [zephyr] Raise worker idle poll backoff cap from 1.0s to 5.0s +1 −1 @hsuhanooi
#5049 [zephyr] (Organization) Add _set_worker_state helper and O(1) _alive_workers counter 💬2 +32 −18 @hsuhanooi
#5050 [zephyr] Raise coordinator heartbeat-check interval from 0.5s to 2.0s 💬1 +1 −1 @hsuhanooi
#5145 [Zephyr] Remove legacy start_stage method +16 −20 @wmoss
#5106 datakit-smoke: verify source staging in us-central1 +105 −4 @ravwojdyla-agent
#5105 datakit: canonical source registry 💬1 +613 −0 @ravwojdyla-agent
#4990 zephyr: exempt preemption requeues from MAX_SHARD_FAILURES 💬1 +147 −102 @ravwojdyla-agent
#5162 [datakit] Normalize produces empty parquet shards for 5 sources 💬3
#5041 [zephyr] _check_worker_group aborts on spurious is_done() during pool scale-up 💬4
#4978 Depth First Search Stitching of CommonPile the Stack v2 💬1
#5146 StepRunner: walk transitive deps so callers don't need to flatten the graph 💬3

#4271 Marin-as-a-library (Bolinas can import marin)

Summary: Measurable: Bolinas can `import marin` and use it as a library.

With the marin-* wheels now publishing nightly, this week's work shifted from packaging mechanics to the API surface those wheels expose. The headline change was @rjpower's #5156, which sketches out service-design guidelines and, in the process, deletes roughly 2.9k lines while adding 416 — a substantial net pruning of the surface area an external importer has to reckon with. The PR sparked a small design discussion with @wmoss and @ravwojdyla about whether Tailscale or Cloudflare could replace the SSH/proxy stitching that currently fronts the cluster, with Will demoing the Tailscale UX (Google auth, taskbar toggle) and Rav noting that ngrok would equally suffice for the tunneling pieces.

The StepRunner got two fixes that move it closer to being something a Bolinas-style consumer can call with confidence. #5148 rewrites scheduling to do a deduped post-order walk of the dep graph so callers can pass only terminal steps and the runner resolves the rest, deleting an unreachable iterable-exhausted branch in the process. #4992 fixes a more embarrassing bug: the unmet-deps error path was passing a whole StepSpec (a frozen dataclass with a list[StepSpec] deps field) into a dict.get, which crashed with TypeError: unhashable type: 'list' instead of producing the intended RuntimeError naming the waiting step. A regression test now pins the right error message. On top of those, @wmoss's #5157 proposes an as_step_fn decorator that lets users skip the lambda op: ... boilerplate when building StepSpecs; Rav pushed back that the convenience introduces magic ("not good for humans, nor agents") and the PR is still open pending that discussion.

Three older tracking issues from the original Executor-as-tracing-model thread were closed during the week — #2287 (typed outputs from a tracing executor), #2349 (HF model download/convert step), and #2408 (download steps for models and tokenizers) — effectively retiring the Executor design discussion now that StepRunner + the marin-* packages are how the library ships. @wmoss's Claude-generated architecture docs PR #4960 was also closed this week in favor of tracking the broader docs-via-skill outcome in #5155; the discussion (Will, dlwh, Rav, @yonromai) converged on the view that hand-written architecture docs go stale fast on a project this young and that an "architecture-overview" skill run on a nightshift schedule is a better fit. Russell also opened #5067 proposing a consistent type|domain naming scheme for the GitHub workflows and a "minimal YAML" policy that pushes logic into independently runnable Python scripts — another step toward making the repo legible from outside.

Two smaller items rounded out the week. #4801 bumped pyrefly from 0.42.0 to 0.61.0 and regenerated the baseline (live errors stayed at 0; suppressions trimmed 173→169), keeping the type-checker honest as the package boundaries continue to shift. And on Discord, romain in #infra flagged that huggingface.co/stanford-crfm/marin-tokenizer appears to have moved to marin-community/marin-tokenizer without an announcement — a small but telling reminder that the HF-org migration that accompanied the marin-* rename last week is still surfacing rough edges for downstream consumers.

10 autocategorized

#4960 Add Claude-generated architecture documentation to docs 💬5 +2091 −0 @wmoss
#5156 Sketch out service design guidelines. 💬3 +416 −3329 @rjpower
#5157 Create convenience decorator `as_step_fn` for building `StepSpec`s 💬2 +21 −3 @wmoss
#5148 StepRunner: walk transitive deps before scheduling +182 −42 @ravwojdyla-agent
#4992 marin: fix unmet-deps error in StepRunner +26 −1 @ravwojdyla-agent
#4801 pyrefly: bump 0.42.0 → 0.61.0 and regenerate baseline +1115 −1163 @rjpower
#2287 Convert Executor to a tracing model with typed outputs
#2349 Create a task to download and convert HF models
#2408 Define steps for downloading models, converting models, and downloading tokenizers
#5067 Github workflow cleanup

#4270 Canary pass rate to 90%+

Summary: Measurable: canary ferry pass rate consistently above 90%.

Pass rates landed where the epic name has been pointing all month. The TPU canary ferry was 7-for-7 this week, the CoreWeave GPU canary 7-of-8 (~87%), and the datakit smoke ferry 6-of-7 (~86%) — all three above or within rounding of the 90% target, and a clean inversion of the picture two weeks ago when central2 Ray restarts were shredding interpretability of the numbers. The single CoreWeave miss on Apr 21 was run 24718701520, which is the exact failure @rjpower diagnosed with #5004: bash does not expand ~ inside the double-quoted KCI assignment in the GPU canary diagnostics step, so every kubectl call had been stat’ing a literal ~ path and emitting zero-byte logs that hid the real failure. The PR moved everything to $HOME, also uploads the claude-code-action transcript as an artifact (the GHA step UI was hiding it), and broadens the triage tool allowlist so the agent can run common shell helpers and write slack_message.md. @yonromai’s #5003 bumped actions/checkout, actions/setup-python, and the docker actions to Node-24-compatible majors ahead of GitHub’s June 2 forced migration, removing a wall of deprecation warnings that had been noise on every marin-canary-ferry-cw manual trigger.

The CoreWeave integration pipeline got a two-step timeout headroom fix after a green-but-tight regime started biting. @yonromai’s #5111 bumped hf_save_steps from 1 to 2 in the CW integration test, removing a duplicate end-of-train HF checkpoint save: with hf_save_steps=1 and num_train_steps=2, the per-step run_hooks(info) and the end-of-train run_hooks(info, force=True) both fired on step 1 and uploaded the model twice to the same destination, costing 3–4s of duplicated S3 I/O against a 600s budget that was already at 96% utilization on green runs. #5112 then raised the outer shell timeout from 600s to 900s — the prior 24806172401 run had failed with exit 124 about 42s after the final checkpoint save, before pytest could clean up. @rjpower’s #5125 bumped further, to a 1800s shell wrapper and 900s pytest-timeout, after observing that the full marin pipeline runs 8 sequential Iris sub-jobs each paying ~60–90s of pod startup plus dep sync, and that on the last CoreWeave CI run train_lm finished only 34s after pytest-timeout fired at 600s. The combined effect is enough headroom for cold-pod variance without losing the inner pytest deadlines that surface real hangs.

The cross-layer test-skipping problem from prior weeks closed out: @rjpower’s #5086 expanded paths-filter in the fray, zephyr, levanter, and marin workflows to include all upstream layers each project imports from, fixing #4728. The concrete motivating instance was iris commit cf05d94ee on Apr 12, which changed Constraint to store values: tuple[AttributeValue, ...] with no .value attribute and silently broke five fray assertions for two days because nothing fray-touching landed in the interim. The HF tokenizer 401 issue #4974 — stanford-crfm/marin-tokenizer was deleted out from under levanter-ray-tests on Apr 20, surfaced by @yonromai’s agent-generated repro showing anonymous 401 / authenticated 404 (HF’s “hide existence” pattern) — was traced in #infra to @dlwh’s recollection that “Yifan deleted it,” and resolved by migrating in-tree references to marin-community/marin-tokenizer via #4977.

Two open items frame the next phase. The GPU canary cutover from SlimPajama to Nemotron in #3704 — bundling the Nemotron CC upload fix, larger download-worker memory, the romain-nt CoreWeave Iris config, and the Zephyr/Iris startup-wait mitigation discovered while debugging — is sitting in draft awaiting a final ready-to-merge from @ravwojdyla. And @rjpower opened #5065, an Infra Stability Epic asking what testing gates should be required before considering an Iris/Zephyr change “done,” sketching a small/medium/large playbook with seconds/minutes/hours/overnight gate tiers and asking that the tests be easily accessible to agents with builtin reporting — positioning canary pass rate as one signal in a larger release-gating story rather than the only one. The cluster restarts that historically dragged pass rate down were less of a factor this week, though Russell restarted the Iris scheduler twice on Apr 22 after a reservation-hold preemption special-case crashed the scheduler loop (fixed in #5032) and again on Apr 24 to pick up new budget code — both quick enough not to dent the ferries.

11 autocategorized

#5125 [iris-cw-ci] Bump integration pipeline timeouts to 30m/15m +2 −2 @rjpower
#5112 [CI] Bump CW integration-pipeline timeout from 600s to 900s +1 −1 @yonromai
#5111 [tests] Bump hf_save_steps to avoid end-of-train HF save duplicate +4 −1 @yonromai
#5004 Fix tilde expansion in canary diagnostics; upload Claude triage log +19 −2 @rjpower
#5003 ci: bump GitHub Actions to Node 24-compatible majors +117 −117 @yonromai
#5086 [CI] Run downstream tests on upstream layer changes +3 −1 @rjpower
#4214 Skip tokenizer tests that require HuggingFace downloads in CI 💬1 +22 −31 @rjpower
#3704 GPU canary: switch from SlimPajama to Nemotron 💬1 +225 −59 @yonromai
#4974 CI: `levanter-ray-tests` fails with HF 401 on `stanford-crfm/marin-tokenizer` 💬1
#4728 [fray] CI skips fray tests when only iris changes, hiding cross-layer breakage 💬1
#5065 Infra Stability Epic

#3192 Synthetic data (research + critical path for post-training)

0/4 sub-issues closed

SWE-ZERO crossed from pipeline validation into measurable downstream signal this week. The 140B-token scale-out #4719 is now ~20% complete (2.44M of 12.3M rollouts, 27.9B of 140B tokens) running ~9 v6e-4 jobs after @AlienKevin consolidated the fleet to US-only to eliminate trans-Atlantic egress. More consequentially, the quality-validation experiment #4898 shipped: SFT on Marin-8B base with 10K, 50K, and 100K SWE-ZERO trajectories, evaluated three times each on a fixed 100-task SWE-bench Verified subset under mini-swe-agent v1. Resolve rate goes from 0.0% baseline to 3.3% +/- 0.6% (10K), 4.0% +/- 2.0% (50K), and 5.3% +/- 1.5% (100K) — a positive scaling trend, but the headline is that purely synthetic, execution-free rollouts produce real SWE-bench capability on an 8B model whose training data contains essentially none of the eval repos. The three published checkpoints (10K, 50K, 100K) and the eval-trace dataset are the first artifacts that turn last week's "the pipeline works" into "the data teaches the model something." A 500K SFT is in flight to extend the curve.

The flip side surfaced a sharp generalization-failure mode worth flagging. At 100K trajectories the model degenerates into immediate !!! repetition on 60% of trials versus ~20% for 10K/50K, and Kevin's root-cause analysis traces it to markdown code fences in user prompts: 100K degenerates on 77% of code-block prompts vs 15% on prompts without — a U-shaped sensitivity curve where 10K is too undertrained to disambiguate input vs output code blocks, 50K resolves the distinction, and 100K erodes it again via catastrophic narrowing on the SWE-ZERO format distribution. The mean resolve rate is still highest at 100K because gains on non-degenerate trials outweigh the degeneration, but it's a clean reminder that single-run SWE-bench numbers at small scale are unreliable (a 4% single-run result was initially read as a regression before the multi-run rerun corrected it to 5.3% +/- 1.5%) and a concrete signal for the eventual 140B SFT design — likely lower LR for longer training, or augmentation with code-fence-heavy prompts.

The agentic-SFT NemotronTerminal arc hit a less encouraging milestone. The Marin-32B 15% Terminal-Corpus run #4760 finished all 859 steps at loss ~0.49, but TB2 evaluation produced 1/89 = 1.1% across three trials — versus 17.2% for the Qwen3-32B SFT at step 500 of #4307 on the same harness, and 27.4% for the published model. Debugging the gap consumed most of the week: an early hypothesis was vLLM TPU architecture incompatibility (Marin-32B differs from Qwen3-32B in attention heads, vocab size, intermediate size, and uses llama3 RoPE scaling), then XLA compilation timeouts, then chat-template thinking blocks. The actual root cause, found by logging the raw vLLM response, is that the model emits ~2K characters of coherent reasoning and then falls into zorazorazora... repetition for the remaining 124K characters until hitting the 24K-token output cap. Sweeping inference settings on the git-leak-recovery task showed the degeneration is mitigable: temperature ≥0.8 (with or without repetition penalty) produces coherent output, while repetition/frequency penalties at temp=0.6 do not. The Qwen3-32B SFT itself improved to 17.2% with n_concurrent=10 sharded eval (up from 14.5%) and remains the team's strongest TB2 baseline at only 8.7% of full training. Whether the Marin-32B underperformance is fundamentally architectural or a fixable inference-config artifact is now the open question.

The instruction-data surface area expanded along several axes. PR #4329 from @Helw150 merged, adding download-and-transform pipelines for six HuggingFace rollout/reasoning datasets — CoderForge-Preview, GPT-OSS-20B rollouts, Nemotron-Terminal-Corpus, Principia-Collection, Superior-Reasoning-SFT, and SYNTHETIC-1 — each with pinned revisions and an experiment script under experiments/rollout_data/. @taivu1998 opened three SFT/RL PRs: PR #4996 registers FineProofs-SFT as both a raw-messages and proof-only view for long-context reasoning, PR #4997 adds NuminaMath CoT and TIR with a staged CoT→TIR experiment that warm-starts Stage 2 from the Stage 1 HF export, and PR #4684 wires native Reasoning Gym environment support into Marin RL with a minimal curriculum example and tests for environment loading and rollout statistics. PR #4620 (the explicit KL config refactor) gained additional review activity and remains open. PR #5035 unifies RL rollout decoding across vLLM and Levanter around a shared configuration, records applied decoding on rollouts, and adds real Levanter top_p support through the native inference engine — a substrate change directly relevant to how SWE-ZERO-style rollouts get produced consistently across backends. In #code-review, @taivu1998 requested review on the FineProofs and NuminaMath PRs together, and in #data-curation @cs2716 surfaced core.ac.uk's 2.7TB open-access research-papers corpus as a candidate source.

The online-distillation thread from last week continued to develop in the background — not as a new PR, but as a steady drumbeat of relevant references. @willheld shared AI2's BAR post in #news on Apr 20 — a recipe for training domain experts independently and merging them into a single MoE — to which @dlwh replied "oh man i wanted to do this at some point," and willheld noted that BAR plus FlexOlmo are good companion reads to the prior week's distillation conversation. On Apr 24 willheld also posted the TIP paper on token-importance in on-policy distillation (which trains a student on its own rollouts under token-level teacher supervision), explicitly cc-ing for "PPL gap sets" — a direct connection to the proxy/prediction work in the downstream-scaling thread, where @RohithKuditipudi is soliciting mid/post-training mixes from @ahmeda14960 and willheld to run across the scaling ladder for issue #4547. The synthetic-data epic and the prediction/proxy epic are visibly converging — SWE-ZERO produces the trajectories, the scaling ladder will tell us whether smaller-model SFT outcomes predict larger-model behavior, and the BAR/TIP/OPD literature is shaping how the team thinks about the eventual recipe.

0 PRs this week, 2 new comments, and 0 new issues (4 total)

Sort:

Issues
#3093 [Agentic SFT] Tracking SFT datasets
#2262 Experiment: OpenThoughts4 Teacher Model Comparison - Qwen3-32B vs. Qwen3-235B-A22B
#3956 Pilot distillation tests for optimal teacher selection
#4719 [SWE-ZERO] Scale to 140B tokens: 122K PRs x 100 rollouts from SWE-rebench V2-PRs 💬2

9 autocategorized

#4997 [SFT] Add NuminaMath CoT/TIR datasets and staged experiment 💬1 +208 −0 @taivu1998
#4996 [SFT] Add FineProofs-SFT raw and proof-only instruction datasets +180 −0 @taivu1998
#4329 Add rollout data pipelines for 6 datasets +922 −0 @Helw150
#4620 [rl] Make KL configuration explicit and add k2 support 💬2 +312 −56 @taivu1998
#4684 [rl] Add native Reasoning Gym environment support 💬1 +940 −1 @taivu1998
#5035 [rl] Refactor rollout decoding across vLLM and Levanter +1261 −334 @taivu1998
#4760 [Agentic SFT] Post-train Marin-32B on the TerminalCorpus 💬7
#4307 [Agentic SFT] Reproduce NemotronTerminal-32B 💬2
#2339 Experiment: OpenThoughts4 Synthetic Data Generation Length Comparison: 7K vs. 16K vs. 32K Tokens

#3100 Data sources for pre-training / mid-training

Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.

0/5 sub-issues closed

The data conversation this week was reorganized end-to-end around a single diagnostic question: where, exactly, do Marin base models lose bits to Llama and Qwen, and what data should fix it? @dlwh landed the gap-report machinery in #4962 — pairwise Levanter-loadable LM comparisons that attribute bits-per-byte gaps to datasets, segments, and surface-form literals — and used it to file a confidence-portfolio epic in #5005. The framing in that issue is explicit: don't define one scalar for "good base model"; build a portfolio of evidence that catches broad failures before expensive post-training and RL evals. The first 8B reruns from #5023, summarized on #4961, located the largest losses in code-shaped surfaces (github_cpp +0.0345, github_python +0.0264, c4_en +0.0279, redpajama +0.0494 bpb against Llama 3.1 8B) and in byte buckets that survive cleaning — text/url, text/non_ascii_word, whitespace/newline, text/number. The conclusion is that the gap is a raw-technical-surface gap, not a GitHub-only crawl gap, and the proposed fix is a new raw_technical_text tranche that unions surface-preserving sources (StarCoder, StarCoder2 extras, Common Pile stackv2/github_archive/ubuntu_irc/python_enhancement_proposals/stackexchange, and raw Dolma algebraic-stack/arxiv/open-web-math/stackexchange) with a follow-on raw-HTML tranche tracked separately on #5012.

From there @dlwh opened a coverage map of "unknown-unknown" PPL slices and matching data-sourcing tasks that ran the length of the week — #5056 raw web/markup/image-text, #5057 binary/network/security artifacts, #5058 bio/chem notation, #5059 time-series/tabular/geospatial, #5060 formal-methods/RTL, #5061 package metadata, #5062 game/music notation, #5093/#5094 diagnostic logs, #5095/#5100 diff/patch corpora, #5096/#5101 paraphrase/translation robustness, #5097/#5102 ASR/OCR-noisy text, and #5098/#5099 GH Archive structured output. Sister coordination issues #4963 (chat/agent/numeracy PPL), #5006 (strong-model sampling as gap-discovery), and #5053 (LM-Eval dev splits as PPL datasets) round out the portfolio. The matching wave of slice PRs landed against the long-tail registry: SVG-backed raw markup #5130, first-wave tabular #5129, formal-methods/RTL #5128, bio/chem #5127, UWF Zeek security #5126, package metadata #5124, capped diagnostic-log builders #5104, diff/patch builders #5103, GH Archive #5119, ASR/OCR #5118, paired robustness #5120, game/music #5080, synthetic reasoning #5133, and a tokenizer-axis diagnostic #5085 for whitespace-sensitive formats split off from #5079. Two cross-cutting infrastructure PRs underpin the wave: #5169 caches per-model PPL scores once per bundle so the cost scales with models rather than model pairs, and #5084 adds a typed registration helper that keeps default validation sets untouched while the long-tail bundle is opt-in.

The first 32B-scale validation came at week's end. #5123 wired a capped Marin 32B vs Qwen3 32B run across the first-wave log/diff/robustness/ASR-OCR/GH-Archive slices, and the result on #5005 was sharp: across 3,520 docs Marin is +0.0142 bpb worse overall, but the loss is concentrated in GitHub-shaped structured programmer text — GH Archive +0.0211 bpb, GHALogs +0.0233, LogHub Apache +0.2895, CommitPack +0.0691 — while Marin actually wins on PAWS paraphrase (-0.1878) and FLORES eng-deu translation (-0.0700) and on both ASR and OCR slices. Pattern attribution put text/url at the top (+103,954 lost bits across 4.13 MB at +0.0252 bpb), with text/number and whitespace/mixed next. The signal is consistent with the 8B reruns and gives the long-tail program a real result to point at, though @dlwh explicitly raised the worry of benchmaxxing on PPL proxies, asking whether agentic-RL suitability can be approximated this way at all without RL'ing.

Multilingual coverage moved in the same week. #5008 added a reusable FineWeb2 multilingual eval bundle (top-50 languages by row count plus native-script Indic configs, pinned to HF parquet test splits and tagged for aggregate/script/language/Indic Levanter metrics), and #5074 wired the matching gap-rerun runner that compares Marin 8B against Llama and Qwen with Paloma and uncheatable held in the same report so multilingual regressions are visible next to broad raw PPL. On the data-source side, @Helw150's #4326 — a download/filter pipeline for HPLT v3.0 English keeping only non-CommonCrawl sources (WIDE, survey crawls) and applying register-based quality filters validated against a Haiku oracle at ~99% agreement, projected ~450B unique English tokens — picked back up after @ravwojdyla apologized for missing it and offered to rebase onto the new datakit registry from #5159. @Helw150 also opened #5009, a DFS stitching pipeline for Common Pile Stack v2 that groups file-level records by owner/repo@commit and concatenates each repo's files in depth-first path order to produce one document per repo, targeting long-context code pretraining and closing #4978. Long-context audit work continued separately on #4735, where the FinePDFs English slice review from #4738 ran against finepdfs_by_language["eng_Latn"] in us-east5 to produce per-source bucket statistics for the exp2062 long-context pilot.

The Luxical-as-data-integration thread on #3049 reached a close. The N=1000 capacity-ladder run (Luxical-192d / Arctic-L-1024d / BGE-large-1024d plus Dolma-3 fasttext baselines and oracle test-retest ceilings) put Luxical at Spearman 0.707 on Claude-rubric quality, Arctic at 0.771 — decisively above the fasttext baseline at 0.258 on the Claude rubric, but below fasttext on Nemotron-bucket quality (0.415 vs 0.588) and on 24-class topic classification. Capacity is not the bottleneck (BGE at 1024d ties Luxical at 192d on quality), and oracle noise is not either (Claude self-agrees at 0.964). @ravwojdyla's tldr: "we can certainly make the embeddings work, but the question is how do we try to make those decisions going forward" — the proposal is to formalize the run's artifacts into a reusable "datakit testbed" for benchmarking dedup strategies, candidate embedders, quality filters, and topic classifiers against a canonical set of probes. In Discord, @cs2716 flagged CORE (2.7 TB of open-access research papers with extracted text or download links) as a possible source for the technical-text tranche, and a new MIT MathNet release (30k problems across 17 languages) showed up in #news as a candidate for the multilingual-reasoning side of the #4963 coverage map.

0 PRs this week, 2 new comments, and 0 new issues (5 total)

Sort:

Issues
#3049 Test Luxical as a General Tool for Data Integration Pipelines 💬2
#3101 Ensure we have 20T deduped tokens of data
#3183 Software Heritage Foundation license
#3194 Gather code environments
#4148 Experiment: synthetic reasoning bootstrap corpus

62 autocategorized

#5008 Add FineWeb2 multilingual eval bundle 💬1 +202 −0 @dlwh
#5074 [evals] Add FineWeb2 multilingual gap rerun +116 −0 @dlwh
#5075 [evals] Add long-tail PPL gap reruns 💬4 +759 −3 @dlwh
#5169 [evals] Cache per-model perplexity scores before gap diffs +1274 −35 @dlwh
#5133 [evals] Add synthetic reasoning PPL slices +625 −0 @dlwh
#5130 [evals] Add SVG-backed raw web markup PPL slices +160 −1 @dlwh
#5129 [evals] Add first-wave tabular PPL slices +1561 −0 @dlwh
#5128 [evals] Add formal-methods and RTL PPL slices +2036 −35 @dlwh
#5127 [evals] Add bio/chem notation PPL slices +1313 −2 @dlwh
#5126 [evals] Add UWF Zeek security eval slice +1501 −1 @dlwh
#5124 [evals] Add package metadata long-tail PPL slices +717 −3 @dlwh
#5123 [evals] Add capped issue 5005 gap-analysis launcher +3753 −0 @dlwh
#5121 [data] Add sample-capped public diagnostic log sourcing +1015 −0 @dlwh
#5120 [evals] Add capped paired robustness PPL slices +532 −0 @dlwh
#5119 [evals] Add GH Archive structured-output PPL evals +753 −0 @dlwh
#5118 [evals] Add capped ASR/OCR noisy-text perplexity-gap slices +347 −0 @dlwh
#5104 [evals] Add capped diagnostic-log eval builders +598 −0 @claude
#5103 [evals] Add diff/patch raw PPL slice builders +308 −0 @claude
#5085 [evals] Add tokenizer-axis long-tail diagnostics +1691 −3 @dlwh
#5084 [evals] Add raw web/markup/image-text PPL registration helper 💬1 +61 −0 @dlwh
#5080 [evals] Add game / music runnable PPL slices (issue #5062) 💬1 +233 −0 @claude
#4995 [evals] Add capability raw validation slices 💬1 +5717 −41 @dlwh
#4962 [levanter] Add raw perplexity gap reports for pairwise LM comparisons +2130 −1 @dlwh
#4326 Import Likely Non-Duplicate HPLT data 💬1 +344 −2 @Helw150
#5064 Fix Arrow Flight host resolution and skip PlantCAD dry-run +4 −2 @wmoss
#4249 [Eval] Add long-context evaluation lane for exp2062 💬1 +890 −2 @taivu1998
#5009 Add DFS stitching pipeline for Common Pile Stack v2 💬1 +466 −0 @Helw150
#4967 Merge main into dna-dev +46191 −12845 @eric-czech
#4961 [data] Add a raw technical-text tranche beyond GitHub 💬2
#4963 [evals] Add base-model PPL eval sets for chat and agent data 💬18
#5005 [evals] Build pretraining checkpoint confidence portfolio 💬6
#5006 [evals] Use strong-model sampling to discover coverage gaps 💬8
#5012 [data] Follow-on: raw-HTML tranche from fineweb (preserve markup)
#5053 [evals] Convert LM Eval dev splits into PPL datasets
#5056 [evals] Add raw web, markup, and image-text PPL slices 💬4
#5057 [evals] Add binary, network, and security artifact PPL slices 💬1
#5058 [evals] Add bio and chemistry notation PPL slices 💬1
#5059 [evals] Add time-series, table, and geospatial PPL slices 💬3
#5060 [evals] Add formal-methods and hardware PPL slices 💬1
#5061 [evals] Add package and dependency metadata PPL slices 💬3
#5062 [evals] Add game and music notation PPL slices 💬2
#5079 [evals] Tokenizer comparison for whitespace-sensitive long-tail PPL slices 💬1
#5093 [evals] Add diagnostic log-stream PPL eval sets
#5094 [data] Source public diagnostic logs for training data 💬3
#5095 [evals] Add diff and patch PPL eval sets
#5096 [evals] Add paraphrase and translation robustness PPL evals 💬2
#5097 [evals] Add ASR and OCR-noisy text PPL evals 💬3
#5098 [evals] Add GH Archive structured-output PPL evals 💬2
#5099 [data] Add GH Archive JSON events to pretraining data
#5100 [data] Source license-filtered diff and patch corpora
#5101 [data] Source licensed paraphrase and parallel text corpora
#5102 [data] Source licensed ASR and OCR-noisy text corpora
#4735 [long-context] Audit Longmino and FinePDF quality for exp2062 💬1
#4550 Reliable scaling for downstream evals/post-training 💬1
#1337 Delphi: Create a modern scaling suite ("modernized Pythia") 💬1
#2345 Data Mixing Many Domains Swarm Run 💬1
#2355 Build out "datakit" for end to end data processing
#2404 Data Mixture: Many Domains, One Phase (replication)
#2386 [RL] Try out Prime Intellect environments, starting with MATH-500
#4923 [speedrun][MuonC] Publish archived Qwen3 submission 💬1
#4903 TPU topology breaks lora 💬1
#4755 LoRA divergences on different TPU hardware 💬3

Community Pulse

External contributors carried significant chunks of infra and modeling work. @wmoss landed Zephyr/Iris PRs #5145, #5077, #5064, and #5063 on stage-level throughput, Arrow Flight, and dashboards. @hsuhanooi's scatter-codec investigations included an end-to-end msgpack-vs-cloudpickle benchmark on #5091 that closed after pickle won on Marin's string-heavy GCS workloads, plus the OOM-proof scatter buffer in #5055. @WhenWen drove an agent-run depth-MuP LR sweep across d512–d1280 on #5178 (no promotable speedup) and an AdamH global-grad-norm variant #5183 that cleared gate 1. @nevillelyh validated the consolidate-less tokenize path in #4814 with end-to-end Iris canary jobs and merged JaggedArrayStore optimization #5070. @eric-czech shipped levanter eval-harness fixes #4911 and #4912, and @MaxiBoether chimed in on #4445 from DatologyAI to share their in-progress open-source data loader.

Ten new members posted introductions: uwu1, RE/founder on multimodality and kernels; David Hidary, Columbia AI undergrad on a TRC grant; Iheb from TII Abu Dhabi, main contributor on Falcon-H1R / H1 / Falcon 3; Howard (zhipeng) from LFAI&Data, announcing a multi-modal speedrun under the Open Model Initiative built on modded-nanogpt and Marin speedrun; Yishun Lu, Oxford post-doc in optimization/HPC; Jeremi Nuer, UCSB student in robot learning and mech interp with a NeurIPS workshop paper on MoE superposition; Taksch Dube, Kent State PhD on multi-agent systems and applied category theory; Timur Kharisov, DL-theory PhD on optimal LR schedules and edge-of-stability; Nick DePalma, self-driving researcher with JAX-based robotics infra; and Wasiu "Truth", maths grad and OnlyDust/Protocol Labs OSS contributor. David's TRC context intersects with his first PR #5150 adding internal-IP-only iris deployments via IAP tunneling for TRC-bound TPUs; Timur brings context on the LR-schedule and stability questions running through grug MoE.

Eric Czech's #infra thread on resumable cross-region training turned into a long discussion with @rjpower on TPU availability, the cost model of mirror://, and whether the Executor needs a "modern era" rewrite that decouples reservations from data-locality — followed by Eric's report on a week-long Claude-babysat 8-job sweep that mostly held without manual intervention. Ahmed M Ahmed's BOS-token regression report opened a multi-day cleanup in #5149: caches built without per-doc BOS were silently degrading runs and impacted tokenization outputs got rebuilt. In #moe, Larry's 1e22 capfactor finding — OOD evals flood a single expert and drop tokens; raising eval-time capacity factor closes the gap from 0.4882 to 0.4259 bpb on github_cpp — fed into EP=8 1e23 planning and Kaiyue-Wen's pitch for a sequence-level auxiliary loss in the dpsk-v3 spirit.

Reading in #news clustered around modular post-training and MoE routing — AI2's BAR recipe for merging independently-trained experts, the DeepSeek-V4 PDF with skeptical commentary from Larry on its sigmoid-gated unbounded right-tail, "TIP: Token Importance in On-Policy Distillation", and a layer-residual rethinking thread proposing alternatives to the ResNet-style depth highway untouched since 2015.

News & research shared

huggingface.co/stanford-crfm/marin-tokenizer — Did something happen to recently? Have I missed a discussion to switch to discussion (2 comments)
allenai.org/blog/bar — BAR is a recipe for post-training language models one capability at a time—train domain experts independently, merge them into a single mixture-of-experts… discussion (1 comment)
huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf — We’re on a journey to advance and democratize artificial intelligence through open source and open science. discussion
x.com/lianghui_zhu/status/2045868757869080695 — For a decade, we've made models wider and deeper—but we've barely changed how layers \*talk\* to each other\. Since ResNet's \`x \+ F$x$… discussion
arxiv.org/pdf/2405.10938 — From discussion
mathnet.csail.mit.edu/ — Browse, search, and learn from 30,000+ Olympiad-level math problems across 47 countries and 17 languages. discussion
x.com/nopainkiller/status/2018283570633265182?s=20 — Hey this is Howard from LFAI&Data Foundation, built upon the great lesson from modded-nanogpt and Marin speedrun, our open model Initiative… discussion
arxiv.org/abs/2604.14084 — On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter… discussion
arxiv.org/abs/2510.23671 — Mixture of Experts (MoE) models have become central to scaling large language models, yet their mechanistic differences from dense networks remain poorly… discussion
arxiv.org/pdf/2604.21691 discussion

GitHub activity from 10 external contributors

Will Moss · @airbnb · San Francisco, CA 12 PRs, 14 comments

#5145 [Zephyr] Remove legacy start_stage method +16 −20
#5077 Iris needs an icon! +55 −23
#5064 Fix Arrow Flight host resolution and skip PlantCAD dry-run +4 −2
#5063 Add stage-level throughput stats +161 −12
#4954 [Zephyr] Fix bug where counter flushing thread was given dummy context 💬3 +62 −11
#5187 [iris] Add markdown status text to tasks UI and using it for Zephyr tasks +587 −110
#5176 [zephyr] skip status log when no counters recorded 💬2 +133 −15
#5157 Create convenience decorator `as_step_fn` for building `StepSpec`s 💬2 +21 −3
#5144 [Iris] Add sections to the UI for task stats 💬1 +1126 −109
#5143 [Iris] Store the task stats in sqlite +752 −17
#5141 [Zephyr] Send task stats from Zephyr to Iris 💬1 +369 −17
#4960 Add Claude-generated architecture documentation to docs 💬5 +2091 −0

14 comments on 11 threads

#5176 [zephyr] skip status log when no counters recorded ×2
#5048 [zephyr] Fix dead threading.Event in _wait_for_stage ×2
#4960 Add Claude-generated architecture documentation to docs ×2
#5186 [iris] Make dashboard sparkline charts fill their container
#5156 Sketch out service design guidelines.
#4954 [Zephyr] Fix bug where counter flushing thread was given dummy context
#5157 Create convenience decorator `as_step_fn` for building `StepSpec`s
#5141 [Zephyr] Send task stats from Zephyr to Iris
#5155 Find a better way to keep `docs` are up-to-date
#5175 Optional counter/perf stats logs
#5185 [Iris] The CPU and Memory spark charts don't fill the container in the dashboard

Hsu Han Ooi · Seattle, WA · I enjoy long robotic walks on the beach and making smalltalk with my chatbot friends. 9 PRs, 16 comments

#5051 [zephyr] Raise worker idle poll backoff cap from 1.0s to 5.0s +1 −1
#5116 [zephyr] Add ArrowIpcCodec as opt-in scatter codec for JSON-shaped pipelines 💬2 +216 −59
#5055 [zephyr] OOM-proof scatter write buffer with byte-based flush budget 💬7 +234 −12
#5048 [zephyr] Fix dead threading.Event in _wait_for_stage 💬4 +29 −8
#5049 [zephyr] (Organization) Add _set_worker_state helper and O(1) _alive_workers counter 💬2 +32 −18
#5050 [zephyr] Raise coordinator heartbeat-check interval from 0.5s to 2.0s 💬1 +1 −1
#5088 [zephyr] Use Arrow IPC for scatter chunks, fall back to pickle 💬1 +88 −33
#5091 [zephyr] Use msgspec msgpack for scatter chunks, fall back to pickle 💬6 +107 −37
#5107 [zephyr] Cache scatter sidecar bytes and fix subprocess flusher context 💬5 +107 −18

16 comments on 7 threads

#5091 [zephyr] Use msgspec msgpack for scatter chunks, fall back to pickle ×5
#5107 [zephyr] Cache scatter sidecar bytes and fix subprocess flusher context ×4
#5055 [zephyr] OOM-proof scatter write buffer with byte-based flush budget ×2
#5049 [zephyr] (Organization) Add _set_worker_state helper and O(1) _alive_workers counter ×2
#5048 [zephyr] Fix dead threading.Event in _wait_for_stage
#5050 [zephyr] Raise coordinator heartbeat-check interval from 0.5s to 2.0s
#5088 [zephyr] Use Arrow IPC for scatter chunks, fall back to pickle

whenwen · A learner, interested in machine learning, language models, and mathematics. 4 PRs, 14 comments

#5183 [grug] Add MoE AdamH global gradient normalization +712 −33
#5181 [grug] Add MoE AdamH gradient normalization +432 −5
#5179 Add MoE depth MuP LR sweep +1474 −6
#4930 [speedrun] Add MuonC archived Qwen3 submission 💬1 +2612 −0

14 comments on 5 threads

#5182 Agent MoE Experiment: AdamH global gradient normalization ×8
#5180 Agent MoE Experiment: AdamH gradient normalization ×3
#5036 Add Grug backward-flow logging
#4930 [speedrun] Add MuonC archived Qwen3 submission
#4923 [speedrun][MuonC] Publish archived Qwen3 submission

Neville Li · NY · Recovering "AI" engineer 2 PRs, 5 comments

#5070 Optimize JaggedArrayStore write path and remove dead code 💬3 +64 −56
#4814 Remove consolidate from tokenize 💬6 +526 −30

5 comments on 2 threads

#4814 Remove consolidate from tokenize ×3
#5070 Optimize JaggedArrayStore write path and remove dead code ×2

Tai Vu · San Francisco, CA, US 6 PRs

#5035 [rl] Refactor rollout decoding across vLLM and Levanter +1261 −334
#4997 [SFT] Add NuminaMath CoT/TIR datasets and staged experiment 💬1 +208 −0
#4996 [SFT] Add FineProofs-SFT raw and proof-only instruction datasets +180 −0
#4620 [rl] Make KL configuration explicit and add k2 support 💬2 +312 −56
#4249 [Eval] Add long-context evaluation lane for exp2062 💬1 +890 −2
#4684 [rl] Add native Reasoning Gym environment support 💬1 +940 −1

claude-nightshift 5 PRs

#5153 [nightshift] 20260424 multi-cleanup +10 −67
#5122 [nightshift] 20260423 multi-cleanup 💬1 +40 −89
#5044 [nightshift] 20260422 multi-cleanup +22 −28
#4998 [nightshift] 20260421 multi-cleanup 💬1 +5 −127
#4949 [nightshift] 20260420 multi-cleanup +2 −66

Eric Czech 3 PRs, 1 comment

#4967 Merge main into dna-dev +46191 −12845
#4912 [levanter] Fix lm_eval use of unsupported implementation arg +6 −0
#4911 [levanter] Fix eval harness resource partitioning +11 −2

1 comment on 1 thread

#4697 Experiment: 1e23 MoE Run

dhidary 1 PR, 1 comment

#5150 [iris] Support internal-IP-only deployments via IAP tunneling 💬3 +92 −74

1 comment on 1 thread

#5150 [iris] Support internal-IP-only deployments via IAP tunneling

Rohith Kuditipudi · cs phd @ stanford 2 PRs

#5015 Add paged-decode interface to Qwen, matching llama/apertus +119 −0
#5014 Add teacher-forced scoring engine over paged KV cache +624 −0

Maximilian Böther · Switzerland · Ph.D. Student at ETH Zurich 0 PRs, 1 comment

1 comment on 1 thread

#4445 Improve levanter store/cache

Top 15 runs (by FLOPs) this week (completed, running, crashed)

The week was dominated by two parallel storylines on #4281: the preregistered 1e23 MoE burning down its 1024 v4 chips in the background, and a flood of agent-driven architecture ablations cashing in against the same baseline. The big run, moe_1e23_d5120_bs2048_ep8_ragged_48l tracked under #4697, accumulated 224 hours and 229k chip-hours this week — by itself 89% of the week's HW FLOPs (2.23e23 of 2.57e23). It is the EP8 ragged relaunch off the v4-2048 known-good ring control; @dlwh's #4959 captured the relaunch artifacts, and #4964 threaded the ring/ragged dispatch choice through the grug MoE config and merged Thursday. As of week-end the run sat at train_loss 2.17 / Paloma macro 2.545 / uncheatable macro 2.211 on 355B tokens, MFU 16.4%, on track against the preregistered ~2.25 macro target.

Forecast bookkeeping for the 1e23. @Helw150 updated the #4697 forecast for Percy's ICLR talk using @eric-czech's VPNLS approach, registering a slightly more conservative 2.295 macro loss prediction (vs. the original 2.25 from the pinned-asymptote three-point fit). @eric-czech cautioned that VPNLS came in too conservative on Delphi unless fit without the irreducible E term and recommended three decimals of precision; @ClassicLarry noted the off-trend 1e22 datapoint in the side-by-side was from the prior WSD/less-tuned LR recipe and not in the fit. Larry also flagged a grad-norm pathology in #moe: clipping at 1.0 is a no-op for the AdamH matrix path because output_proj sits in AdamH and takes constant step size, so the clip currently only awkwardly downscales the embed/non-AdamH params; @Kaiyue-Wen pushed back that we essentially never trigger it and a same-direction downscale wouldn't matter for adam/muon either way.

The agent recipe ladder added 21 ablations and a 1.5x architecture jump. The experiments/grug/moe/README.md ladder Larry published last week ran red-hot: 21 closed agent-MoE issues this week (#4904, #4946, #4951, #4952, #4972, #4973, #4976, #4986, #4987, #4993, #5000, #5002, #5110, #5113, #5114, #5152, #4802, #4803, #4807, #4899, #4906) plus several still in flight, mostly authored by Claude through the agent.md protocol and adjudicated by Larry. The headline result is #4999 — Combined Best stacks E=128 experts + PKO (every 4th) + partial RoPE (every layer) + last-layer forced long+PKO + cached attention on the last 3 layers. At all four gate-2 scales the macro loss drops 0.054-0.067, giving 1.21x effective speedup at d512 climbing to 1.51x at d1280. The pinned-exponent scaling-law projection puts this at -0.035 at 1e21 and -0.023 at 1e23; the free-exponent fit (which gives Combined a steeper -0.097 vs. -0.093 baseline) projects up to 2.09x speedup at the 1e23 scale — Larry called it out conservatively in #moe as 52% speedup on MoE recipe at 3e19 scales (impact is higher at each higher scale), with @Kaiyue-Wen replying wow this is crazily good.

What earned its way into Combined Best vs. what didn't. The clearest standalone winners were #4951 PKO+partial-RoPE (1.09-1.23x, projecting -0.010 at 1e23) and #4946 partial-RoPE-every-layer (1.04-1.12x). The follow-on #5152 MHA + PKO (no GQA, partial key offset every 4th + last) at gate 1 hit 1.28-1.37x, nearly additive with the standalone effects of MHA (#5151, 1.10-1.18x) and PKO alone, and prompted Larry to file #5154 — a barebones-transformer control to track down why PKO shows 20% benefit on the MoE recipe but only 0-2% on the standard speedrun nanogpt/nanochat recipe, in case the tokenized dataset is doing some of the work. Negative or neutral: #4904 GatedNorm init scale (looked promising at 2x at gate 1 with 1.04x but failed to consistently hold past d1024), #4986 value embeddings (helps d768 but regresses d512; held pending GPU activation-memory cost), #4972 non-zero min LR ratio, #4981 constant final LR, #5000 Adam LR shift 0.7x/1.3x, and #5114 RoPE-before-QK-norm. #5110 block attention residuals (bs=4) closed Sunday with a 1.06-1.12x gate-2 pass.

MuonH probe on the MoE recipe. @pc0618's #5134 swaps the AdamH matrix/expert path to MuonH while pinning the AdamH-derived LR schedule, and tests whether MuonH benefits from a 2x batch — motivated by @Kaiyue-Wen's suggestion in #moe that Muon should have larger step-wise speedup compared to Adam with large batch size. The first launch hit a JAX ShardingTypeError in the MuonH update path on the (data, None, None) vs (expert, None, None) broadcast; pc patched the reshard back to parameter sharding and relaunched the same four-run gate-1 matrix. By week-end pc reported in #moe that there isn't any real difference between AdamH baseline, Muon AOL, MuonH base batch, and MuonH 2x batch; @Kaiyue-Wen noted the y-axis in the comparison was likely too coarse — the differences should be within 0.02 macro loss.

Curation extractor bake-off (998M, 2e+21 model FLOPs). @Michael-Ryan ran a four-way data-curation comparison on a fixed FM_natural recipe across three model geometries (d512/L6, d1536/L16/B1024, d1536/L16/B2048) and four extractor pipelines: resiliparse, llm_curated_bos_fixed, nemotron_full_bos_fixed, and dclm. At the 2e+21 d1536 scale, resiliparse finished at eval_bpb 0.954, edging llm_curated at 0.958; the 9e+20 d2432 nemotron_full run is plainly broken (eval_bpb 1.232, code-domain bpb 1.86 on github_cpp vs 1.22 for dclm) — a casualty of the #5034 BOS-token regression that flooded a swath of caches built after the early-April kitoken backend migration. @Michael-Ryan first surfaced the discrepancy in #infra; @ahmeda14960 independently caught it via paloma BPB drift between caches three weeks apart, and Russell scanned the tensorstore outputs and confirmed most training caches are safe — Ahmed's validation sets and runs that picked up the bos-fixed re-tokenization are the ones to watch (see #5149).

Data mixture proxy run for the swarm. @Calvin-Xu's baseline_stratified on Saturday was the first proxy run for the Many Domains swarm #2345: a 1.17B model on 24B tokens of Dolma 3 Pool / Dolmino, joint-fitting domain and quality buckets. Calvin posted updated joint-fit-on-mixture-and-scale plots in #2345 alongside the closeout of #2404 Many Domains One Phase, the RegMix/Olmo 3 replication. MFU was a brutal 8.3% — likely the small-batch v5-32 layout — but throughput was not the point.

Distillation pilot reasoning SFTs. @moojink ran four 8B Marin-base SFT distillation runs on OpenThoughts4 math at 32k context, all on v5-32 and all hitting ~48% MFU, against three different teachers — Qwen3-32B (exp3956w and the "pt3" variant exp3956w_pt3), Kimi K2.5 (exp3956k_pt3), and Qwen3-30B-A3B-Thinking (exp3956x_pt2) — feeding the #3956 teacher-pilot scaling curves and the closely related #2262 teacher comparison, whose updated TL;DR landed Wednesday: per-prompt response count beats teacher size, and Qwen3-32B remains the practical default. Final teacher-bpb came in lowest for Qwen3-30B-A3B-Thinking (0.191) and highest for Qwen3-32B-pt3 (0.276) — but as #2262's TL;DR notes, fit-to-teacher is not the same as downstream eval.

NemotronTerminal-32B SFT reproduction continues. #4307 has the Qwen3-32B SFT past step 500: @AlienKevin's full TB2 eval with Terminus-2 hit 14.5% solve rate (12/83 completed) at n_concurrent=1, then improved to 17.2% (15/87) at n_concurrent=10 once the intermittent AttributeError: coords mesh error and Daytona sandbox setup amortized; agent timeout (60+ tasks) remains the dominant failure mode. The checkpoint is at 8.7% of training; the paper target is 27.4%.

Run	User	Hardware(?)	Hours(?)	FLOP Budget(?)	BPB(?)
#4697 #4959 pre-reg moe_1e23_d5120_bs2048_ep8_ragged_48l_rayuvtpu_20260417_124933	David Leo Wright Hall	TPU v4 (1024 chips)	9.3d	3.66e22 model 2.23e23 HW (16%)	BPB: 0.793
curation-resiliparse-expFM_natural-2e+21-d1536-L16-B2048	Michael Ryan	TPU v5 (256 chips)	13.8h	1.80e21 model 4.61e21 HW (39%)	BPB: 0.954
curation-llm_curated_bos_fixed-expFM_natural-2e+21-d1536-L16-B2048	Michael Ryan	TPU v5 (256 chips)	13.5h	1.80e21 model 4.58e21 HW (39%)	BPB: 0.958
curation-llm_curated_bos_fixed-expFM_natural-9e+20-d512-L6-B4096	Michael Ryan	TPU v5 (64 chips)	1.4d	9.00e20 model 3.30e21 HW (27%)	BPB: 1.201
curation-resiliparse-expFM_natural-9e+20-d512-L6-B4096	Michael Ryan	TPU v5 (64 chips)	1.4d	9.00e20 model 3.27e21 HW (28%)	BPB: 1.198
pinlin_calvin_xu/data_mixture/ngd3d~5b98e67a/baseline_stratified	Calvin Xu	TPU v5 (32 chips)	15.2h	1.88e20 model 2.26e21 HW (8%)	BPB: 0.953
curation-resiliparse-expFM_natural-9e+20-d1536-L16-B1024	Michael Ryan	TPU v5 (128 chips)	11.8h	9.00e20 model 2.12e21 HW (43%)	BPB: 0.965
curation-llm_curated_bos_fixed-expFM_natural-9e+20-d1536-L16-B1024	Michael Ryan	TPU v5 (128 chips)	11.7h	9.00e20 model 2.11e21 HW (43%)	BPB: 0.973
#5034 curation-nemotron_full_bos_fixed-expFM_natural-9e+20-d2432-L24-B256	Michael Ryan	TPU v5 (64 chips)	20.1h	9.00e20 model 1.92e21 HW (47%)	BPB: 1.232
curation-dclm-expFM_natural-9e+20-d2432-L24-B256	Michael Ryan	TPU v5 (64 chips)	20.2h	9.00e20 model 1.91e21 HW (47%)	BPB: 1.038
#3956 exp3956w_50k_marin_8b_base_ot4_math_qwen3_32b_32768tokens-56ae0d	Moo Jin Kim	TPU v5 (32 chips)	1.5d	7.94e20 model 1.67e21 HW (48%)	BPB: 0.135
#3956 exp3956w_50k_pt3_marin_8b_base_ot4_math_qwen3_32b_32768tokens-9bd4a4	Moo Jin Kim	TPU v5 (32 chips)	1.6d	7.94e20 model 1.66e21 HW (48%)	BPB: 0.276
#3956 exp3956k_50k_pt2_marin_8b_base_ot4_math_kimi_k2pt5_32768tokens-6a2bc8	Moo Jin Kim	TPU v5 (32 chips)	1.6d	7.94e20 model 1.66e21 HW (48%)	BPB: 0.276
#3956 exp3956k_50k_pt3_marin_8b_base_ot4_math_kimi_k2pt5_32768tokens-6f6035	Moo Jin Kim	TPU v5 (32 chips)	1.6d	7.94e20 model 1.66e21 HW (48%)	BPB: 0.424
#3956 exp3956x_50k_pt2_marin_8b_base_ot4_math_qwen3_30b_a3b_thinking_3-ee0026	Moo Jin Kim	TPU v5 (32 chips)	1.5d	7.94e20 model 1.66e21 HW (48%)	BPB: 0.191

#4281 MoE Scaling up to April goal

#4269 Single way of running jobs — off Ray completely

#4474 Levanter Store, K8s Logging, and Infrastructure Improvements

#4283 MoE MFU at scale

#4282 Agentify experimentation

#4273 Improve Usability & Observability

#4272 Canonical pipeline (download → norm → dedup/quality → tokenize)

#4271 Marin-as-a-library (Bolinas can import marin)

#4270 Canary pass rate to 90%+

#3192 Synthetic data (research + critical path for post-training)

#3100 Data sources for pre-training / mid-training

Community Pulse

News & research shared

Will Moss · @airbnb · San Francisco, CA 12 PRs, 14 comments

Hsu Han Ooi · Seattle, WA · I enjoy long robotic walks on the beach and making smalltalk with my chatbot friends. 9 PRs, 16 comments

whenwen · A learner, interested in machine learning, language models, and mathematics. 4 PRs, 14 comments

Neville Li · NY · Recovering "AI" engineer 2 PRs, 5 comments

Tai Vu · San Francisco, CA, US 6 PRs

claude-nightshift 5 PRs

Eric Czech 3 PRs, 1 comment

dhidary 1 PR, 1 comment

Rohith Kuditipudi · cs phd @ stanford 2 PRs

Maximilian Böther · Switzerland · Ph.D. Student at ETH Zurich 0 PRs, 1 comment

Top 15 runs (by FLOPs) this week (completed, running, crashed)

Keyboard shortcuts