Marin: Week of April 26th summary

#4272 Canonical pipeline (download → norm → dedup/quality → tokenize)

Summary: Define canonical data pipelines for all data ingestion: download -> normalize -> dedup/quality -> tokenize.

With the testbed baseline landed last week, the canonical pipeline split this week into two parallel motions: source-coverage expansion on the front end and a Zephyr performance push on the back end. @Helw150 drove the source push, registering allenai/Molmo2-Cap as a text-rendering of the long-form video-captioning corpus released with Molmo 2 #5299 (104K videos rendered as merged paragraph plus timestamp-tagged per-frame lines), GAIR/daVinci-Dev with both a ctx-native PR-row renderer and an env-native SWE-Agent trajectory renderer mirroring swe_rebench_openhands #5252, and nyuuzyou/svgfind's ~3.6M Creative Commons icons rendered as Title/Data Pack/Tags-prefixed SVG markup for SFT #5304. Three more sources sit open for review at week's end: #5300 wraps lambda/hermes-agent-reasoning-traces (14,701 multi-turn tool-calling trajectories), #5305 registers TeraflopAI/SEC-EDGAR (43.7B tokens, ~8M filings across 10 form types), and #5339 stages Amazon's MASSIVE multilingual tool-use dataset for 11.39B tokens with a per-locale fan-out zephyr pipeline. #5276 goes further afield with a SWE-rebench v2 ConTree tracing pipeline that runs Python test suites in Nebius sandboxes and writes annotated execution-trace rows for code-world modeling.

Running gated downloads at this volume immediately surfaced an auth footgun. #5280 reported that download_hf_step against a gated dataset silently retried forever when no HF_TOKEN reached the worker — Iris auto-injects from the submitter's os.getenv("HF_TOKEN"), but submitters who logged in via huggingface-cli login only have the token at ~/.cache/huggingface/token, so the mismatch produced no warning at submit time and no surfaced error during the run. #5281 closed it by classifying HfHubHTTPError with status 401/403 as non-retryable and raising a RuntimeError that distinguishes "no token" from "token lacks access," replacing the 20-attempt exponential-backoff loop with a fail-fast. The testbed itself sprouted three formal experiment arms on the issue tracker: #5308 (no-dedup baseline), #5309 (fuzzy-dedup with num_perms=286, num_bands=26, ngram_size=5, cc_max_iterations=10), and #5310 (negative-control duplication at 50%, keeping the first ceil((1 − dup_rate) · N) rows as the unique pool and replaying them) — the three legs that let every subsequent ranking-protocol comparison subtract against a fixed reference at the same compute-optimal point on Paloma macro and uncheatable_eval. Higher up the stack, @ihodes opened #5360 to scope a quality-and-dedup parameter-selection workstream against a mid-May launch, with dedup params and contamination detection as p0 and quality scores as p1.

The bigger story underneath was the launch of a deliberate Zephyr performance push. @wmoss's #5333 design proposal frames the goal bluntly: a standard run currently takes ~10 hours and an order-of-magnitude speedup would change what experimentation looks like. The plan is to capture and analyze the CPU profiles already taken every five minutes for every Iris job, ship the low-hanging-fruit optimizations alongside the tooling that validates them, and use the tooling to define the next round. #5352 tracks the work as a sub-epic, with #5353 the first concrete leaf — using multithreading on CPU-heavy stages to amortize library-import cost and assigning different worker types per stage. @rjpower's #5282 is the headline early win: shard execution becomes pluggable via a StageRunner protocol, with a new InlineRunner running shards in the worker actor's own process for LocalClient while distributed (Iris) clients keep the old SubprocessRunner for crash isolation. The two slow tests called out as motivation drop dramatically: test_connected_components_happy_path from 39.2s to 1.5s (~26×), test_fuzzy_dups_multi_source_per_source from 48.7s to 2.9s (~17×). #5311 follows with a regression guard that asserts subprocess parametrization is real — each shard records its PID, the test asserts ≤max_workers distinct PIDs across 5 shards, and the inline case is marked xfail(strict=True) so a silent fallback would flip to XPASS and fail the suite. #5265 migrates stale ctx.execute() callers to the ZephyrExecutionResult dataclass, and #5286 from @wmoss removes group_files from dedup_commons.py and pushes the sort into _collect_input_files. @ravwojdyla's #5348 playbook (a re-open of the corrupted #5199) wires the perf process to a fineweb_edu ferry running 10 times in a row to surface variance against Iris preemption.

Two memory-hardening fixes landed in parallel. #5231 from @ravwojdyla flips Levanter's _long_string_workaround on unconditionally so a 64M-character outlier never reaches the underlying Rust tokenizer as one giant string — the rewrite encodes per record so an outlier's pieces never coexist with the rest of the batch's encodings, sub-batches the per-outlier encode_batch calls in groups of 256, and accumulates ids in place via ids.extend(...) rather than building a fresh concatenated list. #5340 rebases @hsuhanooi's byte-budgeted scatter from last week's #5055 and lands it, but reviewing it surfaced #5344: under key skew on large-memory workers a single shard's buffer can grow to the full ~25%-of-cgroup global budget before the global gate fires, producing one chunk that the reducer later loads in full via fs.cat_file — a real OOM risk, with a writer-side hard cap and a reducer-side streaming read sketched as candidate fixes. Separately, #5334 documents a hard PyArrow limitation that bites the read path: PyArrow's parquet C++ reader has a ~8 MiB cap on the thrift page-header size, so a single value larger than ~8 MiB makes the writer's per-page column statistics overflow the cap and PyArrow refuses to decode the page header (OSError: Couldn't deserialize thrift: No more data to read.); DuckDB, arrow-rs, and arro3 read the same bytes correctly. #5335 proposes a DuckDB fallback in zephyr.readers.load_parquet while apache/arrow#47758's read-side max_page_header_size fix moves through upstream review.

1 PR this week, and 0 new issues (0 total)

Sort:

#5312 Add add-datakit-dataset skill +248 −113 @ravwojdyla-agent
Issues

38 autocategorized

#4753 Default vLLM mode to native instead of docker sidecar 💬2 +6 −2 @claude
#5055 [zephyr] OOM-proof scatter write buffer with byte-based flush budget 💬2 +234 −12 @hsuhanooi
#5199 Add zephyr playbook 💬2 +1364 −0 @ravwojdyla
#5231 tokenize: bound peak memory on outlier records 💬1 +69 −60 @ravwojdyla
#5242 [dna] Add exp136 enhancer-curation comparison +297 −0 @gonzalobenegas
#5252 Add GAIR/daVinci-Dev to datakit sources (ctx-native + env-native) 💬3 +281 −1 @Helw150
#5265 [zephyr] Migrate stale ctx.execute() callers to ZephyrExecutionResult +10 −10 @ravwojdyla-agent
#5269 experiments: drop unused pretraining_datasets DATASETS registry and CLI +0 −254 @ravwojdyla-agent
#5276 SWE-Rebench v2 Contree Tracing for CodeWorldModeling 💬3 +1786 −0 @Helw150
#5278 Update public TPU inference JAX stack 💬3 +1295 −2115 @yonromai
#5281 [datakit] Fail fast on 401/403 from HF instead of retrying 💬1 +74 −0 @claude
#5282 [zephyr] Inline shard execution; remove per-shard subprocess fork 💬9 +761 −602 @rjpower
#5286 Remove `group_files` function and usage +18 −38 @wmoss
#5299 Add allenai/Molmo2-Cap as a Datakit source +128 −0 @Helw150
#5300 Add lambda/hermes-agent-reasoning-traces as a Datakit source 💬4 +136 −0 @Helw150
#5304 Add nyuuzyou/svgfind to datakit sources (creativecommons) +197 −0 @Helw150
#5305 Add TeraflopAI/SEC-EDGAR to datakit sources +40 −0 @Helw150
#5311 [zephyr] Regression guard for SubprocessRunner parametrization 💬1 +38 −0 @ravwojdyla-agent
#5315 [test fixture] zephyr-perf gate: comment-only tweak in shuffle.py 💬1 +1 −0 @ravwojdyla
#5316 Use TOML deps for native vLLM smoke +23 −14 @yonromai
#5317 [test fixture] zephyr-perf gate: shrink _READ_BLOCK_SIZE to surface r… 💬5 +1 −1 @ravwojdyla
#5326 Remove vLLM Docker sidecar 💬2 +44 −914 @yonromai
#5333 [design] Proposal for Zephyr performance improvements 💬2 +328 −0 @wmoss
#5335 [zephyr] Fall back to DuckDB when PyArrow hits the 8 MiB page-header cap 💬2 +193 −8 @claude
#5339 Amazon MASSIVE Multilingual Tool Use Dataset +936 −0 @Helw150
#5340 [zephyr]: byte-based flush budget scatter heuristic 💬1 +234 −12 @ravwojdyla
#5348 Add zephyr playbook +1700 −3 @ravwojdyla
#5355 [dna] Add exp142 source-curation sweep (v31/v32/v33) +274 −0 @gonzalobenegas
#5200 [datakit] testbed
#5280 [datakit] Fail loudly when HF_TOKEN is missing for a gated download
#5308 Experiment: Datakit Testbed — baseline arm 💬1
#5309 Experiment: Datakit Testbed — fuzzy-dedup arm 💬1
#5310 Experiment: Datakit Testbed — neg_control duplication arm (50%) 💬1
#5334 [zephyr] PyArrow parquet reader can't decode page headers >8 MiB (apache/arrow#46404) 💬4
#5344 [zephyr] Perf-test scatter chunk size: writer-side cap vs reducer block reads
#5352 Zephyr Performance Improvements
#5353 Zephyr: Better Resource Utilization per Stage
#5360 Data pipeline: quality scores + dedup param selection 💬1

#4474 Levanter Store, K8s Logging, and Infrastructure Improvements

Summary: Improve Levanter's data store, fix K8s logging at scale, and address infrastructure gaps in the Iris dashboard and profiling.

The Neville/@ravwojdyla thread on dropping consolidate_shard_caches from tokenize moved from "merge it and see" to a dual stress test this week. Neville's #4814 replaces the producer-side consolidation step with an in-memory ShardedTreeCache that downstream readers see as a single virtual tree; rav was +1 to merge but then ran a real-world load and found that opening the sharded cache for nemotron_cc_v2/medium_quality (1113 shards) was slow — about five minutes between attempting to load the shard ledger and the first downstream read in the log. He pointed Neville at the same data tokenized two ways — sharded under marin-us-central1/data/datakit/tokenized/... and consolidated under marin-tmp-us-central1/ttl=7d/tokenize/... — for a head-to-head, noting that some of his ~100 tokenized datasets have shards in the thousands. @dlwh: Thanks for taking this on! I should have thought to do this. Late on May 1 Neville kicked off two smoke training jobs from his PR branch — one against the sharded cache, one against the consolidated — and the comparison is pending.

The other Levanter-store work this week was robustness, not format. @rjpower opened #5329 after a W&B storage-quota exhaustion took down a large run, and his #5332 wraps W&B and Trackio in a new BackgroundTracker that serializes calls onto a daemon thread and catches/logs exceptions instead of letting them propagate, with CompositeTracker hardened so a failing member cannot drag the others down — auth/init failures still propagate so a misconfigured run refuses to start. @ahmeda14960's #5259 finished off era shuffle now that @dlwh's #5246 made BlockShuffleConfig(io_block_size=256, window_blocks=512, perm_type="feistel") the LM-data default — era's zero-cross-era mixing within an epoch was, per the issue, redundant and a footgun on temporally-segmented physical layouts, and block shuffle's global Feistel permutation strictly dominates it; #5303 swept the remaining stale shuffle references and migrated the OLMo config field. The temp-checkpoint roots from last week's #5066 merged on April 27 — and immediately surfaced a follow-on bug Ahmed filed as #5374: a v5p-256 Levanter job initialized from a cold mirror:// checkpoint in another region had every rank call latest_checkpoint_path() and eagerly stage the full TensorStore tree into the local Marin bucket, with rank 24 tripping rigging's 10 GB cross-region transfer budget and JAX distributed aborting the pod. The fix shape is either single-process staging or a TensorStore source URL it can read directly.

On the cluster-observability and K8s side, @ravwojdyla opened #5198 as a forward-looking stand-up of fleet-wide observability for the TPU and CW GPU pools — bird's-eye health, hardware/fabric monitoring, topology-aware rollups, drill-down to logs and traces, a bad-node lifecycle (detect → quarantine → drain → return), and an explicit ask to resist squeezing the work into Iris and instead build above it. @hammer chimed in with the OpenObserve Parquet-on-object-store option and the Flare paper as worth a closer read; @rjpower: I agree better observability would be good, and splitting it from Iris is the right call. Rav also opened #5175 against the brand-new tokenize perf-counter logs from #5063 — map-only stages were emitting per-second items=0 (0.0/s) heartbeats with nothing to report — and #5220 proposes a daily ops job that runs the cross-region egress-checking script and posts a Discord report with usage tagged to the responsible users, to catch egress blow-ups before they rack up charges. Inside the controller itself, the visible churn this week was Russell's series of restarts in #infra as the lifted-and-shifted finelog log server grew up: a permissions issue racked up 10M lines of pending logs, then a controller restart later in the week lost the log-server configuration outright before being restored a minute later.

The remaining surface is housekeeping that nonetheless touches a lot of files. @yonromai landed the served-model eval RFC #5285 — a ModelDeployment → ModelLauncher → RunningModel → eval adapter handoff so eval code stops instantiating vLLM, Iris, or Fray objects directly — followed by #5322 renaming the modules to drop the served_ prefix, @dlwh's #5325 repairing the import grouping ruff broke on main, and #5331 adding echo=true prompt-plus-completion logprobs to Levanter's OpenAI-compatible /v1/completions server so stock lm_eval local-completions scoring works against Levanter without coupling eval logic to model construction; @ihodes's #5368 defines done for that broader project as MMLU-SL-Verb-5shot and HumanEval-5shot of the 1e22 MoE running pre-emption-resilient on a v5p-8 via vLLM. @dlwh's #5314 dropped I001 from the ruff ignore list and swept 489 import-order violations to stop the churn from leaking into unrelated PRs; @rjpower's #4808 finished cluster C of the pyrefly tightening (1551→1203 diagnostic lines, 129→100 suppressed) and re-enabled unbound-name, not-iterable, bad-index, and bad-context-manager, picking up real bugs along the way. Nightshift, the new auto-CI auditor from #5294, started filing its own follow-ups — #5343 against the 40-second exp1457_multilingual_cpt_eval.py dry-run that re-walks the same DAG hundreds of times, and #5378 recording eight more slow tests that fall into already-tracked categories. @wmoss's #5289 bumped the Marin-tests timeout and parallelism to -n 4 as a stopgap and his #5194 sharded-Zephyr CI proposal was closed — the slowness turned out not to be consistent enough to chase yet — and @dlwh's #5293 replaced the multi-source fuzzy-dedup test's normalize+MinHash setup with a synthetic MinHashAttrData fixture, preserving cross-source regression coverage while skipping the slow end-to-end pipeline. Finally, @rjpower's #5288 opened the design discussion for moving Marin's GitHub Actions onto repo-owned Python workflow scripts so YAML stays a thin shell of triggers, runners, and matrices.

30 autocategorized

#4808 pyrefly: cluster C cleanup (control-flow annotations) +114 −386 @rjpower
#4814 Remove consolidate from tokenize 💬7 +694 −33 @nevillelyh
#5066 [levanter] Wire temporary checkpoint roots into launch paths +200 −24 @dlwh
#5194 [Zephyr][CI] Shard Zephyr test in CI so they run faster 💬1 +303 −2 @wmoss
#5246 [levanter] Default LM data shuffle to block shuffle +91 −55 @dlwh
#5256 [RL] Add vLLM engine seed for inference +78 −19 @ahmeda14960
#5259 [levanter] Remove era shuffle +26 −191 @ahmeda14960
#5285 RFC: Served Model Eval Handoff +923 −86 @yonromai
#5287 Add max-parallelism command line parameter to fineweb_10bt_exact +20 −6 @wmoss
#5288 [Design] workflow_scripts +633 −0 @rjpower
#5289 Updatetimeout and increase parallelism in CI 💬2 +3 −3 @wmoss
#5293 [Tests] Speed up multi-source fuzzy dedup regression +68 −9 @dlwh
#5294 [Infra] Add nightly CI test audit workflow +812 −0 @dlwh
#5303 [levanter] Clean up shuffle migration references +36 −25 @dlwh
#5314 Enable Ruff I001 import sorting +1215 −1284 @rjpower
#5322 Address served lm eval review nits +4 −3 @yonromai
#5325 Fix served LM eval import grouping +1 −1 @dlwh
#5331 Add Levanter completion echo logprobs 💬1 +352 −25 @yonromai
#5332 levanter: make trainer robust against W&B failures 💬2 +622 −20 @rjpower
#1883 Add token-type loss tracking to TaggedEvaluator
#5175 Optional counter/perf stats logs
#5198 Cluster observability 💬7
#5220 Ops - check cross-region egress daily
#5247 Identify useful set of evals 💬1
#5257 [levanter] Remove era shuffle now that block shuffle is the default
#5329 Levanter - Make training runs robust against losing access to W&B
#5343 [nightshift] investigate CI test performance/stability
#5368 Inference service for evals
#5374 [levanter] Avoid all-rank mirror checkpoint staging
#5378 [nightshift] investigate CI test performance/stability

#4283 MoE MFU at scale

Summary: Tracking issue for April MFU work. Tasks/goals:

0/6 sub-issues closed

The holding pattern broke decisively. The JAX 0.9.2 upgrade from #5278 regressed the existing Triton ragged_dot kernel — jax.experimental.pallas no longer exports load/store, and the GPU canary crashed at step 0 every retry, with the auto fallback failing to catch the AttributeError #5341. @yonromai turned the fix around the same day in #5347, replacing pl.load/pl.store with the new plgpu.load/store(ref.at[...]) API and broadening the auto-fallback exception list. The patched Triton path measures within rounding distance of the JAX 0.8.0 baseline (1.007× and 1.011× on the two probe shapes) and stays 24.31× and 13.83× faster than XLA for the same M=32768/E=64 ragged-dot calls on H100×8.

More consequentially, @yonromai's #5330 experiment validated the hypothesis from #4297 that JAX 0.9 would let the backward pass move onto Triton too. Raw autodiff through pallas_call still does not work on 0.9.2, but an explicit Tokamax-shaped custom-VJP — Triton kernel for dlhs with the grouped RHS transposed, separate ragged-contracting-dimension Triton kernel for drhs — does, and is materially faster: paired H100×8 Grug MoE microbench runs (tokens=4096, hidden=1024, intermediate=2048, 64 experts, top-4, bf16) dropped median steady latency from 12.69 ms on XLA to 6.61 ms on Triton fwd+bwd, a 1.91× speedup, with compile-plus-first-step shrinking from 8.4 s to 2.4 s. Expert-weight gradient diffs were exactly zero; input-gradient max abs diff was 3.7e-4 on the bf16 path. Production follow-up is draft #5350, stacked on #5347.

The H100 track also got new structure. @dlwh opened #5328 as a fresh epic for making the local Grug MoE GMM/MLP path fast on H100 via SonicMoE-style ideas — gather-fused grouped GEMM, smarter activation-memory bookkeeping that avoids the large O(T·K·D) saved intermediates, w13/w2 layout work — explicitly scoped as the local-compute half of the stack, with the DeepEP-equivalent transport half deferred to a separate epic. @ihodes opened #5356 (train a June 16B-A2B MoE for ~1k steps on 2+ GPU hosts) and #5357 (get to Nemotron ±ε MFU on H100s) as the near-term goal posts. The first concrete blocker showed up almost immediately: #5377, an NCCL alltoall crash on a CoreWeave H100 node that initially looked like bad hardware but @yonromai root-caused to torch==2.10.0+cu128 pinning nvidia-nccl-cu12==2.27.5; bumping to torch==2.11.0 with nvidia-nccl-cu12==2.28.9 passes a manual 8×H100 lax.all_to_all probe, draft fix in #5379. In the moe Discord channel, @dlwh circulated the Megatron-Core MoE paper (arxiv.org/abs/2603.07685) with the read that the sharding and fusion takeaways “seem like they follow from first principles” and the hope that XLA will do most of the work; the older Grug-vs-Megatron head-to-head sub-issues #4311, #4312, and #4313 still have not seen a measurement update.

0 PRs this week, and 0 new issues (6 total)

Sort:

Issues
#4300 TPU v4: 25%-30% MFU sustained for 100B-A13B
#4301 Experiment: Grug MoE ~116B-A16B bring-up on v4-1024 (d4864/l47/h38)
#4302 H100 x 8 MOE MFU perf
#4311 Measure throughput of megatron on 8x H100 node on relevant geometries
#4312 Improve end-to-end MFU of MOE on 8xH100 node
#4313 Measure/improve MFU on 2 x 8xH100 node

11 autocategorized

#4297 Add GPU Triton kernel for ragged_dot MoE grouped matmul 💬1 +174 −7 @chloechiaw
#5183 [grug] Add MoE AdamH global gradient normalization +763 −33 @WhenWen
#5347 Fix ragged_dot Pallas API usage +53 −3 @yonromai
#5350 [codex] Use Triton kernels for ragged_dot backward +282 −23 @yonromai
#5328 Epic: make local Grug MoE GMM/MLP fast via SonicMoE-style kernels 💬2
#5330 Experiment: JAX 0.9 ragged_dot Triton backward performance 💬3
#5341 [canary-gpu] grug train crashes: pl.load missing in jax.experimental.pallas (JAX 0.9.2)
#5356 Run MoEs on multinode GPUs (H100s) on CoreWeave
#5357 H100 kernel perf
#5377 [canary-gpu] grug train: NCCL alltoall fails on node g13c908 across all 4 retries 💬2
#5386 [grug/moe] Resume corruption plus sparse permanent checkpoints lost 8k steps

#4282 Agentify experimentation

Summary: Split from #4266.

The headline this week was a process upgrade: rjpower opened #5210 as an RFC arguing that since agents make it cheap to start large changes, the team has been skipping design work and paying for it at code review with sprawling, near-sighted, novel-bug-shaped PRs. The proposal is deliberately lightweight — open an issue, paste a 1-pager under 500 words, ping Discord, begin work in parallel — with area owners expected to provide prompt feedback rather than gate. yonromai endorsed the framing ("reviewing the inputs of coding agents is more time efficient than their outputs"), hammer pointed at Compound Engineering / RPI as comparable structured workflows, and dlwh signed on with a caveat against over-meta-engineering. #5236 landed the same day, rewriting the design-doc skill as an interactive Frame → Research → Interrogate → Draft → Stress-test → Publish flow with .agents/projects/design-template.md as the fillable template. #5209 deleted the four GitHub issue templates so "New issue" goes straight to a blank form, moved the experiment template body into .agents/skills/agent-research/SKILL.md, and taught the Claude triage workflow to refuse vague issues until the author supplies reproduction steps or a definition of done.

The skill got a real-world stress test almost immediately. rjpower volunteered the in-flight stats service as the first worked example and shipped #5241 the next day — a typed, schema-registered stats_service co-hosted in finelog that replaces three uncoordinated places Iris emits operational stats and gives the dashboard worker pane a queryable history. #5243 followed with a second design pass on inverting executor-in-training-job launches to drop implicit cross-region egress, and #5285 from yonromai used the new template to land the served-model eval handoff (ModelDeployment → ModelLauncher → RunningModel → eval adapter). yonromai called the side-markdown-files pattern out as working well — "hyperlinking makes it easier for humans to gather context (and check data support)." ravwojdyla flagged a paper cut (the relative file refs in the PR body resolved to dead links instead of perm-links), which rjpower fixed in a follow-up skill update. #5229 wired up the supporting plumbing: scripts/ops/discord.py resolves webhooks from DISCORD_WEBHOOK_* env vars or gcloud secret manager so the same script announces new designs to internal-discuss and code-review from local shells and GH Actions alike — marin-bot used it this week to post both #5241 and #5243 into code-review for feedback.

The other half of the epic this week was the nightly autonomous-cleanup pipeline, which kept producing daily multi-cleanup PRs across all four library trees. #5201 deduped LogStore.append across the DuckDB and in-memory iris stores and collapsed dead HfHubHTTPError.status_code branches in _hf_should_retry; rjpower asked the bot to walk back an over-aggressive removal of a try/except around os.makedirs. #5234 used the canonical is_job_finished helper in IrisClient.terminate_prefix after finding the inlined terminal-states set was missing JOB_STATE_WORKER_FAILED. #5263 caught a real silent bug in the slice_cache HuggingFace README generator. #5307 rewrote the 20-line two-phase BundleStore._evict_if_needed_locked as a 5-line oldest-first loop and deleted a Wikipedia helper that hadn't been called since July 2024. #5342 hoisted a shared format_resources into iris/rpc/proto_utils.py after finding the bug-report path dropped disk entirely and rendered sub-1-GiB memory as 0 GiB, and excised dead control flow in levanter's BackgroundIterator.__next__. Five days of small wins, all by parallel scout agents in their own worktrees.

9 autocategorized

#5201 [nightshift] 20260427 multi-cleanup 💬1 +17 −34 @claude-nightshift
#5209 Drop GitHub issue templates; teach triage workflow to demand actionable issues +66 −160 @rjpower
#5229 [ops] Add scripts/ops/discord.py for one-way channel notifications +99 −0 @rjpower
#5234 [nightshift] 20260428 multi-cleanup +19 −40 @claude-nightshift
#5236 Rewrite design-doc skill as interactive 1-pager workflow +193 −109 @rjpower
#5263 [nightshift] 20260429 multi-cleanup 💬1 +17 −15 @claude-nightshift
#5307 [nightshift] 20260430 multi-cleanup +9 −54 @claude-nightshift
#5342 [nightshift] 20260501 multi-cleanup +82 −86 @claude-nightshift
#5210 RFC - Agentic Coding 💬9

#4281 1e23 run keeps grinding, AdamH-on-embed becomes the lone Gate-2 winner

Summary: Split from #4266.

44/52 sub-issues closed

The 1e23 MoE preregistration run #4697 continued through the week without crossing the finish line. @dlwh stood up a higher-grad-clip relaunch (moe_1e23_d5120_bs2048_ep8_ragged_48l_resume45207_clip15) on 2026-04-27 and posted the only on-record status this week: global_step=45337, train/loss=2.1537, grad/norm/output_proj=0.7798, MFU 16.36%, no exploding-gradient signature in the resume — see the grad-clip-1.5 announcement in #moe and the tracking comment on #4697. The preregistered 2.25 paloma macro target from the isoflop fit in #4447 remains an open bet — final macro_loss against that target has not yet been called.

Around the still-running 1e23 run, the agent-driven ablation sweep produced its first cleanly promotable architecture change: AdamH on the token-embedding table. @ClassicLarry's #5184 took embed off Adam and onto AdamH at lr_mult=1.0, passed Gate 1 with a 1.05x speedup at d512/d768, and survived Gate 2 across all four scales (1.048 / 1.040 / 1.032 / 1.013) with scaling-law projections of 2.249 at 1e23 vs the 2.251 baseline. Larry's diagnosis: the baseline embed param-norm grows from 176 toward arbitrary scales while grad-norm collapses to 0.001 on the 1e23 run, whereas AdamH pins the embed norm at 176 and keeps grad-norm in the 0.04–0.12 range. The follow-up #5203 bumping embed init std to 1.0 (so the first RMSNorm is a no-op) ties on loss but produces vanishing grad-norms below 0.001, so default-init AdamH-embed wins. This closes out the long-running embed-norm-growth investigation #4569.

Most of the rest of the agent-MoE sweep landed as negative results worth keeping. AdamH on the router weights #5211 failed Gate 1 outright; removing the MoE router z-loss #5214 passed Gate 1 but failed Gate 2 at d1024/d1280, with cosine-similarity heatmaps showing the no-z-loss router develops anti-correlated expert pairs while logits grow ~5x larger; LM-head init scale 2x/4x #5222 failed Gate 1; removing QK norm #5230 failed three of four scales. @Kaiyue-Wen's global-gradient-normalization variant #5182 (PR #5183) passed Gate 1 with ~1.04x but lost Gate 2 at the bigger scales — confirming the small-scale 4% speedup but ~4% slowdown at scale she flagged in #moe. The MHA-vs-GQA ablation #5151 showed the recipe is leaving ~15% on the table for the 4x KV-cache savings — explicitly a concession to be reclaimed later via MLA — and the barebones-transformer ablation #5154 attributes ~0.14 macro_loss to the combined MoE + GatedNorm + XSA stack.

On the LR-scaling side, @WhenWen's depth-MuP residual-scaling sweep #5178 finished d512/d768/d1024 all preferring the 1x LR multiplier, matching the existing v16 fit, but d1280 is currently centering around 0.5x–0.707x at the ~5k checkpoint — so it's a central-basin stability signal, not strong scale-invariance yet. @pc0618's Muon Vizier r2 search #5167 found a tiny d512 win for Muon (3.8073 vs AdamH 3.8104) that did not transfer to d768, with AdamH-2x-batch controls also underperforming AdamH baseline; the consensus in #moe is that 2x batch + heavy hyper tuning is needed before declaring on Muon, and a split gate/up + warmup follow-up search has been launched. Looking forward, the recipe-integration work for the June model has now been opened as #5358 (with combined-best variants like #5371 stacking PKO + AdamH-embed + k6e256s5), and @ClassicLarry shipped a multi-host checkpoint-resume fix in #5319 after the wider-expert k6e256s5 d1024 jobs hit a broadcast_one_to_all launch-group drift on resume.

0 PRs this week, 17 new comments, and 0 new issues (52 total)

Sort:

Issues
#3469 MoE Sweep: Hyperparameters, Routing, Architecture, and Optimizer
#4013 [moe] Good 10T gate for #3469
#4014 [moe] Great 10T gate for #3469
#4432 Experiment: Map out Critical Batch Size for Current MoE Recipe
#4447 Experiment: MoE Isoflop on new lr recipe.
#4567 Experiment: Investigate MoE Beta2 Scaling Recipe
#4225 Experiment: Map out LR schedule and tuned value interactions with AdamH
#4569 Experiment: Investigate Approaches to handle Embed Norm Growth in MoE Recipe 💬1
#4697 Experiment: 1e23 MoE Run 💬1
#4716 Agent MoE Experiment: Attention Gate Sizing
#4767 Agent MoE Experiment: E=128 experts (up from E=64)
#4768 Agent MoE Experiment: Shared Expert Sizing (half and none) 💬3
#4769 Agent MoE Experiment: Remove sliding window attention
#4770 Agent MoE Experiment: Remove GatedNorm (keep RMSNorm only)
#4772 Agent MoE Experiment: Remove XSA (Exclusive Self Attention)
#4800 Agent MoE Experiment: Routed expert output weighting factor
#4802 Agent MoE Experiment: Partial key offset (partial RoPE + key shift)
#4803 Agent MoE Experiment: Remove logit z-loss (logsumexp penalty on LM head) 💬1
#4805 Agent MoE Experiment: Softmax routing instead of sigmoid
#4806 Agent MoE Experiment: x0 skip connections (residual to initial embedding)
#4807 Agent MoE Experiment: Layer-grain prediction (weighted sum of per-layer diffs)
#4849 Agent MoE Experiment: Router bias nonzero init
#4899 Agent MoE Experiment: 2x expert granularity (E=128, K=8, half expert dim)
#4900 Agent MoE Experiment: Slim layer (per-layer MoE necessity)
#4901 Agent MoE Experiment: Router combine activation (identity vs pre-sigmoid)
#4902 Agent MoE Experiment: GatedNorm scale factor (allow amplification)
#4904 Agent MoE Experiment: GatedNorm init scale (AdamH-preserved norm)
#4905 Agent MoE Experiment: Pseudogram (per-position sigmoid residual)
#4906 Agent MoE Experiment: Backout (subtract cached midpoint activation)
#4907 Agent MoE Experiment: Paired head attention (cross-head querying on even layers)
#4946 Agent MoE Experiment: Partial RoPE (half-dim rotation)
#4951 Agent MoE Experiment: PKO (every 4th) + partial rope (every layer)
#4952 Agent MoE Experiment: GatedNorm position ablation (embed/final/attn/mlp)
#4972 Agent MoE Experiment: Min LR ratio (0.05/0.1/0.15 vs 0.0)
#4973 Agent MoE Experiment: Router/shared norm split (separate RMSNorm+GatedNorm)
#4976 Agent MoE Experiment: PKO+prope with last layer always long+PKO
#4981 Agent MoE Experiment: Constant final LR (hold at X after linear decay)
#4986 Agent MoE Experiment: Value embeddings (token-indexed v mixing)
#4987 Agent MoE Experiment: Cached attention (reuse attn input for last 3 layers)
#4993 Agent MoE Experiment: Sandwich GatedNorm (post-MLP norm)
#4999 Agent MoE Experiment: Combined best (E128 + PKO+prope+lastPKO + cached attn)
#5000 Agent MoE Experiment: Adam LR shift (0.7x / 1.3x)
#5002 Agent MoE Experiment: Depth-width shift (±1 layer from heuristic)
#5047 Agent MoE Experiment: Wide attention (1.5x heads, head_dim=128) 💬2
#5110 Agent MoE Experiment: Block attention residuals (inter-block softmax weighting) 💬2
#5113 Agent MoE Experiment: Full attention residuals (every output is its own entry)
#5114 Agent MoE Experiment: RoPE before QK norm
#5115 Agent MoE Experiment: Muon AOL coefficients vs AdamH at gate 1 💬1
#5134 Agent MoE Experiment: MuonH swap with 2x batch ablation
#5151 Agent MoE Experiment: MHA (no GQA) 💬3
#5152 Agent MoE Experiment: MHA + PKO (no GQA, partial key offset every 4th + last) 💬1
#5154 Agent MoE Experiment: Barebones transformer (no XSA/GatedNorm/MoE, MHA, dense MLP) 💬2

8 autocategorized

#2434 MuonHT: MuonH with tangent constraints
#3466 Scale up Nano MoE 💬1
#5180 Agent MoE Experiment: AdamH gradient normalization
#5182 Agent MoE Experiment: AdamH global gradient normalization 💬10
#5224 Agent MoE Experiment: lm_head softcap (Gemma2-style, c=30) 💬8
#5306 Agent MoE Experiment: tanh(pos/10) loss reweighting 💬1
#5319 MultiHost Resume Error 💬15
#5358 Land the scaling recipe for June model

#4273 Improve Usability & Observability

Summary: Lower priority / slack-time workstream covering workqueue, dev-tpu replacement, and observability.

The lower-priority observability workstream was anything but lower-priority this week: the log-service/stats-service split that #5072 proposed last week landed end-to-end. @rjpower's #5212 lifted the log store and log server out of the iris controller into a new lib/finelog package, deliberately scoped as a forcing function for the service-extraction template before tackling stats. #5241 then proposed the sibling stats_service — typed schema-registered tables that callers write rows into and query with SQL, with the iris dashboard worker pane as the MVP consumer replacing the current sqlite read path. #5290 built the per-namespace DuckDB backend, Vue dashboard, and deploy plumbing on top of finelog, and #5370 is now the cutover: per-tick worker and per-attempt task resource time-series move out of the controller's sqlite into iris.worker / iris.task stats namespaces, dropping worker_resource_history / task_resource_history and the snapshot_* columns entirely. The controller stays canonical for liveness, scheduling, and roster; only measurements move.

The package rename produced predictable fallout that @ravwojdyla chased down: #5271 added a server-side path rewrite from /iris.logging.LogService/* to /finelog.logging.LogService/* after pre-#5212 worker images started 404'ing into UNIMPLEMENTED errors that weren't retryable, so cached clients never recycled and buffers overflowed; #5297 fixed /var/cache/finelog permissions on marin-dev; and #5262 hoisted LogPusher construction in Worker.start() above adopt_running_containers() so adopted attempts after a restart-worker capture a live pusher instead of silently None. Russell Power narrated some of the migration drama in #infra in real time — "the poor log server had a permissions issue, racked up 10M lines of pending logs and got sad" — and #5321 committed the log server entry that was missing from a previous deploy, the proximate cause of disappearing logs during a ~5-minute window.

Around the cutover, a string of dashboard and resilience polish: #5336 added a generic /proxy/<name>/<sub-path> endpoint on the controller so per-task Ray / JAX / TensorBoard dashboards are reachable through controller auth without each user touching worker IPs, and #5349 immediately followed when the proxy started leaking 10.x bind addresses to browsers behind GCP IAP through three independent sources of bad URLs (uvicorn forwarded-header handling, Starlette slash redirects, upstream-emitted Location headers — all three patched). #5332 wraps W&B and Trackio in a BackgroundTracker that serialises calls onto a daemon thread and catches exceptions instead of propagating, so long runs survive transient quota/network/5xx hiccups. #5207 closed out a regression from last week's RPC stats redesign — the global slow_samples and discovery_samples deques were aging quiet-method errors out under chatty traffic, now keyed per-method. #5187 added markdown status text to the tasks UI driven by Zephyr's items_count / bytes_processed counters, #5186 made the dashboard sparklines actually fill their grid cells, and #5284 added scripts/job_profile_summary.py for offline CPU-profile inspection — downloads the latest controller checkpoint, joins task_profiles with descendant tasks, normalises py-spy noise across siblings, and prints per-task and top-leaf summary tables plus a flamegraph SVG. The auto-mapped #5337 threads per-dataset runtime and token metrics through model-perplexity summaries and prevents eval batches from crossing dataset boundaries, so eval timing numbers are actually attributable.

1 autocategorized

#5337 [evals] Log per-dataset runtime and token metrics +349 −16 @dlwh

#4270 Canary pass rate to 90%+

Summary: Measurable: canary ferry pass rate consistently above 90%.

The TPU ferry stayed perfect at 9-for-9, up from last week's 7-for-7, but the GPU canary regressed sharply: 6 successes against 4 failures and 10 cancelled reruns, or 60% pass on completed runs versus last week's 87%. The arithmetic understates the calendar — Apr 26 through Apr 30 went five-for-five clean, then May 1–2 turned into a debugging marathon. The trigger was the JAX 0.9.2 upgrade exposing an NCCL all-to-all crash on CoreWeave H100, captured as #5377 by claude after four identical Iris retries pinned to node g13c908. Initial diagnosis suggested the node was bad, but @yonromai traced it to nvidia-nccl-cu12 being held at 2.27.5 by torch==2.10.0+cu128; #5379 bumped torch to 2.11.0 to pull NCCL 2.28.9 into the lock and merged Saturday afternoon, validated against an 8×H100 manual probe.

Fixing NCCL only got the ferry past the all-to-all and into Grug training, where it then OOMed on a separate post-NCCL diagnostics path. @yonromai's stacked #5380, still draft at week's end, disables the JAX profiler and Levanter watch-stat logging by default for GPU canaries; isolated 8×H100 probes confirmed the bare canary train step doesn't OOM, only the optional diagnostics around it do. The full CW workflow_dispatch validation at 15:45Z May 2 finally passed cleanly to 100/100 steps with final loss 6.53. @rjpower also filed #5382 documenting a separate slow-startup issue surfaced during the same debugging: cold tensorstore reads of zarr3-sharded SlimPajama tokens from R2 take 30+ minutes on first batch because the default prefetch_size=32 issues ~8192 concurrent reads against a high-latency object store. That one's pre-existing on main.

The datakit smoke ferry slipped to 5-for-7, ~71%, down from last week's 86%. Both failures are post-merge fallout from ravwojdyla-agent's bucket consolidation #5266, which folded marin-tmp-* into per-region marin-{region}/tmp on Apr 29. Issue #5376 captures the May 2 break: an Iris CPU job landed in us-west1-a, one of the zones in the cpu_vm_e2_highmem_2_ondemand scale group, where marin_temp_bucket() resolved to gs://marin-us-west1/tmp — but us-west1 was never in REGION_TO_DATA_BUCKET, so the bucket doesn't exist and the first _write_executor_info 404'd. Fix is either to drop us-west1-a from the CPU scale group or add the region to the bucket map; neither has landed, so datakit will keep flapping until one does.

4 autocategorized

#5379 Fix CW GPU NCCL all-to-all regression +460 −100 @yonromai
#5380 Disable CW GPU canary profiling and watch stats by default +493 −101 @yonromai
#5376 [canary-datakit-smoke] Datakit smoke fails on us-west1 worker — bucket marin-us-west1 does not exist
#5382 Canary ferry CW: cold-cache first-batch fetch takes many minutes (not a deadlock, not metadata)

#4269 Single way of running jobs — off Ray completely

Summary: All jobs run through Fray+Iris.

1/1 sub-issues closed

The week after the Ray sunset was a long shakeout of the logging plane. #5212 lifted the log store and log server out of the Iris controller into a standalone lib/finelog package — wire-compatible with the old iris.logging protos via a transcoding shim, with workers resolving the server through a generic cluster_config.endpoints map. Standing it up exposed three latent failure modes in quick succession. Pre-#5212 worker images kept POSTing to /iris.logging.LogService/* after the rename, got 404 → UNIMPLEMENTED, and because that error wasn't classified as retryable the cached client never recycled and the buffer overflowed; #5271 added a server-side path rewrite to keep old workers alive (closing #5268). Fresh GCE finelog VMs left /var/cache/finelog owned by root so the in-container finelog uid 1000 couldn't write — the existing prod boxes had only worked by coincidence (#5296, fixed in #5297). And after a controller restart, adopted containers came up before the worker's LogPusher existed, so every line silently fell into the if not self._log_pusher: return short-circuit forever — #5262 hoists pusher construction above adoption, with a regression test #5261. @rjpower summed the week up bluntly on infra: "the poor log server had a permissions issue, racked up 10M lines of pending logs and got sad." A missing log-server endpoint commit caused another five-minute outage along the way #5321, and #5291 unbroke the finelog Docker build by adding config/ back to the build context.

Coscheduling preemption finally became real for multi-host TPU jobs. The old short-circuit at if victim.is_coscheduled: continue meant interactive priority couldn't evict batch on any v5p-N or v4-N pod — i.e. exactly where preemption mattered (#5237, filed by @ahmeda14960 with a concrete repro). #5240 closed it: a higher-band coscheduled preemptor can now evict an entire victim slice atomically when device-variant matches and the slice is at least as large as the request, with solo preemption now also gated on matching device-variant so a v5p-256 ask can't reclaim a v5p-8 slot. #5249 mopped up two adjacent bugs that #5240 made visible: a coscheduled task hitting transient failure was returning to PENDING alone while siblings stayed RUNNING, so the retry could land on a different tpu-name and SPMD collectives would hang; and cancel_job left task_attempts active forever, making the dashboard report killed tasks as still occupying their old workers. Migrations 0038/0039 healed existing orphans. Separately, @RohithKuditipudi hit single-VM coscheduling head-on while running inference shards across v5p-8 workers — fray was emitting group_by="tpu-name" for any multi-replica TPU job, which is unschedulable when each replica has its own tpu-name #5219; #5226 mirrored the Iris CLI contract and only coschedules when vm_count > 1. @rjpower also opened #5258 on a related sharp edge — Iris reuses TPU hosts within seconds of evicting the prior tenant, but the JAX coordinator PollForError masks the libtpu busy string the existing tpu_health detector keys on, so /dev/vfio-busy hosts don't get recycled and the next job hangs.

The other big thread was cross-region correctness. @ahmeda14960 lost a Delphi 1e20 midtraining run when Iris rescheduled it from us-central1 to us-east5: MirrorFS rewrote bucket prefixes into the executor identity hash, the run id flipped, and the checkpoint didn't resume — "the executor framework is starting to be a footgun." #5223 stops region prefixes from leaking into hashes by storing relative paths in hash_attrs and adds a regression test that asserts hash_id/name_with_hash stability across MARIN_PREFIX while output paths still differ correctly #5216. #5225 wires tensorstore checkpoint I/O — which bypasses fsspec — into the cross-region transfer budget by adding record_transfer() calls keyed off estimated array byte counts. The longer-term fix is in design: #5279 proposes moving the executor's DAG walk out of the launch entrypoint and into the training worker so a job preempted in us-central1 and rescheduled in us-east5 re-tokenizes locally instead of paying egress, with a sibling #5218 teaching MirrorFS to discover temp checkpoints across regions #5217. ravwojdyla-agent retired the parallel marin-tmp-* bucket family in #5266, folding TTL-prefixed paths into the canonical regional buckets so MirrorFS only has one namespace to scan. And @rjpower's #5273 design proposes a GAR-backed PyPI pull-through mirror to keep Iris task installs alive through pypi.org / github.com flakiness, with marin-dupekit published to PyPI to unblock the one wheel uv's find-links path can't proxy.

Iris's operational surface kept getting polished. @wmoss landed the Zephyr→Iris status pipeline #5187 — markdown status text on tasks, populated from Zephyr's items_count/bytes_processed counters — and then #5283 split that into status_text_summary_md for a new Status column on the job task table and status_text_detail_md for the task page; #5176 stopped map-only Zephyr stages from spamming items=0 (0.0/s) log lines. claude's #5207 made RPC stats sample rings per-method so errors on quiet methods don't age out behind chatty ones #5206, and #5384 server-side-paginates ListWorkers via a new WorkerQuery so the Fleet tab stops rebuilding every WorkerHealthStatus on every refresh #5383. @rjpower's endpoint-proxy stack — #5336 for /proxy/<name>/<sub-path> on the controller dashboard, then #5349 for three independent sources of internal-IP leakage behind GCP IAP — finally lets per-task Ray/JAX/TensorBoard dashboards live behind controller auth without each user reaching a worker IP directly. #5345 tolerates non-Python children in py-spy dump so thread dumps work for training jobs with a wandb-core child, #5327 adds a per-scale-group cache_dir override and a saner dual disk health threshold, and #5284 introduces an offline CPU profile inspector that joins task_profiles with task descendants and emits a flamegraph. @dhidary's #5150 unblocks TRC-grant and security-locked GCP orgs by routing controller and SSH access through IAP tunnels when external IPs are forbidden. @yonromai closed out the last Ray-era doc references in #5202 #5029 and fixed the terminal-status self-race that leaked LeaseLostError from successful steps in #5208 #5026. @Helw150's #5295 — preemptible CPU children couldn't land anywhere with massive idle CPU sitting on TPU host VMs — lit up another scheduling gap to fix on the roadmap, alongside ravwojdyla-agent's #5270 drain/cordon RPC and @ihodes's #5369 infra tune-up tracker.

0 PRs this week, and 0 new issues (1 total)

Sort:

Issues
#5069 Iris: Manual slice startup

71 autocategorized

#3712 docs: fix TPU cluster setup command references 💬2 +6 −6 @dlwh-golem
#4816 [iris] Add slice lifecycle state machine for autoscaler transitions 💬1 +848 −472 @rjpower
#5141 [Zephyr] Send task stats from Zephyr to Iris 💬1 +369 −17 @wmoss
#5143 [Iris] Store the task stats in sqlite 💬1 +752 −17 @wmoss
#5144 [Iris] Add sections to the UI for task stats 💬1 +1126 −109 @wmoss
#5150 [iris] Support internal-IP-only deployments via IAP tunneling 💬3 +92 −74 @dhidary
#5176 [zephyr] skip status log when no counters recorded 💬2 +133 −15 @wmoss
#5186 [iris] Make dashboard sparkline charts fill their container 💬1 +19 −14 @claude
#5187 [iris] Add markdown status text to tasks UI and using it for Zephyr tasks 💬1 +518 −111 @wmoss
#5202 [codex] docs: retire remaining Ray references +82 −138 @yonromai
#5207 [iris] Keep RPC stats sample rings per-method so errors don't age out 💬1 +37 −20 @claude
#5208 [codex] Fix step lock terminal status race +41 −17 @yonromai
#5212 [iris] Lift log store and log server into new lib/finelog package 💬2 +4085 −2392 @rjpower
#5218 [checkpoint] Cross-region temp checkpoint discovery via mirrortmp:// 💬1 +784 −23 @ahmeda14960
#5221 [Marin] Guard region-pinned GCS access in training and Levanter 💬3 +413 −33 @dlwh
#5223 Stop region prefixes leaking into Marin executor identity hashes +180 −25 @ravwojdyla-agent
#5225 Charge cross-region transfer budget on tensorstore checkpoint I/O +113 −0 @rjpower
#5226 [fray] Skip tpu-name coscheduling for single-VM TPU types 💬1 +3 −6 @rjpower
#5240 [iris] Same-variant slice preemption for coscheduled jobs +360 −43 @rjpower
#5241 [Design] stats_service 💬2 +483 −0 @rjpower
#5243 [Design] executor_in_training_job 💬1 +516 −21 @rjpower
#5249 [iris] Fix coscheduled split-slice and orphan attempt bugs +2201 −219 @rjpower
#5262 iris: wire log pusher before adopting containers +56 −15 @ravwojdyla-agent
#5266 infra/rigging: fold tmp buckets into main buckets 💬2 +333 −219 @ravwojdyla-agent
#5267 [evals] Fix main CI: stale fray.v2 imports and removed default_model_perplexity_gap 💬1 +44 −20 @ravwojdyla-agent
#5271 finelog: accept legacy iris.logging url path on the server 💬2 +61 −1 @ravwojdyla-agent
#5272 [Design] iris_worker_cordon 💬2 +457 −0 @ravwojdyla-agent
#5273 [Design] iris_pypi_mirror 💬1 +997 −0 @rjpower
#5275 [iris] Fix overlapping Duration/Exit headers in task overview table +11 −10 @claude
#5279 [executor] Run executor inside training job; remove Iris region inheritance +2002 −962 @rjpower
#5283 [iris/zephyr] Add task status summary column to job UI +237 −139 @wmoss
#5284 [iris] Add scripts/job_profile_summary.py for offline CPU profile inspection 💬1 +570 −0 @rjpower
#5290 [finelog] per-namespace stats backend + dashboard + deploy plumbing 💬2 +11274 −2004 @rjpower
#5291 finelog: include config/ in docker build context 💬1 +1 −0 @ravwojdyla-agent
#5297 finelog: fixup `/var/cache/finelog` permissions +2 −0 @ravwojdyla
#5321 Somehow the log server entry wasn't committed previously, making logs disappear. +2 −0 @rjpower
#5327 [iris] Per-scale-group cache_dir override and dual disk health threshold 💬1 +210 −67 @rjpower
#5336 [iris] Generic HTTP endpoint proxy +934 −1 @rjpower
#5345 [iris] tolerate non-Python children in py-spy dump +9 −1 @ravwojdyla-agent
#5349 [iris] Fix endpoint proxy redirects behind IAP / load balancer +832 −138 @rjpower
#5351 [codex] Add Iris inference routing MVP +1053 −0 @yonromai
#5354 [ci] Implement workflow_scripts: extract logic, rename workflows +1782 −878 @rjpower
#5370 [iris] move per-tick resource stats out of controller DB +1583 −1631 @rjpower
#5384 [Iris] Paginate ListWorkers via WorkerQuery +537 −141 @claude
#5385 [execution] Recurse into tuples in executor configs +41 −0 @dlwh
#2618 Unable to run on TPUs w/ "Failed to initialize TPU system with error: FAILED_PRECONDITION: open(/dev/vfio/0): Device or resource busy" 💬1
#4413 [vllm/tpu-inference] make_optimized_mesh crashes on eu-west4 v6e-4
#5026 `step_lock` self-race: successful steps report `LeaseLostError` on terminal `write_status`
#5029 Docs sweep: replace ray_run.py / launch_on_ray / ray_tpu references after deletion (#5028, #5031)
#5185 [Iris] The CPU and Memory spark charts don't fill the container in the dashboard 💬1
#5188 Flaky integration tests 💬2
#5195 Upgrade `dorny/paths-filter` from v3 to v4
#5204 Iris Monitoring improvements
#5206 Iris: RPC status page/collection should retain error responses
#5215 [iris] Route log destination through rigging.log_setup sinks
#5216 [executor] Canonicalize Marin paths before hashing
#5217 [mirrorfs] Discover temp checkpoints across regions 💬2
#5219 [fray] Fix TPU actor coscheduling for single-VM workers 💬4
#5237 [iris] Preempt coscheduled tasks atomically (v2 preemption for multi-host TPU)
#5258 [iris] tpu_health misses /dev/vfio busy on preempt-then-reuse when JAX coord error masks libtpu
#5261 [iris] adopted task attempts permanently drop container logs (log_pusher=None) 💬4
#5264 Retire `marin-tmp-*` scratch buckets; apply TTL lifecycle to main buckets 💬3
#5268 [iris/finelog] LogPusher silently wedges after /system/log-server restart 💬5
#5270 [iris] Add worker drain/cordon RPC for graceful recycling
#5274 [Iris] Duration and Exit overlap in the task overview table in the Iris UI
#5295 Allow non-reservation jobs to use spare CPU on reserved TPU hosts 💬3
#5296 [finelog] bootstrap leaves /var/cache/finelog unwritable on fresh GCE VMs 💬1
#5346 [iris] Persist task peak memory & resource stats beyond task_resource_history TTL 💬1
#5366 [Zephyr] Setting InlineRunner requires modifying shared code
#5369 Infra Tune up - unified queries, zero-trust proxy, GH integration to Iris
#5383 Iris: ListWorkers calls without offset/limit are slow

#3192 Synthetic data: SWE-ZERO swarm rebuilt at 52% PR coverage; 500K SFT stuck; TerminalCorpus exposes a Marin-32B agentic-prior gap

Summary: After the Marin x00B MoE models are pretrained, the next step is to mid-train/post-train the model using high-quality datasets targeting different capabilities, such as math/code/science reasoning and agentic tasks (e.g. coding). Many such datasets that are open-sourced are generated...

0/4 sub-issues closed

SWE-ZERO generation absorbed most of the week. #4898's 500K-trajectory SFT — which the prior week left mid-flight as the next data point on the dose-response curve — never produced a real training step: @AlienKevin burned through v5–v10 of the launcher fighting region pinning, v5p-16 capacity that simply was not available cluster-wide, then a v5p-8 fallback that hung on Levanter dataset-cache build for 3+ hours before the executor died. As of week-end the run has been listed as down for ~80 hours with .executor_status=FAILED at step-38 (init only), so the 500K dose-response point that follows last week's 0.0% → 3.3% → 4.0% → 5.3% SWE-bench Verified curve is still missing.

The 140B-token generation pipeline for #4719 got rebuilt twice. Mid-week @AlienKevin migrated from a 13-batch shard model to a Michael-Ryan-style swarm with hard us-east5 region pinning and 1260 finer claims (~94 PRs each) instead of 126 coarse ones; a sampled audit of the swarm output against the legacy run found the two were statistically indistinguishable (1.7% vs 1.8% Submitted rate, within-PR Jaccard medians 0.27–0.30, 100% exact-distinct), so legacy and swarm rollouts can be mixed. The pipeline then crossed the 140B raw-token bucket-size target on May 1 — at which point AlienKevin caught a measurement bug and reverted: the real target is 12.6M rollouts (100 × 126K PRs), not bucket bytes, and zero shards had _done markers. Recalibrating against the right metric, the run ended the week at ~6.50M rollouts and 64.1K PRs at the 100-rollout target, i.e. ~52% rollout coverage and ~51% PR coverage. Bad-shard recovery worked well (210 → 1 over the week as workers revisited under-filled claims), and the swarm Submitted rate held steady at ~4.0%, roughly 2× the legacy 1.7%. Throughput collapsed near week-end as us-east5 v6e-4 capacity dried up — fleet flickered between 1 and 30 workers, and the most recent audits showed only ~28K rollouts/hr, prompting AlienKevin to recommend stopping at ~64K PRs and triggering Phase 2 (Qwen3-0.6B SFT + 4-checkpoint evals on what's already there) rather than waiting 3–4 more days for marginal data.

The week's most pointed quality signal came from a sibling experiment in the post-training neighborhood. #4760 midtrains Marin-32B on 15% of Nemotron-Terminal-Corpus to compare against Qwen3-32B SFT at the same training fraction. Early Terminal-Bench 2 evals had Marin solving 0/76 with 70% output-length-exceeded errors; @AlienKevin traced this to a save/serve mismatch — the chat template trained the model to emit <|eot_id|> (128009) while the saved config.json carried eos_token_id=128001 inherited from marin-32b-base and no generation_config.json was written, so vLLM never stopped on the trained terminator and the model degenerated into zorazora… repetition until it hit the output budget. The fix landed as patched checkpoints, a MARIN_GENERATION_CONFIG constant in experiments/marin_models.py, end-to-end plumbing of generation_config through Levanter's HFCheckpointConverter and the SFT pipeline, and a defensive eval-launcher validator that fails fast on EOS-mismatched checkpoints. With EOS fixed, the apples-to-apples re-eval (temp=0.6, top_k=20, top_p=0.95) put Marin-32B SFT step-858 at 2/87 (2.3%) against Qwen3-32B SFT step-1000 at 18/86 (20.9%) on comparable ~110K vs ~128K examples seen — a real ~10× gap. Drilling in, both Marin solves (modernize-scientific-stack, sqlite-with-gcov) are spec-driven recipe tasks that Qwen3 also gets at half its training budget; the 10 tasks Qwen3 unlocks between step-500 and step-1000 (Coq proofs, async-task fixes, building 30-year-old C, etc.) are exactly the ones requiring agentic exploration that Marin still misses. The trace dataset went up at AlienKevin/terminal-bench-2-sft-traces for direct inspection.

The reading is consistent with what @dlwh opened the week with in #evals: Marin-8B and -32B base models look fine on prose-heavy held-out loss but are materially weaker than Qwen3 on patches, tool-call observations, and code-world-modeling slices, with patch-gain BPB of -1.43 vs Qwen3's +0.16. Synthetic data is the answer the epic is supposed to provide; the pieces in flight (SWE-ZERO trajectories, the planned Qwen3-0.6B SFT smoke test, the TerminalCorpus midtrain) are the channels for delivering it. The week's net is that the upstream data tap is now flowing reliably at half-target, the post-training validation on Marin-32B has surfaced a base-prior weakness that more SFT alone may not close, and the headline 500K dose-response point on Marin-8B is still owed.

0 PRs this week, and 0 new issues (4 total)

Sort:

Issues
#3093 [Agentic SFT] Tracking SFT datasets
#2262 Experiment: OpenThoughts4 Teacher Model Comparison - Qwen3-32B vs. Qwen3-235B-A22B
#3956 Pilot distillation tests for optimal teacher selection
#4719 [SWE-ZERO] Scale to 140B tokens: 122K PRs x 100 rollouts from SWE-rebench V2-PRs

44 autocategorized

#4249 [Eval] Add long-context evaluation lane for exp2062 💬1 +659 −2 @taivu1998
#4620 [rl] Make KL configuration explicit and add k2 support 💬1 +312 −56 @taivu1998
#4634 [dpo] Revive LoRA-DPO on canonical train_dpo 💬1 +60064 −395 @ahmeda14960
#5118 [evals] Add capped ASR/OCR noisy-text slices +336 −0 @dlwh
#5119 [evals] Add GH Archive structured-output PPL evals +665 −0 @dlwh
#5124 [evals] Add bounded npm registry metadata slice +589 −10 @dlwh
#5126 [evals] Add UWF Zeek security eval slice +1432 −1 @dlwh
#5127 [evals] Add bio/chem notation PPL slices +1402 −0 @dlwh
#5129 [evals] Add structured-text PPL slices 💬2 +1849 −0 @dlwh
#5130 [evals] Add SVG-backed raw web markup PPL slices 💬1 +1001 −0 @dlwh
#5168 Delphi Downstream Evals 💬1 +377 −12 @Helw150
#5169 [evals] Cache per-model perplexity scores before gap diffs +1256 −99 @dlwh
#5189 [evals] Add HF raw web markup PPL slices 💬1 +665 −0 @dlwh
#5190 [data] Add shared ingestion manifest schema 💬2 +324 −0 @dlwh
#5191 [evals] Add raw capability PPL slices 💬2 +858 −0 @dlwh
#5192 [evals] Add Common Crawl WARC/WAT raw-web slices 💬1 +915 −0 @dlwh
#5193 [evals] Add manifest-backed game/music raw eval slices 💬1 +742 −0 @dlwh
#5196 [evals] Add raw LM-eval bridge slices +784 −0 @dlwh
#5213 [evals] Add mega perplexity-gap integration branch 💬1 +16337 −44 @dlwh
#5248 [evals] Make trace-masked agent probes repeatable 💬1 +4186 −46 @dlwh
#5260 [evals] Register package metadata placeholder slices +148 −3 @dlwh
#5372 [evals] Split standalone tabular PPL slice +779 −0 @dlwh
#2199 Experiment: Get SFT baselines (Marin, Qwen, Llama) on OpenThoughts3 data
#2465 Speed up RL training for long-context reasoning via dynamic batching
#4547 Identifying some mid/post-training recipe candidates 💬1
#4556 [levanter] Revive LoRA-DPO on canonical train_dpo 💬1
#4750 Replace vLLM Docker sidecar path in Iris evals
#4755 LoRA divergences on different TPU hardware 💬1
#4760 [Agentic SFT] Post-train Marin-32B on the TerminalCorpus 💬6
#4790 Find Suitable Open-Weight LM Judge 💬3
#4963 [evals] Add base-model PPL eval sets for chat and agent data 💬3
#5005 [evals] Build pretraining checkpoint confidence portfolio 💬2
#5056 [evals] Add raw web, markup, and image-text PPL slices 💬1
#5061 [evals] Add package and dependency metadata PPL slices
#5062 [evals] Add game and music notation PPL slices 💬2
#5097 [evals] Add ASR and OCR-noisy text PPL evals
#5098 [evals] Add GH Archive structured-output PPL evals
#5205 [analysis] Serve full-document heatmaps outside the static dashboard
#5244 Instruction Following post DPO 💬1
#5254 [evals] Expand programming ecosystem coverage in perplexity-gap suite
#5255 [RL] Add explicit vLLM inference seeding
#5277 [evals] Extend raw LM-eval bridge to more public train/dev tasks
#5318 [docs] Document perplexity-gap workflow after it stabilizes
#5367 Identifying perplexity gaps for eval and training

#3100 Data sources for pre-training / mid-training

Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.

0/5 sub-issues closed

The week pivoted from diagnosing the bits-per-byte gap to actively widening the sourcing tranche. @dlwh followed last week's gap report with a much broader perplexity sweep against Qwen 3 and Llama, posted as the "mineshaft gap" analysis under the confidence-portfolio epic #5005. The sweep anchors on Paloma and Uncheatable Eval and adds long-tail surfaces — FineWeb2 multilingual, raw Common Crawl WARC/WAT, SVG-Stack, GitTables, Web Data Commons, npm registry metadata, UniProt/RNAcentral/PubChem/ChEMBL, plus formal text like SMT-LIB, CoqGym, VerilogEval and PGN — measured in bits per byte after a token-realignment step that lets Qwen's tokenizer be compared cross-family. Marin remains broadly fine on edited English prose but is "very bad" on non-English text, "messy" surfaces (raw HTML, WARC, SVG, package metadata), code, scientific notation, and structured tables; @dlwh also published a sheet of tokens not covered by either the new or existing eval sets, calling out missing C#, Java, and Objective-C coverage and long-tail multilingual as the next gaps to fill.

On the ingestion side, @Helw150 finally landed #4326 after @ravwojdyla rebased it onto the new datakit layout — a download-and-filter pipeline for HPLT v3.0 English that keeps only non-Common-Crawl sources (WIDE and survey crawls, ~450B unique tokens) and applies register-based quality filtering with Haiku as the ground-truth classifier (~99% agreement on machine-translation score, web-register, and PII). The companion #4328, which would have added ~14T tokens of Common Pile, FinePDFs, FineTranslations, NuminaMath, and Institutional Books and centralized download definitions under marin.datakit.download, was closed as superseded by follow-on PRs but most of its surface — Common Pile's 30+ subsets, the StackV2 code-extension filter, the BHL page-to-book stitcher — has already shown up in the token-count snapshot. New tranches landing this week include 57.6B tokens of davinci-dev/ctx-native and 2.6B of davinci-dev/env-native synthetic code, an 8.9B SVG corpus that maps directly onto one of the gap categories, and smaller agent-reasoning and Molmo caption sets. Several agent-filed standalone ingestion tasks — public diagnostic logs #5094 and GH Archive JSON events #5099 — got their first concrete framings; on #5099 @dlwh separated the tiny held-out eval slice in #5119 from the real training-scale ingestion path, calling for explicit train/dev/test date splits and token-count estimates so the mixture weight can be set on real numbers.

The downstream side of the pipeline — turning these tranches into a production mixture — got a sober status report from @Calvin-Xu. After two weeks of joint mixture-and-scale modeling, he settled on an anchor-mixture regression with N, D terms that reduces to Chinchilla Approach 3 at fixed mixture and to a pure mixture regression at the 300M/6B anchor, fitted at α=0.155, β=0.146. The form does not give convincing predicted optima when directly optimized, so as a de-risk he validated the mixture optimized at 60M/1.2B at every scale and showed the perplexity advantage remains stable. The catch, posted on #2345 and #2404: the loss advantage on Uncheatable Eval does not translate strongly to the benchmark/task scores Marin actually cares about. The single-phase ablation of the GRP form on #2404 beat the swarm mean (1.059 vs 1.115 BPB) but still trailed the original two-phase optimum (1.028). Discussion in the data-mixing channel converged on a fix: stop optimizing for Uncheatable alone and add Paloma plus the MMLU-train and HumanEval/GSM8K canonical-solution BPB. @Helw150 shared a parallel attempt on Calvin's functional form, decomposing many tasks into core capability axes via Item Response Theory (inspired by arXiv:2503.13335); the resulting mixture looks sane and edges out proportional, with the next step being to drop individual datasets from the IRT metric and watch the mix respond.

The May target then got formalized: @Helw150 filed #5359 requiring an active swarm over all sources in datakit/sources.py, de-risked on the existing #2345 swarm against UncheatableEval, HumanEval, MMLU, GPQA, and David's PPL sets, with a checked-in mixture file as the definition of done — must beat (or match) both proportional mixing across all sources and proportional mixing over a hand-picked high-quality subset. Sub-issues split out the open questions: SNR-clean metric selection on #5362, quality-and-topic bucket finalization on #5363 (gated on Rafal's quality classifier), and a 1e23-FLOP production swarm spec on #5364 that #5365 will then launch — all aimed at locking the mix by mid-June. On a parallel research track, @XenonMolecule posted week-of progress on #2351 (small model that converts raw WARCs into training tokens): on a fixed 3000-WARC budget Qwen3-8B extraction yields 56.0B tokens versus Resiliparse's 142.7B and DCLM's 2.66B, putting the LLM extractor in a regime where it beats high-quality filtering at scale and stays slightly above Resiliparse on quality, with a scaling-law fit pending. @hammer also weighed in on #5197 with a careful literature pass on metadata conditioning (MeCo and follow-ups), flagging the "fine-grained beats coarse-grained" finding from "URLs Help, Topics Guide" and the open question of whether MeCo's 90/10 prepended-vs-cooldown schedule beats uniform interleaving — a likely interaction with the Luxical embedding work on #3049 and the topic-bucket plan on #5363.

0 PRs this week, and 0 new issues (5 total)

Sort:

Issues
#3049 Test Luxical as a General Tool for Data Integration Pipelines
#3101 Ensure we have 20T deduped tokens of data
#3183 Software Heritage Foundation license
#3194 Gather code environments
#4148 Experiment: synthetic reasoning bootstrap corpus

13 autocategorized

#4326 Import Likely Non-Duplicate HPLT data 💬2 +324 −4 @Helw150
#4328 Add new pretraining datasets and centralize datakit downloads 💬1 +1051 −459 @Helw150
#2345 Data Mixing Many Domains Swarm Run 💬3
#2351 Small Model for Raw Web Data to Training Tokens 💬1
#2404 Data Mixture: Many Domains, One Phase (replication) 💬1
#5094 [data] Source public diagnostic logs for training data 💬1
#5099 [data] Add GH Archive JSON events to pretraining data 💬1
#5197 Evaluate metadata conditioning for improving pretraining data efficiency 💬1
#5359 Determine data mixture for pre- and mid-training for June model
#5362 WIP Select Metric for Data Mix Optimization
#5363 WIP Finalize "Buckets" of data to Mix over
#5364 WIP Create "Production" Model Swarm Spec.
#5365 WIP Launch "Production" Model Swarm

Other Changes

Routine fixes and maintenance, no broader threads to surface this week.

3 PRs this week, and 70 issues closed (70 total)

Sort:

#5324 Bump protobufjs from 7.5.4 to 7.5.6 in /infra/status-page +79 −15 @dependabot
#5323 Bump nltk to 3.9.3 +3 −3 @dlwh
#5320 [gitignore] Ignore .obsidian folders +1 −0 @wmoss

Community Pulse

External GitHub activity centered on @wmoss, who put up roughly a dozen Iris and Zephyr PRs — task-status surfacing in the job UI #5283 and #5187, an in-flight stats-to-sqlite stack #5141/#5143/#5144, and a Zephyr performance design proposal #5333 spawning #5352 and #5353. @gonzalobenegas kept the DNA gLM thread moving with two enhancer-curation experiments #5242 and #5355; @dhidary landed IAP-tunneling for internal-IP Iris in #5150; @RohithKuditipudi filed the fray TPU coscheduling fix #5219; @leloykun opened the MuonHT-with-tangent-constraints design #2434; and @moojink staged OpenThoughts3 SFT baselines in #2199.

On Discord, Furkan picked up @dlwh's evals-gap framing and pushed it toward biology in a Bio Evals thread, asking for the LLM eval setup for proteins, molecules and RNA; dlwh pointed at the notation-prediction work in #5213 and Furkan flagged broken references on the spot, offering codex-time to help finish the framework. In a long infra thread Ahmed M Ahmed walked Russell Power through the interactive-vs-batch v5p-32 preemption case while Russell already had #5240 in flight enabling exact-slice preemption. In data-mixing, dlwh sharpened the swarm-vs-eval debate with "i am very against only picking datasets that correlate with downstream evals, but i am very much in favor of including datasets that correlate," teeing up @Helw150's IRT-decomposed extension of Calvin's functional form later that week.

Eleven new members arrived self-describing as a DeepMind research scientist on LLM-for-coding and agentic RL, an EPFL/pleias AI scientist on HPC and synthetic-data pipelines, a Göttingen Data-Infrastructure HPC + LLM-pretraining engineer, a mathlib-contributing geometric-group-theory PhD, an Apart Research director moving from physics into capability evaluations, a Futarchy Labs founder building forecasting markets, plus arrivals on ranking/retrieval, scaling laws, and end-of-life VLMs. Anjiang Wei's agentic-RL / code-RL work intersects with the SWE-ZERO and agent-trace BPB threads #4963 and #3093; Kelvin Santos (kas) proposed a forecasting tool to predict loss ahead of training runs — a fit for the preregistration machinery around #4697 and #4447. Among silent joiners, André Martins (IST Lisbon / Unbabel) brings context on sparse-attention and entmax-style routing, relevant as the agent-driven MoE sweep #5184 probes routing and lm-head behavior.

Reading skewed toward data quality and the engineering of agentic SFT / RL — AllenAI's olmpool, the SWE-chat and AgentTrove/TaskTrove rollouts, Tilde's Nitrobrew distillation post, the Megatron-Core MoE systems paper, and Rishabh Agarwal's ICLR RL-scaling talk flagged for trainer/inference numerics mismatches on MoEs.

News & research shared

allenai.org/papers/olmpool discussion (4 comments)
x.com/joabaum/status/2048805584158208317 — another SWE style dataset, but with coding agents! sounds similar to proposal to have us track our coding agent rollouts? discussion (2 comments)
arxiv.org/html/2412.04403v2#S4 — Establishing Task Scaling Laws via Compute-Efficient Model Ladders discussion (1 comment)
huggingface.co/datasets/open-thoughts/AgentTrove — We’re on a journey to advance and democratize artificial intelligence through open source and open science. discussion (1 comment)
arxiv.org/abs/2312.10523 — Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of… discussion
huggingface.co/datasets/allenai/paloma — We’re on a journey to advance and democratize artificial intelligence through open source and open science. discussion
arxiv.org/abs/2602.15210 — Multilinguality is a core capability for modern foundation models, yet training high-quality multilingual models remains challenging due to uneven data… discussion
blog.tilderesearch.com/blog/nitrobrew — Nitrobrew: Fast, Lossless Distillation for Free | Tilde discussion
x.com/agarwl_/status/2049263264249368906 — awesome talk from rishabh on scaling RL and his work on perioridc labs. He calls out everything i suspected (simple algoritm but need… discussion
arxiv.org/abs/2603.07685 — Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this… discussion
x.com/nrehiew_/status/2049508076248776952 — i don't think there's anything shocking from the [high level summary]() i've read but i'll read it more closely. Most of the takeaways… discussion
github.com/catswe/flash-attention-residuals — Triton kernels and PyTorch ops for Block Attention Residuals (AttnRes) - catswe/flash-attention-residuals discussion
arxiv.org/abs/2503.13335 — Comprehensive evaluations of language models (LM) during both development and deployment phases are necessary because these models possess numerous capabilities… discussion

GitHub activity from 11 external contributors

Will Moss · @airbnb · San Francisco, CA 12 PRs, 16 comments

#5333 [design] Proposal for Zephyr performance improvements 💬2 +328 −0
#5320 [gitignore] Ignore .obsidian folders +1 −0
#5289 Updatetimeout and increase parallelism in CI 💬2 +3 −3
#5287 Add max-parallelism command line parameter to fineweb_10bt_exact +20 −6
#5286 Remove `group_files` function and usage +18 −38
#5283 [iris/zephyr] Add task status summary column to job UI +237 −139
#5187 [iris] Add markdown status text to tasks UI and using it for Zephyr tasks 💬1 +518 −111
#5176 [zephyr] skip status log when no counters recorded 💬2 +133 −15
#5141 [Zephyr] Send task stats from Zephyr to Iris 💬1 +369 −17
#5143 [Iris] Store the task stats in sqlite 💬1 +752 −17
#5144 [Iris] Add sections to the UI for task stats 💬1 +1126 −109
#5194 [Zephyr][CI] Shard Zephyr test in CI so they run faster 💬1 +303 −2

16 comments on 13 threads

#5289 Updatetimeout and increase parallelism in CI ×2
#5176 [zephyr] skip status log when no counters recorded ×2
#5188 Flaky integration tests ×2
#5333 [design] Proposal for Zephyr performance improvements
#5284 [iris] Add scripts/job_profile_summary.py for offline CPU profile inspection
#5241 [Design] stats_service
#5187 [iris] Add markdown status text to tasks UI and using it for Zephyr tasks
#5186 [iris] Make dashboard sparkline charts fill their container
#5141 [Zephyr] Send task stats from Zephyr to Iris
#5143 [Iris] Store the task stats in sqlite
#5144 [Iris] Add sections to the UI for task stats
#5194 [Zephyr][CI] Shard Zephyr test in CI so they run faster
#5185 [Iris] The CPU and Memory spark charts don't fill the container in the dashboard

whenwen · A learner, interested in machine learning, language models, and mathematics. 1 PR, 18 comments

#5183 [grug] Add MoE AdamH global gradient normalization +763 −33

18 comments on 2 threads

#5182 Agent MoE Experiment: AdamH global gradient normalization ×10
#5224 Agent MoE Experiment: lm_head softcap (Gemma2-style, c=30) ×8

claude-nightshift 5 PRs

#5342 [nightshift] 20260501 multi-cleanup +82 −86
#5307 [nightshift] 20260430 multi-cleanup +9 −54
#5263 [nightshift] 20260429 multi-cleanup 💬1 +17 −15
#5234 [nightshift] 20260428 multi-cleanup +19 −40
#5201 [nightshift] 20260427 multi-cleanup 💬1 +17 −34

dhidary 1 PR, 2 comments

#5150 [iris] Support internal-IP-only deployments via IAP tunneling 💬3 +92 −74

2 comments on 1 thread

#5150 [iris] Support internal-IP-only deployments via IAP tunneling ×2

Neville Li · NY · Recovering "AI" engineer 1 PR, 2 comments

#4814 Remove consolidate from tokenize 💬7 +694 −33

2 comments on 1 thread

#4814 Remove consolidate from tokenize ×2

Tai Vu · San Francisco, CA, US 2 PRs, 1 comment

#4249 [Eval] Add long-context evaluation lane for exp2062 💬1 +659 −2
#4620 [rl] Make KL configuration explicit and add k2 support 💬1 +312 −56

1 comment on 1 thread

#4249 [Eval] Add long-context evaluation lane for exp2062

Gonzalo Benegas · Open Athena · New York, NY · Research Scientist | AI for Science 2 PRs

#5355 [dna] Add exp142 source-curation sweep (v31/v32/v33) +274 −0
#5242 [dna] Add exp136 enhancer-curation comparison +297 −0

Rohith Kuditipudi · cs phd @ stanford 0 PRs, 2 comments

2 comments on 1 thread

#5219 [fray] Fix TPU actor coscheduling for single-VM workers ×2

Hsu Han Ooi · Seattle, WA · I enjoy long robotic walks on the beach and making smalltalk with my chatbot friends. 1 PR

#5055 [zephyr] OOM-proof scatter write buffer with byte-based flush budget 💬2 +234 −12

Chloe Chia · San Francisco 1 PR

#4297 Add GPU Triton kernel for ragged_dot MoE grouped matmul 💬1 +174 −7

Rajat Patel · Chime Financial, Inc. · San Francisco, CA 0 PRs, 1 comment

1 comment on 1 thread

#5094 [data] Source public diagnostic logs for training data

Top 15 runs (by FLOPs) this week (completed, running, crashed)

Compute this week was almost entirely the preregistered 1e23 MoE: three attempts at moe_1e23_d5120_bs2048_ep8_ragged_48l #4697 on 1024 v4 chips burned 319k chip-hours — 94.5% of the week's TPU spend and 96.5% of HW FLOPs — and all three crashed. The original target from the isoflop fit at #4447 was 2.25 paloma macro; best result so far is resume45207_clip15 at train_loss 2.1265 / paloma macro 2.4986 / uncheatable macro 2.169 on 487B tokens (16.4% MFU), with the longer-running rayuvtpu_20260417 trailing at 2.5394 macro / 2.2024 uncheatable on 379B tokens after 240 hours.

The crashes were the story. @ClassicLarry spent the week bisecting a multi-host TPU resume failure #5319 that consistently halted ~one save-step after restart with scheckne: An unexpected peer shows up in the launch group. After ruling out asymmetric pre_jit transfers, XLA non-determinism, _last_temporary_checkpoint divergence, OCDBT/tensorstore reuse, and load_path==save_path, the culprit was finally pinned to wandb.init(resume="allow") on worker 0: the HTTPS resume-fetch creates per-worker host-side state divergence (sockets, asyncio, libtpu internals) that surfaces as a TPU launch-id mismatch at the first post-resume collective. WANDB_MODE=offline on all workers ran cleanly past the bug point; the validated fix is fresh-id-per-attempt with resume="never". @dlwh's RCA on the production resume confirmed the same signature, and #5386 documents a separate operational miss — sparse permanent checkpoints meant recovery rolled back ~8k steps to step-50000, the only confirmed durable fallback.

The other story was @XenonMolecule's WARC-budget scaling sweep for #2351, which placed seven of the top fifteen runs. At the 1B-param / 9e20-FLOP isoflop point on 150B tokens, resiliparse extraction edged the LLM-curated pipeline on paloma macro: resiliparse_1000-d1536 finished at 3.0496 vs llm_curated_1000-d1536 at 3.0634 (uncheatable 2.8484 vs 2.8783). Pushing to 2.9B params on resiliparse FineMath at the same 9e20 budget — resiliparse-expFM-d2432 — knocked paloma macro down to 2.971 / uncheatable 2.7577. The 2e21 FLOP follow-up at d2432 (curation-llm_curated_bos_fixed-expFM-2e21) is still running at 3.1871 paloma macro / 48.4% MFU on 79B tokens of 16-chip v5.

Two more notable runs round out the board. @MooJinKim's e3956np_m32b_kimi_50k SFT of Marin-32B-base on TerminalCorpus #4760 is running on 32 v5 chips at train_loss 0.33 / 45.2% MFU — but @AlienKevin's parallel TB2 evaluation of an earlier checkpoint of the same SFT recipe found the model produces degenerate zorazora... repetition that solves 0/76 tasks regardless of decoding parameters, versus 17.2% for the Qwen3-32B SFT on the identical pipeline; the fine-tuning itself is broken. @Calvin-Xu's baseline_stratified data-mix-optimization swarm anchor #2345 finished at 8.3% MFU, providing the validation-loss anchor for the joint mixture-and-scale regression Calvin posted on #2404 — the optimum found at 60M/1.2B holds its perplexity advantage at every probed scale.

Run	User	Hardware(?)	Hours(?)	FLOP Budget(?)	BPB(?)
#4697 pre-reg moe_1e23_d5120_bs2048_ep8_ragged_48l_resume45207_clip15_20260427	David Leo Wright Hall	TPU v4 (1024 chips)	4.1h	5.08e22 model 3.15e23 HW (16%)	—
#4697 pre-reg moe_1e23_d5120_bs2048_ep8_ragged_48l_resume45207_clip15_20260427	David Leo Wright Hall	TPU v4 (1024 chips)	2.9d	5.02e22 model 3.07e23 HW (16%)	BPB: 0.779
#4697 pre-reg moe_1e23_d5120_bs2048_ep8_ragged_48l_rayuvtpu_20260417_124933	David Leo Wright Hall	TPU v4 (1024 chips)	10.0d	3.91e22 model 2.38e23 HW (16%)	BPB: 0.791
curation-llm_curated_1000-expWARC_natural-9e+20-d768-L8-B4096	Michael Ryan	TPU v5 (128 chips)	14.9h	9.00e20 model 3.66e21 HW (25%)	BPB: 1.118
#2351 curation-llm_curated_1000-expWARC_natural-9e+20-d1536-L16-B1024	Michael Ryan	TPU v5 (128 chips)	11.9h	9.00e20 model 2.99e21 HW (30%)	BPB: 0.973
#2351 curation-llm_curated_bos_fixed-expFM_natural-2e+21-d2432-L24-B512	Michael Ryan	TPU v5 (16 chips)	3.1d	1.45e21 model 2.99e21 HW (48%)	BPB: 1.027
#2351 curation-resiliparse_1000-expWARC_natural-9e+20-d1536-L16-B1024	Michael Ryan	TPU v5 (128 chips)	11.9h	9.00e20 model 2.89e21 HW (31%)	BPB: 0.964
#2351 curation-resiliparse-expFM_natural-9e+20-d2432-L24-B256	Michael Ryan	TPU v5 (32 chips)	1.8d	9.00e20 model 2.70e21 HW (33%)	BPB: 0.941
#4760 e3956np_m32b_kimi_50k	Moo Jin Kim	TPU v5 (32 chips)	2.4d	1.20e21 model 2.66e21 HW (45%)	BPB: 0.216
curation-resiliparse_1000-expWARC_natural-9e+20-d768-L8-B4096	Michael Ryan	TPU v5 (128 chips)	15.0h	9.00e20 model 2.58e21 HW (35%)	BPB: 1.111
curation-resiliparse_1000-expWARC_natural-9e+20-d512-L6-B4096	Michael Ryan	TPU v5 (64 chips)	1.1d	6.54e20 model 2.37e21 HW (28%)	BPB: 1.246
pinlin_calvin_xu/data_mixture/ngd3d~5b98e67a/baseline_stratified	Calvin Xu	TPU v5 (32 chips)	15.2h	1.88e20 model 2.26e21 HW (8%)	BPB: 0.953
curation-resiliparse_2000-expWARC_natural-9e+20-d512-L6-B4096	Michael Ryan	TPU v5 (64 chips)	23.2h	5.95e20 model 2.16e21 HW (28%)	BPB: 1.235
curation-resiliparse_2000-expWARC_natural-9e+20-d1280-L13-B1024	Michael Ryan	TPU v5 (64 chips)	22.6h	8.79e20 model 2.09e21 HW (42%)	BPB: 0.996
curation-llm_curated_2000-expWARC_natural-9e+20-d1280-L13-B1024	Michael Ryan	TPU v5 (64 chips)	1.4d	6.44e20 model 2.07e21 HW (31%)	BPB: 1.064

#4272 Canonical pipeline (download → norm → dedup/quality → tokenize)

#4474 Levanter Store, K8s Logging, and Infrastructure Improvements

#4283 MoE MFU at scale

#4282 Agentify experimentation

#4281 1e23 run keeps grinding, AdamH-on-embed becomes the lone Gate-2 winner

#4273 Improve Usability & Observability

#4270 Canary pass rate to 90%+

#4269 Single way of running jobs — off Ray completely

#3192 Synthetic data: SWE-ZERO swarm rebuilt at 52% PR coverage; 500K SFT stuck; TerminalCorpus exposes a Marin-32B agentic-prior gap

#3100 Data sources for pre-training / mid-training

Other Changes

Community Pulse

News & research shared

Will Moss · @airbnb · San Francisco, CA 12 PRs, 16 comments

whenwen · A learner, interested in machine learning, language models, and mathematics. 1 PR, 18 comments

claude-nightshift 5 PRs

dhidary 1 PR, 2 comments

Neville Li · NY · Recovering "AI" engineer 1 PR, 2 comments

Tai Vu · San Francisco, CA, US 2 PRs, 1 comment

Gonzalo Benegas · Open Athena · New York, NY · Research Scientist | AI for Science 2 PRs

Rohith Kuditipudi · cs phd @ stanford 0 PRs, 2 comments

Hsu Han Ooi · Seattle, WA · I enjoy long robotic walks on the beach and making smalltalk with my chatbot friends. 1 PR

Chloe Chia · San Francisco 1 PR

Rajat Patel · Chime Financial, Inc. · San Francisco, CA 0 PRs, 1 comment

Top 15 runs (by FLOPs) this week (completed, running, crashed)

Keyboard shortcuts