The big news this week was the 1e23 MoE preregistration run #4697 still hasn't crossed the finish line. Three attempts of moe_1e23_d5120_bs2048_ep8_ragged_48l together burned 319k chip-hours — 94.5% of the week's TPU spend and 96.5% of HW FLOPs — and all three crashed. The best on-record state is resume45207_clip15 at train_loss 2.1265 / paloma macro 2.4986 / uncheatable macro 2.169 on 487B tokens, still aimed at the preregistered 2.25 paloma macro target from the isoflop fit at #4447. The recurring crash signature was painstakingly bisected to wandb.init(resume="allow") on worker 0 creating host-state divergence that surfaces as a TPU launch-id mismatch one save-step after restart #5319; fresh-id-per-attempt with resume="never" is the validated fix.
Around the still-running 1e23, the agent-driven ablation sweep produced its first cleanly promotable architecture change in #5184 — AdamH on the token-embedding table — passing Gate 2 across all four scales with a 1e23 paloma projection of 2.249 vs the 2.251 baseline, closing out the long-running embed-norm-growth investigation #4569. The JAX 0.9.2 upgrade rippled across the cluster: it broke the Triton ragged_dot kernel on H100 (fixed same day in #5347), exposed an NCCL alltoall pin in CoreWeave torch builds #5379, and dragged the GPU canary pass rate down to 60% — but it also unlocked a full Triton fwd+bwd MoE path measuring 1.91x faster than XLA on H100x8 in #5330, with production follow-up #5350 stacked and ready. The H100 track also got new structure: #5328 opens a SonicMoE-style local-compute epic and #5356 sets the June 16B-A2B GPU-host target.
The post-Ray-sunset shakeout dominated infra. #5212 lifted the log store and server out of the controller into a new lib/finelog package — three latent failure modes (path rewrites, container perms, log-pusher init order) surfaced and were patched within days. #5290 stood up the sibling stats_service and #5370 moves per-tick worker and per-attempt task time-series out of the controller's sqlite — directly closing last week's #5072 ask. Coscheduling preemption finally became real for multi-host TPU jobs in #5240; cross-region correctness fixes in #5223 and #5225 stop MARIN_PREFIX from leaking into executor hash IDs. On the data side, the canonical pipeline absorbed a wave of new sources and stood up a deliberate Zephyr-perf push — #5282's pluggable shard runner gives 17-26x test speedups — and @dlwh followed last week's gap report with the broader "mineshaft gap" perplexity sweep against Qwen3 and Llama. Synthetic data turned in mixed news: SWE-ZERO's swarm reached ~52% PR coverage and the 500K SFT was stuck for ~80 hours, but the TerminalCorpus Marin-32B midtrain #4760 exposed a real ~10x agentic-prior gap to Qwen3-32B that survived an EOS-token mismatch fix — a base-prior weakness that more SFT alone may not close.
Summary: Define canonical data pipelines for all data ingestion: download -> normalize -> dedup/quality -> tokenize.
With the testbed baseline landed last week, the canonical pipeline split this week into two parallel motions: source-coverage expansion on the front end and a Zephyr performance push on the back end. @Helw150 drove the source push, registering allenai/Molmo2-Cap as a text-rendering of the long-form video-captioning corpus released with Molmo 2 #5299 (104K videos rendered as merged paragraph plus timestamp-tagged per-frame lines), GAIR/daVinci-Dev with both a ctx-native PR-row renderer and an env-native SWE-Agent trajectory renderer mirroring swe_rebench_openhands #5252, and nyuuzyou/svgfind's ~3.6M Creative Commons icons rendered as Title/Data Pack/Tags-prefixed SVG markup for SFT #5304. Three more sources sit open for review at week's end: #5300 wraps lambda/hermes-agent-reasoning-traces (14,701 multi-turn tool-calling trajectories), #5305 registers TeraflopAI/SEC-EDGAR (43.7B tokens, ~8M filings across 10 form types), and #5339 stages Amazon's MASSIVE multilingual tool-use dataset for 11.39B tokens with a per-locale fan-out zephyr pipeline. #5276 goes further afield with a SWE-rebench v2 ConTree tracing pipeline that runs Python test suites in Nebius sandboxes and writes annotated execution-trace rows for code-world modeling.
Running gated downloads at this volume immediately surfaced an auth footgun. #5280 reported that download_hf_step against a gated dataset silently retried forever when no HF_TOKEN reached the worker — Iris auto-injects from the submitter's os.getenv("HF_TOKEN"), but submitters who logged in via huggingface-cli login only have the token at ~/.cache/huggingface/token, so the mismatch produced no warning at submit time and no surfaced error during the run. #5281 closed it by classifying HfHubHTTPError with status 401/403 as non-retryable and raising a RuntimeError that distinguishes "no token" from "token lacks access," replacing the 20-attempt exponential-backoff loop with a fail-fast. The testbed itself sprouted three formal experiment arms on the issue tracker: #5308 (no-dedup baseline), #5309 (fuzzy-dedup with num_perms=286, num_bands=26, ngram_size=5, cc_max_iterations=10), and #5310 (negative-control duplication at 50%, keeping the first ceil((1 − dup_rate) · N) rows as the unique pool and replaying them) — the three legs that let every subsequent ranking-protocol comparison subtract against a fixed reference at the same compute-optimal point on Paloma macro and uncheatable_eval. Higher up the stack, @ihodes opened #5360 to scope a quality-and-dedup parameter-selection workstream against a mid-May launch, with dedup params and contamination detection as p0 and quality scores as p1.
The bigger story underneath was the launch of a deliberate Zephyr performance push. @wmoss's #5333 design proposal frames the goal bluntly: a standard run currently takes ~10 hours and an order-of-magnitude speedup would change what experimentation looks like. The plan is to capture and analyze the CPU profiles already taken every five minutes for every Iris job, ship the low-hanging-fruit optimizations alongside the tooling that validates them, and use the tooling to define the next round. #5352 tracks the work as a sub-epic, with #5353 the first concrete leaf — using multithreading on CPU-heavy stages to amortize library-import cost and assigning different worker types per stage. @rjpower's #5282 is the headline early win: shard execution becomes pluggable via a StageRunner protocol, with a new InlineRunner running shards in the worker actor's own process for LocalClient while distributed (Iris) clients keep the old SubprocessRunner for crash isolation. The two slow tests called out as motivation drop dramatically: test_connected_components_happy_path from 39.2s to 1.5s (~26×), test_fuzzy_dups_multi_source_per_source from 48.7s to 2.9s (~17×). #5311 follows with a regression guard that asserts subprocess parametrization is real — each shard records its PID, the test asserts ≤max_workers distinct PIDs across 5 shards, and the inline case is marked xfail(strict=True) so a silent fallback would flip to XPASS and fail the suite. #5265 migrates stale ctx.execute() callers to the ZephyrExecutionResult dataclass, and #5286 from @wmoss removes group_files from dedup_commons.py and pushes the sort into _collect_input_files. @ravwojdyla's #5348 playbook (a re-open of the corrupted #5199) wires the perf process to a fineweb_edu ferry running 10 times in a row to surface variance against Iris preemption.
Two memory-hardening fixes landed in parallel. #5231 from @ravwojdyla flips Levanter's _long_string_workaround on unconditionally so a 64M-character outlier never reaches the underlying Rust tokenizer as one giant string — the rewrite encodes per record so an outlier's pieces never coexist with the rest of the batch's encodings, sub-batches the per-outlier encode_batch calls in groups of 256, and accumulates ids in place via ids.extend(...) rather than building a fresh concatenated list. #5340 rebases @hsuhanooi's byte-budgeted scatter from last week's #5055 and lands it, but reviewing it surfaced #5344: under key skew on large-memory workers a single shard's buffer can grow to the full ~25%-of-cgroup global budget before the global gate fires, producing one chunk that the reducer later loads in full via fs.cat_file — a real OOM risk, with a writer-side hard cap and a reducer-side streaming read sketched as candidate fixes. Separately, #5334 documents a hard PyArrow limitation that bites the read path: PyArrow's parquet C++ reader has a ~8 MiB cap on the thrift page-header size, so a single value larger than ~8 MiB makes the writer's per-page column statistics overflow the cap and PyArrow refuses to decode the page header (OSError: Couldn't deserialize thrift: No more data to read.); DuckDB, arrow-rs, and arro3 read the same bytes correctly. #5335 proposes a DuckDB fallback in zephyr.readers.load_parquet while apache/arrow#47758's read-side max_page_header_size fix moves through upstream review.
Summary: Improve Levanter's data store, fix K8s logging at scale, and address infrastructure gaps in the Iris dashboard and profiling.
The Neville/@ravwojdyla thread on dropping consolidate_shard_caches from tokenize moved from "merge it and see" to a dual stress test this week. Neville's #4814 replaces the producer-side consolidation step with an in-memory ShardedTreeCache that downstream readers see as a single virtual tree; rav was +1 to merge but then ran a real-world load and found that opening the sharded cache for nemotron_cc_v2/medium_quality (1113 shards) was slow — about five minutes between attempting to load the shard ledger and the first downstream read in the log. He pointed Neville at the same data tokenized two ways — sharded under marin-us-central1/data/datakit/tokenized/... and consolidated under marin-tmp-us-central1/ttl=7d/tokenize/... — for a head-to-head, noting that some of his ~100 tokenized datasets have shards in the thousands. @dlwh: Thanks for taking this on! I should have thought to do this.
Late on May 1 Neville kicked off two smoke training jobs from his PR branch — one against the sharded cache, one against the consolidated — and the comparison is pending.
The other Levanter-store work this week was robustness, not format. @rjpower opened #5329 after a W&B storage-quota exhaustion took down a large run, and his #5332 wraps W&B and Trackio in a new BackgroundTracker that serializes calls onto a daemon thread and catches/logs exceptions instead of letting them propagate, with CompositeTracker hardened so a failing member cannot drag the others down — auth/init failures still propagate so a misconfigured run refuses to start. @ahmeda14960's #5259 finished off era shuffle now that @dlwh's #5246 made BlockShuffleConfig(io_block_size=256, window_blocks=512, perm_type="feistel") the LM-data default — era's zero-cross-era mixing within an epoch was, per the issue, redundant and a footgun
on temporally-segmented physical layouts, and block shuffle's global Feistel permutation strictly dominates it; #5303 swept the remaining stale shuffle references and migrated the OLMo config field. The temp-checkpoint roots from last week's #5066 merged on April 27 — and immediately surfaced a follow-on bug Ahmed filed as #5374: a v5p-256 Levanter job initialized from a cold mirror:// checkpoint in another region had every rank call latest_checkpoint_path() and eagerly stage the full TensorStore tree into the local Marin bucket, with rank 24 tripping rigging's 10 GB cross-region transfer budget and JAX distributed aborting the pod. The fix shape is either single-process staging or a TensorStore source URL it can read directly.
On the cluster-observability and K8s side, @ravwojdyla opened #5198 as a forward-looking stand-up of fleet-wide observability for the TPU and CW GPU pools — bird's-eye health, hardware/fabric monitoring, topology-aware rollups, drill-down to logs and traces, a bad-node lifecycle (detect → quarantine → drain → return), and an explicit ask to resist squeezing the work into Iris and instead build above it. @hammer chimed in with the OpenObserve Parquet-on-object-store option and the Flare paper as worth a closer read; @rjpower: I agree better observability would be good, and splitting it from Iris is the right call.
Rav also opened #5175 against the brand-new tokenize perf-counter logs from #5063 — map-only stages were emitting per-second items=0 (0.0/s) heartbeats with nothing to report — and #5220 proposes a daily ops job that runs the cross-region egress-checking script and posts a Discord report with usage tagged to the responsible users, to catch egress blow-ups before they rack up charges. Inside the controller itself, the visible churn this week was Russell's series of restarts in #infra as the lifted-and-shifted finelog log server grew up: a permissions issue racked up 10M lines of pending logs, then a controller restart later in the week lost the log-server configuration outright before being restored a minute later.
The remaining surface is housekeeping that nonetheless touches a lot of files. @yonromai landed the served-model eval RFC #5285 — a ModelDeployment → ModelLauncher → RunningModel → eval adapter handoff so eval code stops instantiating vLLM, Iris, or Fray objects directly — followed by #5322 renaming the modules to drop the served_ prefix, @dlwh's #5325 repairing the import grouping ruff broke on main, and #5331 adding echo=true prompt-plus-completion logprobs to Levanter's OpenAI-compatible /v1/completions server so stock lm_eval local-completions scoring works against Levanter without coupling eval logic to model construction; @ihodes's #5368 defines done for that broader project as MMLU-SL-Verb-5shot and HumanEval-5shot of the 1e22 MoE running pre-emption-resilient on a v5p-8 via vLLM. @dlwh's #5314 dropped I001 from the ruff ignore list and swept 489 import-order violations to stop the churn from leaking into unrelated PRs; @rjpower's #4808 finished cluster C of the pyrefly tightening (1551→1203 diagnostic lines, 129→100 suppressed) and re-enabled unbound-name, not-iterable, bad-index, and bad-context-manager, picking up real bugs along the way. Nightshift, the new auto-CI auditor from #5294, started filing its own follow-ups — #5343 against the 40-second exp1457_multilingual_cpt_eval.py dry-run that re-walks the same DAG hundreds of times, and #5378 recording eight more slow tests that fall into already-tracked categories. @wmoss's #5289 bumped the Marin-tests timeout and parallelism to -n 4 as a stopgap and his #5194 sharded-Zephyr CI proposal was closed — the slowness turned out not to be consistent enough to chase yet — and @dlwh's #5293 replaced the multi-source fuzzy-dedup test's normalize+MinHash setup with a synthetic MinHashAttrData fixture, preserving cross-source regression coverage while skipping the slow end-to-end pipeline. Finally, @rjpower's #5288 opened the design discussion for moving Marin's GitHub Actions onto repo-owned Python workflow scripts so YAML stays a thin shell of triggers, runners, and matrices.
Summary: Tracking issue for April MFU work. Tasks/goals:
The holding pattern broke decisively. The JAX 0.9.2 upgrade from #5278 regressed the existing Triton ragged_dot kernel — jax.experimental.pallas no longer exports load/store, and the GPU canary crashed at step 0 every retry, with the auto fallback failing to catch the AttributeError #5341. @yonromai turned the fix around the same day in #5347, replacing pl.load/pl.store with the new plgpu.load/store(ref.at[...]) API and broadening the auto-fallback exception list. The patched Triton path measures within rounding distance of the JAX 0.8.0 baseline (1.007× and 1.011× on the two probe shapes) and stays 24.31× and 13.83× faster than XLA for the same M=32768/E=64 ragged-dot calls on H100×8.
More consequentially, @yonromai's #5330 experiment validated the hypothesis from #4297 that JAX 0.9 would let the backward pass move onto Triton too. Raw autodiff through pallas_call still does not work on 0.9.2, but an explicit Tokamax-shaped custom-VJP — Triton kernel for dlhs with the grouped RHS transposed, separate ragged-contracting-dimension Triton kernel for drhs — does, and is materially faster: paired H100×8 Grug MoE microbench runs (tokens=4096, hidden=1024, intermediate=2048, 64 experts, top-4, bf16) dropped median steady latency from 12.69 ms on XLA to 6.61 ms on Triton fwd+bwd, a 1.91× speedup, with compile-plus-first-step shrinking from 8.4 s to 2.4 s. Expert-weight gradient diffs were exactly zero; input-gradient max abs diff was 3.7e-4 on the bf16 path. Production follow-up is draft #5350, stacked on #5347.
The H100 track also got new structure. @dlwh opened #5328 as a fresh epic for making the local Grug MoE GMM/MLP path fast on H100 via SonicMoE-style ideas — gather-fused grouped GEMM, smarter activation-memory bookkeeping that avoids the large O(T·K·D) saved intermediates, w13/w2 layout work — explicitly scoped as the local-compute half of the stack, with the DeepEP-equivalent transport half deferred to a separate epic. @ihodes opened #5356 (train a June 16B-A2B MoE for ~1k steps on 2+ GPU hosts) and #5357 (get to Nemotron ±ε MFU on H100s) as the near-term goal posts. The first concrete blocker showed up almost immediately: #5377, an NCCL alltoall crash on a CoreWeave H100 node that initially looked like bad hardware but @yonromai root-caused to torch==2.10.0+cu128 pinning nvidia-nccl-cu12==2.27.5; bumping to torch==2.11.0 with nvidia-nccl-cu12==2.28.9 passes a manual 8×H100 lax.all_to_all probe, draft fix in #5379. In the moe Discord channel, @dlwh circulated the Megatron-Core MoE paper (arxiv.org/abs/2603.07685) with the read that the sharding and fusion takeaways “seem like they follow from first principles” and the hope that XLA will do most of the work; the older Grug-vs-Megatron head-to-head sub-issues #4311, #4312, and #4313 still have not seen a measurement update.
Summary: Split from #4266.
The headline this week was a process upgrade: rjpower opened #5210 as an RFC arguing that since agents make it cheap to start large changes, the team has been skipping design work and paying for it at code review with sprawling, near-sighted, novel-bug-shaped PRs. The proposal is deliberately lightweight — open an issue, paste a 1-pager under 500 words, ping Discord, begin work in parallel — with area owners expected to provide prompt feedback rather than gate. yonromai endorsed the framing ("reviewing the inputs of coding agents is more time efficient than their outputs"), hammer pointed at Compound Engineering / RPI as comparable structured workflows, and dlwh signed on with a caveat against over-meta-engineering. #5236 landed the same day, rewriting the design-doc skill as an interactive Frame → Research → Interrogate → Draft → Stress-test → Publish flow with .agents/projects/design-template.md as the fillable template. #5209 deleted the four GitHub issue templates so "New issue" goes straight to a blank form, moved the experiment template body into .agents/skills/agent-research/SKILL.md, and taught the Claude triage workflow to refuse vague issues until the author supplies reproduction steps or a definition of done.
The skill got a real-world stress test almost immediately. rjpower volunteered the in-flight stats service as the first worked example and shipped #5241 the next day — a typed, schema-registered stats_service co-hosted in finelog that replaces three uncoordinated places Iris emits operational stats and gives the dashboard worker pane a queryable history. #5243 followed with a second design pass on inverting executor-in-training-job launches to drop implicit cross-region egress, and #5285 from yonromai used the new template to land the served-model eval handoff (ModelDeployment → ModelLauncher → RunningModel → eval adapter). yonromai called the side-markdown-files pattern out as working well — "hyperlinking makes it easier for humans to gather context (and check data support)." ravwojdyla flagged a paper cut (the relative file refs in the PR body resolved to dead links instead of perm-links), which rjpower fixed in a follow-up skill update. #5229 wired up the supporting plumbing: scripts/ops/discord.py resolves webhooks from DISCORD_WEBHOOK_* env vars or gcloud secret manager so the same script announces new designs to internal-discuss and code-review from local shells and GH Actions alike — marin-bot used it this week to post both #5241 and #5243 into code-review for feedback.
The other half of the epic this week was the nightly autonomous-cleanup pipeline, which kept producing daily multi-cleanup PRs across all four library trees. #5201 deduped LogStore.append across the DuckDB and in-memory iris stores and collapsed dead HfHubHTTPError.status_code branches in _hf_should_retry; rjpower asked the bot to walk back an over-aggressive removal of a try/except around os.makedirs. #5234 used the canonical is_job_finished helper in IrisClient.terminate_prefix after finding the inlined terminal-states set was missing JOB_STATE_WORKER_FAILED. #5263 caught a real silent bug in the slice_cache HuggingFace README generator. #5307 rewrote the 20-line two-phase BundleStore._evict_if_needed_locked as a 5-line oldest-first loop and deleted a Wikipedia helper that hadn't been called since July 2024. #5342 hoisted a shared format_resources into iris/rpc/proto_utils.py after finding the bug-report path dropped disk entirely and rendered sub-1-GiB memory as 0 GiB, and excised dead control flow in levanter's BackgroundIterator.__next__. Five days of small wins, all by parallel scout agents in their own worktrees.
Summary: Split from #4266.
The 1e23 MoE preregistration run #4697 continued through the week without crossing the finish line. @dlwh stood up a higher-grad-clip relaunch (moe_1e23_d5120_bs2048_ep8_ragged_48l_resume45207_clip15) on 2026-04-27 and posted the only on-record status this week: global_step=45337, train/loss=2.1537, grad/norm/output_proj=0.7798, MFU 16.36%, no exploding-gradient signature in the resume — see the grad-clip-1.5 announcement in #moe and the tracking comment on #4697. The preregistered 2.25 paloma macro target from the isoflop fit in #4447 remains an open bet — final macro_loss against that target has not yet been called.
Around the still-running 1e23 run, the agent-driven ablation sweep produced its first cleanly promotable architecture change: AdamH on the token-embedding table. @ClassicLarry's #5184 took embed off Adam and onto AdamH at lr_mult=1.0, passed Gate 1 with a 1.05x speedup at d512/d768, and survived Gate 2 across all four scales (1.048 / 1.040 / 1.032 / 1.013) with scaling-law projections of 2.249 at 1e23 vs the 2.251 baseline. Larry's diagnosis: the baseline embed param-norm grows from 176 toward arbitrary scales while grad-norm collapses to 0.001 on the 1e23 run, whereas AdamH pins the embed norm at 176 and keeps grad-norm in the 0.04–0.12 range. The follow-up #5203 bumping embed init std to 1.0 (so the first RMSNorm is a no-op) ties on loss but produces vanishing grad-norms below 0.001, so default-init AdamH-embed wins. This closes out the long-running embed-norm-growth investigation #4569.
Most of the rest of the agent-MoE sweep landed as negative results worth keeping. AdamH on the router weights #5211 failed Gate 1 outright; removing the MoE router z-loss #5214 passed Gate 1 but failed Gate 2 at d1024/d1280, with cosine-similarity heatmaps showing the no-z-loss router develops anti-correlated expert pairs while logits grow ~5x larger; LM-head init scale 2x/4x #5222 failed Gate 1; removing QK norm #5230 failed three of four scales. @Kaiyue-Wen's global-gradient-normalization variant #5182 (PR #5183) passed Gate 1 with ~1.04x but lost Gate 2 at the bigger scales — confirming the small-scale 4% speedup but ~4% slowdown at scale she flagged in #moe. The MHA-vs-GQA ablation #5151 showed the recipe is leaving ~15% on the table for the 4x KV-cache savings — explicitly a concession to be reclaimed later via MLA — and the barebones-transformer ablation #5154 attributes ~0.14 macro_loss to the combined MoE + GatedNorm + XSA stack.
On the LR-scaling side, @WhenWen's depth-MuP residual-scaling sweep #5178 finished d512/d768/d1024 all preferring the 1x LR multiplier, matching the existing v16 fit, but d1280 is currently centering around 0.5x–0.707x at the ~5k checkpoint — so it's a central-basin stability signal, not strong scale-invariance yet. @pc0618's Muon Vizier r2 search #5167 found a tiny d512 win for Muon (3.8073 vs AdamH 3.8104) that did not transfer to d768, with AdamH-2x-batch controls also underperforming AdamH baseline; the consensus in #moe is that 2x batch + heavy hyper tuning is needed before declaring on Muon, and a split gate/up + warmup follow-up search has been launched. Looking forward, the recipe-integration work for the June model has now been opened as #5358 (with combined-best variants like #5371 stacking PKO + AdamH-embed + k6e256s5), and @ClassicLarry shipped a multi-host checkpoint-resume fix in #5319 after the wider-expert k6e256s5 d1024 jobs hit a broadcast_one_to_all launch-group drift on resume.
Summary: Lower priority / slack-time workstream covering workqueue, dev-tpu replacement, and observability.
The lower-priority observability workstream was anything but lower-priority this week: the log-service/stats-service split that #5072 proposed last week landed end-to-end. @rjpower's #5212 lifted the log store and log server out of the iris controller into a new lib/finelog package, deliberately scoped as a forcing function for the service-extraction template before tackling stats. #5241 then proposed the sibling stats_service — typed schema-registered tables that callers write rows into and query with SQL, with the iris dashboard worker pane as the MVP consumer replacing the current sqlite read path. #5290 built the per-namespace DuckDB backend, Vue dashboard, and deploy plumbing on top of finelog, and #5370 is now the cutover: per-tick worker and per-attempt task resource time-series move out of the controller's sqlite into iris.worker / iris.task stats namespaces, dropping worker_resource_history / task_resource_history and the snapshot_* columns entirely. The controller stays canonical for liveness, scheduling, and roster; only measurements move.
The package rename produced predictable fallout that @ravwojdyla chased down: #5271 added a server-side path rewrite from /iris.logging.LogService/* to /finelog.logging.LogService/* after pre-#5212 worker images started 404'ing into UNIMPLEMENTED errors that weren't retryable, so cached clients never recycled and buffers overflowed; #5297 fixed /var/cache/finelog permissions on marin-dev; and #5262 hoisted LogPusher construction in Worker.start() above adopt_running_containers() so adopted attempts after a restart-worker capture a live pusher instead of silently None. Russell Power narrated some of the migration drama in #infra in real time — "the poor log server had a permissions issue, racked up 10M lines of pending logs and got sad" — and #5321 committed the log server entry that was missing from a previous deploy, the proximate cause of disappearing logs during a ~5-minute window.
Around the cutover, a string of dashboard and resilience polish: #5336 added a generic /proxy/<name>/<sub-path> endpoint on the controller so per-task Ray / JAX / TensorBoard dashboards are reachable through controller auth without each user touching worker IPs, and #5349 immediately followed when the proxy started leaking 10.x bind addresses to browsers behind GCP IAP through three independent sources of bad URLs (uvicorn forwarded-header handling, Starlette slash redirects, upstream-emitted Location headers — all three patched). #5332 wraps W&B and Trackio in a BackgroundTracker that serialises calls onto a daemon thread and catches exceptions instead of propagating, so long runs survive transient quota/network/5xx hiccups. #5207 closed out a regression from last week's RPC stats redesign — the global slow_samples and discovery_samples deques were aging quiet-method errors out under chatty traffic, now keyed per-method. #5187 added markdown status text to the tasks UI driven by Zephyr's items_count / bytes_processed counters, #5186 made the dashboard sparklines actually fill their grid cells, and #5284 added scripts/job_profile_summary.py for offline CPU-profile inspection — downloads the latest controller checkpoint, joins task_profiles with descendant tasks, normalises py-spy noise across siblings, and prints per-task and top-leaf summary tables plus a flamegraph SVG. The auto-mapped #5337 threads per-dataset runtime and token metrics through model-perplexity summaries and prevents eval batches from crossing dataset boundaries, so eval timing numbers are actually attributable.
Summary: Measurable: canary ferry pass rate consistently above 90%.
The TPU ferry stayed perfect at 9-for-9, up from last week's 7-for-7, but the GPU canary regressed sharply: 6 successes against 4 failures and 10 cancelled reruns, or 60% pass on completed runs versus last week's 87%. The arithmetic understates the calendar — Apr 26 through Apr 30 went five-for-five clean, then May 1–2 turned into a debugging marathon. The trigger was the JAX 0.9.2 upgrade exposing an NCCL all-to-all crash on CoreWeave H100, captured as #5377 by claude after four identical Iris retries pinned to node g13c908. Initial diagnosis suggested the node was bad, but @yonromai traced it to nvidia-nccl-cu12 being held at 2.27.5 by torch==2.10.0+cu128; #5379 bumped torch to 2.11.0 to pull NCCL 2.28.9 into the lock and merged Saturday afternoon, validated against an 8×H100 manual probe.
Fixing NCCL only got the ferry past the all-to-all and into Grug training, where it then OOMed on a separate post-NCCL diagnostics path. @yonromai's stacked #5380, still draft at week's end, disables the JAX profiler and Levanter watch-stat logging by default for GPU canaries; isolated 8×H100 probes confirmed the bare canary train step doesn't OOM, only the optional diagnostics around it do. The full CW workflow_dispatch validation at 15:45Z May 2 finally passed cleanly to 100/100 steps with final loss 6.53. @rjpower also filed #5382 documenting a separate slow-startup issue surfaced during the same debugging: cold tensorstore reads of zarr3-sharded SlimPajama tokens from R2 take 30+ minutes on first batch because the default prefetch_size=32 issues ~8192 concurrent reads against a high-latency object store. That one's pre-existing on main.
The datakit smoke ferry slipped to 5-for-7, ~71%, down from last week's 86%. Both failures are post-merge fallout from ravwojdyla-agent's bucket consolidation #5266, which folded marin-tmp-* into per-region marin-{region}/tmp on Apr 29. Issue #5376 captures the May 2 break: an Iris CPU job landed in us-west1-a, one of the zones in the cpu_vm_e2_highmem_2_ondemand scale group, where marin_temp_bucket() resolved to gs://marin-us-west1/tmp — but us-west1 was never in REGION_TO_DATA_BUCKET, so the bucket doesn't exist and the first _write_executor_info 404'd. Fix is either to drop us-west1-a from the CPU scale group or add the region to the bucket map; neither has landed, so datakit will keep flapping until one does.
Summary: All jobs run through Fray+Iris.
The week after the Ray sunset was a long shakeout of the logging plane. #5212 lifted the log store and log server out of the Iris controller into a standalone lib/finelog package — wire-compatible with the old iris.logging protos via a transcoding shim, with workers resolving the server through a generic cluster_config.endpoints map. Standing it up exposed three latent failure modes in quick succession. Pre-#5212 worker images kept POSTing to /iris.logging.LogService/* after the rename, got 404 → UNIMPLEMENTED, and because that error wasn't classified as retryable the cached client never recycled and the buffer overflowed; #5271 added a server-side path rewrite to keep old workers alive (closing #5268). Fresh GCE finelog VMs left /var/cache/finelog owned by root so the in-container finelog uid 1000 couldn't write — the existing prod boxes had only worked by coincidence (#5296, fixed in #5297). And after a controller restart, adopted containers came up before the worker's LogPusher existed, so every line silently fell into the if not self._log_pusher: return short-circuit forever — #5262 hoists pusher construction above adoption, with a regression test #5261. @rjpower summed the week up bluntly on infra: "the poor log server had a permissions issue, racked up 10M lines of pending logs and got sad." A missing log-server endpoint commit caused another five-minute outage along the way #5321, and #5291 unbroke the finelog Docker build by adding config/ back to the build context.
Coscheduling preemption finally became real for multi-host TPU jobs. The old short-circuit at if victim.is_coscheduled: continue meant interactive priority couldn't evict batch on any v5p-N or v4-N pod — i.e. exactly where preemption mattered (#5237, filed by @ahmeda14960 with a concrete repro). #5240 closed it: a higher-band coscheduled preemptor can now evict an entire victim slice atomically when device-variant matches and the slice is at least as large as the request, with solo preemption now also gated on matching device-variant so a v5p-256 ask can't reclaim a v5p-8 slot. #5249 mopped up two adjacent bugs that #5240 made visible: a coscheduled task hitting transient failure was returning to PENDING alone while siblings stayed RUNNING, so the retry could land on a different tpu-name and SPMD collectives would hang; and cancel_job left task_attempts active forever, making the dashboard report killed tasks as still occupying their old workers. Migrations 0038/0039 healed existing orphans. Separately, @RohithKuditipudi hit single-VM coscheduling head-on while running inference shards across v5p-8 workers — fray was emitting group_by="tpu-name" for any multi-replica TPU job, which is unschedulable when each replica has its own tpu-name #5219; #5226 mirrored the Iris CLI contract and only coschedules when vm_count > 1. @rjpower also opened #5258 on a related sharp edge — Iris reuses TPU hosts within seconds of evicting the prior tenant, but the JAX coordinator PollForError masks the libtpu busy string the existing tpu_health detector keys on, so /dev/vfio-busy hosts don't get recycled and the next job hangs.
The other big thread was cross-region correctness. @ahmeda14960 lost a Delphi 1e20 midtraining run when Iris rescheduled it from us-central1 to us-east5: MirrorFS rewrote bucket prefixes into the executor identity hash, the run id flipped, and the checkpoint didn't resume — "the executor framework is starting to be a footgun." #5223 stops region prefixes from leaking into hashes by storing relative paths in hash_attrs and adds a regression test that asserts hash_id/name_with_hash stability across MARIN_PREFIX while output paths still differ correctly #5216. #5225 wires tensorstore checkpoint I/O — which bypasses fsspec — into the cross-region transfer budget by adding record_transfer() calls keyed off estimated array byte counts. The longer-term fix is in design: #5279 proposes moving the executor's DAG walk out of the launch entrypoint and into the training worker so a job preempted in us-central1 and rescheduled in us-east5 re-tokenizes locally instead of paying egress, with a sibling #5218 teaching MirrorFS to discover temp checkpoints across regions #5217. ravwojdyla-agent retired the parallel marin-tmp-* bucket family in #5266, folding TTL-prefixed paths into the canonical regional buckets so MirrorFS only has one namespace to scan. And @rjpower's #5273 design proposes a GAR-backed PyPI pull-through mirror to keep Iris task installs alive through pypi.org / github.com flakiness, with marin-dupekit published to PyPI to unblock the one wheel uv's find-links path can't proxy.
Iris's operational surface kept getting polished. @wmoss landed the Zephyr→Iris status pipeline #5187 — markdown status text on tasks, populated from Zephyr's items_count/bytes_processed counters — and then #5283 split that into status_text_summary_md for a new Status column on the job task table and status_text_detail_md for the task page; #5176 stopped map-only Zephyr stages from spamming items=0 (0.0/s) log lines. claude's #5207 made RPC stats sample rings per-method so errors on quiet methods don't age out behind chatty ones #5206, and #5384 server-side-paginates ListWorkers via a new WorkerQuery so the Fleet tab stops rebuilding every WorkerHealthStatus on every refresh #5383. @rjpower's endpoint-proxy stack — #5336 for /proxy/<name>/<sub-path> on the controller dashboard, then #5349 for three independent sources of internal-IP leakage behind GCP IAP — finally lets per-task Ray/JAX/TensorBoard dashboards live behind controller auth without each user reaching a worker IP directly. #5345 tolerates non-Python children in py-spy dump so thread dumps work for training jobs with a wandb-core child, #5327 adds a per-scale-group cache_dir override and a saner dual disk health threshold, and #5284 introduces an offline CPU profile inspector that joins task_profiles with task descendants and emits a flamegraph. @dhidary's #5150 unblocks TRC-grant and security-locked GCP orgs by routing controller and SSH access through IAP tunnels when external IPs are forbidden. @yonromai closed out the last Ray-era doc references in #5202 #5029 and fixed the terminal-status self-race that leaked LeaseLostError from successful steps in #5208 #5026. @Helw150's #5295 — preemptible CPU children couldn't land anywhere with massive idle CPU sitting on TPU host VMs — lit up another scheduling gap to fix on the roadmap, alongside ravwojdyla-agent's #5270 drain/cordon RPC and @ihodes's #5369 infra tune-up tracker.
Summary: After the Marin x00B MoE models are pretrained, the next step is to mid-train/post-train the model using high-quality datasets targeting different capabilities, such as math/code/science reasoning and agentic tasks (e.g. coding). Many such datasets that are open-sourced are generated...
SWE-ZERO generation absorbed most of the week. #4898's 500K-trajectory SFT — which the prior week left mid-flight as the next data point on the dose-response curve — never produced a real training step: @AlienKevin burned through v5–v10 of the launcher fighting region pinning, v5p-16 capacity that simply was not available cluster-wide, then a v5p-8 fallback that hung on Levanter dataset-cache build for 3+ hours before the executor died. As of week-end the run has been listed as down for ~80 hours with .executor_status=FAILED at step-38 (init only), so the 500K dose-response point that follows last week's 0.0% → 3.3% → 4.0% → 5.3% SWE-bench Verified curve is still missing.
The 140B-token generation pipeline for #4719 got rebuilt twice. Mid-week @AlienKevin migrated from a 13-batch shard model to a Michael-Ryan-style swarm with hard us-east5 region pinning and 1260 finer claims (~94 PRs each) instead of 126 coarse ones; a sampled audit of the swarm output against the legacy run found the two were statistically indistinguishable (1.7% vs 1.8% Submitted rate, within-PR Jaccard medians 0.27–0.30, 100% exact-distinct), so legacy and swarm rollouts can be mixed. The pipeline then crossed the 140B raw-token bucket-size target on May 1 — at which point AlienKevin caught a measurement bug and reverted: the real target is 12.6M rollouts (100 × 126K PRs), not bucket bytes, and zero shards had _done markers. Recalibrating against the right metric, the run ended the week at ~6.50M rollouts and 64.1K PRs at the 100-rollout target, i.e. ~52% rollout coverage and ~51% PR coverage. Bad-shard recovery worked well (210 → 1 over the week as workers revisited under-filled claims), and the swarm Submitted rate held steady at ~4.0%, roughly 2× the legacy 1.7%. Throughput collapsed near week-end as us-east5 v6e-4 capacity dried up — fleet flickered between 1 and 30 workers, and the most recent audits showed only ~28K rollouts/hr, prompting AlienKevin to recommend stopping at ~64K PRs and triggering Phase 2 (Qwen3-0.6B SFT + 4-checkpoint evals on what's already there) rather than waiting 3–4 more days for marginal data.
The week's most pointed quality signal came from a sibling experiment in the post-training neighborhood. #4760 midtrains Marin-32B on 15% of Nemotron-Terminal-Corpus to compare against Qwen3-32B SFT at the same training fraction. Early Terminal-Bench 2 evals had Marin solving 0/76 with 70% output-length-exceeded errors; @AlienKevin traced this to a save/serve mismatch — the chat template trained the model to emit <|eot_id|> (128009) while the saved config.json carried eos_token_id=128001 inherited from marin-32b-base and no generation_config.json was written, so vLLM never stopped on the trained terminator and the model degenerated into zorazora… repetition until it hit the output budget. The fix landed as patched checkpoints, a MARIN_GENERATION_CONFIG constant in experiments/marin_models.py, end-to-end plumbing of generation_config through Levanter's HFCheckpointConverter and the SFT pipeline, and a defensive eval-launcher validator that fails fast on EOS-mismatched checkpoints. With EOS fixed, the apples-to-apples re-eval (temp=0.6, top_k=20, top_p=0.95) put Marin-32B SFT step-858 at 2/87 (2.3%) against Qwen3-32B SFT step-1000 at 18/86 (20.9%) on comparable ~110K vs ~128K examples seen — a real ~10× gap. Drilling in, both Marin solves (modernize-scientific-stack, sqlite-with-gcov) are spec-driven recipe tasks that Qwen3 also gets at half its training budget; the 10 tasks Qwen3 unlocks between step-500 and step-1000 (Coq proofs, async-task fixes, building 30-year-old C, etc.) are exactly the ones requiring agentic exploration that Marin still misses. The trace dataset went up at AlienKevin/terminal-bench-2-sft-traces for direct inspection.
The reading is consistent with what @dlwh opened the week with in #evals: Marin-8B and -32B base models look fine on prose-heavy held-out loss but are materially weaker than Qwen3 on patches, tool-call observations, and code-world-modeling slices, with patch-gain BPB of -1.43 vs Qwen3's +0.16. Synthetic data is the answer the epic is supposed to provide; the pieces in flight (SWE-ZERO trajectories, the planned Qwen3-0.6B SFT smoke test, the TerminalCorpus midtrain) are the channels for delivering it. The week's net is that the upstream data tap is now flowing reliably at half-target, the post-training validation on Marin-32B has surfaced a base-prior weakness that more SFT alone may not close, and the headline 500K dose-response point on Marin-8B is still owed.
Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.
The week pivoted from diagnosing the bits-per-byte gap to actively widening the sourcing tranche. @dlwh followed last week's gap report with a much broader perplexity sweep against Qwen 3 and Llama, posted as the "mineshaft gap" analysis under the confidence-portfolio epic #5005. The sweep anchors on Paloma and Uncheatable Eval and adds long-tail surfaces — FineWeb2 multilingual, raw Common Crawl WARC/WAT, SVG-Stack, GitTables, Web Data Commons, npm registry metadata, UniProt/RNAcentral/PubChem/ChEMBL, plus formal text like SMT-LIB, CoqGym, VerilogEval and PGN — measured in bits per byte after a token-realignment step that lets Qwen's tokenizer be compared cross-family. Marin remains broadly fine on edited English prose but is "very bad" on non-English text, "messy" surfaces (raw HTML, WARC, SVG, package metadata), code, scientific notation, and structured tables; @dlwh also published a sheet of tokens not covered by either the new or existing eval sets, calling out missing C#, Java, and Objective-C coverage and long-tail multilingual as the next gaps to fill.
On the ingestion side, @Helw150 finally landed #4326 after @ravwojdyla rebased it onto the new datakit layout — a download-and-filter pipeline for HPLT v3.0 English that keeps only non-Common-Crawl sources (WIDE and survey crawls, ~450B unique tokens) and applies register-based quality filtering with Haiku as the ground-truth classifier (~99% agreement on machine-translation score, web-register, and PII). The companion #4328, which would have added ~14T tokens of Common Pile, FinePDFs, FineTranslations, NuminaMath, and Institutional Books and centralized download definitions under marin.datakit.download, was closed as superseded by follow-on PRs but most of its surface — Common Pile's 30+ subsets, the StackV2 code-extension filter, the BHL page-to-book stitcher — has already shown up in the token-count snapshot. New tranches landing this week include 57.6B tokens of davinci-dev/ctx-native and 2.6B of davinci-dev/env-native synthetic code, an 8.9B SVG corpus that maps directly onto one of the gap categories, and smaller agent-reasoning and Molmo caption sets. Several agent-filed standalone ingestion tasks — public diagnostic logs #5094 and GH Archive JSON events #5099 — got their first concrete framings; on #5099 @dlwh separated the tiny held-out eval slice in #5119 from the real training-scale ingestion path, calling for explicit train/dev/test date splits and token-count estimates so the mixture weight can be set on real numbers.
The downstream side of the pipeline — turning these tranches into a production mixture — got a sober status report from @Calvin-Xu. After two weeks of joint mixture-and-scale modeling, he settled on an anchor-mixture regression with N, D terms that reduces to Chinchilla Approach 3 at fixed mixture and to a pure mixture regression at the 300M/6B anchor, fitted at α=0.155, β=0.146. The form does not give convincing predicted optima when directly optimized, so as a de-risk he validated the mixture optimized at 60M/1.2B at every scale and showed the perplexity advantage remains stable. The catch, posted on #2345 and #2404: the loss advantage on Uncheatable Eval does not translate strongly to the benchmark/task scores Marin actually cares about. The single-phase ablation of the GRP form on #2404 beat the swarm mean (1.059 vs 1.115 BPB) but still trailed the original two-phase optimum (1.028). Discussion in the data-mixing channel converged on a fix: stop optimizing for Uncheatable alone and add Paloma plus the MMLU-train and HumanEval/GSM8K canonical-solution BPB. @Helw150 shared a parallel attempt on Calvin's functional form, decomposing many tasks into core capability axes via Item Response Theory (inspired by arXiv:2503.13335); the resulting mixture looks sane and edges out proportional, with the next step being to drop individual datasets from the IRT metric and watch the mix respond.
The May target then got formalized: @Helw150 filed #5359 requiring an active swarm over all sources in datakit/sources.py, de-risked on the existing #2345 swarm against UncheatableEval, HumanEval, MMLU, GPQA, and David's PPL sets, with a checked-in mixture file as the definition of done — must beat (or match) both proportional mixing across all sources and proportional mixing over a hand-picked high-quality subset. Sub-issues split out the open questions: SNR-clean metric selection on #5362, quality-and-topic bucket finalization on #5363 (gated on Rafal's quality classifier), and a 1e23-FLOP production swarm spec on #5364 that #5365 will then launch — all aimed at locking the mix by mid-June. On a parallel research track, @XenonMolecule posted week-of progress on #2351 (small model that converts raw WARCs into training tokens): on a fixed 3000-WARC budget Qwen3-8B extraction yields 56.0B tokens versus Resiliparse's 142.7B and DCLM's 2.66B, putting the LLM extractor in a regime where it beats high-quality filtering at scale and stays slightly above Resiliparse on quality, with a scaling-law fit pending. @hammer also weighed in on #5197 with a careful literature pass on metadata conditioning (MeCo and follow-ups), flagging the "fine-grained beats coarse-grained" finding from "URLs Help, Topics Guide" and the open question of whether MeCo's 90/10 prepended-vs-cooldown schedule beats uniform interleaving — a likely interaction with the Luxical embedding work on #3049 and the topic-bucket plan on #5363.
Routine fixes and maintenance, no broader threads to surface this week.
External GitHub activity centered on @wmoss, who put up roughly a dozen Iris and Zephyr PRs — task-status surfacing in the job UI #5283 and #5187, an in-flight stats-to-sqlite stack #5141/#5143/#5144, and a Zephyr performance design proposal #5333 spawning #5352 and #5353. @gonzalobenegas kept the DNA gLM thread moving with two enhancer-curation experiments #5242 and #5355; @dhidary landed IAP-tunneling for internal-IP Iris in #5150; @RohithKuditipudi filed the fray TPU coscheduling fix #5219; @leloykun opened the MuonHT-with-tangent-constraints design #2434; and @moojink staged OpenThoughts3 SFT baselines in #2199.
On Discord, Furkan picked up @dlwh's evals-gap framing and pushed it toward biology in a Bio Evals thread, asking for the LLM eval setup for proteins, molecules and RNA; dlwh pointed at the notation-prediction work in #5213 and Furkan flagged broken references on the spot, offering codex-time to help finish the framework. In a long infra thread Ahmed M Ahmed walked Russell Power through the interactive-vs-batch v5p-32 preemption case while Russell already had #5240 in flight enabling exact-slice preemption. In data-mixing, dlwh sharpened the swarm-vs-eval debate with "i am very against only picking datasets that correlate with downstream evals, but i am very much in favor of including datasets that correlate," teeing up @Helw150's IRT-decomposed extension of Calvin's functional form later that week.
Eleven new members arrived self-describing as a DeepMind research scientist on LLM-for-coding and agentic RL, an EPFL/pleias AI scientist on HPC and synthetic-data pipelines, a Göttingen Data-Infrastructure HPC + LLM-pretraining engineer, a mathlib-contributing geometric-group-theory PhD, an Apart Research director moving from physics into capability evaluations, a Futarchy Labs founder building forecasting markets, plus arrivals on ranking/retrieval, scaling laws, and end-of-life VLMs. Anjiang Wei's agentic-RL / code-RL work intersects with the SWE-ZERO and agent-trace BPB threads #4963 and #3093; Kelvin Santos (kas) proposed a forecasting tool to predict loss ahead of training runs — a fit for the preregistration machinery around #4697 and #4447. Among silent joiners, André Martins (IST Lisbon / Unbabel) brings context on sparse-attention and entmax-style routing, relevant as the agent-driven MoE sweep #5184 probes routing and lm-head behavior.
Reading skewed toward data quality and the engineering of agentic SFT / RL — AllenAI's olmpool, the SWE-chat and AgentTrove/TaskTrove rollouts, Tilde's Nitrobrew distillation post, the Megatron-Core MoE systems paper, and Rishabh Agarwal's ICLR RL-scaling talk flagged for trainer/inference numerics mismatches on MoEs.
Compute this week was almost entirely the preregistered 1e23 MoE: three attempts at moe_1e23_d5120_bs2048_ep8_ragged_48l #4697 on 1024 v4 chips burned 319k chip-hours — 94.5% of the week's TPU spend and 96.5% of HW FLOPs — and all three crashed. The original target from the isoflop fit at #4447 was 2.25 paloma macro; best result so far is resume45207_clip15 at train_loss 2.1265 / paloma macro 2.4986 / uncheatable macro 2.169 on 487B tokens (16.4% MFU), with the longer-running rayuvtpu_20260417 trailing at 2.5394 macro / 2.2024 uncheatable on 379B tokens after 240 hours.
The crashes were the story. @ClassicLarry spent the week bisecting a multi-host TPU resume failure #5319 that consistently halted ~one save-step after restart with scheckne: An unexpected peer shows up in the launch group. After ruling out asymmetric pre_jit transfers, XLA non-determinism, _last_temporary_checkpoint divergence, OCDBT/tensorstore reuse, and load_path==save_path, the culprit was finally pinned to wandb.init(resume="allow") on worker 0: the HTTPS resume-fetch creates per-worker host-side state divergence (sockets, asyncio, libtpu internals) that surfaces as a TPU launch-id mismatch at the first post-resume collective. WANDB_MODE=offline on all workers ran cleanly past the bug point; the validated fix is fresh-id-per-attempt with resume="never". @dlwh's RCA on the production resume confirmed the same signature, and #5386 documents a separate operational miss — sparse permanent checkpoints meant recovery rolled back ~8k steps to step-50000, the only confirmed durable fallback.
The other story was @XenonMolecule's WARC-budget scaling sweep for #2351, which placed seven of the top fifteen runs. At the 1B-param / 9e20-FLOP isoflop point on 150B tokens, resiliparse extraction edged the LLM-curated pipeline on paloma macro: resiliparse_1000-d1536 finished at 3.0496 vs llm_curated_1000-d1536 at 3.0634 (uncheatable 2.8484 vs 2.8783). Pushing to 2.9B params on resiliparse FineMath at the same 9e20 budget — resiliparse-expFM-d2432 — knocked paloma macro down to 2.971 / uncheatable 2.7577. The 2e21 FLOP follow-up at d2432 (curation-llm_curated_bos_fixed-expFM-2e21) is still running at 3.1871 paloma macro / 48.4% MFU on 79B tokens of 16-chip v5.
Two more notable runs round out the board. @MooJinKim's e3956np_m32b_kimi_50k SFT of Marin-32B-base on TerminalCorpus #4760 is running on 32 v5 chips at train_loss 0.33 / 45.2% MFU — but @AlienKevin's parallel TB2 evaluation of an earlier checkpoint of the same SFT recipe found the model produces degenerate zorazora... repetition that solves 0/76 tasks regardless of decoding parameters, versus 17.2% for the Qwen3-32B SFT on the identical pipeline; the fine-tuning itself is broken. @Calvin-Xu's baseline_stratified data-mix-optimization swarm anchor #2345 finished at 8.3% MFU, providing the validation-loss anchor for the joint mixture-and-scale regression Calvin posted on #2404 — the optimum found at 60M/1.2B holds its perplexity advantage at every probed scale.
| Run | User | Hardware(?) | Hours(?) | FLOP Budget(?) | Loss | BPB(?) |
|---|---|---|---|---|---|---|
| #4697 pre-reg moe_1e23_d5120_bs2048_ep8_ragged_48l_resume45207_clip15_20260427 | David Leo Wright Hall |
TPU v4 (1024 chips) |
4.1h |
5.08e22 model
3.15e23 HW (16%) |
— | |
| #4697 pre-reg moe_1e23_d5120_bs2048_ep8_ragged_48l_resume45207_clip15_20260427 | David Leo Wright Hall |
TPU v4 (1024 chips) |
2.9d |
5.02e22 model
3.07e23 HW (16%) |
BPB: 0.779 | |
| #4697 pre-reg moe_1e23_d5120_bs2048_ep8_ragged_48l_rayuvtpu_20260417_124933 | David Leo Wright Hall |
TPU v4 (1024 chips) |
10.0d |
3.91e22 model
2.38e23 HW (16%) |
BPB: 0.791 | |
| curation-llm_curated_1000-expWARC_natural-9e+20-d768-L8-B4096 | Michael Ryan |
TPU v5 (128 chips) |
14.9h |
9.00e20 model
3.66e21 HW (25%) |
BPB: 1.118 | |
| #2351 curation-llm_curated_1000-expWARC_natural-9e+20-d1536-L16-B1024 | Michael Ryan |
TPU v5 (128 chips) |
11.9h |
9.00e20 model
2.99e21 HW (30%) |
BPB: 0.973 | |
| #2351 curation-llm_curated_bos_fixed-expFM_natural-2e+21-d2432-L24-B512 | Michael Ryan |
TPU v5 (16 chips) |
3.1d |
1.45e21 model
2.99e21 HW (48%) |
BPB: 1.027 | |
| #2351 curation-resiliparse_1000-expWARC_natural-9e+20-d1536-L16-B1024 | Michael Ryan |
TPU v5 (128 chips) |
11.9h |
9.00e20 model
2.89e21 HW (31%) |
BPB: 0.964 | |
| #2351 curation-resiliparse-expFM_natural-9e+20-d2432-L24-B256 | Michael Ryan |
TPU v5 (32 chips) |
1.8d |
9.00e20 model
2.70e21 HW (33%) |
BPB: 0.941 | |
| #4760 e3956np_m32b_kimi_50k | Moo Jin Kim |
TPU v5 (32 chips) |
2.4d |
1.20e21 model
2.66e21 HW (45%) |
BPB: 0.216 | |
| curation-resiliparse_1000-expWARC_natural-9e+20-d768-L8-B4096 | Michael Ryan |
TPU v5 (128 chips) |
15.0h |
9.00e20 model
2.58e21 HW (35%) |
BPB: 1.111 | |
| curation-resiliparse_1000-expWARC_natural-9e+20-d512-L6-B4096 | Michael Ryan |
TPU v5 (64 chips) |
1.1d |
6.54e20 model
2.37e21 HW (28%) |
BPB: 1.246 | |
| pinlin_calvin_xu/data_mixture/ngd3d~5b98e67a/baseline_stratified | Calvin Xu |
TPU v5 (32 chips) |
15.2h |
1.88e20 model
2.26e21 HW (8%) |
BPB: 0.953 | |
| curation-resiliparse_2000-expWARC_natural-9e+20-d512-L6-B4096 | Michael Ryan |
TPU v5 (64 chips) |
23.2h |
5.95e20 model
2.16e21 HW (28%) |
BPB: 1.235 | |
| curation-resiliparse_2000-expWARC_natural-9e+20-d1280-L13-B1024 | Michael Ryan |
TPU v5 (64 chips) |
22.6h |
8.79e20 model
2.09e21 HW (42%) |
BPB: 0.996 | |
| curation-llm_curated_2000-expWARC_natural-9e+20-d1280-L13-B1024 | Michael Ryan |
TPU v5 (64 chips) |
1.4d |
6.44e20 model
2.07e21 HW (31%) |
BPB: 1.064 |
16 comments on 13 threads