Week of March 30th summary for marin-community/marin

The 1e22 MoE run launched on v4-512 while capacity-factor and LR-decay ablations refined the recipe, and the Delphi 1e23 dense scaling ladder continued on v4-1024 at 30% MFU. Iris performance work cut controller lock hold time from 80ms to under 5ms and delivered 4x faster heartbeat batching. Agentic SFT v2 runs fixed critical gradient clipping and think-token issues, matching released OT-Agent 32K benchmarks at 13% SWE-bench.

#3096 Pre-training: MoE Scaling Laws

2/6 sub-issues closed

@ClassicLarry launched the 1e22 MoE run #3800 — a 34.6B-total / 5.1B-active model on v4-512 with capacity factor 1.0, running at 488k tok/s and 23% MFU (W&B). The capacity factor was set to 1.0 based on ablation results #4016 showing cap=1.0 is 8.3% faster than 1.25 with negligible loss impact at 1e20 scale. The v7 isoflop sweep #4225 mapped LR schedule and decay interactions with AdamH across dimensions, finding that removing linear decay at small step counts (d2048 @ 1e18) improved BPB by 0.03 — suggesting decay fraction needs to scale with training length. @Helw150 resolved the QB overhead problem #3972: the async CPU overlap approach was a dead end (a misleading benchmark), but sharded microbatch QB eliminated the 1.2 MFU-point overhead entirely. He also updated the MoE baseline #4084 and validated moe-sharded-qb-gn-xsa as the best iter-04 configuration with a Vizier hyperparameter sweep running on us-east5-a #2167. On the Delphi dense scaling ladder #1337, the 1e23 run on v4-1024 continued at 30% MFU with Paloma macro BPB 0.79, and the 1e22 seed42 run finished at 47% MFU (macro BPB 0.84). @dlwh landed XLA-first Mamba-3 SISO and MIMO TPU kernels #3961 (+5,352 lines) with a sharding-safe ranked public API #4149, and @msclar merged the AdaMuon optimizer implementation #3300.

1 PR this week, 9 new comments, and 0 new issues (6 total)

Sort:

#4297 Add GPU Triton kernel for ragged_dot MoE grouped matmul 💬7 +174 −7 @chloechiaw
Issues
#2371 Grug MoE
#2167 Add a version of isoflop_sweep for MoE's
#3182 Determine optimal scaling parameters for MoE
#2828 Port MoE training to GPU: kernel experiments and performance validation 💬1
#3800 Test MoE Arch at 1e21 and 1e22 Flop Scales 💬1
#4012 [moe] Experiment: compute-optimal 32B-A4B (~1e22 FLOPs) on TPU

10 autocategorized

#4234 [Grug] Upcast MoE router logits to fp32 💬1 +2 −1 @dlwh
#2185 speedrun submission: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale 💬1 +221 −0 @redagavin
#4083 [optim] Add PolynomialLrSchedule and InvSqrtDecayLrSchedule +179 −0 @claude
#3314 main: default train/eval/lora/viz to array-first Grug datasets 💬2 +143 −84 @dlwh
#4225 Experiment: Map out LR schedule and tuned value interactions with AdamH 💬43
#4232 [grug] Add activation logging on heavy watch steps 💬1
#4016 [moe] Good 10T: measure capacity overflow
#4017 [moe] Good 10T: sweep capacity factor 💬1
#1337 Delphi: Create a modern scaling suite ("modernized Pythia") 💬1
#3868 Experiment: ship Mamba-3 XLA SISO/MIMO TPU kernel 💬1

#2836 Infrastructure: MoE Training Support

Summary: Train a 50B MoE model on GPU hardware reliably — from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.

35/49 sub-issues closed

@rjpower drove a major Iris controller performance push: lock hold time in drain_dispatch_all dropped from 80ms to under 5ms #4222, a two-pass heartbeat batch delivered 4x faster provider loops #4210, lightweight job state polling reduced controller load #4209, the ORM query builder was replaced with raw SQL #4181, and log fetching was unified under FetchLogs with LIKE patterns #4202 (-598 lines). Checkpoint management got zstd compression and old-checkpoint pruning #4143. @ravwojdyla shipped the actor proxy service for external access to cluster actors #4126, refactored Zephyr chunking for improved shuffle scalability #3839 (+1,357/-539 lines), and allowed specifying coordinator resources #4095. User-defined counters (MapReduce-style per-job stats) were added #4085 with records_in/records_out counters for readers and writers #4189 and per-worker counter queries #4164. @yonromai fixed a TensorStore handle leak in cache-copy (~14 MiB/shard) #4198, fixed stale coordinators killing new workers on retry #4199, and added Slack alerts and Claude triage to the canary ferry #4158 #4177. @dlwh landed the region-aware executor on Iris #3824 (+1,075 lines). The integration test suite was redesigned #4009 and optional auth mode landed for gradual adoption #3937.

0 PRs this week, and 0 new issues (49 total)

Sort:

Issues
#2822 Iris: Implement CoreWeave platform
#2823 Iris: Improve worker/process status visibility and post-mortem log access
#2824 Iris: Multi-region support with per-scaling-group environment configuration
#2825 Iris: Quota-aware scheduling and cross-zone fallback
#2826 Iris: Richer profiling and worker-level observability
#3699 [iris] Add task-level profiling to CLI
#2827 Iris: Proactive unhealthy/degraded node identification
#2829 Data processing pipeline: validate end-to-end tokenization for all target datasets
#3004 Corrupted Levanter cache
#2927 Zephyr coordinator death causes unrecoverable pipeline failure
#2943 Zephyr coordinator stuck with workers
#2982 Replace Zephyr shared data broadcast with disk_cache-based approach
#3066 Handle GCP auth/credentials error
#2830 Training monitoring: alerting on stalled/diverging loss and health dashboard
#2929 Canary regression gate: alert on metric thresholds after canary ferry
#2930 Live monitoring and stall detection for training jobs
#2831 Validate fault tolerance: checkpoint resume and preemption recovery on CoreWeave
#2832 Agent can run a small model E2E without human intervention
#2833 Establish daily canary training runs
#2834 Executor v2: split out caching module and simplify step API
#2835 Standardize on Vortex format with typed dataset schemas
#2629 Iris: bootstrap script templates are too fragile
#2377 Jobs are not tolerant to the node where `self._run_steps` is running being preempted.
#2651 Iris: Resolver/Actor system should always auto-resolve on transient errors
#2809 Iris: Survey threading and timeouts for the controller
#2810 Iris: benchmark test for controller performance
#2424 Iris - initial resource observability
#2710 Experiment: MoE EP benchmark milestone
#2418 Add AdamC, fp32 router compute, router_topk_then_softmax, qk-norm option for MoE stability sweeps
#2414 Experiment: OLMoE size sweep with MoE stability measures
#2804 fsspec should reject cross region reads (or those over X MB)
#2744 Iris: bootstrap should probably live on the scaling group
#2745 Iris: Add attributes to ScaleGroupConfig for scheduling-level metadata
#2642 Iris: preemptible shouuld be a taint, not an attribute
#2735 Iris: Zone-aware scheduler
#2762 Iris: fair scheduler
#2625 Iris: Users and Priorities
#2749 iris: Migrate GCP platform from gcloud CLI to Python API
#2772 Iris: add proxy for worker view
#2803 iris-controller: add built-in py-spy profiling endpoint to dashboard
#2754 Embed speedscope in Iris dashboard for one-click profile viewing
#2413 SwiGLU vs Bilinear MLP layers for MoE Experts
#2708 Zephyr: auto-scale worker groups up to match demand
#2535 Iris: Integrate chronos virtual time into chaos test suite
#2849 Iris: add smoke test into CI
#2926 Iris: Add Levanter health check in Iris
#3035 StepRunner shouldn't launch tasks with Fray by default
#3098 Evaluate (first few steps) x00B MoE on TPU and GPU
#3164 Iris: allow controller restarts without resetting tasks

5 autocategorized

#4096 Add user budgets, priority bands and preemption to Iris 💬1 +4471 −42 @rjpower
#4208 Add make rust-package: unified dupekit wheel build, publish, and pin +406 −99 @rjpower
#4245 [iris] Add blob storage to BundleStore for externalized workdir files +93 −39 @rjpower
#4230 Defer workspace bundle creation until job submission +11 −11 @rjpower
#4003 [zephyr] On-demand workers with heterogeneous resource specs 💬2

Other Changes

@dlwh drafted a staged modeling experiment skill #4166 to make modeling experiments Grug-first with W&B view tagging and optional stage gates.

112 PRs this week, 205 new comments, and 36 issues closed (36 total)

Sort:

#4431 [iris] Fix FetchLogs compat shim for old clients using proto binary encoding +27 −12 @rjpower
#4430 [iris] Fix log service endpoint resolution for remote workers +38 −13 @rjpower
#4425 feat(dspy): multi-hop eval programs, DspyEvaluator improvements, ToonAdapter fixes 💬1 +322 −127 @eramis73
#4424 [iris] Fix AT_MAX_SLICES demand waterfall causing multi-zone cascade +92 −7 @rjpower
#4423 [iris] Fix unpooled CPU VM groups rendering under last TPU pool +8 −8 @rjpower
#4422 fix(iris): binary-safe base64 decoding for profile downloads +21 −5 @rjpower
#4418 fix(iris): add SYS_PTRACE to controller pod for profiling 💬1 +1 −0 @ravwojdyla
#4415 [iris] Return raw memray .bin traces and enable native tracking 💬1 +51 −9 @rjpower
#4412 [iris] Low-risk simplification: decompose controller, fix leaks, consolidate mutations 💬2 +866 −562 @rjpower
#4407 [iris] Allow capacity_type in resources with stale compiled protos +4 −3 @rjpower
#4403 [levanter] Fix JaggedArrayStore.reload() serving stale cached data +24 −8 @rjpower
#4402 Fix triage workflow posting duplicate comments +20 −24 @rjpower
#4399 [iris] Fix SSH OS Login to use impersonation SA from config +22 −6 @rjpower
#4395 [iris] Fix dev_tpu.py to use provider_bundle API +4 −3 @yonromai
#4392 [iris] Add tpu_pools config sugar and allocation tier blocking 💬1 +1068 −505 @rjpower
#4391 [nightshift] Parallel sub-agents, automerge, and reviewer selection +242 −24 @rjpower
#4390 [nightshift] Remove dead functions and orphaned imports +0 −85 @github-actions
#4388 Fixup tensorstore env credentials +2 −4 @ravwojdyla
#4385 [levanter] Pass explicit AWS credentials to silence CRT log spam 💬1 +7 −0 @ravwojdyla-agent
#4382 [iris] Drop implicit SA fallback from SSH impersonation 💬2 +13 −30 @rjpower
#4381 [iris] Skip SA impersonation in metadata SSH auth mode +4 −0 @ravwojdyla-agent
#4379 [iris] Replace preemptible bool with capacity_type enum 💬2 +806 −203 @rjpower
#4376 cache s3/boto client 💬3 +2 −0 @ravwojdyla-agent
#4370 [iris] Fix K8s stream_logs OOM: use --since-time instead of full log fetch +49 −50 @ravwojdyla-agent
#4369 Fail if `avg_item_bytes` missing from scatter manifest +12 −5 @ravwojdyla
#4365 [iris] Fix OS Login SSH: self-provision keys, fix IAM bindings +328 −22 @rjpower
#4364 Fix W&B profiler artifact names +41 −2 @dlwh
#4363 [iris] Collapse tracked_workers into workers table, fix restart_worker race +87 −42 @rjpower
#4362 Switch canary ferry to marin-dev cluster for OS Login compatibility +1 −8 @rjpower
#4361 [levanter] Remove tensorstore recheck_cached_data feature +1 −18 @rjpower
#4360 [nightshift] Remove dead code in wandb_utils and speedrun +6 −85 @github-actions
#4355 [iris] Speed up controller worker failure handling +495 −272 @claude
#4354 [iris] Migrate read-only db.snapshot() calls to read_snapshot() 💬1 +8 −8 @claude
#4351 Deprecate Ray babysit skills, consolidate into babysit-job +79 −178 @ravwojdyla-agent
#4350 Add controller tunnel metadata fallback for OS Login rollout +111 −39 @rjpower
#4349 [Iris] Replace node-count promotion gate with token bucket rate limit 💬1 +48 −71 @rjpower
#4348 Remove CW interruptable toleration from GPU worker pods +7 −21 @rjpower
#4347 Move Iris GCP SSH to service-account OS Login +1459 −47 @rjpower
#4346 [infra] Extract data_browser into marin-community/data_browser +3 −25702 @rjpower
#4343 [Iris] Expand CPU VM scale group to cover all TPU-hosting zones +2 −2 @ahmeda14960
#4341 [tokenize] Use distributed workers for file scan 💬4 +12 −4 @ravwojdyla-agent
#4340 [zephyr] Use local temp + put for S3 atomic writes +62 −0 @ravwojdyla-agent
#4339 [iris] Classify OOMKilled as application failure, not infrastructure +15 −10 @rjpower
#4337 [iris] Thread since_ms through follow path in job logs CLI +5 −1 @claude
#4336 [iris] Add --max-lines and --tail flags to job logs CLI +15 −1 @claude
#4332 [tokenize] Add cache copy worker cap +16 −3 @yonromai
#4331 Update logscan to v2 with composable grep/summarize modes +318 −203 @rjpower
#4330 [nightshift] Deduplicate write_jsonl_file compression branches in zephyr 💬1 +25 −32 @github-actions
#4324 [iris] Bump default max_retries_preemption from 100 to 10000 💬1 +12 −12 @claude
#4322 [Iris] Remove controller interruptable toleration 💬1 +0 −2 @yonromai
#4320 iris: move execution timeout enforcement from worker to controller 💬2 +310 −115 @rjpower
#4318 Refactor tracker histograms to SummaryStats 💬2 +0 −0 @dlwh
#4317 Stop tracker draccus config artifact dumping +3 −20 @dlwh
#4315 Fix Grug MoE XSA for grouped-query attention 💬8 +82 −18 @dlwh-golem
#4310 [iris] Tune limits for task containers +9 −5 @rjpower
#4309 [iris] Fix dashboard log viewer showing child job logs +6 −1 @rjpower
#4308 [iris] Adaptive scheduling loop with exponential backoff +46 −25 @rjpower
#4306 [fray] Fix H100 SXM detection and add L40S device FLOPS 💬2 +31 −6 @claude
#4299 Remove runtime threshold from canary metrics +0 −1 @yonromai
#4296 [iris] Optimize K8s sync loop: cache pod list, background log/resource collection 💬1 +384 −162 @rjpower
#4295 Add logscan skill and script for Gemini-powered log analysis +429 −0 @rjpower
#4293 [nightshift] Fix should_allow_eval logic bug and remove dead math_utils code +5 −85 @github-actions
#4291 [Docs] Strengthen AI-writing review guidance +190 −3 @dlwh
#4288 fix(ci): add Docker Buildx setup to iris-dev-restart workflow +3 −0 @rjpower
#4280 [iris] SQL schema registry with typed projections 💬1 +3456 −1153 @rjpower
#4279 [iris] Fix memray stats storing file path instead of JSON data 💬1 +22 −8 @claude
#4277 perf(iris): add missing indices, fix query plans, prune task_profiles +206 −18 @rjpower
#4275 [iris] Graceful rolling worker restarts via container adoption 💬3 +1012 −43 @rjpower
#4274 [iris] Replace heartbeat-based logging with push-based LogService 💬2 +1236 −675 @rjpower
#4267 [iris] Move duckdb, pyarrow, zstandard to optional [controller] extra 💬1 +55 −14 @rjpower
#4265 [ci] Upgrade github-pr-review skill to multi-agent pipeline +97 −45 @rjpower
#4264 perf(iris): schema denormalization and lightweight row models for scheduling hot paths +1700 −845 @rjpower
#4263 [dupekit] Disable default features on arrow/parquet to reduce dependency tree +2 −2 @claude
#4262 [iris] Fix stale e2e test_iris_run tests +11 −24 @claude
#4261 [iris] DB performance and lock reduction improvements +249 −119 @rjpower
#4260 Remove guidelines-internal.md from CLAUDE.md context +0 −1 @rjpower
#4257 [iris] Tighten test timeouts and eliminate unnecessary sleeps +67 −56 @rjpower
#4255 [nightshift] Deduplicate LM/DPO training setup in marin +67 −80 @github-actions
#4254 [agents] Add commit skill for lint-commit-push workflow +64 −0 @ahmeda14960
#4250 [nightshift] fix documentation drift +3 −3 @github-actions
#4248 Clarify pull-request skill guidance for existing PR branches +7 −2 @dlwh
#3245 [rigging] Extract shared utilities from iris into new rigging package 💬1 +1057 −949 @rjpower
#4434 add GBNFAdapter and XGrammarAdapter +770 −1 @eramis73
#4421 Rewrite storage purge: soft-delete instead of STS backup + lifecycle +15138 −0 @rjpower
#4410 Bump dependencies to fix 47 Dependabot security alerts +3305 −3230 @rjpower
#4409 [iris] Flatten migration chain into single baseline +22 −828 @claude
#4405 [levanter/marin] Add MarinTokenizer Protocol, migrate all tokenizer usage 💬1 +2719 −437 @rjpower
#4396 Proxy metrics for MATH accuracy +969 −0 @RohithKuditipudi
#4393 [iris] Replace gcloud CLI with REST API client in CloudGcpService 💬1 +927 −459 @rjpower
#4387 [levanter] Separate temporary checkpoint base path and use Marin temp buckets +264 −30 @claude
#4377 Mark managed experiment summaries as bot-authored +2 −1 @dlwh-golem
#4375 [zephyr] On-demand single-task workers with per-stage resources +284 −85 @ravwojdyla-agent
#4374 Add fineweb-edu 10BT exact paragraph dedup experiment +49 −0 @rjpower
#4372 [iris] Add PodDisruptionBudget for controller and coordinator pods on K8s 💬2 +200 −18 @ravwojdyla-agent
#4359 Fix ragged all-to-all capacity clipping +234 −11 @dlwh
#4358 Refactor tracker histograms to SummaryStats +361 −193 @dlwh
#4344 [zephyr] Fix load_parquet memory: use ParquetFile, drop dataset API 💬6 +696 −75 @ravwojdyla-agent
#4338 Collapse workspace to single package with dependency tiers +12266 −4138 @ryan-williams
#4329 Add rollout data pipelines for 6 datasets 💬1 +922 −0 @Helw150
#4328 Add new pretraining datasets and centralize datakit downloads 💬1 +1051 −459 @Helw150
#4327 [levanter] Add Qwen3.5 (hybrid GDN + Transformer) model support 💬1 +1293 −12 @XenonMolecule
#4326 Import Likely Non-Duplicate HPLT data 💬1 +344 −2 @Helw150
#4321 [docs] Add agent-optimized documentation generator +1790 −0 @rjpower
#4319 Add grug activation watch logging +492 −55 @dlwh
#4305 [MoE] Add ~120B-A12B bring-up configs for v4-1024 +131 −3 @claude
#4303 [levanter] Add scan-safe backward metric sinks via custom_vjp 💬18 +579 −4 @claude
#4335 Collapse workspace to single package with dependency tiers 💬1 +5448 −3604 @ryan-williams
#3712 docs: fix TPU cluster setup command references 💬1 +6 −6 @dlwh-golem
#3952 [docs] Add Marin executor skill and usage notes 💬2 +225 −3 @dlwh
#4292 [docs] Add anti-patterns guide to writing-style skill 💬1 +0 −0 @claude
#4380 [fray] Remove 24h default timeout on job wait 💬3 +4 −2 @ravwojdyla-agent
#3092 Fix Qwen Export 💬1 +2 −1 @Helw150

External Contributions

Chloe Chia · San Francisco 1 PR, 2 comments

#4297 Add GPU Triton kernel for ragged_dot MoE grouped matmul 💬7 +174 −7

2 comments on 2 threads

#4297 Add GPU Triton kernel for ragged_dot MoE grouped matmul
#2828 Port MoE training to GPU: kernel experiments and performance validation

eramis73 2 PRs

#4425 feat(dspy): multi-hop eval programs, DspyEvaluator improvements, ToonAdapter fixes 💬1 +322 −127
#4434 add GBNFAdapter and XGrammarAdapter +770 −1

Rohith Kuditipudi · cs phd @ stanford 1 PR, 1 comment

#4396 Proxy metrics for MATH accuracy +969 −0

1 comment on 1 thread

#4389 Identify a soft proxy for agentic benchmarks to support data-mixture studies

Gavin Yang · NY, USA · First-year CS PhD student at Northeastern University. Graduated with a Bachelor's degree in CS & DS from NYU. 1 PR, 1 comment

#2185 speedrun submission: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale 💬1 +221 −0

1 comment on 1 thread

#2185 speedrun submission: Add llama_50m_muon_1x - Muon optimizer at 1× Chinchilla scale

Top 15 runs (by FLOPs) this week (completed, running, crashed)

The largest active run is the Delphi 1e23 dense scaling ladder (W&B), a 25B parameter model on v4-1024 now at 608B tokens with Paloma macro BPB 0.79 and train loss 2.08 — @Helw150 noted the loss trend during cooldown is tracking between the pessimistic and optimistic forecasts #1337. The 1e22 MoE v7 run (W&B) launched mid-week on v4-512 at 23% MFU, estimated 7.7 days #3800. Two finished v7 isoflop runs at 1e20 validated capacity factor=1.0 as 8.3% faster with near-identical loss #4016. On the post-training side, @AlienKevin's 32K v2 OT-Agent SFT (W&B) reached 13% SWE-bench (matching released 14%) #3896, while the 131K v2 run on v5p-256 (W&B) revealed that batch-size-scaled gradient clipping is needed at 131K context #3897. Three Qwen3-14B resilience SFT runs for math and medical domains completed on v5p-16 at 56-57% MFU.

Run	User	Hardware(?)	Hours(?)	FLOP Budget(?)	BPB(?)
#1337 adamh-scaling-ladder-nemotron-optimal-1e+23-v5-27f2fb	Will Held	TPU v4 (512 chips)	22.0d	8.16e22 model 2.64e23 HW (31%)	BPB: 0.796
#1337 adamh-scaling-ladder-nemotron-optimal-1e+22-v5-seed42-deeff4	Will Held	TPU v4 (256 chips)	3.4d	10.00e21 model 2.12e22 HW (47%)	BPB: 0.769
exp3490b_sft_nemotron_terminal_corpus_full_qwen3_8b_32768tokens_-3da6c1	Kevin Li	TPU v5 (32 chips)	5.0d	2.49e21 model 6.08e21 HW (41%)	—
moe-d2304-1e21	Larry Dial	TPU v4 (128 chips)	2.1d	1.00e21 model 5.59e21 HW (18%)	BPB: 0.823
exp2262pt3h_240k_pt4_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-f3ec95	Moo Jin Kim	TPU v4 (128 chips)	1.2d	1.30e21 model 3.73e21 HW (35%)	BPB: 0.124
exp2262pt3i_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-eeb1d6	Moo Jin Kim	TPU v4 (128 chips)	1.2d	1.30e21 model 3.73e21 HW (35%)	BPB: 0.181
exp2262pt3h_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_4b_32768tok-007999	Moo Jin Kim	TPU v4 (128 chips)	1.2d	1.30e21 model 3.73e21 HW (35%)	BPB: 0.085
exp2262pt3l_240k_pt3_qwen3_1pt7b_base_ot4_math_qwen3_30b_a3b_thi-5eb7b0	Moo Jin Kim	TPU v4 (128 chips)	1.2d	1.30e21 model 3.73e21 HW (35%)	BPB: 0.167
exp2262pt3i_240k_pt4_qwen3_1pt7b_base_ot4_math_qwen3_32b_32768to-84f4d3	Moo Jin Kim	TPU v4 (128 chips)	1.2d	1.30e21 model 3.73e21 HW (35%)	BPB: 0.243
exp2262pt3l_240k_pt4_qwen3_1pt7b_base_ot4_math_qwen3_30b_a3b_thi-77d4d8	Moo Jin Kim	TPU v4 (128 chips)	1.2d	1.30e21 model 3.73e21 HW (35%)	BPB: 0.209
exp3897_sft_ota_131k_qwen3_8b_131072tokens_v5p256-f7d21a	Kevin Li	TPU v5 (128 chips)	16.1h	1.16e21 model 3.16e21 HW (37%)	—
adamh-scaling-ladder-nemotron-optimal-1e+21-v5-seed42-e251d0	Will Held	TPU v4 (64 chips)	1.3d	1.00e21 model 2.05e21 HW (49%)	BPB: 0.844
adamh-scaling-ladder-nemotron-optimal-1e+21-v5-seed62746-659a1b	Will Held	TPU v4 (64 chips)	1.3d	1.00e21 model 1.89e21 HW (53%)	BPB: 0.845
isoflop-moe-v2-1e+20-d1536-bs128	Larry Dial	TPU v4 (64 chips)	12.5h	10.00e19 model 5.73e20 HW (17%)	BPB: 0.918
#2167 isoflop-moe-adamh-gatednorm-v5p64-r2-1e20-d1536-retry25	Kaiyue Wen	TPU v5 (32 chips)	11.7h	9.36e19 model 5.13e20 HW (18%)	BPB: 0.913

#3096 Pre-training: MoE Scaling Laws

#2836 Infrastructure: MoE Training Support

Other Changes

External Contributions

Chloe Chia · San Francisco 1 PR, 2 comments

eramis73 2 PRs

Rohith Kuditipudi · cs phd @ stanford 1 PR, 1 comment

Gavin Yang · NY, USA · First-year CS PhD student at Northeastern University. Graduated with a Bachelor's degree in CS & DS from NYU. 1 PR, 1 comment

Top 15 runs (by FLOPs) this week (completed, running, crashed)

Keyboard shortcuts