Week of March 23rd summary for marin-community/marin
119 merged
59 opened
75 issues closed
15 contributors
4 epics
282 comments this week
GCP TPU 1.02e24 HW FLOPs (0 reserved)
W&B 3.77e23 HW FLOPs (1.18e23 model FLOPs)
web 7.7T (57.1%)
multilingual 3.9T (28.4%)
code 1.3T (9.3%)
math 377.1B (2.8%)
specialized 330.0B (2.4%)
The 1e22 MoE run launched on v4-512 while capacity-factor and LR-decay ablations refined the recipe, and the Delphi 1e23 dense scaling ladder continued on v4-1024 at 30% MFU. Iris performance work cut controller lock hold time from 80ms to under 5ms and delivered 4x faster heartbeat batching. Agentic SFT v2 runs fixed critical gradient clipping and think-token issues, matching released OT-Agent 32K benchmarks at 13% SWE-bench.
#3192
Synthetic Data & Post-training
The agentic SFT reproduction made significant progress this week. @AlienKevin identified four root causes in the v1 32K run #3896 : catastrophic overfitting (loss dropping to 0.00003) caused by max_grad_norm=1.0 instead of OT-Agent's 1e-4, and generating <|start_think|> instead of native <think> tokens. The v2 32K SFT run with these fixes reached 13% SWE-bench (matching released 14%) and 7.9% TB2 (matching released ~8.1%), though TB-Lite averaged 12% vs released 18%. For 131K context #3897 , the v2 run on v5p-256 with max_grad_norm=1e-4 regressed to 15% SWE-bench (from v1's 30%) because 1e-4 clipping with only 248 steps at batch=128 was too aggressive โ gradients were clipped by ~1300x every step. A v2a run with sqrt-scaled hyperparams (LR ร sqrt(8), grad_norm ร 8) showed much healthier loss curves before preemption. @AlienKevin also successfully reproduced NemotronTerminal #3490 , reaching 15.9% on TB2 (vs released 13.0%) and 29.0% TB-Lite (vs released 23.0%). @eramis73 merged DspyEvaluator improvements with ToonAdapter support #4213 , and @taivu1998 drafted a long-context evaluation lane for the exp2062 long-context plan #4249 .
0 PRs this week, and 0 new issues (4 total)
Sort:
Updated โ
Updated โ
Created โ
Created โ
Activity
#2956
[Agentic SFT] SFT Qwen3-8B on 5K SWE-smith trajectories and show improvement on SWE-bench
#2905
[Agentic SFT] Generate 30K Coding Trajectories across 6 Languages
#3093
[Agentic SFT] Tracking SFT datasets
#2262
Experiment: OpenThoughts4 Teacher Model Comparison - Qwen3-32B vs. Qwen3-235B-A22B
13 autocategorized
#4249
[Eval] Add long-context evaluation lane for exp2062
+890 −2
@taivu1998
#3948
docs: collocated RL project doc and Tunix RL reference
+16948 −190
@ahmeda14960
#4213
Add DspyEvaluator improvements and ToonAdapter
+514 −34
@eramis73
#2759
Add multihost inference support for TPU
+29653 −132
@ahmeda14960
#4153
[levanter] HF checkpoint save overwrites eos_token_id from source model with tokenizer default
#4159
[levanter] Add generation_config.json support for chat model checkpoints
#3826
[levanter] Multi-host TPU inference for GRPO sampling
#3490
[Agentic SFT] Reproduce NemotronTerminal
#3896
[Agentic SFT] Reproduce OT-Agent's best 32K context length SFT
#3897
[Agentic SFT] Reproduce OT-Agent's best 131K context length SFT
#2062
[Epic] Long Context Plan
#1747
RL: Investigate slow copies off of TPU in training worker (~1GB/s)
#3792
Ray Multi-Host vLLM on TPU: Serve 235B+ Models Across Multiple Hosts
#3100
Data Sources for Pre-training
Summary: We will need 20T of high-quality (including / in particular code) tokens for our large MoE runs in Q2/Q3; this is the work in March that we will do to enable that.
The synthetic reasoning bootstrap corpus #4148 expanded rapidly with 24 comments this week from @dlwh , adding deterministic generators for BFS shortest path, binary search, coin change DP, connected components, Dijkstra, and more algorithmic reasoning domains โ all producing step-by-step traces without LLM labels. @ravwojdyla shipped major data infrastructure work: fuzzy dedup was updated to scale to Nemotron #3750 (+465/-514 lines), and the new datakit tool consolidated and cleaned up data downloads #4142 (+878/-1,331 lines). FlatMixture was added for virtual dataset concatenation with global shuffle without re-tokenizing #4133 . @yonromai fixed a TensorStore handle leak that was accumulating ~14 MiB per shard during cache-copy #4198 , and shared ts.Transaction for metadata consolidation cut serial copy time #4105 . @ahmeda14960 added generation_config.json support for chat model checkpoints #4160 and shipped data browser QOL updates #3954 .
0 PRs this week, 24 new comments, and 1 new issue (5 total)
Sort:
Updated โ
Updated โ
Created โ
Created โ
Activity
#3049
Test Luxical as a General Tool for Data Integration Pipelines
#3101
Ensure we have 20T deduped tokens of data
#3183
Software Heritage Foundation license
#3194
Gather code environments
#4148
๐ Experiment: synthetic reasoning bootstrap corpus
27 autocategorized
#4219
Use hardcoded path for nemotron download to stabilize version hash
+7 −3
@rjpower
#4198
Fix TensorStore handle leak in cache-copy (~14 MiB/shard)
+45 −28
@yonromai
#4163
fix(fray): eliminate ray.kill() and __ray_terminate__ from actor shutdown
+79 −154
@rjpower
#4162
[dedup] Replace DupCounters with zephyr.counters for live monitoring
+138 −103
@ravwojdyla-agent
#4160
[levanter] Add generation_config.json support for chat model checkpoints
+365 −4
@ahmeda14960
#4142
Bootstrap `datakit` - consolidate and cleanup downloads
+878 −1331
@ravwojdyla-agent
#4133
Add FlatMixture: virtual dataset concatenation with global shuffle
+154 −3
@claude
#4105
Use shared ts.Transaction for metadata consolidation
+157 −60
@yonromai
#3999
chore: update dupekit wheels
+29 −10
@github-actions
#3955
chore: update dupekit wheels
+1 −1
@github-actions
#3954
QOL updates for data browser
+484 −87
@ahmeda14960
#3938
[dedup] Replace side-effect mutation with reduce in fuzzy dedup map
+9 −11
@rjpower
#3750
Update fuzzy dedup to scale to nemotron
+465 −514
@ravwojdyla
#4247
Rebase dna branch onto main
+89848 −240257
@eric-czech
#4214
Skip tokenizer tests that require HuggingFace downloads in CI
+22 −31
@rjpower
#4188
datakit: add normalize step (download โ standard Parquet)
+849 −1
@ravwojdyla-agent
#4184
Fix fasttext classifier for distributed/S3 environments
+22 −28
@claude
#4171
[download] Replace HfFileSystem.find() with HfApi.list_repo_files() to fix silent truncation
+42 −5
@claude
#4077
Remove unused modules from download and transform
+0 −1122
@ravwojdyla
#4154
[levanter] Preserve source model token IDs in HF checkpoint saves
+23 −0
@ahmeda14960
#4100
Serial metadata copy in consolidate_shard_caches takes hours for large tokenization jobs
#4104
Experiment: Ising tokenizer rollouts and critical interpolation
#4132
๐ค FlatMixture: virtual dataset concatenation without re-tokenizing
#4170
๐ค HfFileSystem.find() silently truncates file lists on large repos, causing incomplete downloads
#4204
changes to get_nemotron_split_paths causes downstream steps to rerun
#2383
Initial VLM Training Experiment
#2345
Swarm Runs Proposal
#3096
Pre-training: MoE Scaling Laws
@ClassicLarry launched the 1e22 MoE run #3800 โ a 34.6B-total / 5.1B-active model on v4-512 with capacity factor 1.0, running at 488k tok/s and 23% MFU (W&B ). The capacity factor was set to 1.0 based on ablation results #4016 showing cap=1.0 is 8.3% faster than 1.25 with negligible loss impact at 1e20 scale. The v7 isoflop sweep #4225 mapped LR schedule and decay interactions with AdamH across dimensions, finding that removing linear decay at small step counts (d2048 @ 1e18) improved BPB by 0.03 โ suggesting decay fraction needs to scale with training length. @Helw150 resolved the QB overhead problem #3972 : the async CPU overlap approach was a dead end (a misleading benchmark), but sharded microbatch QB eliminated the 1.2 MFU-point overhead entirely. He also updated the MoE baseline #4084 and validated moe-sharded-qb-gn-xsa as the best iter-04 configuration with a Vizier hyperparameter sweep running on us-east5-a #2167 . On the Delphi dense scaling ladder #1337 , the 1e23 run on v4-1024 continued at 30% MFU with Paloma macro BPB 0.79, and the 1e22 seed42 run finished at 47% MFU (macro BPB 0.84). @dlwh landed XLA-first Mamba-3 SISO and MIMO TPU kernels #3961 (+5,352 lines) with a sharding-safe ranked public API #4149 , and @msclar merged the AdaMuon optimizer implementation #3300 .
0 PRs this week, 13 new comments, and 1 new issue (6 total)
Sort:
Updated โ
Updated โ
Created โ
Created โ
Activity
#2371
Grug MoE
#2167
Add a version of isoflop_sweep for MoE's
#3182
Determine optimal scaling parameters for MoE
#2828
Port MoE training to GPU: kernel experiments and performance validation
#3800
Test MoE Arch at 1e21 and 1e22 Flop Scales
#4012
๐ [moe] Experiment: compute-optimal 32B-A4B (~1e22 FLOPs) on TPU
104 autocategorized
#4149
[mamba3] Make ranked public API sharding-safe
+214 −196
@dlwh
#4130
[levanter] Share Pallas autotune helpers and restore compile offload
+437 −183
@dlwh
#4118
Bring back `profiler` to `GrugMoeLaunchConfig`
+2 −1
@ravwojdyla
#4114
Use `run_grug_moe_trial` instead of `run_grug_moe`
+2 −2
@ravwojdyla
#3961
[Levanter] Add XLA-first Mamba-3 SISO and MIMO TPU kernels
+5352 −0
@dlwh
#3300
AdaMuon implementation
+525 −0
@msclar
#4244
[levanter] Add attention instability metrics to pretraining
+415 −0
@claude
#4243
[levanter] Add NaN/Inf/spike counters to trainer metrics
+230 −7
@claude
#4242
[levanter] Add gradient zero-count and zero-fraction metrics
+113 −5
@claude
#4241
[levanter] Add rank-level straggler reporting callback
+326 −0
@claude
#4240
[levanter] Add rank-level straggler reporting callback
+265 −0
@claude
#4234
[Grug] Upcast MoE router logits to fp32
+2 −1
@dlwh
#4233
[grug] Add activation logging on heavy watch steps
+460 −34
@claude
#4092
[grug] Add hybrid Mamba-3 sweep and split-LR ladder
+15013 −46
@dlwh
#4075
[moe] Add great 10T expert count sweep (E in {128,256,512})
+234 −0
@claude
#4074
[moe] Add great 10T K sweep in {1,2,4,8}
+231 −0
@claude
#4073
[moe] Add multi-budget SWA ablation for great 10T gate
+246 −1
@claude
#4072
[moe] Add great 10T gated norm ablation experiment
+510 −2
@claude
#4071
[moe] Add great 10T inv-sqrt LR schedule ablation
+289 −0
@claude
#4069
[moe] Add multi-scale AdamH vs Adam isoflop experiment
+664 −0
@claude
#4068
[moe] Add great 10T first-k-dense isoflop ablation sweep
+264 −0
@claude
#4067
[moe] Add K sweep experiment (1,2,4,8) for good 10T gate
+273 −0
@claude
#4066
[moe] Add attention gate support to grug MoE model
+177 −0
@claude
#4065
[moe] Add Great 10T router z-loss ablation experiment
+282 −0
@claude
#4064
[moe] Add headwise attention gate and ablation launch script
+190 −2
@claude
#4063
[moe] Add sigmoid gating option for great-gate 10T comparison
+166 −1
@claude
#4062
[moe] Add multi-budget shared expert ablation for great 10T gate
+247 −0
@claude
#4061
[moe] Add residual bottleneck variant for 10T gate experiments
+1278 −0
@claude
#4060
[moe] Add good-10T E=128 isoflop experiment script
+200 −0
@claude
#4059
[moe] Add AdamH vs Adam comparison experiment at 1e19 FLOPs
+325 −0
@claude
#4058
[moe] Add num_dense_layers to Grug MoE for first-k-dense ablation
+151 −4
@claude
#4057
[moe] Add gated norm support and ablation launch script
+225 −2
@claude
#4056
[moe] Make capacity_factor configurable and add sweep script
+144 −1
@claude
#4055
[moe] Add sliding_window to GrugModelConfig and SWA ablation experiment
+186 −1
@claude
#4054
[moe] Add 3e18 FLOP hparam sweep for MoE recipe search
+297 −0
@claude
#4053
[moe] Add Muon throughput experiment for grug MoE
+88 −0
@claude
#4052
[moe] Wire capacity overflow reporting into Grug MoE training metrics
+39 −1
@claude
#4051
[moe] Add shared-expert ablation experiment at ~1e19 FLOPs
+194 −0
@claude
#4050
[moe] Add inv-sqrt LR schedule experiment for Good 10T gate
+68 −0
@claude
#4049
[moe] Add expert count sweep for E in {128,256,512}
+129 −0
@claude
#3978
Fix fused CE autotune under shard_map
+227 −16
@dlwh
#2185
speedrun submission: Add llama_50m_muon_1x - Muon optimizer at 1ร Chinchilla scale
+221 −0
@redagavin
#4119
Trigger levanter and marin tests for grug changes
+1 −0
@ravwojdyla
#4084
Updated MoE Baseline
+441 −139
@Helw150
#3661
[levanter] Add tree memory-kind helper and HBM guide usage
+47 −2
@dlwh
#4083
[optim] Add PolynomialLrSchedule and InvSqrtDecayLrSchedule
+179 −0
@claude
#3290
Default trainer meshes to explicit axis types with explicit-sharding fixes
+110 −38
@dlwh
#3314
main: default train/eval/lora/viz to array-first Grug datasets
+143 −84
@dlwh
#4110
Experiment: MuonHR vs MuonH at 3e18 FLOPs on Nemotron
#4122
[grug] Pallas TPU load-balancing loss kernel crashes on v5p-8: Mosaic auto-partition not supported
#4129
[levanter] Extract shared Pallas autotune sharding helpers
#4131
lint is failing in main
#4225
Experiment: Map out LR schedule and tuned value interactions with AdamH
#4232
[grug] Add activation logging on heavy watch steps
#4235
[levanter] Add rank-level straggler reporting to training
#4236
[levanter] Add rank-level straggler reporting to training
#4237
[levanter] Add gradient zero-count and zero-fraction metrics
#4238
[levanter] Add NaN/Inf/spike counters to trainer metrics
#4239
[levanter] Add attention-instability metrics to pretraining
#4251
Experiment: Bolinas DNA transferred scaling sweep
#3930
Experiment: direct compare es3r2 vs Qwen3 30B-A3B / Qwen3.5 35B-A3B full-attention baselines
#3964
Experiment: AdamHR vs AdamH at 3e18 FLOPs on Nemotron
#3972
Make QB Fast Enough to Use
#3993
Compare finished 1e20 d1536 runs in dial_moe
#4013
[moe] Good 10T gate for #3469
#4014
[moe] Great 10T gate for #3469
#4015
[moe] Good 10T: E=128
#4016
[moe] Good 10T: measure capacity overflow
#4017
[moe] Good 10T: sweep capacity factor
#4018
[moe] Good 10T: MoE hparams at 3e18 FLOPs
#4019
[moe] Good 10T: scaling hparams for MoE
#4020
[moe] Good 10T: ablate attention gate
#4021
[moe] Good 10T: ablate shared expert
#4022
[moe] Good 10T: ablate first-k dense
#4023
[moe] Good 10T: compare QB vs sign+aux
#4024
[moe] Good 10T: compare AdamH vs Adam
#4025
[moe] Good 10T: ablate exclusive self-attention
#4026
[moe] Good 10T: ablate gated norms
#4027
[moe] Good 10T: ablate sliding-window attention
#4028
[moe] Good 10T: inv-sqrt LR schedule
#4029
[moe] Good 10T: sweep K in {1,2,4,8}
#4030
[moe] Good 10T: sweep E in {128,256,512}
#4031
[moe] Great 10T: latent MoE perf
#4032
[moe] Great 10T: latent MoE loss
#4033
[moe] Great 10T: Muon perf on MoE
#4034
[moe] Great 10T: Muon loss on MoE
#4035
[moe] Great 10T: residual bottleneck experiments
#4036
[moe] Great 10T: router z-loss
#4037
[moe] Great 10T: sigmoid vs softmax gating
#4038
[moe] Great 10T: ablate attention gate
#4039
[moe] Great 10T: ablate shared expert
#4040
[moe] Great 10T: ablate first-k dense
#4041
[moe] Great 10T: compare QB vs sign+aux
#4042
[moe] Great 10T: compare AdamH vs Adam
#4043
[moe] Great 10T: ablate exclusive self-attention
#4044
[moe] Great 10T: ablate gated norms
#4045
[moe] Great 10T: ablate sliding-window attention
#4046
[moe] Great 10T: inv-sqrt LR schedule
#4047
[moe] Great 10T: sweep K in {1,2,4,8}
#4048
[moe] Great 10T: sweep E in {128,256,512}
#1337
Delphi: Create a modern scaling suite ("modernized Pythia")
#3868
Experiment: ship Mamba-3 XLA SISO/MIMO TPU kernel
#4147
Summarize current scaling heuristics for run planning
#3660
Add tree memory-kind helper and document HBM offload usage
#2836
Infrastructure: MoE Training Support
Summary: Train a 50B MoE model on GPU hardware reliably โ from data preparation through sustained multi-node training with automatic fault recovery. This epic tracks all the infrastructure, data pipeline, and training work needed to get there by March 31.
@rjpower drove a major Iris controller performance push: lock hold time in drain_dispatch_all dropped from 80ms to under 5ms #4222 , a two-pass heartbeat batch delivered 4x faster provider loops #4210 , lightweight job state polling reduced controller load #4209 , the ORM query builder was replaced with raw SQL #4181 , and log fetching was unified under FetchLogs with LIKE patterns #4202 (-598 lines). Checkpoint management got zstd compression and old-checkpoint pruning #4143 . @ravwojdyla shipped the actor proxy service for external access to cluster actors #4126 , refactored Zephyr chunking for improved shuffle scalability #3839 (+1,357/-539 lines), and allowed specifying coordinator resources #4095 . User-defined counters (MapReduce-style per-job stats) were added #4085 with records_in/records_out counters for readers and writers #4189 and per-worker counter queries #4164 . @yonromai fixed a TensorStore handle leak in cache-copy (~14 MiB/shard) #4198 , fixed stale coordinators killing new workers on retry #4199 , and added Slack alerts and Claude triage to the canary ferry #4158 #4177 . @dlwh landed the region-aware executor on Iris #3824 (+1,075 lines). The integration test suite was redesigned #4009 and optional auth mode landed for gradual adoption #3937 .
0 PRs this week, and 0 new issues (49 total)
Sort:
Updated โ
Updated โ
Created โ
Created โ
Activity
#2822
Iris: Implement CoreWeave platform
#2823
Iris: Improve worker/process status visibility and post-mortem log access
#2824
Iris: Multi-region support with per-scaling-group environment configuration
#2825
Iris: Quota-aware scheduling and cross-zone fallback
#2826
Iris: Richer profiling and worker-level observability
#3699
[iris] Add task-level profiling to CLI
#2827
Iris: Proactive unhealthy/degraded node identification
#2829
Data processing pipeline: validate end-to-end tokenization for all target datasets
#3004
Corrupted Levanter cache
#2927
Zephyr coordinator death causes unrecoverable pipeline failure
#2943
Zephyr coordinator stuck with workers
#2982
Replace Zephyr shared data broadcast with disk_cache-based approach
#3066
Handle GCP auth/credentials error
#2830
Training monitoring: alerting on stalled/diverging loss and health dashboard
#2929
Canary regression gate: alert on metric thresholds after canary ferry
#2930
Live monitoring and stall detection for training jobs
#2831
Validate fault tolerance: checkpoint resume and preemption recovery on CoreWeave
#2832
Agent can run a small model E2E without human intervention
#2833
Establish daily canary training runs
#2834
Executor v2: split out caching module and simplify step API
#2835
Standardize on Vortex format with typed dataset schemas
#2629
Iris: bootstrap script templates are too fragile
#2377
Jobs are not tolerant to the node where `self._run_steps` is running being preempted.
#2651
Iris: Resolver/Actor system should always auto-resolve on transient errors
#2809
Iris: Survey threading and timeouts for the controller
#2810
Iris: benchmark test for controller performance
#2424
Iris - initial resource observability
#2710
Experiment: MoE EP benchmark milestone
#2418
Add AdamC, fp32 router compute, router_topk_then_softmax, qk-norm option for MoE stability sweeps
#2414
Experiment: OLMoE size sweep with MoE stability measures
#2804
fsspec should reject cross region reads (or those over X MB)
#2744
Iris: bootstrap should probably live on the scaling group
#2745
Iris: Add attributes to ScaleGroupConfig for scheduling-level metadata
#2642
Iris: preemptible shouuld be a taint, not an attribute
#2735
Iris: Zone-aware scheduler
#2762
Iris: fair scheduler
#2625
Iris: Users and Priorities
#2749
iris: Migrate GCP platform from gcloud CLI to Python API
#2772
Iris: add proxy for worker view
#2803
iris-controller: add built-in py-spy profiling endpoint to dashboard
#2754
Embed speedscope in Iris dashboard for one-click profile viewing
#2413
SwiGLU vs Bilinear MLP layers for MoE Experts
#2708
Zephyr: auto-scale worker groups up to match demand
#2535
Iris: Integrate chronos virtual time into chaos test suite
#2849
Iris: add smoke test into CI
#2926
Iris: Add Levanter health check in Iris
#3035
StepRunner shouldn't launch tasks with Fray by default
#3098
Evaluate (first few steps) x00B MoE on TPU and GPU
#3164
Iris: allow controller restarts without resetting tasks
180 autocategorized
#4229
[iris] Add job state filter to dashboard JobsTab
+28 −5
@claude
#4228
[iris] Add sorting to child jobs table in dashboard
+59 −6
@claude
#4222
perf(iris): reduce drain_dispatch_all lock hold time from 80ms to <5ms
+193 −65
@rjpower
#4217
[iris] Fix bundle exclude patterns for docs and test snapshots
+5 −5
@rjpower
#4216
fix(iris): remove --nonblocking from py-spy to fix native CPU profiles
+1 −2
@rjpower
#4215
[iris] Skip per-task detail in GetJobStatus by default
+32 −16
@rjpower
#4212
[zephyr] Fix counter loss and double-counting via generation tags
+112 −64
@ravwojdyla-agent
#4210
[iris] Two-pass heartbeat batch: 4x faster provider loop phase 3
+267 −57
@rjpower
#4209
Add lightweight job state polling to reduce controller load
+94 −12
@rjpower
#4207
[iris] Build native extensions for Rust crates in rust-dev mode
+12 −0
@rjpower
#4206
fix(iris): strip trailing slash from MirrorFileSystem prefix
+1 −1
@yonromai
#4205
fix(zephyr): don't SHUTDOWN idle workers while shards are in-flight
+36 −2
@yonromai
#4203
[iris] Fix flaky test_smoke_gcp_config_boots_locally timeout
+35 −41
@rjpower
#4202
[iris] Unify log fetching: FetchLogs with LIKE patterns, deprecate GetTaskLogs
+416 −598
@rjpower
#4199
[zephyr/iris] Fix stale coordinator killing new workers on retry
+48 −38
@rjpower
#4197
[iris] Remove obsolete CoreWeave CI workflows and configs
+62 −1267
@rjpower
#4195
fix(canary): lower GPU canary batch size to 16 to fix OOM
+1 −1
@yonromai
#4194
[iris] Capture periodic memory profiles alongside CPU profiles
+127 −32
@claude
#4192
Add --frozen flag to uv sync and optimize PYTHONDONTWRITEBYTECODE
+9 −3
@rjpower
#4191
[nightshift] Deduplicate vLLM server polling loop and env defaults
+122 −107
@github-actions
#4189
[zephyr] Add records_in / records_out counters to readers and writers
+22 −1
@ravwojdyla-agent
#4186
Enable native frame collection in CPU profiling by default
+4 −1
@rjpower
#4185
Switch SQLite backups from WAL to DELETE journal mode
+17 −2
@rjpower
#4182
Use iris bundle for Ray working_dir to include pb2 files
+213 −145
@rjpower
#4181
[iris] Remove ORM query builder, replace with raw SQL
+430 −987
@rjpower
#4180
[iris] Fix VM bootstrap monitor killing healthy workers
+242 −23
@rjpower
#4179
[iris] Fix scheduling loop: filter reservation jobs at SQL level
+81 −7
@rjpower
#4177
Add Claude triage step to canary ferry workflows
+146 −12
@yonromai
#4176
Tolerate controller unavailability for up to 1h in job monitoring
+276 −32
@rjpower
#4175
Halve Ray cluster minimums, boost Iris capacity
+28 −28
@rjpower
#4174
Add CoreWeave CI workflow for Iris PRs
+645 −155
@rjpower
#4173
Fix canary validation artifact replication
+1 −0
@yonromai
#4172
[nightshift] Extract duplicated dedup aggregator and finalization
+100 −121
@github-actions
#4165
docs: add actor RPC section to debug-zephyr-job skill
+23 −0
@ravwojdyla-agent
#4164
[zephyr] Per-worker counter queries, accumulate across stages
+14 −4
@ravwojdyla-agent
#4161
feat(iris): route actor RPC calls through proxy and decode responses
+72 −1
@ravwojdyla-agent
#4158
Add Slack alerts for canary workflow failures
+18 −0
@yonromai
#4157
fix(iris): handle scheme in actor proxy upstream URL
+45 −27
@ravwojdyla-agent
#4155
fix(iris): fixup Starlette `on_shutdown`
+26 −5
@ravwojdyla
#4152
fix(iris): add Docker Buildx setup to GPU canary workflow
+3 −0
@yonromai
#4150
[infra] Revert pull_request_target to pull_request and restore id-token: write
+9 −5
@rjpower
#4145
iris: add --dry-run mode to controller startup
+542 −157
@rjpower
#4144
fix(iris): add Docker Buildx setup to smoke test workflows
+6 −0
@rjpower
#4143
[iris] Compress checkpoints with zstd, prune old ones, improve restart UX
+250 −101
@rjpower
#4140
[zephyr] Fix _check_worker_group false abort after completed stage
+101 −0
@rjpower
#4139
[iris] Use buildx --push for atomic build+push+cache-write
+35 −19
@rjpower
#4138
[infra] Remove OIDC auth from Claude workflows, use OAuth token
+2 −6
@rjpower
#4137
[iris] Use npx for Playwright system deps on CI cache hit
+2 −2
@rjpower
#4136
fix(iris): hash workdir tokens to prevent ENAMETOOLONG for nested jobs
+23 −3
@rjpower
#4135
Fix ResourceConfig serialization and remove asia-northeast1 buckets
+4 −4
@rjpower
#4128
Migrate dev TPU guides into skills
+277 −292
@dlwh
#4126
feat(iris): add actor proxy service for external access to cluster actors
+359 −8
@ravwojdyla
#4125
[Iris] Remove caps from marin-dev cluster config
+10 −10
@ravwojdyla-agent
#4124
Update iris skill, ref config
+14 −1
@ravwojdyla
#4123
[canary] Forward R2/S3 credentials to GPU canary task pod
+3 −0
@rjpower
#4116
Parallelize data_size computation in consolidate_shard_caches
+58 −5
@yonromai
#4112
[ferry] Use mirror:// for canary validation, drop hardcoded MARIN_PREFIX
+9 −8
@rjpower
#4108
[Iris] Fix stale detail state and nested child jobs in dashboard
+230 −119
@ahmeda14960
#4106
[fray] Use graceful actor termination to avoid Ray task_manager assertion
+30 −4
@rjpower
#4101
[canary] add logging for task pods, `kind` test
+441 −0
@rjpower
#4099
fix: worker processes exit cleanly after SHUTDOWN
+136 −5
@yonromai
#4097
`StepSpec` -> `ExecutorStep` and use `StepSpec` in integration test
+141 −29
@ravwojdyla
#4095
zephyr: allow to specify coordinator resources (#4005)
+5 −1
@ravwojdyla
#4094
[tests] Fix tests that hit HuggingFace during CI
+55 −36
@rjpower
#4091
[iris] Increase GCP smoke test shell poll timeout
+1 −1
@rjpower
#4089
[iris] Fix dev cluster restart ghcr.io auth failure
+16 −0
@rjpower
#4087
fix: address iris integration test review feedback
+129 −7
@claude
#4086
[nightshift] Remove dead code and deduplicate helpers in decon.py
+9 −59
@github-actions
#4085
[iris] Add user-defined counters (MapReduce-style per-job stats)
+167 −6
@ravwojdyla-agent
#4081
Fix worker pod over-provisioning and sequential cleanup
+156 −13
@yonromai
#4078
Fix hatch build proto staleness detection
+31 −4
@rjpower
#4076
[zephyr] Fix tests that relied on closure mutation for call counting
+39 −53
@yonromai
#4011
[iris] Label always-on CoreWeave nodes as system-critical
+3 −0
@rjpower
#4010
[CI] Fix required checks hanging + rename to project-scoped names
+38 −80
@rjpower
#4009
[iris] Redesign integration test suite
+1069 −2
@claude
#4008
[zephyr] Fix coordinator loop crash causing silent pipeline hangs
+41 −9
@yonromai
#4005
zephyr: allow to specify coordinator resources
+5 −1
@ravwojdyla
#4002
[iris] CoreWeave dev cluster: always-on CPU, scale-to-zero GPU, nightly restart
+136 −353
@rjpower
#3997
fix: regenerate uv.lock in dupekit wheels workflow
+6 −1
@rjpower
#3991
[iris] Make controller restart idempotent
+23 −4
@rjpower
#3990
[executor] Add mirrored() for region-agnostic data references
+326 −25
@rjpower
#3986
[CI] Fix required checks hanging due to workflow path filters
+229 −47
@rjpower
#3984
[iris] Add dev cluster config with daily automated restart
+210 −0
@rjpower
#3982
[iris] Add events permission to CoreWeave RBAC ClusterRole
+5 −0
@rjpower
#3980
[nightshift] Deduplicate distributed lock + heartbeat logic
+49 −81
@github-actions
#3966
[nightshift] fix documentation drift in tutorials
+12 −9
@github-actions
#3946
Babysit/debug Iris job skills
+290 −188
@ravwojdyla
#3937
[iris] Add optional auth mode for gradual adoption
+269 −35
@claude
#3910
[zephyr/iris] Fix ConfigMap size limit for large pipelines
+66 −6
@yonromai
#3900
iris: eliminate Platform abstraction, reorganize into Service + Provider layers
+14497 −15020
@rjpower
#3849
[nightshift] Remove duplicated vLLM model resolution from EvalchemyEvaluator
+3 −41
@github-actions
#3839
zephyr: refactor chunking, improve shuffle scalability
+1357 −539
@ravwojdyla
#3675
Fix Claude PR review for fork PR OIDC
+3 −3
@dlwh
#4252
[RL] Add GPU smoke probes and resource-aware config
+1361 −203
@taivu1998
#4246
[iris] Offload large workdir files to blob store in launch_job
+224 −1
@rjpower
#4231
Remove deprecated GetTaskLogs RPC
+7 −99
@claude
#4224
[executor] Add .mirrored() to ExecutorStep and InputName
+79 −36
@claude
#4218
Fix raw dataset tokenize input paths
+12 −6
@yonromai
#4178
[zephyr] Arrow-native scatter/reduce: 1.1x reduce speedup
+816 −405
@rjpower
#4169
[iris] Fix get_job_info crash on constraint JSON with unknown fields
+27 −1
@claude
#4168
[iris] Ignore unknown inherited constraint fields
+5103 −16
@dlwh
#4121
[iris] Auto-quarantine workers with consecutive SIGSEGV task crashes
+195 −6
@claude
#4096
Add user budgets, priority bands and preemption to Iris
+3706 −29
@rjpower
#4090
[classification] Migrate inference pipeline from Ray to Fray v2
+210 −286
@claude
#4080
[iris] Add GCP impersonation based on logged-in user identity
+237 −0
@rjpower
#4070
Add Cloudflare Tunnel integration for public controller access
+803 −15
@rjpower
#3983
Skip region validation for cached executor steps
+150 −0
@claude
#3824
Make executor region-aware on Iris from GCS dependencies
+1075 −34
@dlwh
#3691
docs: refresh Ray cluster setup tutorial
+15 −17
@dlwh-golem
#4208
Add make rust-package: unified dupekit wheel build, publish, and pin
+406 −99
@rjpower
#4221
[iris] Reduce controller write lock contention in drain_dispatch_all and prune_old_data
+154 −26
@claude
#4245
[iris] Add blob storage to BundleStore for externalized workdir files
+93 −39
@rjpower
#4187
Log final counters on coordinator shutdown
+5 −0
@rjpower
#4102
Handle empty string 'WANDB_API_KEY' in init_wandb
+2 −1
@ravwojdyla
#3988
Soften issue-linking rule in pull-request skill
+6 −4
@claude
#4230
Defer workspace bundle creation until job submission
+11 −11
@rjpower
#4151
Prefer origin pushes for Marin PR branches
+137 −14
@dlwh-golem
#4098
Worker processes never exit after SHUTDOWN โ _host_actor() blocks forever on anonymous Event
#4103
[iris] Truncate workdir tokens for deeply nested job names
#4107
[Iris] Fix stale detail state and nested child jobs in dashboard
#4109
[iris] Actor proxy service for external access to cluster actors
#4113
[executor] Retokenize data when it doesn't exist in the local region
#4117
[iris] Flaky test_marin_pipeline_on_iris: ZephyrWorkerError in dedup_fuzzy_document
#4120
[iris] Auto-quarantine and delete TPU nodes that crash with SIGSEGV
#4127
Migrate dev TPU docs into skills
#3103
iris: worker kubectl saturation causes heartbeat timeouts and task pod disappearances
#4141
iris: add a --dry-run to controller startup
#3643
zephyr: batch instead of hard coded `chunk_size`
#4156
RayActorGroup.shutdown() should disable retries for __ray_terminate__
#4167
[iris] get_job_info crashes on constraint JSON with mode field
#2639
Iris: Zephyr v2 context is unclear
#4183
Fasttext classifier code doesn't work in distributed/S3 environments
#3674
Claude PR Review fails on fork PRs: missing OIDC env in pull_request runs
#4190
[zephyr] Fix counter loss and double-counting between heartbeat and report_result
#4193
Iris: capture periodic memory profiles alongside CPU profiles
#4196
Memory leak in cache-copy: TensorStore handles accumulate ~14 MiB per shard
#3172
Iris: fix memray, needs LLDB in the task container
#3174
Iris: mark non-preemptible slices with a autoscaling taint
#4200
Zephyr coordinator stalls when last-shard worker dies or hangs
#4201
Remove deprecated GetTaskLogs RPC
#3185
iris: consolidate controller to single-thread event loop model
#3706
[iris] Separate auth and log tables into attached SQLite databases
#4220
perf(iris): Controller write lock contention causes 2s+ RegisterEndpoint latency at scale
#4223
Add .mirrored() to ExecutorStep and InputName for cross-region reads
#3201
iris: controller optimization for heartbeat loop
#4226
Subtasks in the Iris dashboard should be sorted by the main sort criteria
#4227
Filter and Reset Buttons Don't Work in the Iris Dashboard
#3266
smoke-test full mode: controller restart fails (VM not discoverable) + logging format bug
#2244
Consider replacing fsspec with obstore
#2257
Update docs for GPU Docker images
#3798
[zephyr] Coordinator OOM from quadratic scatter metadata growth
#3359
iris: track task/job state in DB for O(1) child status lookups in get_task_logs
#2364
Iris - Backup/Restore and post-morterm visibility
#3404
iris: missing TPUs should be marked as FAILED by autoscaler
#3926
iris: extend cloud smoke test github workflow to test against at CPU CW cluster
#3932
[iris] test_iris_run_cli_simple_job times out since #3899 preemptible heuristic
#3936
iris: add OptionalAuth mechanism to allow gradual adoption
#3947
Attempt to get collocated RL on single TPU worker
#3953
QOL Updates for Data Browser
#3968
iris: dev vs prod clusters
#3969
iris: create dev cluster config with daily automated restart
#2944
Iris: introduce provider abstraction
#3467
Iris: dashboard should fetch top-level jobs, zoom in as needed
#3981
[canary] TPU ferry fails: inherited Iris region pin conflicts with MARIN_PREFIX
#3985
Required checks hang when workflow path filters skip execution
#3989
Hardcoded gs:// paths in experiment configs pin execution to specific regions
#3992
[iris] Unify TaskProvider and K8sTaskProvider behind a single protocol
#3994
[zephyr] External sort shuffle
#3995
[iris] Add user-defined counters (MapReduce-style per-job stats)
#3996
[zephyr] Coordinator thread crashes silently, causing pipelines to hang at N-1/N
#3998
iris: decompose K8sControllerProvider into focused modules
#4001
[iris] CoreWeave dev cluster: always-on CPU, scale-to-zero GPU, nightly restart
#4003
[zephyr] On-demand workers with heterogeneous resource specs
#4004
[zephyr] Coordinator thread has no lifecycle logging โ hangs are undiagnosable
#4007
iris: split smoke test into separate explicit commands
#3538
iris: merge LogStore tables into ControllerDB
#4079
Iris launches and cleans up thousands of worker pods after coordinator exits
#3058
Iris: add "thread dump" link for tasks
#4088
iris: migrate classification/inference to Fray & Iris
#4093
zephyr: workers restart race condition?
Other Changes
@dlwh drafted a staged modeling experiment skill #4166 to make modeling experiments Grug-first with W&B view tagging and optional stage gates.
1 PR this week, 11 new comments, and 75 issues closed (75 total)
Sort:
Updated โ
Updated โ
Created โ
Created โ
Activity
#4166
Add staged modeling experiment skill
+254 −1
@dlwh
Top 15 runs (by FLOPs) this week (completed, running, crashed)
The largest active run is the Delphi 1e23 dense scaling ladder (W&B ), a 25B parameter model on v4-1024 now at 608B tokens with Paloma macro BPB 0.79 and train loss 2.08 โ @Helw150 noted the loss trend during cooldown is tracking between the pessimistic and optimistic forecasts #1337 . The 1e22 MoE v7 run (W&B ) launched mid-week on v4-512 at 23% MFU, estimated 7.7 days #3800 . Two finished v7 isoflop runs at 1e20 validated capacity factor=1.0 as 8.3% faster with near-identical loss #4016 . On the post-training side, @AlienKevin 's 32K v2 OT-Agent SFT (W&B ) reached 13% SWE-bench (matching released 14%) #3896 , while the 131K v2 run on v5p-256 (W&B ) revealed that batch-size-scaled gradient clipping is needed at 131K context #3897 . Three Qwen3-14B resilience SFT runs for math and medical domains completed on v5p-16 at 56-57% MFU.
Merged PR
Open PR
Draft PR
Closed PR
Open issue
Closed issue
Keyboard shortcuts
? Toggle this help
j / k Next / previous section
t Toggle details in current section
s Cycle sort order in current section
o Open current epic on GitHub
m Open current milestone on GitHub
M Open milestones list on GitHub