Iris reliability hardening, Grug's module-first API refactor, and early MoE training experiments on TPU. The first CoreWeave GPU canary ferry was stood up and @ClassicLarry got Grug MoE running with replicated weights on v4 and v5p.
@gonzalobenegas added DNA experiments covering promoters, genomic regions, and k-mer tokenization #2992, plus auto-detection of BOS/EOS tokens in the DNA batch tokenizer #3055. @teetone updated Evalchemy non-math evaluation domains #3128. Agent recipe and scrub skill improvements by @dlwh #3129 and @rjpower #3056.
Dense isoflop scaling experiments continued with a full v8 rerun of the scaling_v3 sweeps at 3e20 and 2e20 FLOP budgets, showing slightly improved losses vs. v3 (e.g., 2.576 vs. 2.577 at the 2.5B/3e20 point). The first Grug MoE runs appeared: dlwh ran a 286M MoE with replicated weights on v5p-8 to 11.3B tokens at 21.6% MFU as a flopmatch baseline against the dense canary, and attempted a 2-host v5p-16 trial that crashed. ClassicLarry submitted a 300M speedrun on v5p-16. The first CoreWeave GPU canary ferry launched on H100x8 but crashed early at 3.7% MFU, beginning the GPU platform bring-up. The AdamH hyperparameter mega-sweep continued with 50+ additional 5B-token runs.
| Run | Owner | Hardware | FLOP Budget | Wall Time | Loss | Evals | Links |
|---|---|---|---|---|---|---|---|
| (done) exp2262pt2f_sft_qwen2pt5_ot4_30k_math_qwen3_235b_a22b_32768token-ad1022 | Moo Jin Kim | TPU v6 lite (128 chips) |
6.45e20 model
2.61e21 HW (25% MFU) |
23.3h | BPB: 0.500 | W&B | |
| (done) exp2262pt2g_sft_qwen2pt5_ot4_30k_math_n1_rejsamp_qwen3_32b_32768-689abd | Moo Jin Kim | TPU v6 lite (128 chips) |
6.45e20 model
2.61e21 HW (25% MFU) |
17.6h | BPB: 0.279 | W&B | |
| (done) exp2262pt3d_qwen3_1pt7b_base_ot4_240k_math_qwen3_32b_32768tokens-dec321 | Moo Jin Kim | TPU v5 (16 chips) |
1.30e21 model
2.60e21 HW (50% MFU) |
2.8d | BPB: 0.129 | W&B | |
| (done) exp2262pt3g_qwen3_1pt7b_base_ot4_30k_math_n8_rejsamp_soft_qwen3_-63d27f | Moo Jin Kim | TPU v5 (32 chips) |
1.01e21 model
2.19e21 HW (46% MFU) |
2.0d | BPB: 0.062 | W&B | |
| (done) exp2262pt3c_qwen3_1pt7b_base_ot4_240k_math_qwen3_4b_32768tokens-557d96 | Moo Jin Kim | TPU v5 lite (128 chips) |
1.30e21 model
2.08e21 HW (62% MFU) |
2.3d | BPB: 0.056 | W&B | |
| isoflop-3e+20-d4096-L40-B16-adamh_scaling_v6 | Will Held | TPU v4 (32 chips) |
3.00e20 model
1.36e21 HW (22% MFU) |
2.4d | BPB: 0.955 | W&B | |
| isoflop-3e+20-d4096-L40-B16-adamh_scaling_v3 | Will Held | TPU v4 (32 chips) |
3.00e20 model
1.36e21 HW (22% MFU) |
2.3d | BPB: 0.955 | W&B | |
| (done) exp2262pt2g_3_llama3pt1_ot4_30k_math_n1_rejsamp_qwen3_32b_32768t-662abd | Moo Jin Kim | TPU v5 lite (256 chips) |
4.76e20 model
1.30e21 HW (36% MFU) |
8.9h | BPB: 0.173 | W&B | |
| isoflop-3e+20-d768-L8-B1024-adamh_scaling_v5 | Will Held | TPU v4 (32 chips) |
3.00e20 model
1.29e21 HW (23% MFU) |
1.8d | BPB: 1.005 | W&B | |
| isoflop-3e+20-d768-L8-B1024-adamh_scaling_v8 | Will Held | TPU v4 (32 chips) |
3.00e20 model
1.29e21 HW (23% MFU) |
1.8d | BPB: 1.002 | W&B | |
| isoflop-3e+20-d768-L8-B1024-adamh_scaling_v3 | Will Held | TPU v4 (32 chips) |
3.00e20 model
1.29e21 HW (23% MFU) |
1.8d | BPB: 1.013 | W&B | |
| isoflop-3e+20-d768-L8-B1024-adamh_scaling_v6 | Will Held | TPU v4 (32 chips) |
3.00e20 model
1.29e21 HW (23% MFU) |
1.8d | BPB: 1.004 | W&B | |
| isoflop-3e+20-d1024-L11-B512-adamh_scaling_v7 | Will Held | TPU v4 (32 chips) |
3.00e20 model
1.23e21 HW (24% MFU) |
1.7d | BPB: 0.947 | W&B | |
| isoflop-3e+20-d1024-L11-B512-adamh_scaling_v8 | Will Held | TPU v4 (32 chips) |
3.00e20 model
1.23e21 HW (24% MFU) |
1.7d | BPB: 0.946 | W&B | |
| isoflop-3e+20-d1024-L11-B512-adamh_scaling_v6 | Will Held | TPU v4 (32 chips) |
3.00e20 model
1.23e21 HW (24% MFU) |
1.7d | BPB: 0.947 | W&B | |
| Dense isoflop v8 sweep (3e20 budget, 273M-4.3B, 7 sizes) | @Helw150 | v4-32 (16 chips) |
(23-44% MFU) |
30-48h per run | loss=2.576-2.937, 11-223B tokens per run | W&B W&BW&B | |
| Dense isoflop v8 sweep (2e20 budget, 273M-2.5B, 6 sizes) | @Helw150 | v4-16 (8 chips) |
(30-42% MFU) |
25-39h per run | loss=2.605-2.911, 11-134B tokens per run | W&B | |
| Grug MoE flopmatch daily (286M MoE, replicated weights) | @dlwh | v5p-8 (4 chips) |
(21.6% MFU) |
12.5h | loss=3.122, 11.3B tokens | #2710 W&B W&B | |
| Grug MoE v5p-16 trial (286M, 2-host) crashed | @dlwh | v5p-16 (8 chips) |
(21.5% MFU) |
2.9h | loss=N/A, 4.2B tokens (crashed) | #2710 W&B | |
| 300M speedrun (stdattn, 4096 ctx) | @ClassicLarry | v5p-16 (8 chips) |
(20.9% MFU) |
2.2h | loss=2.926, 6.0B tokens | #2184 W&B | |
| CoreWeave GPU canary ferry (H100x8) crashed | @rjpower | H100x8 (8 GPUs) |
(3.7% MFU) |
0.3h | 20M tokens (crashed early) | #3022 W&B | |
| AdamH hyperparameter mega-sweep v3 (loop 3-9, 157M) | @Helw150 | v5p-8 (4 chips) |
(23% MFU) |
5h per run | loss=3.48-3.75, 5B tokens each, ~50+ runs | W&B |