Kernel-Style Machine Learning
Rapid prototyping and automation for open source ML R&D using Linux kernel development methodologies
Fused INT4 KV quantization across 14 models and 4 GPU architectures. Batch-driven saturation, fusion mechanism, KV precision asymmetry, and runtime calibration.
Fisher Information as a universal signal for pruning, quantization, and layer allocation. Coming soon.
All built on the same signal: diagonal Fisher Information (E[g²] ≈ Adam exp_avg_sq). Squisher (2025) proves the equivalence. Higher FIM trace = parameters doing more work.
bitter7 achieves 15.6% better perplexity than magnitude baseline (37.28 vs 44.15 PPL), leveraging Adam's exp_avg_sq (≈ FIM diagonal) for importance scoring.
importance = |w| × (exp_avg_sq + ε)^0.25
Diagonal Fisher identifies critical tensors for precision allocation. Upgrading 4 layers from Q3_K to Q6_K achieves 1.26% better perplexity at only 1.8% size increase.
FIM-guided compression adds ~20% extra compression on top of MLA (7.2x vs 6x), achieving 11% better perplexity (63 vs 71 PPL) and +1 HellaSwag vs MLA baseline.
Learned Q@K.T ↔ K@Q.T alternation achieves 82% better perplexity (50.5 vs 282 PPL) vs baseline. Outperforms Qwen's SDPA G1 gate by 77%. FIM trace guides layer selection.
FIM-ranked parameter placement across GPU, CPU, and storage tiers. exp_avg_sq ranking determines which tensors stay in fast memory.
Reciprocal Attention transfers from transformers to GNNs. FIM-guided RA achieves +7% F1 on DGraphFin by applying RA selectively to uncertain nodes (bottom 33% by FIM trace). Page-aware batching provides 4× better I/O locality.
All applications leverage the same underlying signal: E[g²]
| Application | FIM Signal | Result |
|---|---|---|
| bitter7 pruning | exp_avg_sq^0.25 | 15.6% better PPL |
| Mobile quantization | Σg² per tensor | 1.26% better PPL |
| KVSplice layers | FIM trace | 11% better PPL |
| Reciprocal Attention | FIM trace | 82% better PPL |
| Memory Tiering | exp_avg_sq rank | Optimal placement |
| GNN Fraud Detection | FIM trace → nodes | +7% F1 |
Autoregressive decode is dominated by memory traffic. These explainers establish the structural and empirical foundations behind the BPA line of work: RGSA → BPA → fused KV quantization.
Structural explanation of why autoregressive decode rereads KV state every step and why this is an inherent memory-traffic problem.
Empirical cross-GPU decode throughput, latency, and bandwidth measurements across W7900, A100, H100, and B200.
Interactive roofline model. Decode attention operates 18–148× below the compute/memory ridge on H100 (∼295 FLOP/byte).
Statistical methods used in the paper. Spearman rank correlation tests whether architecture predicts KV sensitivity (ρ < 0.2, it does not).
Pre-trained KV caches that skip prefill. 10.6x speedup at 16K tokens. vLLM integration via KVConnectorBase_V1 with zero core patches.