knlp: Kernel-Style Machine Learning

Kernel-Style Machine Learning

Rapid prototyping and automation for open source ML R&D using Linux kernel development methodologies

knlp Papers

Memory-Traffic Saturation in Autoregressive Transformer Decode

Asymmetric FP16-K / FP8-V KV cache quantization via a modified FlashInfer matches or beats FP16 decode throughput on every tested model (1.38× on Qwen2.5-7B) while rescuing fragile-key models where symmetric FP8 collapses. Backed by a batch-driven Hill saturation model across 14 models and 5 GPU architectures.

9 interactive panels · flashinfer fork · vllm fork

FIM-Guided Compression

in progress

Fisher Information as a universal signal for pruning, quantization, and layer allocation. Coming soon.

Serving and Systems

Learned Prefix Caches

umbrella

The shared architecture for Cartridges, ReasonCACHE, and skill caches. The page defines LPC and Prefix Induction, then folds in the H100 PI scale result, GSM8K sanity check, and A100 SCI/Active Reading retest.

PI frontier · Cartridges + ReasonCACHE

ReasonCACHE

vLLM

ReasonCACHE as a reusable reasoning interface: paper summary, serving implications, and KNLP vLLM support. It now records the trace-induction runs, public-strata PI scale, GSM8K trace sanity, and SCI/first-k/strided initializer split.

20/24 GSM8K sanity · LPC branch

SPF — Speculative Prefetch

parked

A scheduler-side speculative prefetch experiment for vLLM's KV cache. Did not clear its survival gate on Llama-3.2-3B under cache pressure; branch preserved so future scorers can be measured against the same bar.

feature overview · kv-cache policy

Code Reasoning

Pluggable certificates for domain-specific fault finding

R&D

Whether a structured certificate — premises with cited evidence, a trace, a formal conclusion — makes a model better at static code reasoning, and how to make that structure pluggable and specialized to a target domain: the Linux kernel. The baseline reproduces the paper's model-dependence split (Opus-4.5 gains, Sonnet-4.5 regresses) under verbosity, evidence-only, and memorization controls, on a frozen bug set graded blind to the certificate.

baseline reproduced · contamination-aware · Linux-kernel target

FIM-Guided Research

All built on the same signal: diagonal Fisher Information (E[g²] ≈ Adam exp_avg_sq). Squisher (2025) proves the equivalence. Higher FIM trace = parameters doing more work.

Adam State-Based Pruning

bitter7 achieves 15.6% better perplexity than magnitude baseline (37.28 vs 44.15 PPL), leveraging Adam's exp_avg_sq (≈ FIM diagonal) for importance scoring.


                importance = |w| × (exp_avg_sq + ε)^0.25

B200x4

Documentation Math Foundations Interactive Demo

FIM-Guided Weight Quantization

At decode, model weights are ~83% of the bytes, so weight quant is the dominant lever. Diagonal Fisher says which weights need more bits: on a held-out cross-family eval (Qwen2.5-7B + Llama-3.2-3B), int8 is near-lossless for −41% decode bytes. The surprise: the W_v weights are the most Fisher-sensitive even though the V-cache is FP8-robust — cache precision doesn't predict weight precision.

decode

Documentation Interactive Demo

KVSplice

FIM-guided compression adds ~20% extra compression on top of MLA (7.2x vs 6x), achieving 11% better perplexity (63 vs 71 PPL) and +1 HellaSwag vs MLA baseline.

B200x4

Documentation Interactive Demo

Reciprocal Attention

Learned Q@K.T ↔ K@Q.T alternation achieves 82% better perplexity (50.5 vs 282 PPL) vs baseline. Outperforms Qwen's SDPA G1 gate by 77%. FIM trace guides layer selection.

B200x4

Documentation Interactive Demo

Memory Tiering

FIM-ranked parameter placement across GPU, CPU, and storage tiers. exp_avg_sq ranking determines which tensors stay in fast memory.

Interactive Demo

FIM-Guided GNN Fraud Detection

Reciprocal Attention transfers from transformers to GNNs. FIM-guided RA achieves +7% F1 on DGraphFin by applying RA selectively to uncertain nodes (bottom 33% by FIM trace). Page-aware batching provides 4× better I/O locality.

GNN

Documentation Interactive Demo

KRI-FT

Fine-tuning a model under a KV-cache routing mask so it tolerates aggressive cache pruning at inference. On GPT-2 small: 8× fewer KV blocks for the same quality and 3–6× lower KL under every KRI router, across 3 seeds. The honest limit: the win shrinks on modern small models that already tolerate routing.

Documentation Interactive Demo

muvera-small

Reproducible byte-floor benchmark for retrieval vectors. Sub-4096-byte records are fast when resident, but raw NVMe random reads pay a full page per record — up to 128× read amplification. Pooled int8 (384 B) nearly matches a 120 KB ColBERTv2 oracle. make defconfig-muvera-small && make muvera-small.

Documentation Interactive Demo

Contributing to Marin?

An honest fit assessment of these methods against Marin, Stanford's open LLM. FIM-guided pruning is the only clean port but is off Marin's quality-per-FLOP axis; KVSplice is architecturally blocked (Marin isn't MLA); RA isn't ready. An assessment, not a pitch.

Fit assessment

Unified FIM Framework

All applications leverage the same underlying signal: E[g²]

Application	FIM Signal	Result
bitter7 pruning	exp_avg_sq^0.25	15.6% better PPL
Weight quantization	E[g²] per layer	int8 −41% near-lossless
KVSplice layers	FIM trace	11% better PPL
Reciprocal Attention	FIM trace	82% better PPL
Memory Tiering	exp_avg_sq rank	Optimal placement
GNN Fraud Detection	FIM trace → nodes	+7% F1

Decode & Bandwidth

Autoregressive decode is dominated by memory traffic. These explainers establish the structural and empirical foundations behind the BPA line of work: RGSA → BPA → fused KV quantization.

FP8 KV-Cache Failure Atlas

19 models

Why FP8 KV caches silently break some models and not others. Across 19 LLMs, catastrophic collapse (up to 12.4× PPL on Qwen2.5-7B) tracks K-bias magnitude, not the attention_bias flag. INT8-K rescues what FP8 destroys at the same 8 bits (format beats bit-width); K16/V8 is the deployable fix; and a release-time preflight flags fragile models with a 0/5 false-safe rate. 13 interactive charts.

causal controls + recovery Pareto · make defconfig-quant-fp8-validate

Certified LM-head decode

lossless

The output head is a 150,000-row multiply rerun every token, and almost all of it is wasted. A low-rank int8 shadow gives a provable upper bound on every token's logit, so the decoder opens only the vocabulary blocks that could possibly win and proves it found the exact dense argmax — reading ~20% of the head's bytes (5× less), at 1.97× batch-1 decode, the answer provably unchanged.

6 interactive panels · argmax match 1.0000

Linear Attention & Bounded Memory

DeltaNet · Trellis

A from-scratch, matched-harness study of the bounded-memory family — DeltaNet, Gated DeltaNet, and Trellis (arXiv:2512.23852) — whose fixed-size recurrent state replaces a growing KV cache. A fused Triton kernel closes the throughput gap 6.4×, and a matched run puts our Trellis cleanly between dense and its linear cousins. Preliminary work, still being brought into line with the paper’s original intent — Gated DeltaNet is the standout at the scale we can train.

9 interactive panels · make defconfig-trellis-tiny

knlp Papers

Memory-Traffic Saturation in Autoregressive Transformer Decode

FIM-Guided Compression

Serving and Systems

Learned Prefix Caches

ReasonCACHE

SPF — Speculative Prefetch

Code Reasoning

Pluggable certificates for domain-specific fault finding

FIM-Guided Research

Adam State-Based Pruning

FIM-Guided Weight Quantization

KVSplice

Reciprocal Attention

Memory Tiering

FIM-Guided GNN Fraud Detection

KRI-FT

muvera-small

Contributing to Marin?

Unified FIM Framework

Decode & Bandwidth

FP8 KV-Cache Failure Atlas

Certified LM-head decode

Linear Attention & Bounded Memory

AR Decode Bottleneck

Decode Scaling

Ridge Point

Pearson r, R² & Spearman ρ

Cartridges

knlp routing

KRI

Prefix Integrity Analysis

Documentation

Quick Start

Architecture

FIM Analysis

Run with kdevops