GGUF: optional Metal dequant fast path via kernels-community by ArthurZucker · Pull Request #45975 · huggingface/transformers

ArthurZucker · 2026-05-14T08:22:35Z

Summary

Adds an optional Metal dequant fast path in GGUFDequantize. When the kernels package is installed and kernels-community/gguf-dequant is reachable, MPS GGUF loads (Q4_0 / Q8_0 / Q4_K / Q5_K / Q6_K / IQ4_NL / IQ4_XS) route through one Metal compute kernel per tensor — 2.3–7.5× faster than the chained PyTorch path on M3 Max at real-world tensor sizes. Falls back to the existing pure-torch path on CPU/CUDA, for unsupported quant types, or when the Hub doesn't have a build variant for the current env. Lazy import + opt-out via TRANSFORMERS_GGUF_USE_METAL_KERNELS=0. Stacked on #44794.

Kernel currently published at ArthurZ/gguf-dequant while the transfer to kernels-community is in flight; override with TRANSFORMERS_GGUF_METAL_KERNELS_REPO=<repo-id>.

Bench (M3 Max, 8 M elems per case, input already on GPU)

Format	Metal GB/s	PyTorch MPS GB/s	speedup
Q4_0	21.6	9.5	2.3×
Q8_0	50.8	13.4	3.8×
Q4_K	34.5	11.1	3.1×
Q5_K	32.9	6.5	5.1×
Q6_K	37.5	6.8	5.5×
IQ4_NL	52.7	7.0	7.5×
IQ4_XS	38.5	7.2	5.3×

Test plan

RUN_SLOW=1 pytest tests/quantization/ggml/test_gguf_load_completeness.py -k Q4_K with pip install kernels + TRANSFORMERS_GGUF_METAL_KERNELS_REPO=ArthurZ/gguf-dequant; all 7 cells pass with no missing/unexpected keys.
Same suite with TRANSFORMERS_GGUF_USE_METAL_KERNELS=0 — confirms fallback path still works.
AutoModelForCausalLM.from_pretrained("TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF", gguf_file="…Q4_K_M.gguf", device_map="mps") end-to-end load wall time before/after.

…uant When the `kernels` package is installed and `kernels-community/gguf-dequant` has a build for the current torch/Apple-Silicon combo, route Q4_0 / Q8_0 / Q4_K / Q5_K / Q6_K / IQ4_NL / IQ4_XS dequant on MPS through one Metal compute kernel per tensor — 2.3-7.5× faster than the chained PyTorch path on M3 Max at real-world tensor sizes (apples-to-apples with input already on device). Lazy import on first dequant; opt-out with `TRANSFORMERS_GGUF_USE_METAL_KERNELS=0`. Falls back to the existing pure-torch path for CPU/CUDA, unsupported quant types, or when the Hub doesn't have a build variant for the current env. No behavioural change for any path that doesn't have the package installed.

When `transformers serve` runs on Apple Silicon (`--device auto` or `mps`) with `kernels` installed and no explicit `--attn-implementation` flag, default the attention to `kernels-community/metal-flash-sdpa` instead of plain SDPA. On the 100-sample gsm8k benchmark (Qwen2.5-0.5B-Instruct, MPS fp16, generate_batch) it's a 1.66x throughput improvement (158 -> 256 tok/s) with token-for-token parity for greedy decoding. Users who don't want it can opt out with `--attn-implementation sdpa`. Help text on the `--attn-implementation` flag also now lists the kernels-hub syntax explicitly.

…nels-community

ArthurZucker · 2026-05-14T08:30:02Z

Test plan results (M3 Max, torch 2.11, kernel loaded via `get_local_kernel` from the build at `ArthurZ/gguf-dequant`)

Correctness (`output_loading_info=True`, all on `device_map="mps"`, bf16)

[PASS] TinyLlama-1.1B Q4_K_M    missing=0 unexpected=0 mismatched=0
[PASS] Qwen2.5-0.5B Q4_0        missing=0 unexpected=0 mismatched=0
[PASS] Llama-3.2-3B Q4_K_M      missing=0 unexpected=0 mismatched=0
[PASS] Gemma-3-4B Q4_0          missing=0 unexpected=0 mismatched=0

All four model archs / quant types load with empty diff vs the meta-state-dict.

End-to-end wall time (`AutoModelForCausalLM.from_pretrained`, median of 2)

Model	baseline (s)	Metal kernel (s)	speedup
TinyLlama-1.1B Q4_K_M	3.96	3.44	1.15×
Qwen2.5-0.5B Q4_0	9.99	9.96	1.00×
Llama-3.2-3B Q4_K_M	13.22	12.79	1.03×

Modest at the load-wall level (1.0–1.15×) — dequant is one phase of model load; the I/O, converter chain, MPS allocator, and module init dominate at these sizes. The kernel's microbench-level 2.3–7.5× speedup compresses through the rest of the pipeline. Expect a bigger relative win on larger models where dequant is a larger share of total time.

Fallback (`TRANSFORMERS_GGUF_USE_METAL_KERNELS=0`)

The "baseline" column above is exactly this case — the existing pure-torch path. No regression introduced by the patch.

Caveats reproducing locally

kernel-builder emits Metal builds for torch 2.8 / 2.9 / 2.10 only. On torch 2.11 nightly the Hub-fetch path 404s on the variant lookup, so the test used get_local_kernel to load the torch 2.10 build directly. Hub-side rebuild (or stage the 2.10 build as 2.11) needed before merge for the default install path to work on the current nightly.
Kernel repo still at ArthurZ/gguf-dequant; transfer to kernels-community in flight. Override via TRANSFORMERS_GGUF_METAL_KERNELS_REPO=….

HuggingFaceDocBuilderDev · 2026-05-14T08:34:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2026-05-14T08:40:29Z

Wall-time bench, full set (median of 2, `device_map="mps"`, bf16)

Model	baseline (s)	Metal kernel (s)	speedup
TinyLlama-1.1B Q4_K_M	3.27	3.15	1.04×
Qwen2.5-0.5B Q4_0	8.62	8.80	0.98×
Llama-3.2-3B Q4_K_M	11.16	11.55	0.97×
Gemma-2-2B Q4_K_M	13.31	13.18	1.01×
Gemma-3-4B Q4_0	12.04	12.23	0.98×
Qwen1.5-MoE-A2.7B Q4_K_M	14.15	10.63	1.33× 🚀

The MoE case is the headline — 60 experts × multiple Q4_K projections per layer, so dequant is a meaningful chunk of total load wall time. 3.5 seconds saved on a 14 s load.

Dense models sit at 0.97–1.04× — within run-to-run noise. At these sizes file I/O, the converter chain, and module init dominate over per-tensor dequant. Where the kernel pays off in absolute terms is models with lots of expert tensors (Mixtral, Qwen-MoE, DeepSeek-MoE).

ArthurZucker added 3 commits May 14, 2026 17:05

Point Metal kernel repo to ArthurZ/gguf-dequant until transfer to ker…

b4fb5d7

…nels-community

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GGUF: optional Metal dequant fast path via kernels-community#45975

GGUF: optional Metal dequant fast path via kernels-community#45975
ArthurZucker wants to merge 3 commits into
update-gguffrom
gguf-metal-kernels

ArthurZucker commented May 14, 2026

Uh oh!

ArthurZucker commented May 14, 2026

Uh oh!

HuggingFaceDocBuilderDev commented May 14, 2026

Uh oh!

ArthurZucker commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArthurZucker commented May 14, 2026

Summary

Bench (M3 Max, 8 M elems per case, input already on GPU)

Test plan

Uh oh!

ArthurZucker commented May 14, 2026

Test plan results (M3 Max, torch 2.11, kernel loaded via get_local_kernel from the build at ArthurZ/gguf-dequant)

Correctness (output_loading_info=True, all on device_map="mps", bf16)

End-to-end wall time (AutoModelForCausalLM.from_pretrained, median of 2)

Fallback (TRANSFORMERS_GGUF_USE_METAL_KERNELS=0)

Caveats reproducing locally

Uh oh!

HuggingFaceDocBuilderDev commented May 14, 2026

Uh oh!

ArthurZucker commented May 14, 2026

Wall-time bench, full set (median of 2, device_map="mps", bf16)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Test plan results (M3 Max, torch 2.11, kernel loaded via `get_local_kernel` from the build at `ArthurZ/gguf-dequant`)

Correctness (`output_loading_info=True`, all on `device_map="mps"`, bf16)

End-to-end wall time (`AutoModelForCausalLM.from_pretrained`, median of 2)

Fallback (`TRANSFORMERS_GGUF_USE_METAL_KERNELS=0`)

Wall-time bench, full set (median of 2, `device_map="mps"`, bf16)