Fix deepseek v4 by ArthurZucker · Pull Request #45892 · huggingface/transformers

ArthurZucker · 2026-05-11T10:45:10Z

What does this PR do?

Attention-mask layout per layer type

Tiny-config visualization (sliding_window=8, CSA m=4, HCA m'=8, index_topk=2, S=16) of the actual cat([sliding_mask, block_bias]) each DeepseekV4Attention layer feeds to eager_attention_forward. Green cells = attended-to, dim slate = masked, red = causally available but the indexer's top-k didn't pick. Wide green blocks in the compressor section bundle m source positions into one KV slot; dashed separators inside the block show which tokens were compressed together.

Sliding-only — plain sliding-window-causal [S, S], no compressor section:

CSA — sliding KV + entry-view of the Lightning Indexer's top-k picks; red cells = available but not picked:

HCA — sliding KV + every compressed entry (no indexer); each C_w summarises m' source positions:

Reproduces with python docs/source/en/imgs/deepseek_v4/visualize_attention_masks.py --svg docs/source/en/imgs/deepseek_v4 from the repo root.

…eepseek-v4-csa-per-query-mask

HuggingFaceDocBuilderDev · 2026-05-11T10:57:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2026-05-12T00:28:28Z

Hey! I don't get how you want to run sdpa when it does not have attention sinks backed in it? You won't ever get good results.

I-hercules · 2026-05-12T07:12:53Z

ArthurZucker Thanks for your hard work.

ArthurZucker · 2026-05-12T07:16:01Z

Sorry everyone I trusted Claude too much

I-hercules · 2026-05-12T07:26:44Z

Sorry everyone I trusted Claude too much

人之常情，Keep going!

Cyrilvallez

ALright, I don't know the specifics of deepseek v4, so just reviewed the general parts, not the exact mathematics.
I'm mostly a bit worried about all the dtype upcasting everywhere, it's quite expensive in general, both for speed and memory. And I believe some of them are actually useless. And when it's performed on weights directly, let's use keep_in_fp32_strict to always have the weights in fp32 instead of upcasting it every forward!

github-actions · 2026-05-12T07:59:32Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: deepseek_v4

ArthurZucker · 2026-05-12T08:26:31Z

Checked batch version!

Sawyer117 · 2026-05-12T15:25:50Z

Hey! I don't get how you want to run sdpa when it does not have attention sinks backed in it? You won't ever get good results.

Hi sorry, the sdpa test was just comparing the two mask forms apples-to-apples on a non-sink-aware backend, but anyway please ignore sdpa, the main point I want to make, is that the [S, S*top_k] mask is unnecessary; [S, compressed_len] is math-identical with better performance(both speed and mem).

Current (DeepseekV4CSACompressor.forward ):

gathered = flat_kv.index_select(0, flat_indices).view(batch, 1, -1, self.head_dim)
block_bias = gathered.new_full((batch, 1, seq_len, seq_len, top_k), float("-inf"))
allowed = torch.where(valid, gathered.new_zeros(()), gathered.new_full((), float("-inf")))
block_bias[:, 0, query_indices, query_indices, :] = allowed
return gathered, block_bias.view(batch, 1, seq_len, seq_len * top_k)

Alt:

safe_indices = torch.where(valid, top_k_indices, torch.full_like(top_k_indices, compressed_len))
block_bias = compressed_kv.new_full((batch, 1, seq_len, compressed_len + 1), float("-inf"))
block_bias.scatter_(-1, safe_indices.unsqueeze(1), 0.0)
return compressed_kv, block_bias[..., :compressed_len]

Each query's softmax has the same top_k non--inf entries either way (same indices into the compressed_kv axis, same dot-product values). Only the row width differs:
dense S*top_k mostly filled with -inf vs compressed_len with top_k nonzeros.

* up * small updates * a mistery how this got through * updates * update * final fixes? * update * up * yup * last nit! * up * update * revert shitty AI work * up * yyups

ArjunSrivastava1 · 2026-05-14T10:30:26Z

Regarding the issue i mentioned, its confirmed, not opened by me, flagged by another, but ive been looking at it, the paper and current implementation

A brief summary is this for the current forward cell block this is occuring:
Block: [E, F, G, PAD]
↓
Softmax over all 4 positions (including PAD)
↓
PAD contributes to compressed entry
↓
Polluted KV entry → incorrect attention

further details are in the issue itself, well v4 is after all a model unlike others, cant be helped that even after fixes issues happen

ArjunSrivastava1 · 2026-05-14T10:37:52Z

Seeing as theres also a bit of lack of diagrams in the very PR focused to solve v4 issues and i imagine we may need them, again, im attaching a few for everyones reference here too from the paper itself

Fig 1: CSA architecture

Fig 2: HCA architecture

This provides a more complete picture for whoever opens and uses this pr

ArthurZucker added 2 commits May 11, 2026 19:43

up

7d47042

Merge branch 'main' of github.com:huggingface/transformers into fix-d…

8bdbfbb

…eepseek-v4-csa-per-query-mask

Sawyer117 added a commit to Sawyer117/transformers that referenced this pull request May 11, 2026

bench: relabel arthur -> huggingface#45892; rename file

e6e4e87

This comment was marked as off-topic.

Sign in to view

ArthurZucker added 5 commits May 12, 2026 09:47

small updates

b9688d8

a mistery how this got through

0260527

updates

b455143

update

24ba03a

final fixes?

8b3dbd7

ArthurZucker marked this pull request as ready for review May 12, 2026 04:58

ArthurZucker added 5 commits May 11, 2026 22:52

update

f3af33c

up

11ae644

yup

c889726

last nit!

6f8a77b

up

d936ffa

update

61c69b5

Cyrilvallez reviewed May 12, 2026

View reviewed changes

ArthurZucker added 2 commits May 12, 2026 00:54

revert shitty AI work

72e00bb

up

acce5a1

yyups

679d215

ArthurZucker merged commit a1b77cc into main May 12, 2026
41 checks passed

ArthurZucker deleted the fix-deepseek-v4-csa-per-query-mask branch May 12, 2026 08:26

ArthurZucker added for patch Tag issues / labels that should be included in the next patch labels May 12, 2026

This was referenced May 12, 2026

[DeepSeekV4] Potential RoPE theta mismatch between main attention and compressed KV branches #45910

Closed

DeepSeek-V4 CSA eager path may not preserve per-query top-k masking for S > 1 #45758

Closed

DeepSeekV4 的 CSA Indexer 因果 mask #45820

Closed

ArthurZucker mentioned this pull request May 13, 2026

Deepseek v4 csa mask collapse #45928

Merged

ArthurZucker added a commit that referenced this pull request May 13, 2026

Fix deepseek v4 (#45892)

283f3f0

* up * small updates * a mistery how this got through * updates * update * final fixes? * update * up * yup * last nit! * up * update * revert shitty AI work * up * yyups

ArjunSrivastava1 mentioned this pull request May 13, 2026

[DeepSeekV4] Compressor does not seem to account for padding tokens when forming compressed KV blocks #45938

Open

Conversation

ArthurZucker commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Attention-mask layout per layer type

Uh oh!

HuggingFaceDocBuilderDev commented May 11, 2026

Uh oh!

This comment was marked as off-topic.

This comment was marked as off-topic.

ArthurZucker commented May 12, 2026

Uh oh!

I-hercules commented May 12, 2026

Uh oh!

ArthurZucker commented May 12, 2026

Uh oh!

I-hercules commented May 12, 2026

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

ArthurZucker commented May 12, 2026

Uh oh!

Uh oh!

Sawyer117 commented May 12, 2026

Uh oh!

ArjunSrivastava1 commented May 14, 2026

Uh oh!

ArjunSrivastava1 commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ArthurZucker commented May 11, 2026 •

edited

Loading