Fix deepseek v4#45892
Conversation
…eepseek-v4-csa-per-query-mask
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
|
Hey! I don't get how you want to run sdpa when it does not have attention sinks backed in it? You won't ever get good results. |
|
ArthurZucker Thanks for your hard work. |
|
Sorry everyone I trusted Claude too much |
人之常情,Keep going! |
Cyrilvallez
left a comment
There was a problem hiding this comment.
ALright, I don't know the specifics of deepseek v4, so just reviewed the general parts, not the exact mathematics.
I'm mostly a bit worried about all the dtype upcasting everywhere, it's quite expensive in general, both for speed and memory. And I believe some of them are actually useless. And when it's performed on weights directly, let's use keep_in_fp32_strict to always have the weights in fp32 instead of upcasting it every forward!
|
[For maintainers] Suggested jobs to run (before merge) run-slow: deepseek_v4 |
|
Checked batch version! |
Hi sorry, the sdpa test was just comparing the two mask forms apples-to-apples on a non-sink-aware backend, but anyway please ignore sdpa, the main point I want to make, is that the Current ( gathered = flat_kv.index_select(0, flat_indices).view(batch, 1, -1, self.head_dim)
block_bias = gathered.new_full((batch, 1, seq_len, seq_len, top_k), float("-inf"))
allowed = torch.where(valid, gathered.new_zeros(()), gathered.new_full((), float("-inf")))
block_bias[:, 0, query_indices, query_indices, :] = allowed
return gathered, block_bias.view(batch, 1, seq_len, seq_len * top_k)Alt: safe_indices = torch.where(valid, top_k_indices, torch.full_like(top_k_indices, compressed_len))
block_bias = compressed_kv.new_full((batch, 1, seq_len, compressed_len + 1), float("-inf"))
block_bias.scatter_(-1, safe_indices.unsqueeze(1), 0.0)
return compressed_kv, block_bias[..., :compressed_len]Each query's softmax has the same |
* up * small updates * a mistery how this got through * updates * update * final fixes? * update * up * yup * last nit! * up * update * revert shitty AI work * up * yyups
|
Regarding the issue i mentioned, its confirmed, not opened by me, flagged by another, but ive been looking at it, the paper and current implementation A brief summary is this for the current forward cell block this is occuring: further details are in the issue itself, well v4 is after all a model unlike others, cant be helped that even after fixes issues happen |


What does this PR do?
Attention-mask layout per layer type
Tiny-config visualization (
sliding_window=8, CSAm=4, HCAm'=8,index_topk=2,S=16) of the actualcat([sliding_mask, block_bias])eachDeepseekV4Attentionlayer feeds toeager_attention_forward. Green cells = attended-to, dim slate = masked, red = causally available but the indexer's top-k didn't pick. Wide green blocks in the compressor section bundlemsource positions into one KV slot; dashed separators inside the block show which tokens were compressed together.Sliding-only — plain sliding-window-causal
[S, S], no compressor section:CSA — sliding KV + entry-view of the Lightning Indexer's top-
kpicks; red cells = available but not picked:HCA — sliding KV + every compressed entry (no indexer); each
C_wsummarisesm'source positions:Reproduces with
python docs/source/en/imgs/deepseek_v4/visualize_attention_masks.py --svg docs/source/en/imgs/deepseek_v4from the repo root.