Patch release v5.8.1

@qgallouedec

Patch release v5.8.1

This release is mainly to fix the Deepseek V4 integration!!!

[fix] Add fatal_error to ContinuousBatchingManager so the serving... by @qgallouedec, @remi-or
Fix WeightConverter regex incorrectly matching shared_experts as experts by @silencelamb, @claude
Fix deepseek v4 by @ArthurZucker (#45892)
Deepseek v4 csa mask collapse by @ArthurZucker, @Sawyer117 (#45928)

@ArthurZucker

Release v5.8.0

New Model additions

DeepSeek-V4

DeepSeek-V4 is the next-generation MoE (Mixture of Experts) language model from DeepSeek that introduces several architectural innovations over DeepSeek-V3. The architecture replaces Multi-head Latent Attention (MLA) with a hybrid local + long-range attention design, swaps residual connections for Manifold-Constrained Hyper-Connections (mHC), and bootstraps the first few MoE layers with a static token-id → expert-id hash table. This implementation covers DeepSeek-V4-Flash, DeepSeek-V4-Pro, and their -Base pretrained variants, which share the same architecture but differ in width, depth, expert count and weights.

Links: Documentation | Paper

Add DeepSeek V4 (#45643) by @ArthurZucker in #45643

Gemma 4 Assistant

Gemma 4 Assistant is a small, text-only model that enables speculative decoding for Gemma 4 models using the Multi-Token Prediction (MTP) method and associated candidate generator. The model shares the same Gemma4TextModel backbone as other Gemma 4 models but uses KV sharing throughout the entire model, allowing it to reuse the KV cache populated by the target model and skip the pre-fill phase entirely. This architecture includes cross-attention to make the most of the target model's context, allowing the assistant to accurately predict more drafted tokens per drafting round.

Links: Documentation

First model (#45788) by @SindhuRaghuram97 in #45788

GraniteSpeechPlus

Granite Speech Plus is a variant of Granite Speech that enhances the projector by consuming the concatenation of the encoder's final hidden states with an arbitrary subset of its intermediate hidden states along the feature dimension. It is a multimodal speech-to-text model that can transcribe audio, provide speaker annotation and word level timestamps by responding to text prompts. The model inherits the same architecture components as Granite Speech including the speech encoder, query transformer projector, language model, and optional LoRA adapter.

Links: Documentation

Support for a new Granite-Speech-Plus model (#45695) by @zvik in #45695

Granite4Vision

Granite Vision 4.1 is a vision-language model from IBM Research designed for enterprise-grade document data extraction. It specializes in chart extraction (Chart2CSV, Chart2Summary, Chart2Code), table extraction (JSON, HTML, OTSL), and semantic key-value pair extraction. The model builds on LLaVA-NeXT with architectural innovations including SigLIP2 Vision Encoder, Window Q-Former Projectors, and DeepStack Feature Injection with 8 vision-to-LLM injection points.

Links: Documentation

Add Granite 4.1 Vision (granite4_vision) (#45597) by @artem-spector in #45597

EXAONE-4.5

EXAONE 4.5 is the first open-weight vision language model developed by LG AI Research, integrating a dedicated visual encoder into the existing EXAONE 4.0 framework to expand multimodal capabilities. The model features 33 billion parameters in total, including 1.2 billion parameters from the vision encoder, and achieves competitive performance in general benchmarks while outperforming similar-sized models in document understanding and Korean contextual reasoning. It builds on EXAONE 4.0 with key enhancements including an expanded vocabulary of 153,600 tokens, support for up to 256K token context windows, and a Multi-Token Prediction (MTP) mechanism.

Links: Documentation | Paper | Blog Post

Add EXAONE 4.5 implementations (#45471) by @nuxlear in #45471

PP-FormulaNet

PP-FormulaNet-L and PP-FormulaNet_plus-L are lightweight models designed for table structure recognition, focusing on accurately recognizing table structures in documents and natural scenes. The models are part of the SLANet series and can be used for image-to-text tasks, specifically for detecting and processing mathematical formulas and table structures from images.

Links: Documentation

[Model] Add PP-FormulaNet Model Support (#45626) by @zhang-prog in #45626

Breaking changes

Apex integration has been removed from the library (including RMSNorm usage in T5 and related models), so users relying on Apex for mixed precision or fused ops should migrate to PyTorch's native equivalents instead.

🚨 Get rid of most Apex references (#45723) by @Rocketknight1

Tokenization

Fixed tokenizer mapping issues for DeepSeek R1 distilled (Qwen2) and DeepSeek OCR models, and resolved a significant performance regression in PreTrainedTokenizer.convert_ids_to_tokens where skip_special_tokens=True was rebuilding the special token set on every iteration, resulting in a ~300x speedup for that code path.

deepseek r1 distilled tokenizer fix for qwen2 mapping (#45741) by @itazap in [#45741]
DeepSeek OCR specifies an incorrect tokenizer class on the Hub (#45739) by @hmellor in [#45739]
PythonBackend slow tokenizer convert_ids_to_tokens fix (#45728) by @i3hz in [#45728]

Bugfixes and improvements

fix: correct spelling in continuous_api docstring (#45749) by @Dhruv908615 in [#45749]
Fix link to modular transformers documentation (#45746) by @SangbumChoi in [#45746]
Gemma4: fix failed test cases (#45568) by @kaixuanliu in [#45568]
Fix CI: Allow more artifacts to be download in CI (#45785) by @ydshieh in [#45785]
Add concurrency to PR CI workflow file (pr-ci-caller.yml) (#45786) by @ydshieh in [#45786]
Reorder decorators for autodoc and dataclass (#45702) by @zucchini-nlp in [#45702]
Unwrap text_config in AutoModelFor*.from_config (#45770) by @jamesbraza in [#45770]
fix: Added Mps support in float fallback backends list (#45687) by @rigen1048 in [#45687]
Github Actions PR CI (caller) (#45476) by @ydshieh in [#45476]
make sure we call check_auto in CI (#45775) by @tarekziade in [#45775]
Fix auto mapping script (#45774) by @Cyrilvallez in [#45774]
[MINISTRAL3] Fix conversion script yarn's apply_scale support. (#45744) by @juliendenize in [#45744]
[nemotron_h] respect _no_reinit flag on dt_bias and out_proj.weight (#45591) by @vai-minzhou in [#45591]
fix(utils): Resolve backbone utils test regressions (#45594) by @harshaljanjani in [#45594]
[CB] Better overall script and decode bucketting (#45653) by @remi-or in [#45653]
[docs] model testing (#45152) by @stevhliu in [#45152]
update dev (#45726) by @vasqu in [#45726]
Doc translate to Persian(farsi) (#45664) by @zeoses in [#45664]
[OAI Privacy Filter] Add integration test (#45725) by @vasqu in [#45725]
Speedup Qwen2VLImageProcessor (#45719) by @lgeiger in [#45719]
Remove dead beam-search dummies from dummy_pt_objects.py (#45722) by @jw9603 in [#45722]
chore(typing): add ty type checking for 10 utility files (#45703) by @moonbogi in [#45703]
Llama3 video fix (#45040) by @sywangyi in [#45040]
Fix custom-module copies inheriting read-only permissions (#45686) by @nurpax in [#45686]
Python code in model docs (#45608) by @zucchini-nlp in [#45608]
fix failed test cases for blt model (#45596) by @kaixuanliu in [#45596]
chore(typing): add ty type checking for 3 pipeline files (#45667) by @moonbogi in [#45667]

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@artem-spector
- Add Granite 4.1 Vision (granite4_vision) (#45597)
@SindhuRaghuram97
- First model (#45788)
@nuxlear
- Add EXAONE 4.5 implementations (#45471)
@ArthurZucker
- Add DeepSeek V4 (#45643)
@remi-or
- [CB] Better overall script and decode bucketting (#45653)
@zhang-prog
- [Model] Add PP-FormulaNet Model Support (#45626)
@zvik
- Support for a new Granite-Speech-Plus model (#45695)

@joerowell

Release v5.7.0

New Model additions

Laguna

Laguna is Poolside's mixture-of-experts language model family that extends standard SwiGLU MoE transformers with two key innovations. It features per-layer head counts allowing different decoder layers to have different query-head counts while sharing the same KV cache shape, and implements a sigmoid MoE router with auxiliary-loss-free load balancing that uses element-wise sigmoid of gate logits plus learned per-expert bias for router scoring.

Links: Documentation

Laguna XS.2 implementation (#45673) by @joerowell in #45673

DEIMv2

DEIMv2 (DETR with Improved Matching v2) is a real-time object detection model that extends DEIM with DINOv3 features and spans eight model sizes from X to Atto for diverse deployment scenarios. It uses a Spatial Tuning Adapter (STA) for larger variants to convert DINOv3's single-scale output into multi-scale features, while ultra-lightweight models employ pruned HGNetv2 backbones. The unified design achieves superior performance-cost trade-offs, with DEIMv2-X reaching 57.8 AP with only 50.3M parameters and DEIMv2-S being the first sub-10M model to exceed 50 AP on COCO.

Links: Documentation | Paper

model: Add DEIMv2 to Transformers (#44339) by @harshaljanjani in #44339

Attention

Several attention-related bugs were fixed across multiple models, including a cross-attention cache type error in T5Gemma2 for long inputs, incorrect cached forward behavior in Qwen3.5's gated-delta-net linear attention, and a crash in GraniteMoeHybrid when no Mamba layers are present. Attention function dispatch was also updated to align with the latest model implementations.

Fix cross-attention cache layer type for T5Gemma2 long inputs (#45540) by @Beichen-Ma in [#45540]
[Qwen3.5] Fix GDN linear attention multi-token cached forward (#45513) by @kashif in [#45513]
Fix GraniteMoeHybrid _update_mamba_mask crash on attention-only models (#45514) by @tianhaocui in [#45514]
Align latest model attention function dispatch (#45598) by @Cyrilvallez in [#45598]

Tokenizers

There was a bug in AutoTokenizer that caused the wrong tokenizer class to be initialized. This caused regressions in models like DeepSeek R1.

change got reverted (#45680) by @itazap in [#45680]

Generation

Continuous batching generation received several fixes and improvements, including correcting KV deduplication and memory estimation for long sequences (16K+), and removing misleading warnings about num_return_sequences and other unsupported features that were incorrectly firing even when functionality worked correctly. Documentation for per-request sampling parameters was also added.

generate: drop stale num_return_sequences warning on continuous batching path (#45582) by @joaquinhuigomez in [#45582]
Remove unnecessary generate warnings (#45619) by @Cyrilvallez in [#45619]
[CB] Changes for long generation (#45530) by @remi-or in [#45530]
[docs] per-request sampling params (#45553) by @stevhliu in [#45553]

Kernels

Improved kernel support by fixing configuration reading and error handling for FP8 checkpoints (e.g., Qwen3.5-35B-A3B-FP8), enabling custom expert kernels registered from the HF Hub to be properly loaded, and resolving an incompatibility that prevented Gemma3n and Gemma4 from using the rotary kernel.

Fix configuration reading and error handling for kernels (#45610) by @hmellor in [#45610]
Allow for registered experts from kernels hub (#45577) by @winglian in [#45577]
Gemma3n and Gemma4 cannot use rotary kernel (#45564) by @Cyrilvallez in [#45564]

Bugfixes and improvements

fixing more typos (#45689) by @vasqu in [#45689]
[docs] cb memory management (#45587) by @stevhliu in [#45587]
[docs] cpu offloading (#45660) by @stevhliu in [#45660]
docs(README_zh-hans): clarify conditions for not using Transformers (#45688) by @GuaiZai233 in [#45688]
fix padding side issue for fast_vlm tests (#45592) by @kaixuanliu in [#45592]
Fix x_clip: 8 failed test cases (#45394) by @kaixuanliu in [#45394]
zero_shot_object_detection ValueError fix for python 3.13 (#45669) by @AnkitAhlawat7742 in [#45669]
Fix pageable H2D copies in Gated DeltaNet PyTorch fallback (#45665) by @ruixiang63 in [#45665]
Fix UnboundLocalError in shard_and_distribute_module for replicated parameters (#45675) by @Abdennacer-Badaoui in [#45675]
[MistralCommonBackend] Soften validation mode and apply_chat_template arguments check (#45628) by @juliendenize in [#45628]
Fix NameError: PeftConfigLike triggered by PreTrainedModel.__init_subclass__ (#45658) by @qgallouedec in [#45658]
chore(typing): added modeling_utils to ty (#45425) by @tarekziade in [#45425]
[gemma4] infer from config instead of hardcoding (#45606) by @eustlb in [#45606]
Update quants tests (#45480) by @SunMarc in [#45480]
🔴🔴🔴 fix: skip clean_up_tokenization for BPE tokenizers in PreTrainedTokenizerFast (#44915) by @maxsloef-goodfire in [#44915]
Fix colmodernvbert tests (#45652) by @Cyrilvallez in [#45652]
[CB] [Major] Add CPU request offloading (#45184) by @remi-or in [#45184]
Fix peft constructors (#45622) by @Cyrilvallez in [#45622]
chore: speedup modular converter (~30%) (#45046) by @tarekziade in [#45046]
Fix whisper return language (#42227) by @FredHaa in [#42227]
Add supports_gradient_checkpointing to NemotronHPreTrainedModel (#45625) by @sergiopaniego in [#45625]
Raise clear error for problem_type="single_label_classification" with num_labels=1 (#45611) by @gaurav0107 in [#45611]
CircleCI with torch 2.11 (#45633) by @ydshieh in [#45633]
chore: bump doc-builder SHA for main doc build workflow (#45631) by @rtrompier in [#45631]
Allow more artifacts to be download in CI (#45629) by @ydshieh in [#45629]
chore(qa): split pipeline and add type checking (#45432) by @tarekziade in [#45432]
Skip failing offloading tests (#45624) by @Cyrilvallez in [#45624]
fix: compute auxiliary losses when denoising is disabled in D-FINE (#45601) by @Abineshabee in [#45601]
qa: bumped mlinter and allow local override (#45585) by @tarekziade in [#45585]
Processing Utils: continue when content is a string (#45605) by @RyanMullins in [#45605]
SonicMoe (#45433) by @IlyasMoutawwakil in [#45433]
fix transformers + torchao nvfp4 serialization (#45573) by @vkuzo in [#45573]
[AMD CI] Fix expectations for Gemma3n (#45602) by @Abdennacer-Badaoui in [#45602]
[docs] multi-turn tool calling (#45554) by @stevhliu in [#45554]
Fix AttributeError on s_aux=None in flash_attention_forward (#45589) by @jamesbraza in [#45589]
do not index past decoded chars with special tokens (#45435) by @itazap in [#45435]
Update dev version (#45583) by @vasqu in [#45583]
Update torchao usage for XPU and CPU (#45560) by @jiqing-feng in [#45560]

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@vasqu
- fixing more typos (#45689)
- Update dev version (#45583)
@joerowell
- Laguna XS.2 implementation (#45673)
@tarekziade
- chore(typing): added modeling_utils to ty (#45425)
- chore: speedup modular converter (~30%) (#45046)
- chore(qa): split pipeline and add type checking (#45432)
- qa: bumped mlinter and allow local override (#45585)
@harshaljanjani
- model: Add DEIMv2 to Transformers (#44339)
@remi-or
- [CB] [Major] Add CPU request offloading (#45184)
- [CB] Changes for long generation (#45530)

@hmellor

Patch release v5.6.2

Qwen 3.5 and 3.6 MoE (text-only) were broken when using with FP8. It should now work again with this 🫡

Fix configuration reading and error handling for kernels (#45610) by @hmellor

Full Changelog: v5.6.1...v5.6.2

@jamesbraza

Patch release v5.6.1

Flash attention path was broken! Sorry everyone for this one 🤗

Fix AttributeError on s_aux=None in flash_attention_forward (#45589) by @jamesbraza

@vasqu

Release v5.6.0

New Model additions

OpenAI Privacy Filter

OpenAI Privacy Filter is a bidirectional token-classification model for personally identifiable information (PII) detection and masking in text. It is intended for high-throughput data sanitization workflows where teams need a model that they can run on-premises that is fast, context-aware, and tunable. The model labels an input sequence in a single forward pass, then decodes coherent spans with a constrained Viterbi procedure, predicting probability distributions over 8 privacy-related output categories for each input token.

Links: Documentation

[Privacy Filter] Add model (#45580) by @vasqu in #45580

QianfanOCR

Qianfan-OCR is a 4B-parameter end-to-end document intelligence model developed by Baidu that performs direct image-to-text conversion without traditional multi-stage OCR pipelines. It supports a broad range of prompt-driven tasks including structured document parsing, table extraction, chart understanding, document question answering, and key information extraction all within one unified model. The model features a unique "Layout-as-Thought" capability that generates structured layout representations before producing final outputs, making it particularly effective for complex documents with mixed element types.

Links: Documentation | Paper

add Qianfan-OCR model definition (#45280) by @marvinzh in #45280

SAM3-LiteText

SAM3-LiteText is a lightweight variant of SAM3 that replaces the heavy SAM3 text encoder (353M parameters) with a compact MobileCLIP-based text encoder optimized through knowledge distillation, while keeping the SAM3 ViT-H image encoder intact. This reduces text encoder parameters by up to 88% while maintaining segmentation performance comparable to the original model. The model enables efficient vision-language segmentation by addressing the redundancy found in text prompting for segmentation tasks.

Links: Documentation | Paper

Add SAM3-LiteText (#44320) by @NielsRogge in #44320

SLANet

SLANet and SLANet_plus are lightweight models designed for table structure recognition, focusing on accurately recognizing table structures in documents and natural scenes. The model improves accuracy and inference speed by adopting a CPU-friendly lightweight backbone network PP-LCNet, a high-low-level feature fusion module CSP-PAN, and a feature decoding module SLA Head that aligns structural and positional information. SLANet was developed by Baidu PaddlePaddle Vision Team as part of their table structure recognition solutions.

Links: Documentation

[Model] Add SLANet Model Support (#45532) by @zhang-prog in #45532

Breaking changes

The internal rotary_fn is no longer registered as a hidden kernel function, so any code referencing self.rotary_fn(...) within an Attention module will break and must be updated to call the function directly instead.

🚨 [Kernels] Fix kernel function registration (#45420) by @vasqu

Serve

The transformers serve command received several enhancements, including a new /v1/completions endpoint for legacy text completion, multimodal support for audio and video inputs, improved tool-calling via parse_response, proper forwarding of tool_calls/tool_call_id fields, a 400 error on model mismatch when the server is pinned to a specific model, and fixes for the response API. Documentation was also updated to cover new serving options such as --compile and --model-timeout.

Add /v1/completions endpoint (OpenAI legacy completions API) to transformers serve (#44558) by @rain-1 in [#44558]
Updated the image cache for Paddle models according to the latest API (#45562) by @zhang-prog in [#45562]
Raise 400 on model mismatch when transformers serve is pinned (#45443) by @qgallouedec in [#45443]
[serve] Update tool call to switch to parse_response (#45485) by @SunMarc in [#45485]
Fix response api support (#45463) by @SunMarc in [#45463]
[serve] Forward tool_calls/tool_call_id in processor inputs (#45418) by @qgallouedec in [#45418]
refactor(qa): extend extras so ty can run on server modules (#45456) by @tarekziade in [#45456]
Multimodal serve support (#45220) by @SunMarc in [#45220]
[docs] transformers serve (#45174) by @stevhliu in [#45174]

Vision

Several vision-related bug fixes were applied in this release, including correcting Qwen2.5-VL temporal RoPE scaling for still images, fixing missing/mismatched image processor backends for Emu3 and BLIP, resolving modular image processor class duplication, and preventing accelerate from incorrectly splitting vision encoders in PeVideo/PeAudioVideo models. Image loading performance was also improved by leveraging torchvision's native decode_image in the torchvision backend, yielding up to ~17% speedup over PIL-based loading.

Revert "Fix: modular image processors (#45492)" (#45531) by @tarekziade in [#45531]
Fix: modular image processors (#45492) by @zucchini-nlp in [#45492]
fix: prevent accelerate from splitting vision encoder by setting no… (#43047) by @ in [#43047]
Fix Qwen2.5-VL temporal RoPE scaling applied to still images (#45330) by @Kash6 in [#45330]
Use torchvision decode_image to load images in the torchvision backend (#45195) by @yonigozlan in [#45195]
Fix missing image processors backends (#45165) by @zucchini-nlp in [#45165]

Parallelization

Fixed several bugs affecting distributed training, including silently wrong results or NaN loss with Expert Parallelism, NaN weights on non-rank-0 FSDP processes, and a resize failure in PP-DocLayoutV3; additionally added support for loading adapters with Tensor Parallelism, added MoE to the Gemma4 TP plan, and published documentation for TP training.

Fix EP: RouterParallel shape, tp_plan property, grouped_mm sentinels (#45473) by @AmineDiro in [#45473]
Fix NaN weights on non-rank-0 FSDP processes (#45050) by @albertvillanova in [#45050]
Load adapter with TP (#45155) by @michaelbenayoun in [#45155]
[docs] tp training (#44613) by @stevhliu in [#44613]
Fix resize failure caused by zero-sized masks in PP-DocLayoutV3 (#45281) by @zhang-prog in [#45281]
Add MoE to Gemma4 TP plan (#45219) by @sywangyi in [#45219]

Tokenization

Fixed a docstring typo in streamer classes, resolved a Kimi-K2.5 tokenizer regression and _patch_mistral_regex AttributeError, and patched a streaming generation crash for Qwen3VLProcessor caused by incorrect _tokenizer attribute access. Additional housekeeping included moving the GPT-SW3 instruct tokenizer to an internal testing repo and fixing a global state leak in the tokenizer registry during tests.

[Doc] Fix 'tokenized' -> 'tokenizer' typo in streamer docstrings (#45508) by @avasis-ai in [#45508]
Fix Kimi-K2.5 tokenizer regression and _patch_mistral_regex AttributeError (#45359) by @ArthurZucker in [#45359]
fix(serving): resolve rust tokenizer from ProcessorMixin in streaming generation (#45368) by @sharziki in [#45368]
[Tokenizers] Move gpt sw3 tokenizer out (#45404) by @vasqu in [#45404]
fix: leak in tokenizer registry for test_processors (#45318) by @tarekziade in [#45318]

Cache

Cache handling was improved for Gemma4 and Gemma3n models by dissociating KV state sharing from the Cache class, ensuring KV states are always shared regardless of whether a Cache is used. Additionally, the image cache for Paddle models was updated to align with the latest API.

Align gemma3n cache sharing to gemma4 (#45489) by @Cyrilvallez in [#45489]
remove cache file from tree (#45392) by @tarekziade in [#45392]
[gemma4] Dissociate kv states sharing from the Cache (#45312) by @Cyrilvallez in [#45312]

Audio

Audio models gained vLLM compatibility through targeted fixes across several model implementations, while reliability improvements were also made including exponential back-off retries for audio file downloads, a crash fix in the text-to-speech pipeline when generation configs contain None values, and corrected test failures for Kyutai Speech-To-Text.

feat[vLLM × v5]: Add vLLM compatibility for audio models (#45326) by @harshaljanjani in [#45326]
http retries on audio file downloads (#45126) by @tarekziade in [#45126]
fix(testing): Fix Kyutai Speech-To-Text and LongCatFlash test failures on main CI (#44695) by @harshaljanjani in [#44695]
Fix text-to-speech pipeline crash when generation config contains None values (#45107) by @jiqing-feng in [#45107]

Bugfixes and improvements

[Privacy Filter] Add model (#45580) by @vasqu in [#45580]
Add ForSequenceClassification heads for the OLMo family (#45551) by @earino in [#45551]
Add IndexCache support for GLM5 DSA (#45424) by @louzongzhi in [#45424]
Fix redundant logic in video processing SmolVLM (#45272) by @yonigozlan in [#45272]
Fix typos (#45574) by @vasqu in [#45574]
[Model] Add SLANet Model Support (#45532) by @zhang-prog in [#45532]
refactor(Dots1): drop Dots1MoE override to pass (inherits from DSV3 MoE) (#45572) by @casinca in [#45572]
perf: avoid recomputing rotary_emb for each layer in some Google and ModernBERT models (#45555) by @casinca in [#45555]
Gemma4 training with t...

Patch release v5.5.4

This is mostly some fixes that are good to have asap, mostly for tokenizers;
** Fix Kimi-K2.5 tokenizer regression and _patch_mistral_regex Attribute… (#45305) by ArthurZucker

For training:
** Fix #45305 + add regression test GAS (#45349) by florian6973, SunMarc
** Fix IndexError with DeepSpeed ZeRO-3 when kernels rotary is active (#…) by ArthurZucker

And for Qwen2.5-VL :
** Fix Qwen2.5-VL temporal RoPE scaling applied to still images (#45330) by Kash6, zucchini-nlp

@Cyrilvallez

Small patch release to fix device_map support for Gemma4! It contains the following commit:

[gemma4] Fix device map auto (#45347) by @Cyrilvallez

@sywangyi

Small patch dedicated to optimizing gemma4, fixing inference with use_cache=False due to k/v states sharing between layers, as well as conversion mappings for some models that would inconsistently serialize their weight names. It contains the following PRs:

Add MoE to Gemma4 TP plan (#45219) by @sywangyi and @Cyrilvallez
[gemma4] Dissociate kv states sharing from the Cache (#45312) by @Cyrilvallez
[gemma4] Remove all shared weights, and silently skip them during loading (#45336) by @Cyrilvallez
Fix conversion mappings for vlms (#45340) by @Cyrilvallez

@Cyrilvallez

Patch release v5.5.1

This patch is very small and focuses on vLLM and Gemma4!

** Fix export for gemma4 and add Integration tests (#45285) by @Cyrilvallez
** Fix vllm cis (#45139) by @ArthurZucker

Releases: huggingface/transformers