fix: Added Mps support in float fallback backends list by rigen1048 · Pull Request #45687 · huggingface/transformers

rigen1048 · 2026-04-28T15:23:48Z

What does this PR do?

Fixes a NotImplementedError: "histogram_mps" not implemented for 'Int' when running Mixture-of-Experts (MoE) models on Apple Silicon (MPS backend).
The error occurred in src/transformers/integrations/moe.py because torch.histc does not support integer dtypes on MPS. The original condition failed to account for the MPS backend. This PR improves the logic by checking specifically for CUDA instead, as float operations are more reliably supported across a wider range of hardware, including legacy devices.

Before:

Use float on CPU
Use int on all other backend (CUDA, MPS, XPU, TPU, etc)

After:

Use int on CUDA (best performance)
Use float32 on all other backends (CPU, MPS, XPU, etc.)

Fixes #45685

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by
code agents. We are currently bottlenecked by our ability to review and respond to them. As a result,
we ask that new users do not submit pure code agent PRs at this time.
You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents
not to open any PRs or issues for the moment.
PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this
repeatedly or maliciously.
This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result,
this policy is likely to be updated regularly in the near future. For more information, please read CONTRIBUTING.md.

I confirm that this is not a pure code agent PR.
- AI was used to understand broader application and write a manual smoke test script for regression checking & bring clarity to the PR

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum?
Yes, discussed in #45685
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@IlyasMoutawwakil (author of the current grouped_mm_experts_forward logic)
@ArthurZucker
@Cyrilvallez

IlyasMoutawwakil · 2026-04-28T17:50:45Z

i think it makes more sense if we continue using int as the default for any hardware we are not aware of. because it's better to have the best perf on as many hardware as possible and only use float if it breaks it.

IlyasMoutawwakil · 2026-04-28T17:51:25Z

because int histograms are easier to compute than floating ones

rigen1048 · 2026-04-28T18:13:51Z

@IlyasMoutawwakil I agree with you — using int by default makes sense for performance whenever it's reliably supported, since integer histograms are generally faster and cheaper to compute.
That said, a broader range of hardware (such as MPS on Apple Silicon, and potentially some other backends) currently doesn't support integer dtypes for torch.histc, which is what caused the NotImplementedError.
I also considered the "try int first, fallback to float" pattern. While it's more aggressive at giving good performance on unknown hardware, it has a couple of downsides:

It can hide bugs or lead to inconsistent behavior across devices.
The fallback path adds a small runtime cost and makes the code slightly more complex.

Because of that, I went with the explicit approach in this PR: int on CUDA (where we know it works well and gives the best perf) and float32 everywhere else for broad compatibility.
Happy to adjust if there's a cleaner way to handle this.

rigen1048 · 2026-04-28T18:54:34Z

Given that the large majority of users (especially for MoE models) run on CUDA, this approach gives them the best performance while ensuring reliable behavior on other backends like MPS. We can always expand the int64 support list later as more backends (ROCm, XPU, etc.) stabilize.

Vincent08199 · 2026-04-29T02:52:07Z

It feels like this is a classic performance vs portability tradeoff:

int histograms → better performance
float histograms → broader hardware support (MPS, etc.)

One question I had: do we know how significant the performance difference is in practice for MoE workloads?

If the gap is large, it might be worth keeping int as the default and falling back only when needed. But if the difference is small, the current approach (float for non-CUDA) seems safer.

Also wondering whether a "try int → fallback to float" approach could be acceptable if the failure is rare and predictable.

rigen1048 · 2026-04-29T04:43:51Z

It feels like this is a classic performance vs portability tradeoff:
* int histograms → better performance

* float histograms → broader hardware support (MPS, etc.)
One question I had: do we know how significant the performance difference is in practice for MoE workloads?

If the gap is large, it might be worth keeping int as the default and falling back only when needed. But if the difference is small, the current approach (float for non-CUDA) seems safer.

Also wondering whether a "try int → fallback to float" approach could be acceptable if the failure is rare and predictable.

Yes, that's correct. However, the main issue here is more about hardware reality than a classic performance vs portability tradeoff.

Integer histograms for torch.histc are still poorly supported outside of CUDA. They tend to fail on MPS, CPU, and several other backends (APU, TPU, etc.), and broader support may take a long time to mature.

My current approach uses int only on CUDA (where we get the best performance) and float32 everywhere else. This gives strong performance for the majority of MoE users while ensuring reliable behavior across hardware.

Regarding the "try int first, then fallback to float" idea — I actually considered it as well. While it could offer better performance on unknown hardware, I’m concerned it might introduce subtle issues such as problems with torch.compile, Python exception handling interfering with the memory cache, or inconsistent states on backends like Metal (MPS). That said, I might be overthinking it, so I’d really appreciate the maintainers’ conservative feedback on this.

IlyasMoutawwakil · 2026-05-02T19:39:02Z

i wouldn't wanna penalize xpu and xla/tpu for example or any other custom torch backend that supports int histc
so i'm more inclined towards expanding the float fallback backends list rather than defaulting to float

rigen1048 · 2026-05-03T00:33:39Z

i wouldn't wanna penalize xpu and xla/tpu for example or any other custom torch backend that supports int histc so i'm more inclined towards expanding the float fallback backends list rather than defaulting to float

Thank you for the response. I now see that this repo follows a performance-first, stability-second approach, which is a valid stance. I had initially gotten the priorities the other way around.
Upon reflection, your approach makes complete sense. Going with a float-compatible hardware list is smart because PyTorch and other serious AI backends are likely to expand their integer histogram support soon. This makes the current pattern nicely future-proof.
Sorry for dragging this out. I’ll keep my lessons in mind to improve my future PRs in the repo.

IlyasMoutawwakil · 2026-05-03T18:22:50Z

basically we've had this experts impl for a couple months now, and no reports from xpu/hpu. so i'd say they probably already support int, otherwise they'd be the first to report it (we have a couple contributors from intel).

rigen1048 · 2026-05-03T19:00:50Z

basically we've had this experts impl for a couple months now, and no reports from xpu/hpu. so i'd say they probably already support int, otherwise they'd be the first to report it (we have a couple contributors from intel).

That's reassuring.

ArthurZucker

sounds good to me

IlyasMoutawwakil

LGTM! thanks for the fix

HuggingFaceDocBuilderDev · 2026-05-04T09:01:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

rigen1048 · 2026-05-04T15:13:39Z

I couldn't reproduce the failure using this command in my local device: uv run pytest tests/cli/test_download.py::test_cli_download_trust_remote. Perhaps local cache and automated test cache is handled differently?

rigen1048 · 2026-05-04T21:00:13Z

Hi @IlyasMoutawwakil @ArthurZucker @Cyrilvallez, could you check the latest change? CI was failing due to a distributed race condition with pytest-xdist.
When running with -n 8, workers have isolated sys.path. If Worker A downloads a model and initializes the cache, Worker B skips initialization because the files already exist, but then fails the import because its sys.path doesn't have the cache directory.
I've updated get_class_in_module to ensure every worker is initialized. I managed to reproduce this with a script—let me know if you'd prefer this in a separate PR!

IlyasMoutawwakil · 2026-05-05T07:57:39Z

@rigen1048 thanks for investigating that's very helpful ! let's keep it in a separate PR tho since it would require reviewing from our ci/testing mantainers.

Co-authored-by: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>

@IlyasMoutawwakil

…5687) * fix: Made histc_input robust for broader hardware * Fix lint issues with Ruff * fix mps support * Apply suggestion from @IlyasMoutawwakil * rerun ci * potential fix: ModuleNotFoundError caused by distributed race condition * Apply suggestions from code review Co-authored-by: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com> --------- Co-authored-by: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>

rigen1048 and others added 3 commits April 28, 2026 20:31

fix: Made histc_input robust for broader hardware

9ea76fe

Merge branch 'main' into OSI

78e2c2d

Fix lint issues with Ruff

75dad67

evalstate mentioned this pull request Apr 28, 2026

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#41

Open

This was referenced Apr 29, 2026

Cumulative feature and defect updates from recent Transformers PRs evalstate/transformers#42

Open

Cumulative defect fixes from recent Transformers PRs evalstate/transformers#43

Open

Merge branch 'main' into OSI

7d362ba

rigen1048 and others added 2 commits May 3, 2026 06:54

fix mps support

2700adf

Merge branch 'main' into OSI

7b1a3d8

rigen1048 changed the title ~~fix: Made histc_input robust for broader hardware~~ fix: Added Mps support in float fallback backends list May 3, 2026

ArthurZucker approved these changes May 4, 2026

View reviewed changes

IlyasMoutawwakil reviewed May 4, 2026

View reviewed changes

Comment thread src/transformers/integrations/moe.py Outdated

Apply suggestion from @IlyasMoutawwakil

6c643b6

IlyasMoutawwakil approved these changes May 4, 2026

View reviewed changes

IlyasMoutawwakil added this pull request to the merge queue May 4, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 4, 2026

Merge branch 'main' into OSI

67b74ef

molbap mentioned this pull request May 4, 2026

Fix mps device check for moe histogram routing #45754

Closed

6 tasks

Merge branch 'main' into OSI

4007fc9

rigen1048 added 2 commits May 5, 2026 01:33

rerun ci

c113399

potential fix: ModuleNotFoundError caused by distributed race condition

91e9257

rigen1048 requested review from ArthurZucker and IlyasMoutawwakil May 4, 2026 21:07

rigen1048 mentioned this pull request May 5, 2026

[Bug] pytest-xdist workers race on captured_info.txt in patched testing utils #45561

Open

4 tasks

IlyasMoutawwakil reviewed May 5, 2026

View reviewed changes

Comment thread src/transformers/dynamic_module_utils.py Outdated

IlyasMoutawwakil reviewed May 5, 2026

View reviewed changes

Comment thread src/transformers/dynamic_module_utils.py Outdated

Apply suggestions from code review

d8c6d73

Co-authored-by: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>

IlyasMoutawwakil enabled auto-merge May 5, 2026 08:14

IlyasMoutawwakil added this pull request to the merge queue May 5, 2026

Merged via the queue into huggingface:main with commit d379ac1 May 5, 2026
29 checks passed

rigen1048 deleted the OSI branch May 5, 2026 13:44

rigen1048 mentioned this pull request May 5, 2026

fix: ModuleNotFoundError caused by distributed race condition in ci/cd test #45787

Closed

5 tasks

rigen1048 mentioned this pull request May 8, 2026

fix: ModuleNotFoundError caused by distributed race condition in ci/cd #45840

Open

5 tasks

Conversation

rigen1048 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Uh oh!

IlyasMoutawwakil commented Apr 28, 2026

Uh oh!

IlyasMoutawwakil commented Apr 28, 2026

Uh oh!

rigen1048 commented Apr 28, 2026

Uh oh!

rigen1048 commented Apr 28, 2026

Uh oh!

Vincent08199 commented Apr 29, 2026

Uh oh!

rigen1048 commented Apr 29, 2026

Uh oh!

IlyasMoutawwakil commented May 2, 2026

Uh oh!

rigen1048 commented May 3, 2026

Uh oh!

IlyasMoutawwakil commented May 3, 2026

Uh oh!

rigen1048 commented May 3, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

IlyasMoutawwakil left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented May 4, 2026

Uh oh!

Uh oh!

rigen1048 commented May 4, 2026

Uh oh!

rigen1048 commented May 4, 2026

Uh oh!

IlyasMoutawwakil commented May 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rigen1048 commented Apr 28, 2026 •

edited

Loading