[inductor] Skip cudagraph capture when CUDA caching allocator is bypassed by jeffdaily · Pull Request #183780 · pytorch/pytorch

jeffdaily · 2026-05-14T20:33:43Z

PYTORCH_NO_CUDA_MEMORY_CACHING=1 (and the equivalent PYTORCH_NO_HIP_MEMORY_CACHING) forces every allocation through cudaMalloc / hipMalloc, bypassing the caching allocator. Cudagraph capture pools allocations through that allocator; without it, capture appears to complete normally but the pool tracking diverges, and the first replay fires a cryptic "storage data ptrs are not allocated in pool ..." error.

Detect the bypass at the same layer as the existing device-topology check (cudagraph_utils.check_lowering_disable_cudagraph) and skip cudagraph capture with a clear reason. Tests and other code that opportunistically enable cudagraphs (test_padding, test_callback, inductor's compile_fx pipeline, etc.) still run — they just exercise the eager triton kernels instead of the cudagraph wrapper.

Two existing tests in test/inductor/test_cuda_repro.py (test_unused_cpu_input_cudagraphs and test_cpu_index) explicitly assert graph.disable_cudagraphs_reason is None. Add skipTest() guards in those tests so they bow out cleanly when the env bypass is set, keeping the new inductor invariant strict.

This is a debugging-experience fix for anyone using PYTORCH_NO_CUDA_MEMORY_CACHING (memory-debug builds, sanitizer builds, etc.), not a ROCm-specific change.

Authored with Claude.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @mlazos

…ssed `PYTORCH_NO_CUDA_MEMORY_CACHING=1` (and the equivalent `PYTORCH_NO_HIP_MEMORY_CACHING`) forces every allocation through `cudaMalloc` / `hipMalloc`, bypassing the caching allocator. Cudagraph capture pools allocations through that allocator; without it, capture appears to complete normally but the pool tracking diverges, and the first replay fires a cryptic `"storage data ptrs are not allocated in pool ..."` error. Detect the bypass at the same layer as the existing device-topology check (`cudagraph_utils.check_lowering_disable_cudagraph`) and skip cudagraph capture with a clear reason. Tests and other code that opportunistically enable cudagraphs (`test_padding`, `test_callback`, inductor's `compile_fx` pipeline, etc.) still run — they just exercise the eager triton kernels instead of the cudagraph wrapper. Two existing tests in `test/inductor/test_cuda_repro.py` (`test_unused_cpu_input_cudagraphs` and `test_cpu_index`) explicitly assert `graph.disable_cudagraphs_reason is None`. Add `skipTest()` guards in those tests so they bow out cleanly when the env bypass is set, keeping the new inductor invariant strict. This is a debugging-experience fix for anyone using `PYTORCH_NO_CUDA_MEMORY_CACHING` (memory-debug builds, sanitizer builds, etc.), not a ROCm-specific change. Authored with Claude.

pytorch-bot · 2026-05-14T20:33:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/183780

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Run pull request jobs on OSDC runners in shadow mode

❌ 1 New Failure, 2 Unrelated Failures

As of commit c5270cc with merge base a1cc64b ():

NEW FAILURE - The following job has failed:

Lint / lintrunner-noclang-partial / lint (gh)
>>> Lint for torch/_inductor/cudagraph_utils.py:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-jammy-py3.14-clang18 / test (default, 5, 5, linux.4xlarge) (gh) (disabled by #183810)
test/dynamo/test_package.py::TestPackage::test_automatic_dynamo_graph_breaks_device_cpu

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3.14-clang18 / test-osdc (default, 4, 5, mt-l-x86iavx512-16-128) (gh) (PyTorch OSDC job marked as unstable)
test/dynamo/test_package.py::TestPackage::test_automatic_dynamo_graph_breaks_device_cpu

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jansel · 2026-05-15T03:03:19Z

@claude explain those CI failures, check if they are related, and review the changes in this PR

claude · 2026-05-15T03:03:56Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

ezyang · 2026-05-15T03:09:56Z

+    Truthy match follows c10::utils::check_env: only the literal "1" counts."""
+    if (
+        os.environ.get("PYTORCH_NO_CUDA_MEMORY_CACHING") == "1"
+        or os.environ.get("PYTORCH_NO_HIP_MEMORY_CACHING") == "1"


It would be better to query the memory allocator state directly to get this info rather than re-inferring it from env

ezyang · 2026-05-15T03:10:28Z

+            os.environ.get("PYTORCH_NO_CUDA_MEMORY_CACHING") == "1"
+            or os.environ.get("PYTORCH_NO_HIP_MEMORY_CACHING") == "1"
+        ):
+            self.skipTest("cudagraphs disabled when caching allocator is bypassed")


Surely this is worth reifying into a decorator

ezyang · 2026-05-15T03:11:03Z

This seems fine, although I'm not sure if there's a more structural fix we should do for the direct mode / other consequences when the caching allocator is disabled.

pytorch-bot Bot added ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests module: inductor labels May 14, 2026

jeffdaily added release notes: inductor and removed ciflow/torchtitan Run TorchTitan integration tests labels May 14, 2026

jeffdaily requested review from BoyuanFeng, Lucaskabela and eellison May 14, 2026 20:35

pytorchbot added the open source label May 14, 2026

ezyang reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor] Skip cudagraph capture when CUDA caching allocator is bypassed#183780

[inductor] Skip cudagraph capture when CUDA caching allocator is bypassed#183780
jeffdaily wants to merge 1 commit into
mainfrom
jeffdaily/inductor-skip-cudagraph-no-cache

jeffdaily commented May 14, 2026 •

edited by pytorch-bot Bot

Loading

Uh oh!

pytorch-bot Bot commented May 14, 2026 •

edited

Loading

Uh oh!

jansel commented May 15, 2026

Uh oh!

claude Bot commented May 15, 2026

Uh oh!

ezyang May 15, 2026

Uh oh!

ezyang May 15, 2026

Uh oh!

ezyang commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jeffdaily commented May 14, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/183780

❗ 1 Active SEVs

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

jansel commented May 15, 2026

Uh oh!

claude Bot commented May 15, 2026

Uh oh!

ezyang May 15, 2026

Choose a reason for hiding this comment

Uh oh!

ezyang May 15, 2026

Choose a reason for hiding this comment

Uh oh!

ezyang commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jeffdaily commented May 14, 2026 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented May 14, 2026 •

edited

Loading