Skip to content

[inductor] Skip cudagraph capture when CUDA caching allocator is bypassed#183780

Open
jeffdaily wants to merge 1 commit into
mainfrom
jeffdaily/inductor-skip-cudagraph-no-cache
Open

[inductor] Skip cudagraph capture when CUDA caching allocator is bypassed#183780
jeffdaily wants to merge 1 commit into
mainfrom
jeffdaily/inductor-skip-cudagraph-no-cache

Conversation

@jeffdaily
Copy link
Copy Markdown
Collaborator

@jeffdaily jeffdaily commented May 14, 2026

PYTORCH_NO_CUDA_MEMORY_CACHING=1 (and the equivalent PYTORCH_NO_HIP_MEMORY_CACHING) forces every allocation through cudaMalloc / hipMalloc, bypassing the caching allocator. Cudagraph capture pools allocations through that allocator; without it, capture appears to complete normally but the pool tracking diverges, and the first replay fires a cryptic "storage data ptrs are not allocated in pool ..." error.

Detect the bypass at the same layer as the existing device-topology check (cudagraph_utils.check_lowering_disable_cudagraph) and skip cudagraph capture with a clear reason. Tests and other code that opportunistically enable cudagraphs (test_padding, test_callback, inductor's compile_fx pipeline, etc.) still run — they just exercise the eager triton kernels instead of the cudagraph wrapper.

Two existing tests in test/inductor/test_cuda_repro.py (test_unused_cpu_input_cudagraphs and test_cpu_index) explicitly assert graph.disable_cudagraphs_reason is None. Add skipTest() guards in those tests so they bow out cleanly when the env bypass is set, keeping the new inductor invariant strict.

This is a debugging-experience fix for anyone using PYTORCH_NO_CUDA_MEMORY_CACHING (memory-debug builds, sanitizer builds, etc.), not a ROCm-specific change.

Authored with Claude.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @mlazos

…ssed

`PYTORCH_NO_CUDA_MEMORY_CACHING=1` (and the equivalent `PYTORCH_NO_HIP_MEMORY_CACHING`) forces every allocation through `cudaMalloc` / `hipMalloc`, bypassing the caching allocator. Cudagraph capture pools allocations through that allocator; without it, capture appears to complete normally but the pool tracking diverges, and the first replay fires a cryptic `"storage data ptrs are not allocated in pool ..."` error.

Detect the bypass at the same layer as the existing device-topology check (`cudagraph_utils.check_lowering_disable_cudagraph`) and skip cudagraph capture with a clear reason. Tests and other code that opportunistically enable cudagraphs (`test_padding`, `test_callback`, inductor's `compile_fx` pipeline, etc.) still run — they just exercise the eager triton kernels instead of the cudagraph wrapper.

Two existing tests in `test/inductor/test_cuda_repro.py` (`test_unused_cpu_input_cudagraphs` and `test_cpu_index`) explicitly assert `graph.disable_cudagraphs_reason is None`. Add `skipTest()` guards in those tests so they bow out cleanly when the env bypass is set, keeping the new inductor invariant strict.

This is a debugging-experience fix for anyone using `PYTORCH_NO_CUDA_MEMORY_CACHING` (memory-debug builds, sanitizer builds, etc.), not a ROCm-specific change.

Authored with Claude.
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 14, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/183780

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure, 2 Unrelated Failures

As of commit c5270cc with merge base a1cc64b (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jeffdaily jeffdaily added release notes: inductor and removed ciflow/torchtitan Run TorchTitan integration tests labels May 14, 2026
@jansel
Copy link
Copy Markdown
Contributor

jansel commented May 15, 2026

@claude explain those CI failures, check if they are related, and review the changes in this PR

@claude
Copy link
Copy Markdown

claude Bot commented May 15, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

Truthy match follows c10::utils::check_env: only the literal "1" counts."""
if (
os.environ.get("PYTORCH_NO_CUDA_MEMORY_CACHING") == "1"
or os.environ.get("PYTORCH_NO_HIP_MEMORY_CACHING") == "1"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to query the memory allocator state directly to get this info rather than re-inferring it from env

os.environ.get("PYTORCH_NO_CUDA_MEMORY_CACHING") == "1"
or os.environ.get("PYTORCH_NO_HIP_MEMORY_CACHING") == "1"
):
self.skipTest("cudagraphs disabled when caching allocator is bypassed")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely this is worth reifying into a decorator

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented May 15, 2026

This seems fine, although I'm not sure if there's a more structural fix we should do for the direct mode / other consequences when the caching allocator is disabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants