[inductor] Skip cudagraph capture when CUDA caching allocator is bypassed#183780
[inductor] Skip cudagraph capture when CUDA caching allocator is bypassed#183780jeffdaily wants to merge 1 commit into
Conversation
…ssed `PYTORCH_NO_CUDA_MEMORY_CACHING=1` (and the equivalent `PYTORCH_NO_HIP_MEMORY_CACHING`) forces every allocation through `cudaMalloc` / `hipMalloc`, bypassing the caching allocator. Cudagraph capture pools allocations through that allocator; without it, capture appears to complete normally but the pool tracking diverges, and the first replay fires a cryptic `"storage data ptrs are not allocated in pool ..."` error. Detect the bypass at the same layer as the existing device-topology check (`cudagraph_utils.check_lowering_disable_cudagraph`) and skip cudagraph capture with a clear reason. Tests and other code that opportunistically enable cudagraphs (`test_padding`, `test_callback`, inductor's `compile_fx` pipeline, etc.) still run — they just exercise the eager triton kernels instead of the cudagraph wrapper. Two existing tests in `test/inductor/test_cuda_repro.py` (`test_unused_cpu_input_cudagraphs` and `test_cpu_index`) explicitly assert `graph.disable_cudagraphs_reason is None`. Add `skipTest()` guards in those tests so they bow out cleanly when the env bypass is set, keeping the new inductor invariant strict. This is a debugging-experience fix for anyone using `PYTORCH_NO_CUDA_MEMORY_CACHING` (memory-debug builds, sanitizer builds, etc.), not a ROCm-specific change. Authored with Claude.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/183780
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 New Failure, 2 Unrelated FailuresAs of commit c5270cc with merge base a1cc64b ( NEW FAILURE - The following job has failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@claude explain those CI failures, check if they are related, and review the changes in this PR |
|
I'll analyze this and get back to you. |
| Truthy match follows c10::utils::check_env: only the literal "1" counts.""" | ||
| if ( | ||
| os.environ.get("PYTORCH_NO_CUDA_MEMORY_CACHING") == "1" | ||
| or os.environ.get("PYTORCH_NO_HIP_MEMORY_CACHING") == "1" |
There was a problem hiding this comment.
It would be better to query the memory allocator state directly to get this info rather than re-inferring it from env
| os.environ.get("PYTORCH_NO_CUDA_MEMORY_CACHING") == "1" | ||
| or os.environ.get("PYTORCH_NO_HIP_MEMORY_CACHING") == "1" | ||
| ): | ||
| self.skipTest("cudagraphs disabled when caching allocator is bypassed") |
There was a problem hiding this comment.
Surely this is worth reifying into a decorator
|
This seems fine, although I'm not sure if there's a more structural fix we should do for the direct mode / other consequences when the caching allocator is disabled. |

PYTORCH_NO_CUDA_MEMORY_CACHING=1(and the equivalentPYTORCH_NO_HIP_MEMORY_CACHING) forces every allocation throughcudaMalloc/hipMalloc, bypassing the caching allocator. Cudagraph capture pools allocations through that allocator; without it, capture appears to complete normally but the pool tracking diverges, and the first replay fires a cryptic"storage data ptrs are not allocated in pool ..."error.Detect the bypass at the same layer as the existing device-topology check (
cudagraph_utils.check_lowering_disable_cudagraph) and skip cudagraph capture with a clear reason. Tests and other code that opportunistically enable cudagraphs (test_padding,test_callback, inductor'scompile_fxpipeline, etc.) still run — they just exercise the eager triton kernels instead of the cudagraph wrapper.Two existing tests in
test/inductor/test_cuda_repro.py(test_unused_cpu_input_cudagraphsandtest_cpu_index) explicitly assertgraph.disable_cudagraphs_reason is None. AddskipTest()guards in those tests so they bow out cleanly when the env bypass is set, keeping the new inductor invariant strict.This is a debugging-experience fix for anyone using
PYTORCH_NO_CUDA_MEMORY_CACHING(memory-debug builds, sanitizer builds, etc.), not a ROCm-specific change.Authored with Claude.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @mlazos