Add GKE TPU v7 runner and documentation#1775
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d35443b. Configure here.
|
|
||
| # Execute Ix standard processing script | ||
| cd "${SCRIPT_DIR}/.." | ||
| python3 utils/process_result.py |
There was a problem hiding this comment.
Benchmark JSON wrong directory
High Severity
kubectl cp writes the benchmark output to results/${JOB_NAME}.json, but process_result.py opens ${RESULT_FILENAME}.json from the repo root after cd there. Post-processing always fails with a missing file even when the GKE job succeeded.
Reviewed by Cursor Bugbot for commit d35443b. Configure here.
|
|
||
| # Execute Ix standard processing script | ||
| cd "${SCRIPT_DIR}/.." | ||
| python3 utils/process_result.py |
There was a problem hiding this comment.
Missing process_result env vars
High Severity
After each sweep iteration the script runs process_result.py with DISAGG=false but never exports TP, EP_SIZE, or DP_ATTENTION. For single-node runs that script requires those variables and raises EnvironmentError before writing agg_*.json.
Reviewed by Cursor Bugbot for commit d35443b. Configure here.
|
|
||
| echo "Waiting for TPU Job to complete (this may take several minutes)..." | ||
| # Wait for the JobSet to be successful | ||
| kubectl wait --for=condition=completed jobset/${JOB_NAME} --timeout=1h |
There was a problem hiding this comment.
Wrong JobSet wait condition
High Severity
The script waits with kubectl wait --for=condition=completed, but JobSet status conditions use PascalCase types such as Completed or Complete, not lowercase completed. The wait never succeeds and the sweep stalls until timeout.
Reviewed by Cursor Bugbot for commit d35443b. Configure here.
…s, and enable persistent JAX compile cache on GKE
…gging-based result collection
…TODO in README_TPU.md
…g and missing process_result env vars


Note
Low Risk
New optional runner scripts and Kubernetes manifests only; no changes to core benchmark logic, CI matrices, or security-sensitive code paths.
Overview
Adds a TPU v7 on GKE path to InferenceX so TPU runs can use the same serving benchmark and
process_result.pynormalization as GPU runners.launch_tpu-v7-gke.shsweeps ISL/OSL/concurrency, renderstpu-v7-jobset.yaml.templateinto per-case JobSets, waits for completion, copiesresult.jsonviakubectl cp, deletes the job, and post-processes withRUNNER_TYPE=tpu-v7. Defaults target Qwen3.5-397B-FP8 and a configurable vLLM TPU image.The JobSet template runs a vLLM server and sidecar on
tpu7x/2x2x1nodes; the sidecar callsbenchmark_serving.pyand signals shutdown through a shared volume.README_TPU.mddocuments cluster creation, JobSet controller install, credentials, env overrides, and troubleshooting.Reviewed by Cursor Bugbot for commit d35443b. Bugbot is set up for automated code reviews on this repo. Configure here.