Add GKE TPU v7 runner and documentation by DRZDONG · Pull Request #1775 · SemiAnalysisAI/InferenceX

DRZDONG · 2026-06-15T07:30:21Z

Note

Low Risk
New optional runner scripts and Kubernetes manifests only; no changes to core benchmark logic, CI matrices, or security-sensitive code paths.

Overview
Adds a TPU v7 on GKE path to InferenceX so TPU runs can use the same serving benchmark and process_result.py normalization as GPU runners.

launch_tpu-v7-gke.sh sweeps ISL/OSL/concurrency, renders tpu-v7-jobset.yaml.template into per-case JobSets, waits for completion, copies result.json via kubectl cp, deletes the job, and post-processes with RUNNER_TYPE=tpu-v7. Defaults target Qwen3.5-397B-FP8 and a configurable vLLM TPU image.

The JobSet template runs a vLLM server and sidecar on tpu7x / 2x2x1 nodes; the sidecar calls benchmark_serving.py and signals shutdown through a shared volume.

README_TPU.md documents cluster creation, JobSet controller install, credentials, env overrides, and troubleshooting.

^{Reviewed by Cursor Bugbot for commit d35443b. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit d35443b. Configure here.}

cursor · 2026-06-15T07:34:26Z

+
+            # Execute Ix standard processing script
+            cd "${SCRIPT_DIR}/.."
+            python3 utils/process_result.py


Benchmark JSON wrong directory

High Severity

kubectl cp writes the benchmark output to results/${JOB_NAME}.json, but process_result.py opens ${RESULT_FILENAME}.json from the repo root after cd there. Post-processing always fails with a missing file even when the GKE job succeeded.

^{Reviewed by Cursor Bugbot for commit d35443b. Configure here.}

cursor · 2026-06-15T07:34:26Z

+
+            # Execute Ix standard processing script
+            cd "${SCRIPT_DIR}/.."
+            python3 utils/process_result.py


Missing process_result env vars

High Severity

After each sweep iteration the script runs process_result.py with DISAGG=false but never exports TP, EP_SIZE, or DP_ATTENTION. For single-node runs that script requires those variables and raises EnvironmentError before writing agg_*.json.

^{Reviewed by Cursor Bugbot for commit d35443b. Configure here.}

cursor · 2026-06-15T07:34:26Z

+
+            echo "Waiting for TPU Job to complete (this may take several minutes)..."
+            # Wait for the JobSet to be successful
+            kubectl wait --for=condition=completed jobset/${JOB_NAME} --timeout=1h


Wrong JobSet wait condition

High Severity

The script waits with kubectl wait --for=condition=completed, but JobSet status conditions use PascalCase types such as Completed or Complete, not lowercase completed. The wait never succeeds and the sweep stalls until timeout.

^{Reviewed by Cursor Bugbot for commit d35443b. Configure here.}

…s, and enable persistent JAX compile cache on GKE

…gging-based result collection

…TODO in README_TPU.md

…g and missing process_result env vars

…-node choices

Add GKE TPU v7 runner and documentation

d35443b

DRZDONG requested a review from a team June 15, 2026 07:30

github-project-automation Bot added this to InferenceMAX Board Jun 15, 2026

DRZDONG marked this pull request as draft June 15, 2026 07:30

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Ruizi Dong added 10 commits June 15, 2026 00:38

Merge branch 'main' into tpu-v7-runner

595d063

Merge remote-tracking branch 'origin/main' into tpu-v7-runner

b0768c5

feat(tpu-v7): support consolidated sweep loop, optimize host RAM size…

a927ab9

…s, and enable persistent JAX compile cache on GKE

Merge remote-tracking branch 'origin/main' into tpu-v7-runner

e2349d7

docs(tpu-v7): update README_TPU.md to match the new sweep loop and lo…

7f0463a

…gging-based result collection

docs(tpu-v7): add expected run timeline table to README_TPU.md

4e6f058

docs(tpu-v7): document database ingestion gap and add GHA automation …

df32780

…TODO in README_TPU.md

docs(tpu-v7): simplify TODO in README_TPU.md

986fd04

fix(tpu-v7): address bugbot reviews on wait conditions, result copyin…

18ba739

…g and missing process_result env vars

docs(tpu-v7): set single-node as default and document single vs multi…

dbfa0f2

…-node choices

DRZDONG closed this Jun 17, 2026

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GKE TPU v7 runner and documentation#1775

Add GKE TPU v7 runner and documentation#1775
DRZDONG wants to merge 11 commits into
SemiAnalysisAI:mainfrom
DRZDONG:tpu-v7-runner

DRZDONG commented Jun 15, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 15, 2026

Uh oh!

cursor Bot Jun 15, 2026

Uh oh!

cursor Bot Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DRZDONG commented Jun 15, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 15, 2026

Choose a reason for hiding this comment

Benchmark JSON wrong directory

Uh oh!

cursor Bot Jun 15, 2026

Choose a reason for hiding this comment

Missing process_result env vars

Uh oh!

cursor Bot Jun 15, 2026

Choose a reason for hiding this comment

Wrong JobSet wait condition

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DRZDONG commented Jun 15, 2026 •

edited by cursor Bot

Loading