Skip to content

Add GKE TPU v7 runner and documentation#1775

Closed
DRZDONG wants to merge 11 commits into
SemiAnalysisAI:mainfrom
DRZDONG:tpu-v7-runner
Closed

Add GKE TPU v7 runner and documentation#1775
DRZDONG wants to merge 11 commits into
SemiAnalysisAI:mainfrom
DRZDONG:tpu-v7-runner

Conversation

@DRZDONG

@DRZDONG DRZDONG commented Jun 15, 2026

Copy link
Copy Markdown

Note

Low Risk
New optional runner scripts and Kubernetes manifests only; no changes to core benchmark logic, CI matrices, or security-sensitive code paths.

Overview
Adds a TPU v7 on GKE path to InferenceX so TPU runs can use the same serving benchmark and process_result.py normalization as GPU runners.

launch_tpu-v7-gke.sh sweeps ISL/OSL/concurrency, renders tpu-v7-jobset.yaml.template into per-case JobSets, waits for completion, copies result.json via kubectl cp, deletes the job, and post-processes with RUNNER_TYPE=tpu-v7. Defaults target Qwen3.5-397B-FP8 and a configurable vLLM TPU image.

The JobSet template runs a vLLM server and sidecar on tpu7x / 2x2x1 nodes; the sidecar calls benchmark_serving.py and signals shutdown through a shared volume.

README_TPU.md documents cluster creation, JobSet controller install, credentials, env overrides, and troubleshooting.

Reviewed by Cursor Bugbot for commit d35443b. Bugbot is set up for automated code reviews on this repo. Configure here.

@DRZDONG DRZDONG requested a review from a team June 15, 2026 07:30
@DRZDONG DRZDONG marked this pull request as draft June 15, 2026 07:30

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d35443b. Configure here.

Comment thread runners/launch_tpu-v7-gke.sh Outdated

# Execute Ix standard processing script
cd "${SCRIPT_DIR}/.."
python3 utils/process_result.py

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark JSON wrong directory

High Severity

kubectl cp writes the benchmark output to results/${JOB_NAME}.json, but process_result.py opens ${RESULT_FILENAME}.json from the repo root after cd there. Post-processing always fails with a missing file even when the GKE job succeeded.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d35443b. Configure here.

Comment thread runners/launch_tpu-v7-gke.sh Outdated

# Execute Ix standard processing script
cd "${SCRIPT_DIR}/.."
python3 utils/process_result.py

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing process_result env vars

High Severity

After each sweep iteration the script runs process_result.py with DISAGG=false but never exports TP, EP_SIZE, or DP_ATTENTION. For single-node runs that script requires those variables and raises EnvironmentError before writing agg_*.json.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d35443b. Configure here.

Comment thread runners/launch_tpu-v7-gke.sh Outdated

echo "Waiting for TPU Job to complete (this may take several minutes)..."
# Wait for the JobSet to be successful
kubectl wait --for=condition=completed jobset/${JOB_NAME} --timeout=1h

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong JobSet wait condition

High Severity

The script waits with kubectl wait --for=condition=completed, but JobSet status conditions use PascalCase types such as Completed or Complete, not lowercase completed. The wait never succeeds and the sweep stalls until timeout.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d35443b. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant