feat: Write the feasibility report: findings, recommended default page strategy, sync/async decision, failure-mode handling, open risks

Omega Agent · Omega Agent · commit 8bb1f6a2b8d0 · 2026-04-30T11:15:20.000Z
Task: beads-15f3da
Spec: evaluate-post-verification-rendered-page-as-activa-19
diff --git a/docs/spikes/evaluate-post-verification-rendered-page-as-activa-19/feasibility-report.md b/docs/spikes/evaluate-post-verification-rendered-page-as-activa-19/feasibility-report.md
@@ -0,0 +1,263 @@
+# Feasibility Report — Post-Verification Rendered Page as Activation Signal
+
+- **Spec:** `evaluate-post-verification-rendered-page-as-activa-19`
+- **Task:** 4.2 — Write the feasibility report (findings, recommended default
+  page strategy, sync/async decision, failure-mode handling, open risks)
+- **Status:** Final. Inputs: tasks 1.x, 2.x, 3.x, and 4.1 of this spec.
+- **Owner:** Get Started squad (team:omega)
+- **Date:** 2026-04-30
+
+This report is the single consolidated answer to the four acceptance criteria
+of the spike. Everything below is drawn from the numbered investigation
+documents in `docs/spikes/`; those remain the source-of-truth for method and
+raw data. This report is the squad-review surface and the input to task 4.3
+(MVP scope for the follow-up implementation ticket).
+
+Non-goals of the spec — production UI, changes to the render API / crawler /
+verification endpoint, batch rendering, screenshots, visual diffing, pricing /
+quota / accounting changes, i18n / a11y / analytics polish, caching work, and
+refactors of `features/integrationWizard` beyond the throwaway prototype — are
+respected throughout. Nothing in this report proposes work inside those areas.
+
+## 1. TL;DR
+
+| Acceptance criterion | Verdict |
+|---|---|
+| Feasibility of post-verification render is clear | **Feasible.** No changes to this repo required. `GET /render?url=…` already exposes everything the wizard needs. |
+| Page selection approach decided | **Verified domain root (`https://<verified-domain>/`)**, with an optional user override pre-filled to the same value. |
+| Failure cases documented | Five render-path cases + two wizard-side cases, all non-blocking in the UI. Enumerated in §5. |
+| Recommended implementation path defined | **Async-first** call to `/render` from the wizard backend at verification-success, with a **3 s opportunistic sync window**. MVP is a single wizard PR; no render-service change. See §6. |
+
+## 2. Feasibility of post-verification render
+
+**Feasible with the existing render path, with no changes to this repo.**
+Confirmed by reading `lib/server.js`, `lib/index.js`, `lib/util.js`, and
+`lib/browsers/chrome.js` during tasks 1.2, 2.1, 2.3, and 2.4.
+
+Key facts that make it feasible today:
+
+- Single entry point: `GET /render?url=<url>&renderType=html` (and the POST
+  equivalent) hits `server.onRequest` in `lib/server.js`. No new endpoint,
+  no new route, no crawler or sitemap dependency (task 2.1).
+- Input validation is already at the boundary: `valid-url.isWebUri` rejects
+  malformed input with `400` before the plugin chain runs
+  (`lib/server.js:221-225`). The wizard does not need to re-validate.
+- Response contract is already what the wizard needs: status code from the
+  origin is propagated (`req.prerender.statusCode`, `lib/server.js:303`),
+  configurable `renderErrorStatusCode` (default `504`) with
+  `x-prerender-504-reason` header for renderer-side failures
+  (`lib/server.js:484-486`), and rendered HTML in the body.
+- Bounded worst case: `pageLoadTimeout` (default 20 s, `lib/server.js:9`)
+  plus `waitAfterLastRequest` (500 ms, `lib/server.js:5`) plus the 60 s
+  per-request hang watchdog (`lib/server.js:241`) cap how long any single
+  call can take.
+- Auth is already solved for service-to-service use: if deployed behind
+  `basicAuth`, the wizard backend reuses the same credential the crawler
+  uses. No new auth path is required (task 1.2).
+
+The spec's reliability bar (≥ 95% success on verified URLs) and latency
+budget (P50 ≤ 6 s, P95 ≤ 12 s, hard ceiling 20 s) are achievable with the
+verified-root selection because the root is the URL whose reachability we
+just proved during verification — DNS, TLS, and bot-gating are known-good at
+call time (task 2.4). The budget is still tight against the 20 s
+`pageLoadTimeout`, which is why §4 below recommends against blocking the
+wizard on the render.
+
+## 3. Method (summary)
+
+Detail lives in the individual spike documents; this is the précis so the
+report stands alone.
+
+- **Entry point & auth scoping:** read of `lib/server.js`, `lib/index.js`,
+  `lib/plugins/{basicAuth,whitelist,blacklist}.js` — `post-verification-render-entry-point.md`.
+- **Feasibility, without crawler dependency:** read of `lib/server.js`,
+  `lib/util.js`, confirmation that no crawler/sitemap/queue code is reachable
+  from the render path — `post-verification-render-feasibility.md`.
+- **Failure enumeration & reproduction:** read of `lib/browsers/chrome.js`
+  + live repros against the local server on branch tip `e83e20d` —
+  `post-verification-render-failure-cases.md`.
+- **Page-selection comparison:** three-strategy evaluation (verified root,
+  user-provided, small fixed set) against reliability and latency from the
+  onboarding-render sample —
+  `evaluate-post-verification-rendered-page-activation-page-selection.md`.
+- **Sync vs. async decision:** latency and failure-mode data applied to the
+  onboarding flow — `sync-vs-async-decision.md` (task 4.1).
+
+## 4. Decisions
+
+### 4.1 Page selection: verified domain root, with override
+
+**Decision: default to `https://<verified-domain>/`.** Optional user override,
+pre-filled with the verified root.
+
+Rationale (full detail in task 2.4 doc):
+
+- The root is the URL whose reachability we just proved during verification —
+  conditional probability of render success is materially higher than for an
+  arbitrary user-supplied URL.
+- Single render call, so p95 latency is bounded by one `pageLoadTimeout`.
+- No probe/fallback code path, no backend caching, no render-accounting
+  change — all explicitly out of scope for this spec.
+- Override input covers legitimate edge cases (SPA shells, locale redirects,
+  marketing-only homepages) without introducing a probe-and-pick path.
+
+**Explicitly rejected:** the "small fixed set" probe-and-pick strategy. Each
+failed probe costs a full `pageLoadTimeout` before fall-through, which breaks
+the latency budget and compounds onboarding-render accounting (non-goal).
+
+### 4.2 Sync vs. async: async-first with a 3 s opportunistic sync window
+
+**Decision: async-first. Never block the wizard on the render.** Fire the
+render the moment verification succeeds; if it returns within a 3 s wall-clock
+budget show it inline; otherwise advance the wizard and update the preview
+slot in place when the render finally resolves (success or failure).
+
+Rationale (full detail in task 4.1 doc, `sync-vs-async-decision.md`):
+
+- P95 latency is too close to the 20 s `pageLoadTimeout` for a blocking
+  wizard step to be acceptable. Timeouts and navigate errors only resolve
+  after the full page-load budget elapses, so a sync UX pays worst-case
+  latency on every failure.
+- A fully deferred UX (email / later surface) loses the "Aha" moment the
+  spike is trying to create — a non-trivial fraction of homepages do render
+  quickly and those users deserve inline feedback.
+- The 3 s sync budget is well above typical fast-mode render time and well
+  below hang-perception thresholds for a secondary widget inside a completed
+  step. Tunable; the follow-up ticket should instrument actual hit rate.
+
+The verification step itself is **always** marked complete regardless of
+render outcome. The preview is additive, never a gate.
+
+## 5. Failure-mode handling
+
+Consolidates task 2.3 (`post-verification-render-failure-cases.md`) and the
+wizard-side surface from task 1.2. All states are non-blocking; all are
+derived from `statusCode`, body, and `x-prerender-504-reason` alone — no new
+render-service signal is required.
+
+### 5.1 Render-path cases (originate in this repo)
+
+| # | Case | Server signal | HTTP seen by wizard | Wizard UX state |
+|---|---|---|---|---|
+| 1 | Renderer error (Chrome-side) | `renderErrorStatusCode` branch in `chrome.js`, `x-prerender-504-reason` set | `504` (default `RENDERING_ERROR_STATUS_CODE`) | "We couldn't render `<url>` automatically." Retry button; no auto-retry in MVP. |
+| 2 | Non-2xx upstream (4xx/5xx propagated) | `tab.prerender.statusCode = params.response.status` | Origin status (e.g. `404`, `500`, `301`) | Show status + short explanation ("Your site returned `403`. If this page is behind login, try a public URL."). Offer override. |
+| 3 | Blocked / auth-gated page | Origin `401`/`403`, or `200` with login HTML / WAF interstitial | Pass-through status, or `200` with thin body | Hard case: same as #2. Soft case: flag as "rendered, but content looks thin"; needs a body-length / login-heuristic check on the wizard side. |
+| 4 | Empty / unparseable payload | `parseHtmlFromPage` resolves with empty body | `200` (or passed-through) with ~empty body | "Preview looks empty." Treat as render failure. Offer override. |
+| 5 | Page-load timeout | `pageLoadTimeout` + `tab.prerender.timedout` | `timeoutStatusCode` if set, else captured status | "Preview is taking longer than expected." Retry button. |
+
+### 5.2 Operational / wizard-side cases
+
+| # | Case | Trigger | Wizard UX state |
+|---|---|---|---|
+| 6 | Invalid URL (shouldn't happen with verified root; can with override) | `valid-url.isWebUri` rejects → `400` | Inline input error on the override field. |
+| 7 | Service unavailable / auth rejected / domain not allow-listed | `401` (basicAuth), `404` (whitelist/blacklist), network failure | Generic "Preview temporarily unavailable — try again" banner. **Never** blocks the wizard. |
+
+### 5.3 Properties the MVP relies on
+
+- **Every case is observable from the response alone.** No new telemetry, no
+  new endpoint, no new header is needed for the MVP UI to distinguish them.
+- **No case blocks verification completion.** The wizard's primary CTA stays
+  enabled at all times. This is the invariant the async-first decision
+  protects.
+- **Retry is user-initiated.** The spike deliberately does not auto-retry:
+  cascading retries on the long tail would break the latency budget for any
+  user left watching.
+
+## 6. Recommended implementation path (for task 4.3 / the follow-up ticket)
+
+Scope so the follow-up fits in a single wizard PR. No render-service change.
+
+1. **Call site.** After verification resolves successfully, the wizard
+   backend issues one
+   `GET <PRERENDER_BASE_URL>/render?url=<verifiedRoot>&renderType=html&followRedirects=true`
+   using the existing crawler credential. Pre-fill an override input with
+   the verified root. Pass a neutral user-agent so the rendered output
+   matches what a real visitor sees, not the crawler UA.
+2. **Kickoff timing.** Fire the render immediately when verification succeeds
+   — not when the user clicks "Next." The sync window is a race against the
+   render, not against user think-time.
+3. **Sync window.** 3 s wall-clock from kickoff. If the render resolves within
+   the window, show the preview inline. Otherwise, advance the wizard with a
+   "Your preview is rendering…" affordance in the preview slot.
+4. **Async resolution.** When the render ultimately resolves, update the
+   preview slot in place (success, failure, or empty). If the user has moved
+   on, the finished preview is available on the post-onboarding / integration
+   settings page.
+5. **UI states.** `loading`, `success` (iframe / sandbox, plus "looks right?
+   / try a different URL"), and the seven failure states from §5.1–5.2. All
+   required by the universal async-UI rule; none require backend changes.
+6. **Sandbox.** Render output goes into a sandboxed iframe; treat as
+   untrusted third-party HTML; do not persist beyond the request/response
+   lifecycle in the MVP.
+7. **Client-side timeout cap.** Hard cap the wizard's own fetch at slightly
+   above `pageLoadTimeout` (e.g. 25 s) so a stuck tab cannot wedge the
+   preview slot indefinitely.
+8. **Scope guardrails (explicit deferrals).** No caching, no batching, no
+   screenshots, no before/after diff, no per-customer quota accounting, no
+   analytics instrumentation, no i18n / a11y polish — all spec non-goals.
+   Task 4.3 is the place to lock this list down for the implementation
+   ticket.
+
+Rough sizing (not binding; task 4.3 refines):
+
+- **Frontend:** 1 panel, 1 sandboxed iframe, state machine covering the
+  seven failure states and the sync / async handoff. Small.
+- **Backend:** 1 service-to-service call at the verification-success
+  handler, reusing existing credentials and HTTP client. Smaller.
+- **Ops:** no new service, no new env var, no new dashboard. Monitor the
+  3 s sync-window hit rate post-launch to tune the budget.
+
+## 7. Open risks and follow-ups
+
+All items here are **out of scope for this spike**. They are tracked here so
+they are not lost going into task 4.3 and the follow-up implementation.
+
+1. **Sync-window hit rate is not measured yet.** 3 s is the starting point;
+   the follow-up should instrument how often the inline race wins and tune
+   from data, not intuition.
+2. **SPA-shell false negatives.** A `200` with near-empty body after
+   hydration didn't finish will read as "empty payload" in §5.1 case 4.
+   Body-length heuristics are a MVP wizard concern, not a render-service
+   concern; the threshold is a judgement call the follow-up makes.
+3. **Soft bot / WAF blocks indistinguishable from a thin-content page.**
+   Same class as (2); same treatment. Accept false-negative activation
+   signals as a known limit of the MVP; document in the UI copy.
+4. **Post-wizard surfacing.** Whether the preview should persist in the
+   integration's ongoing settings view is a product decision, not a
+   rendering decision. Flagged for the follow-up.
+5. **Override abuse / retries.** A customer spamming override + retry is
+   rate-limited by the render service's own connection concurrency, but the
+   spec explicitly defers quota/accounting. If abuse is observed post-launch,
+   mitigate in the wizard backend, not here.
+6. **User-agent choice.** A neutral UA keeps the rendered output
+   representative but means the wizard cannot reuse crawler-UA-specific
+   cached renders on the render service. This is a deliberate trade-off.
+7. **Browser health.** The render server restarts Chrome periodically
+   (`browserTryRestartPeriod`) and may reject with `503` when the browser
+   isn't connected. The async-first design absorbs this; the follow-up just
+   needs to treat `503` as the transient-unavailable state in §5.2 case 7.
+
+## 8. Acceptance-criteria mapping
+
+| Criterion | Section |
+|---|---|
+| Feasibility of post-verification render is clear | §2 |
+| Page selection approach decided | §4.1 |
+| Failure cases documented | §5 |
+| Recommended implementation path defined | §6 |
+
+## 9. Non-goals respected
+
+This report does not:
+
+- Propose changes to the render API, the crawler, or the verification
+  endpoint.
+- Propose multi-URL batching, screenshots, visual diffing, or before/after
+  comparison.
+- Propose pricing, quota, or render-accounting changes for onboarding renders.
+- Propose caching beyond what the existing render path already provides.
+- Propose i18n, a11y, or analytics instrumentation for the eventual UI.
+- Propose any refactor of `features/integrationWizard` beyond hosting the
+  throwaway prototype covered in task 3.1.
+- Ship production UI; the wizard implementation is the follow-up ticket
+  scoped by task 4.3 on the basis of this report.