Skip to content

pipeline-manager: dismiss deployment error and clear bootstrap policy#5663

Merged
snkas merged 1 commit intomainfrom
dismiss-deployment-error
Feb 23, 2026
Merged

pipeline-manager: dismiss deployment error and clear bootstrap policy#5663
snkas merged 1 commit intomainfrom
dismiss-deployment-error

Conversation

@snkas
Copy link
Contributor

@snkas snkas commented Feb 19, 2026

Add the /dismiss_error endpoint for the user to dismiss the content of the deployment_error field, which is set when the pipeline becomes Stopping/Stopped due to an fatal error occurring deployment. The functionality is also integrated into the /start endpoint with the parameter ?dismiss_error=false/true (default: true).

Python API changes:

  • pipeline.start() has the ignore_deployment_error parameter removed (not backward compatible)
  • pipeline.start() has the dismiss_error parameter added which is by default True
  • pipeline.dismiss_error() is added

CLI client fda changes:

  • fda start has the --no-dismiss-error parameter added
  • fda dismiss-error is added

The bootstrap_policy field is fixed to be cleared along with deployment_initial when the pipeline is stopping or has stopped.

PR information

  • Testing: database unit test, database fuzzing inclusion, Python integration test, simple fda test
  • Documentation updated
  • Changelog updated
  • Backward incompatible change: removal of the ignore_deployment_error parameter for the Python API pipeline.start() method

@lalithsuresh
Copy link
Contributor

lalithsuresh commented Feb 19, 2026

This sounds like a breaking change for all orchestration workflows (that is, all scripts need to call dismiss error before calling start to be idempotent, and that's a poor UX). Could you clarify?

What's wrong with start() always overriding deployment_errors?

@snkas
Copy link
Contributor Author

snkas commented Feb 19, 2026

Error dismissal is not integrated into the /start endpoint, as otherwise it would be impossible for clients to first do an error dismissal and then start without having the possibility that /start retries dismiss an error.

@snkas
Copy link
Contributor Author

snkas commented Feb 19, 2026

It's for the following use case:

  1. User calls HTTP POST to /start but doesn't get the HTTP response
  2. Pipeline starts but crashes with error
  3. User calls retry HTTP POST to /start
  4. Pipeline would start again dismissing the error

@snkas
Copy link
Contributor Author

snkas commented Feb 19, 2026

For the client (chiefly, Python, fda doesn't have retries at least currently), it needs to use the separate endpoint for this reason. Such that when it does start(dismiss_error=True/False) it does:

  1. Dismiss an error (if dismiss_error=True)
  2. Call /start with retries
  3. Await for the status to becoming matching initial

@lalithsuresh
Copy link
Contributor

It's for the following use case:

1. User calls HTTP POST to `/start` but doesn't get the HTTP response

2. Pipeline starts but crashes with error

3. User calls retry HTTP POST to `/start`

4. Pipeline would start again dismissing the error

And that looks looks fine to me. If they want the deployment error to block a start in orchestration code, they can call start(ignore_deployment_error=false). What am I missing?

@snkas
Copy link
Contributor Author

snkas commented Feb 19, 2026

The default behavior of the HTTP API /start should be to not dismiss error, rather than to dismiss it like it does currently indirectly.

@snkas
Copy link
Contributor Author

snkas commented Feb 19, 2026

I could add a /start?dismiss_error=false/true parameter, but dismiss_error=true shouldn't be used by the clients in their implementation because of the use case I described above.

Also, it's unclear what idempotency means when the pipeline is already running and one calls due to retry /start?dismiss_error=true.

@ryzhyk
Copy link
Contributor

ryzhyk commented Feb 19, 2026

I'm with @lalithsuresh . It's a footgun if an error from the previous run can cause /start to fail in the default case. I think the default behavior of /start should be dismiss_error=true.

@ryzhyk
Copy link
Contributor

ryzhyk commented Feb 19, 2026

I do like the new /dismiss_error endpoint - finally we'll be able to clear errors in the UI (@Karakatiza666 ).

@snkas snkas force-pushed the dismiss-deployment-error branch from e3e739a to 2f4c855 Compare February 19, 2026 20:58
@snkas
Copy link
Contributor Author

snkas commented Feb 19, 2026

@claude Review the code changes

@claude
Copy link

claude bot commented Feb 19, 2026

Claude finished @snkas's task in 2m 50s —— View job


Code Review

  • Gathering context and reading changed files
  • Reviewing backend changes (Rust)
  • Reviewing database layer changes
  • Reviewing Python API changes
  • Reviewing CLI changes
  • Reviewing tests
  • Final summary

Overall the implementation is well-structured and the bootstrap_policy fix is a good correction. A few issues worth addressing below.


Bugs

1. Busy-wait loop in test without sleep (python/tests/platform/test_pipeline_error.py:85-91)

while time.time() < deadline:
    if pipeline.status() == PipelineStatus.STOPPED:
        return
raise TimeoutError(...)

There is no time.sleep() inside the loop. This will spin the CPU at 100% for up to 60 seconds. Add time.sleep(0.1) (or similar) inside the loop body, consistent with how other wait loops in the codebase are written.

Fix this →

2. Docstring says "False by default" but actual default is True (python/feldera/rest/feldera_client.py:502)

def start_pipeline(
    self,
    pipeline_name: str,
    ...
    dismiss_error: bool = True,  # <-- actual default
) -> Optional[PipelineStatus]:
    """
    :param dismiss_error: Set True to dismiss any deployment error before starting. False by default.  # <-- wrong
    """

The docstring says "False by default" but the signature has dismiss_error: bool = True. Fix this →


Design Issues

3. start_pipeline_as_paused and start_pipeline_as_standby don't expose dismiss_error (python/feldera/rest/feldera_client.py:514-558)

start_pipeline has a dismiss_error parameter, but start_pipeline_as_paused and start_pipeline_as_standby don't expose it — they use the default True silently. A user who calls start_pipeline_as_paused(dismiss_error=False) gets a TypeError rather than the intended behavior. For consistency this should be added to both methods.

4. Two-step dismiss+start in Python client is intentionally non-atomic

The Python _inner_start_pipeline calls /dismiss_error then /start?dismiss_error=false as two separate HTTP requests (lines 465-476). The PR comments explain the reasoning correctly: this prevents a retry of /start from silently dismissing a new error that occurred during the retry window. This is a valid and deliberate design choice, and the comment in the code makes it clear. Worth preserving in a code comment in the method.

5. HTTP API default for dismiss_error is true, contradicting the original PR description

The PR description says "dismiss_error=false (default)" for the /start endpoint, but the code sets:

fn default_pipeline_start_dismiss_error() -> bool {
    true
}

The implementation matches what the reviewers (ryzhyk, lalithsuresh) requested. The PR description should be updated to reflect this.


Minor

6. Response body of post_pipeline_dismiss_error is a bare JSON string

Ok(HttpResponse::Ok().json(json!("Pipeline deployment error has been dismissed")))

This serializes as a JSON string "Pipeline deployment error has been dismissed" rather than a JSON object. Most other endpoints return {} or no body on success. This inconsistency is minor but worth aligning with the convention used by other lifecycle endpoints (e.g., /stop, /clear).


Positives

  • Bootstrap policy fix is correct. bootstrap_policy is now cleared on Stopping and Stopped transitions, and retained through Provisioning and Provisioned. This matches the intended lifecycle and the test in test_pipeline_lifecycle.py::test_pipeline_bootstrap_policy_is_removed validates it end-to-end.
  • dismiss_deployment_error is idempotent when no error is present — the condition && current.deployment_error.is_some() means a no-op dismiss always succeeds regardless of pipeline state.
  • DB-layer atomicity is preserved. The dismiss_error flag within set_deployment_resources_desired_status clears the error as part of the same transaction as the status change.
  • The proptest model is updated correctly alongside the real implementation, including the new DismissDeploymentError action variant.
  • PipelineBuilder.create_or_replace() correctly adds dismiss_error() so stale errors don't block re-creation.

@snkas
Copy link
Contributor Author

snkas commented Feb 19, 2026

@claude Review the code changes again

@claude
Copy link

claude bot commented Feb 19, 2026

Claude finished @snkas's task in 3m 54s —— View job


Code Review

  • Gathering context and reading changed files
  • Reviewing backend changes (Rust)
  • Reviewing database layer changes
  • Reviewing Python API changes
  • Reviewing CLI changes
  • Reviewing tests
  • Final summary

Previous review items have been addressed: the busy-wait loop now has time.sleep(0.2), the wrong docstring is fixed, and dismiss_error is exposed on start_pipeline_as_paused / start_pipeline_as_standby.


Bugs

1. Test: silent pass when expected exception is not raised (python/tests/platform/test_pipeline_error.py:50-54)

try:
    pipeline.start(dismiss_error=False)
except FelderaAPIError as e:
    assert e.status_code == 400
    assert e.error_code == "CannotStartWithUndismissedError"
# ← if the exception is NOT raised, the two assertions above never execute
assert pipeline.deployment_error() is not None

If start(dismiss_error=False) does not raise (e.g. due to a regression), the test silently passes without ever checking status_code or error_code. Replace with pytest.raises:

with pytest.raises(FelderaAPIError) as exc_info:
    pipeline.start(dismiss_error=False)
assert exc_info.value.status_code == 400
assert exc_info.value.error_code == "CannotStartWithUndismissedError"
assert pipeline.deployment_error() is not None

Fix this →


Still Present from Previous Review

2. Response body of post_pipeline_dismiss_error is a bare JSON string (crates/pipeline-manager/src/api/endpoints/pipeline_management.rs)

Ok(HttpResponse::Ok().json(json!("Pipeline deployment error has been dismissed")))

This serializes as the JSON string "Pipeline deployment error has been dismissed" (with quotes). Other lifecycle endpoints (/stop, /clear) return either an empty body or {}. This inconsistency is still present.


Design Observations

3. HTTP API dismiss_error=true default is never exercised by the Python client

_inner_start_pipeline unconditionally sets start_params["dismiss_error"] = "false" and handles dismissal as a separate pre-call. This is intentional and the comment in the code explains the rationale. Worth knowing that the HTTP API's dismiss_error=true default path is only reachable by direct HTTP consumers, not by the Python SDK.

4. create_or_replace dismiss_error before recreate

pipeline_builder.py calls p.dismiss_error() after stop(force=True) and clear_storage(). Since stop(force=True) is synchronous and guarantees a fully-stopped state before returning, this is safe. If any of the three calls raise FelderaAPIError, the except block catches it and proceeds with creation anyway — acceptable since the pipeline may not exist yet.


Positives

  • Bootstrap policy fix is correct and well-tested. bootstrap_policy is now retained through Provisioning/Provisioned and cleared on Stopping/Stopped. The new test_pipeline_bootstrap_policy_is_removed test validates this end-to-end across all three bootstrap policy variants.
  • Previous review's busy-wait loop is fixed. cause_error_and_wait_for_stopped now has time.sleep(0.2) inside the loop.
  • dismiss_deployment_error is idempotent. The condition && current.deployment_error.is_some() means calling it when there is no error always succeeds regardless of pipeline state.
  • DB-layer atomicity is preserved. The dismiss_error flag within set_deployment_resources_desired_status clears the error in the same transaction as the desired-status change.
  • Proptest model updated correctly. The new DismissDeploymentError action variant is added to the fuzzing model alongside the real implementation.
  • Both docstrings are now correct. dismiss_error: True by default is accurately documented across pipeline.py and feldera_client.py.
  • start_paused, start_standby, and restart all properly expose dismiss_error. The previous gap is fixed.

@snkas
Copy link
Contributor Author

snkas commented Feb 19, 2026

@claude Review the code changes again

@claude
Copy link

claude bot commented Feb 19, 2026

Claude finished @snkas's task in 3m 53s —— View job


Code Review

  • Gathering context and reading changed files
  • Reviewing backend changes (Rust)
  • Reviewing database layer changes
  • Reviewing Python API changes
  • Reviewing CLI changes
  • Reviewing tests
  • Final summary

All items from the previous two reviews have been addressed. One minor item remains outstanding; the other "still present" issue from the last review was a false positive.


Minor

1. pytest.raises is more idiomatic (carried from previous review)

python/tests/platform/test_pipeline_error.py:50-57

error_occurred = False
try:
    pipeline.start(dismiss_error=False)
except FelderaAPIError as e:
    error_occurred = True
    assert e.status_code == 400
    assert e.error_code == "CannotStartWithUndismissedError"
assert error_occurred

Clarification from previous review: The previous review stated this "silently passes" if the exception isn't raised — that's incorrect. assert error_occurred on line 57 will fail the test if the exception is never raised. The pattern is functionally correct.

That said, pytest.raises is still the preferred idiomatic form because it makes the intent clearer and avoids the extra state variable:

with pytest.raises(FelderaAPIError) as exc_info:
    pipeline.start(dismiss_error=False)
assert exc_info.value.status_code == 400
assert exc_info.value.error_code == "CannotStartWithUndismissedError"
assert pipeline.deployment_error() is not None

Fix this →


Correction from Previous Review

The post_pipeline_dismiss_error response body is consistent, not anomalous.

The previous review flagged this as inconsistent with other lifecycle endpoints:

Ok(HttpResponse::Ok().json(json!("Pipeline deployment error has been dismissed")))

On inspection, all lifecycle action endpoints in this codebase return bare JSON strings in the same style:

  • /start"Pipeline is starting"
  • /stop?force=true"Pipeline is forcefully stopping"
  • /clear"Pipeline storage is being cleared"

The dismiss_error response is fully consistent with this pattern. Not an issue.


Positives

  • Bootstrap policy fix is correct. set_deployment_resources_status_stopping and set_deployment_resources_status_stopped now both pass None, None for final_deployment_initial and final_bootstrap_policy, clearing them atomically on the Stopping and Stopped transitions. Provisioning and Provisioned transitions preserve these fields from current. The new test_pipeline_bootstrap_policy_is_removed test validates this end-to-end for all three policy variants.

  • dismiss_deployment_error idempotency is correct. The condition && current.deployment_error.is_some() means calling dismiss when there is no error is always a no-op regardless of pipeline state. Calling it when there is an error but the pipeline is not fully stopped correctly returns DismissErrorRestrictedToFullyStopped.

  • DB atomicity is preserved. When dismiss_error=true is passed to set_deployment_resources_desired_status, the error is cleared as part of the same SQL UPDATE that sets the desired status — no TOCTOU window.

  • Two-step Python dismiss+start is intentional and commented. The comment in _inner_start_pipeline (line 463) correctly explains why dismiss and start are separate requests: retry safety — a /start retry must not silently swallow a new crash that occurred in the retry window.

  • fda correctly mirrors the two-step pattern. Lines 819–847 dismiss first (when !no_dismiss_error), then call /start?dismiss_error=false.

  • Proptest model updated. DismissDeploymentError variant is added to the fuzz model alongside the real implementation.

  • All docstrings are correct. dismiss_error: True by default is accurately documented across pipeline.py and feldera_client.py. The previous "False by default" bug is fixed.

  • dismiss_error is exposed on start_paused, start_standby, and restart. The gap from the first review is closed.

  • Busy-wait loop has time.sleep(0.2). The CPU-spinning bug from the first review is fixed.


@snkas snkas requested review from gz and igorscs February 19, 2026 21:50
@igorscs
Copy link
Contributor

igorscs commented Feb 20, 2026

Probably we should document the new API in documentations docs.feldera.com dir pipelines/lifecycle.md and modifying.md, mention that /dismiss_error is a sync operation that guarantees error is cleared in tutorials/restapi. Also would be helpful to add an example how to use this api in python/docs/examples.rst and/or introduction.rst

Copy link
Contributor

@igorscs igorscs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python looks good

The `ignore_deployment_error` parameter has been removed from the Python
`pipeline.start()` method. Instead, make use of the newly added `dismiss_error` parameter.
A pipeline deployment error can now be separately dismissed using a dedicated endpoint
and the corresponding client functions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to add a code-snippet of how this was invoked before and how an invocation of start looks like now

Copy link
Contributor

@igorscs igorscs Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update the changelog with all api changes in my PR, if it works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edited Changelog to:

        The `ignore_deployment_error` parameter has been removed from the Python
        `pipeline.start()` method. Instead, make use of the newly added `dismiss_error` parameter.
        If you do not want the pipeline to start if there is a pre-existing deployment error,
        you should call `pipeline.start(dismiss_error=False)`. Otherwise call `pipeline.start()`
        which is by default equivalent to `pipeline.start(dismiss_error=True)` (preserving
        existing behavior). If the start results in an error occurring, the method will still
        throw an error as before. A pipeline deployment error can now also be separately dismissed
        using a dedicated endpoint and the corresponding client functions (e.g.,
        `pipeline.dismiss_error()` for the Python client).

@igorscs igorscs added this pull request to the merge queue Feb 20, 2026
@igorscs igorscs removed this pull request from the merge queue due to a manual request Feb 20, 2026
@snkas snkas force-pushed the dismiss-deployment-error branch from d435b9a to c86adee Compare February 20, 2026 16:21
@snkas
Copy link
Contributor Author

snkas commented Feb 20, 2026

Probably we should document the new API in documentations docs.feldera.com dir pipelines/lifecycle.md and modifying.md, mention that /dismiss_error is a sync operation that guarantees error is cleared in tutorials/restapi. Also would be helpful to add an example how to use this api in python/docs/examples.rst and/or introduction.rst

@igorscs I updated pipelines/lifecycle.md. pipelines/modifying.md is about backfill avoidance, so seems out of scope to add information about deployment errors there. The same for tutorials/rest_api, in that it right now at least just showcases a first tutorial on how to interact with the API. Yes, for python/docs/*.rst, it can be further expanded to be added, but at least right now it is backward compatible and there is no usage of ignore_deployment_error that needs to be removed.

@snkas
Copy link
Contributor Author

snkas commented Feb 20, 2026

@claude Please review the code changes again

@claude
Copy link

claude bot commented Feb 20, 2026

Claude finished @snkas's task in 3m 27s —— View job


Code Review

  • Gathering context and reading changed files
  • Reviewing backend changes (Rust)
  • Reviewing database layer changes
  • Reviewing Python API changes
  • Reviewing CLI changes
  • Reviewing tests and documentation
  • Final summary

All items from previous reviews have been addressed. One minor style nit remains; nothing blocking.


Minor (carried from review #3)

try/except with error_occurred flag (python/tests/platform/test_pipeline_error.py:50-57)

error_occurred = False
try:
    pipeline.start(dismiss_error=False)
except FelderaAPIError as e:
    error_occurred = True
    assert e.status_code == 400
    assert e.error_code == "CannotStartWithUndismissedError"
assert error_occurred

As noted in review #3, this is functionally correct (assert error_occurred catches the non-raise case). pytest.raises is still the idiomatic form but there's no bug here.


Correctness of New Items Since Last Review

--no-dismiss-error requires("start") in create-or-replace (crates/fda/src/cli.rs:598)

The requires("start") constraint is now in place for the CreateOrReplace variant. The Start and Restart variants correctly omit it since those actions always imply a start.

BootstrapPolicy.from_str implementation (python/feldera/enums.py)

@staticmethod
def from_str(value):
    for member in BootstrapPolicy:
        if member.name.lower() == value.lower():
            return member
    raise ValueError(...)

member.name.lower() produces "allow", "reject", "await_approval" — matching exactly what bootstrap_policy_to_string in Rust stores in the DB. The round-trip (start(bootstrap_policy=X)pipeline.bootstrap_policy() == X) is verified by the new integration test.

bootstrap_policy field cleared on Stopping/Stopped, retained on Provisioning/Provisioned

The DB fix is applied symmetrically in both the model (DbModel in test.rs) and the real implementation (set_deployment_resources_status). The new final_bootstrap_policy parameter is:

  • None in set_deployment_resources_status_stopping and set_deployment_resources_status_stopped — correct.
  • current.bootstrap_policy in set_deployment_resources_status_provisioning and set_deployment_resources_status_provisioned — correct.

dismiss_deployment_error precondition check in set_deployment_resources_desired_status

The check if new_deployment_error.is_some() in validate_resources_desired_status_transition is only reachable in the Stopped → Provisioned arm. For the Provisioned → Stopped and Stopped → Stopped arms, the error field is ignored by the validation logic. Passing deployment_error through to the stop-path validators is safe and adds no incorrect restrictions.

pipeline_builder.py dismiss after stop(force=True) and clear_storage()

stop(force=True) blocks until the pipeline is in Stopped/Stopped before returning, so dismiss_error() is guaranteed to see a fully-stopped pipeline. Correct.

fda test.bash sequence

fda dismiss-error p1       # no-op (no error present), idempotent
fda start p1 --no-dismiss-error  # start without dismissing, succeeds since no error
fda start p1               # normal start with dismiss (default)

This sequence correctly exercises both code paths.

Changelog (docs.feldera.com/docs/changelog.md)

The migration guidance is clear: calls using ignore_deployment_error=True should switch to dismiss_error=True (the new default), and users who want the blocking behavior should use dismiss_error=False.


Positives

  • All three previous reviews' actionable items are resolved.
  • bootstrap_policy fix is complete and correct across both DB model and real implementation.
  • dismiss_error parameter is consistently exposed on start, start_paused, start_standby, and restart.
  • DB-layer atomicity preserved: dismiss_error flag within set_deployment_resources_desired_status clears the error in the same SQL UPDATE as the desired-status change.
  • Proptest model updated with DismissDeploymentError variant.
  • Idempotency of /dismiss_error is correct: && current.deployment_error.is_some() means a no-op dismiss always succeeds regardless of pipeline state.
  • fda dismiss-error command correctly mirrors the Python pipeline.dismiss_error() API.

Copy link
Contributor

@igorscs igorscs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@igorscs igorscs added this pull request to the merge queue Feb 20, 2026
@snkas snkas removed this pull request from the merge queue due to a manual request Feb 20, 2026
Copy link
Collaborator

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is correct and the design is sound. The key design choices are well-reasoned:

  • as the HTTP API default matches what Leonid and Lalith asked for — old errors from a previous crash shouldn't silently block a retry
  • The two-step dismiss-then-start in the Python client (always calling first, then ) is the right call for idempotent retries: a retry must not silently eat a new crash error that occurred in the retry window
  • being cleared on Stopping/Stopped but preserved through Provisioning/Provisioned is the correct lifecycle semantics

The bot did four rounds of careful review, each catching real issues (busy-wait loop, wrong docstring, missing on //) — all were fixed. The result is a tight, well-tested PR. Happy to approve.

Add the `/dismiss_error` endpoint for the user to dismiss the content of
the `deployment_error` field, which is set when the pipeline becomes
`Stopping`/`Stopped` due to an fatal error occurring deployment. The
functionality is also integrated into the `/start` endpoint with the
parameter `?dismiss_error=false/true` (default: `true`).

Python API changes:
- `pipeline.start()` has the `ignore_deployment_error` parameter removed
  (not backward compatible)
- `pipeline.start()` has the `dismiss_error` parameter added which is by
   default `True`
- `pipeline.dismiss_error()` is added

CLI client `fda` changes:
- `fda start` has the `--no-dismiss-error` parameter added
- `fda dismiss-error` is added

The `bootstrap_policy` field is fixed to be cleared along with
`deployment_initial` when the pipeline is stopping or has stopped.

Signed-off-by: Simon Kassing <simon.kassing@feldera.com>
@snkas snkas force-pushed the dismiss-deployment-error branch from c86adee to 19ba231 Compare February 23, 2026 10:07
@snkas snkas added this pull request to the merge queue Feb 23, 2026
Merged via the queue into main with commit 13be1d4 Feb 23, 2026
1 check passed
@snkas snkas deleted the dismiss-deployment-error branch February 23, 2026 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feldera client] Move async lifecycle wait logic from integration tests into the Feldera Python client

6 participants