Skip to content

[adapters] NATS: Fix FATAL error bug, by improving classify stream/sequence validation errors#5835

Merged
blp merged 1 commit intofeldera:mainfrom
fornybar:kfollesdal/nats-conector-state-bug
Mar 16, 2026
Merged

[adapters] NATS: Fix FATAL error bug, by improving classify stream/sequence validation errors#5835
blp merged 1 commit intofeldera:mainfrom
fornybar:kfollesdal/nats-conector-state-bug

Conversation

@kfollesdal
Copy link
Copy Markdown
Contributor

@kfollesdal kfollesdal commented Mar 16, 2026

This PR resulted from debugging problem:

When a pipeline is stopped and restarted without any configuration changes sometimes the
runtime incorrectly reports Pipeline has been modified since the last checkpoint
and flags one or more input connectors as newly added. This forces the pipeline
into awaiting_approval state, blocks automatic restart, and discards the replay
journal on approval.

The reason for the above "bug" was that connector wrongly had reported FATAL error and then died. This resulted in connector is removed form checkpoint. So then if pipeline is stopped and then started again the connector config looks like a new connector for Feldera and think it is a new connector.

Think this is the right behavior if connector quits with an aqual FATAL error?

The bug in our case was that the error was not fatal but retryable. Timeout/connection error in function fetch_stream_state got propagated to an FATAL error and NOT retryable. This PR fix this.

  • Treat JetStream state fetch failures during replay/resume bounds checks as retryable connector errors, while keeping logical out-of-range and empty-stream validation failures fatal.

Add NATS tests that cover retryable vs fatal classification for resume and replay validation paths, including server-down and out-of-bounds scenarios.

Describe Manual Test Plan

Rust test are run and pass. And have manually run pipeline-manager locally. But can not test Enterprise feature/FT manually.

Checklist

  • Unit tests added/updated
  • Integration tests added/updated
  • Documentation updated
  • Changelog updated

Breaking Changes?

Mark if you think the answer is yes for any of these components:

Describe Incompatible Changes

@kfollesdal kfollesdal force-pushed the kfollesdal/nats-conector-state-bug branch from aadbcf9 to ea3e203 Compare March 16, 2026 17:40
Copy link
Copy Markdown
Member

@blp blp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

It sounds like Feldera might have a higher-level issue here to address, but this should help mitigate it.

@blp blp enabled auto-merge March 16, 2026 18:57
@blp blp added this pull request to the merge queue Mar 16, 2026
@kfollesdal
Copy link
Copy Markdown
Contributor Author

Yes @blp, this PR should help for the NATS connector by improving the classification of FATAL errors, which should reduce the chance of running into this issue.

I wasn’t sure whether the current behavior is intended or not. At the very least, the message “Pipeline has been modified since the last checkpoint” is a bit confusing in this context.

My understanding is that if a connector fails with a fatal error, it gets dropped from the checkpoint metadata. Then on the next pipeline restart it is discovered again as a new connector from the config.

Maybe it would make sense to keep in the checkpoint state that a connector has failed fatally, and use it to give better message or the option to reinitialize the connector?

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Mar 16, 2026
@blp
Copy link
Copy Markdown
Member

blp commented Mar 16, 2026

Yes @blp, this PR should help for the NATS connector by improving the classification of FATAL errors, which should reduce the chance of running into this issue.

I wasn’t sure whether the current behavior is intended or not. At the very least, the message “Pipeline has been modified since the last checkpoint” is a bit confusing in this context.

My understanding is that if a connector fails with a fatal error, it gets dropped from the checkpoint metadata. Then on the next pipeline restart it is discovered again as a new connector from the config.

Maybe it would make sense to keep in the checkpoint state that a connector has failed fatally, and use it to give better message or the option to reinitialize the connector?

Enterprise Feldera is supposed to be able to tolerate changes in the SQL, including changes to connector configuration, by rerunning only part of the pipeline (called "bootstrapping"). This only makes sense when the pipeline starts from a checkpoint (if it starts fresh, then there is nothing to rerun). Outside of the enterprise edition of Feldera, this doesn't make sense, since only enterprise Feldera supports checkpoints.

Supposing that enterprise Feldera is in use, if after resume from a checkpoint, Feldera treats a crashed input connector as a change to connector configuration and tries to adapt to a change in connector configuration, then that sounds like a bug. I don't think we've had that bug reported yet, so I'll report it if I've described it accurately.

@ryzhyk, does any of this (see the PR description as a starting point) give you thoughts?

@blp
Copy link
Copy Markdown
Member

blp commented Mar 16, 2026

@kfollesdal I think this needs a rebase because your other PR merged and is now a conflict.

Fix "bug" where timeout/connection error in function
`fetch_stream_state` got propagated to an FATAL error and not RETRYABLE,
resulting connector killed and get reiniziliaed at next pipeline
startup.

Treat JetStream state fetch failures during replay/resume bounds checks
as retryable connector errors, while keeping logical out-of-range and
empty-stream validation failures fatal.

Add focused NATS tests that cover retryable vs fatal classification for
resume and replay validation paths, including server-down and
out-of-bounds scenarios.

Signed-off-by: kfollesdal <kristoffer.follesdal@eviny.no>
@kfollesdal kfollesdal force-pushed the kfollesdal/nats-conector-state-bug branch from ea3e203 to 17644cd Compare March 16, 2026 21:27
@kfollesdal
Copy link
Copy Markdown
Contributor Author

@blp, I have now rebasee

@blp blp enabled auto-merge March 16, 2026 21:33
@blp blp added this pull request to the merge queue Mar 16, 2026
Merged via the queue into feldera:main with commit d94c34b Mar 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants