[adapters] NATS: Fix FATAL error bug, by improving classify stream/sequence validation errors by kfollesdal · Pull Request #5835 · feldera/feldera

kfollesdal · 2026-03-16T13:52:26Z

This PR resulted from debugging problem:

When a pipeline is stopped and restarted without any configuration changes sometimes the
runtime incorrectly reports Pipeline has been modified since the last checkpoint
and flags one or more input connectors as newly added. This forces the pipeline
into awaiting_approval state, blocks automatic restart, and discards the replay
journal on approval.

The reason for the above "bug" was that connector wrongly had reported FATAL error and then died. This resulted in connector is removed form checkpoint. So then if pipeline is stopped and then started again the connector config looks like a new connector for Feldera and think it is a new connector.

Think this is the right behavior if connector quits with an aqual FATAL error?

The bug in our case was that the error was not fatal but retryable. Timeout/connection error in function fetch_stream_state got propagated to an FATAL error and NOT retryable. This PR fix this.

Treat JetStream state fetch failures during replay/resume bounds checks as retryable connector errors, while keeping logical out-of-range and empty-stream validation failures fatal.

Add NATS tests that cover retryable vs fatal classification for resume and replay validation paths, including server-down and out-of-bounds scenarios.

Describe Manual Test Plan

Rust test are run and pass. And have manually run pipeline-manager locally. But can not test Enterprise feature/FT manually.

Checklist

Unit tests added/updated
Integration tests added/updated
Documentation updated
Changelog updated

Breaking Changes?

Mark if you think the answer is yes for any of these components:

OpenAPI / REST HTTP API / feldera-types / manager (What is a breaking change?)
Feldera SQL (Syntax, Semantics)
feldera-sqllib (incl. dependencies fxp, etc.) (What is a breaking change?)
Python SDK (What is a breaking change?)
fda (CLI arguments)
Adapters (including configuration)
Storage Format / Checkpoints
Others (specify)

Describe Incompatible Changes

crates/adapters/src/transport/nats/input/test/custom_tests.rs

blp

Thank you!

It sounds like Feldera might have a higher-level issue here to address, but this should help mitigate it.

kfollesdal · 2026-03-16T19:11:47Z

Yes @blp, this PR should help for the NATS connector by improving the classification of FATAL errors, which should reduce the chance of running into this issue.

I wasn’t sure whether the current behavior is intended or not. At the very least, the message “Pipeline has been modified since the last checkpoint” is a bit confusing in this context.

My understanding is that if a connector fails with a fatal error, it gets dropped from the checkpoint metadata. Then on the next pipeline restart it is discovered again as a new connector from the config.

Maybe it would make sense to keep in the checkpoint state that a connector has failed fatally, and use it to give better message or the option to reinitialize the connector?

blp · 2026-03-16T20:33:50Z

Yes @blp, this PR should help for the NATS connector by improving the classification of FATAL errors, which should reduce the chance of running into this issue.

I wasn’t sure whether the current behavior is intended or not. At the very least, the message “Pipeline has been modified since the last checkpoint” is a bit confusing in this context.

My understanding is that if a connector fails with a fatal error, it gets dropped from the checkpoint metadata. Then on the next pipeline restart it is discovered again as a new connector from the config.

Maybe it would make sense to keep in the checkpoint state that a connector has failed fatally, and use it to give better message or the option to reinitialize the connector?

Enterprise Feldera is supposed to be able to tolerate changes in the SQL, including changes to connector configuration, by rerunning only part of the pipeline (called "bootstrapping"). This only makes sense when the pipeline starts from a checkpoint (if it starts fresh, then there is nothing to rerun). Outside of the enterprise edition of Feldera, this doesn't make sense, since only enterprise Feldera supports checkpoints.

Supposing that enterprise Feldera is in use, if after resume from a checkpoint, Feldera treats a crashed input connector as a change to connector configuration and tries to adapt to a change in connector configuration, then that sounds like a bug. I don't think we've had that bug reported yet, so I'll report it if I've described it accurately.

@ryzhyk, does any of this (see the PR description as a starting point) give you thoughts?

blp · 2026-03-16T20:34:27Z

@kfollesdal I think this needs a rebase because your other PR merged and is now a conflict.

Fix "bug" where timeout/connection error in function `fetch_stream_state` got propagated to an FATAL error and not RETRYABLE, resulting connector killed and get reiniziliaed at next pipeline startup. Treat JetStream state fetch failures during replay/resume bounds checks as retryable connector errors, while keeping logical out-of-range and empty-stream validation failures fatal. Add focused NATS tests that cover retryable vs fatal classification for resume and replay validation paths, including server-down and out-of-bounds scenarios. Signed-off-by: kfollesdal <kristoffer.follesdal@eviny.no>

kfollesdal · 2026-03-16T21:29:00Z

@blp, I have now rebasee

mythical-fred approved these changes Mar 16, 2026

View reviewed changes

crates/adapters/src/transport/nats/input/test/custom_tests.rs Show resolved Hide resolved

kfollesdal force-pushed the kfollesdal/nats-conector-state-bug branch from aadbcf9 to ea3e203 Compare March 16, 2026 17:40

blp approved these changes Mar 16, 2026

View reviewed changes

blp enabled auto-merge March 16, 2026 18:57

blp added this pull request to the merge queue Mar 16, 2026

github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Mar 16, 2026

kfollesdal force-pushed the kfollesdal/nats-conector-state-bug branch from ea3e203 to 17644cd Compare March 16, 2026 21:27

blp enabled auto-merge March 16, 2026 21:33

blp added this pull request to the merge queue Mar 16, 2026

Merged via the queue into feldera:main with commit d94c34b Mar 16, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[adapters] NATS: Fix FATAL error bug, by improving classify stream/sequence validation errors#5835

[adapters] NATS: Fix FATAL error bug, by improving classify stream/sequence validation errors#5835
blp merged 1 commit intofeldera:mainfrom
fornybar:kfollesdal/nats-conector-state-bug

kfollesdal commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

blp left a comment

Uh oh!

kfollesdal commented Mar 16, 2026

Uh oh!

blp commented Mar 16, 2026

Uh oh!

blp commented Mar 16, 2026

Uh oh!

kfollesdal commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kfollesdal commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe Manual Test Plan

Checklist

Breaking Changes?

Describe Incompatible Changes

Uh oh!

Uh oh!

blp left a comment

Choose a reason for hiding this comment

Uh oh!

kfollesdal commented Mar 16, 2026

Uh oh!

blp commented Mar 16, 2026

Uh oh!

blp commented Mar 16, 2026

Uh oh!

kfollesdal commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kfollesdal commented Mar 16, 2026 •

edited

Loading