[adapters] NATS: Fix FATAL error bug, by improving classify stream/sequence validation errors#5835
Conversation
aadbcf9 to
ea3e203
Compare
blp
left a comment
There was a problem hiding this comment.
Thank you!
It sounds like Feldera might have a higher-level issue here to address, but this should help mitigate it.
|
Yes @blp, this PR should help for the NATS connector by improving the classification of FATAL errors, which should reduce the chance of running into this issue. I wasn’t sure whether the current behavior is intended or not. At the very least, the message “Pipeline has been modified since the last checkpoint” is a bit confusing in this context. My understanding is that if a connector fails with a fatal error, it gets dropped from the checkpoint metadata. Then on the next pipeline restart it is discovered again as a new connector from the config. Maybe it would make sense to keep in the checkpoint state that a connector has failed fatally, and use it to give better message or the option to reinitialize the connector? |
Enterprise Feldera is supposed to be able to tolerate changes in the SQL, including changes to connector configuration, by rerunning only part of the pipeline (called "bootstrapping"). This only makes sense when the pipeline starts from a checkpoint (if it starts fresh, then there is nothing to rerun). Outside of the enterprise edition of Feldera, this doesn't make sense, since only enterprise Feldera supports checkpoints. Supposing that enterprise Feldera is in use, if after resume from a checkpoint, Feldera treats a crashed input connector as a change to connector configuration and tries to adapt to a change in connector configuration, then that sounds like a bug. I don't think we've had that bug reported yet, so I'll report it if I've described it accurately. @ryzhyk, does any of this (see the PR description as a starting point) give you thoughts? |
|
@kfollesdal I think this needs a rebase because your other PR merged and is now a conflict. |
Fix "bug" where timeout/connection error in function `fetch_stream_state` got propagated to an FATAL error and not RETRYABLE, resulting connector killed and get reiniziliaed at next pipeline startup. Treat JetStream state fetch failures during replay/resume bounds checks as retryable connector errors, while keeping logical out-of-range and empty-stream validation failures fatal. Add focused NATS tests that cover retryable vs fatal classification for resume and replay validation paths, including server-down and out-of-bounds scenarios. Signed-off-by: kfollesdal <kristoffer.follesdal@eviny.no>
ea3e203 to
17644cd
Compare
|
@blp, I have now rebasee |
This PR resulted from debugging problem:
The reason for the above "bug" was that connector wrongly had reported FATAL error and then died. This resulted in connector is removed form checkpoint. So then if pipeline is stopped and then started again the connector config looks like a new connector for Feldera and think it is a new connector.
Think this is the right behavior if connector quits with an aqual FATAL error?
The bug in our case was that the error was not fatal but retryable. Timeout/connection error in function
fetch_stream_stategot propagated to an FATAL error and NOT retryable. This PR fix this.Add NATS tests that cover retryable vs fatal classification for resume and replay validation paths, including server-down and out-of-bounds scenarios.
Describe Manual Test Plan
Rust test are run and pass. And have manually run pipeline-manager locally. But can not test Enterprise feature/FT manually.
Checklist
Breaking Changes?
Mark if you think the answer is yes for any of these components:
Describe Incompatible Changes