Conversation
When starting a pipeline from S3 checkpoint, if a local checkpoint is available, we prefer it if it has made more progress than the remote checkpoint. Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
mihaibudiu
left a comment
There was a problem hiding this comment.
Which line checks that the local one is preferred?
I think a comment explaining this would be nice.
Updates the description to mention that we will prefer the checkpoint that has made more progress between the latest remote and the latest local checkpoint. Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
| if chk_uuid is not None: | ||
| assert UUID(uuid) >= UUID(chk_uuid) | ||
|
|
||
| if not clear_storage: |
There was a problem hiding this comment.
Do we have tests covering a few cases here? E.g., local but no remote, local only, remote only, local ahead of remote, local == remote, and local behind remote?
There was a problem hiding this comment.
We test:
- remote only
- local only isn't covered as it is for the purposes of sync, the same as local + remote
We do not test for local checkpoint that has made less progress than remote. We will add new tests with the new sync implementation discussed as part of the rfc.
mythical-fred
left a comment
There was a problem hiding this comment.
The test addition looks correct and the doc update is accurate. One hard block.
Missing manual testing description.
The PR body describes the behaviour change ("if a local checkpoint is available, we prefer it if it has made more progress than the remote checkpoint") but says nothing about how the change was manually verified. Even a minimal sentence — e.g. "ran python -m pytest python/tests/platform/test_checkpoint_sync.py against a live environment and the new assertion passed" — satisfies the requirement.
Commit messages ("tests: sync: adapt test for issue 5609", "docs: sync: update start_from_checkpoint description") are clear and appropriate. The test logic itself is straightforward: insert a local-only record before stopping, then restart from the S3 checkpoint and verify that the local-only record is still present in the output. The assertCountEqual at the end correctly validates ordering-independent equality.
mythical-fred
left a comment
There was a problem hiding this comment.
Upgrading my vote. Since the PR has integration test coverage, the missing manual test description is a soft block — I was too strict. lalithsuresh's open question about the "local behind remote" case still stands.
When starting a pipeline from S3 checkpoint, if a local checkpoint is available, we prefer it if it has made more progress than the remote checkpoint.