Skip to content

Conversation

@JoaoJandre
Copy link
Contributor

Description

Using XenServer as the hypervisor, when deleting a snapshot that has a parent, that parent will also get erased on storage, causing data loss. This behavior was introduced with #7873, where the list of snapshot states that can be deleted was changed to add BackedUp snapshots.

This PR changes the states list back to the original list, and swaps the while loop for a do while loop to account for the changes in #7873.

Fixes #9446

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

How Has This Been Tested?

  1. Created a VM
  2. Took a volume snapshot (S1).
  3. Accessed the VM and wrote some data with: dd if=/dev/zero of=test bs=1M count=1000
  4. Used sync to make sure data is written to disk
  5. Took a second volume snapshot (S2).
  6. Deleted the last volume snapshot (S2).

Before the change, S1 would be mistakenly deleted from storage as well, and an inconsistent snapshot record would remain.
After the change, S1 is untouched and works as intended.

@JoaoJandre JoaoJandre requested a review from shwstppr July 25, 2024 17:19
@JoaoJandre JoaoJandre changed the title fix-parent-being-deleted Fix snapshot chain being deleted on XenServer Jul 25, 2024
@JoaoJandre JoaoJandre linked an issue Jul 25, 2024 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Jul 25, 2024

Codecov Report

Attention: Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.

Project coverage is 15.08%. Comparing base (49cd5ba) to head (62759e6).

Files Patch % Lines
...tack/storage/snapshot/DefaultSnapshotStrategy.java 33.33% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##               4.19    #9447     +/-   ##
===========================================
  Coverage     15.07%   15.08%             
+ Complexity    11170    11168      -2     
===========================================
  Files          5406     5406             
  Lines        472699   472699             
  Branches      61521    58998   -2523     
===========================================
+ Hits          71281    71287      +6     
+ Misses       393489   393484      -5     
+ Partials       7929     7928      -1     
Flag Coverage Δ
uitests 4.30% <ø> (ø)
unittests 15.79% <33.33%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@shwstppr
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 10476

@rohityadavcloud rohityadavcloud added this to the 4.19.2.0 milestone Jul 26, 2024
@rohityadavcloud
Copy link
Member

@blueorangutan test alma8 xcpng82

@blueorangutan
Copy link

@rohityadavcloud a [SL] Trillian-Jenkins test job (alma8 mgmt + xcpng82) has been kicked to run smoke tests

Copy link
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM. Will test with a multi-zone env

Copy link
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with a 2 zone (z1, z2) env with one XCP-ng82 host in each.

  1. Deployed VM
  2. Created ROOT volume snapshot(snap1) in both zones
  3. Made changes inside VM
  4. Again created ROOT volume snapshot(snap2) in both zones
  5. Deleted snap2 from z2. Successful. snap2 was intact in zone z1. snap1 was intact in both z1 and z2
  6. Deleted snap2 without passing zoneid. Successful. snap1 was intact in both z1 and z2

@blueorangutan
Copy link

[SF] Trillian test result (tid-10968)
Environment: xcpng82 (x2), Advanced Networking with Mgmt server a8
Total time taken: 65809 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9447-t10968-xcpng82.zip
Smoke tests completed. 129 look OK, 3 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_condensed_drs_algorithm Failure 169.43 test_cluster_drs.py
test_02_balanced_drs_algorithm Failure 176.50 test_cluster_drs.py
test_01_non_strict_host_anti_affinity Error 203.81 test_nonstrict_affinity_group.py
test_02_non_strict_host_affinity Error 123.63 test_nonstrict_affinity_group.py
test_01_vpc_site2site_vpn_multiple_options Failure 758.47 test_vpc_vpn.py

@rohityadavcloud rohityadavcloud modified the milestones: 4.19.2.0, 4.19.1.1 Jul 29, 2024
@JoaoJandre JoaoJandre mentioned this pull request Jul 29, 2024
13 tasks
@shwstppr
Copy link
Contributor

@blueorangutan test matrix

@blueorangutan
Copy link

@shwstppr a [SL] Trillian-Jenkins matrix job (EL8 mgmt + EL8 KVM, Ubuntu22 mgmt + Ubuntu22 KVM, EL8 mgmt + VMware 7.0u3, EL9 mgmt + XCP-ng 8.2 ) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian Build Failed (tid-10988)

@blueorangutan
Copy link

[SF] Trillian Build Failed (tid-10989)

@blueorangutan
Copy link

[SF] Trillian test result (tid-10987)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 44278 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9447-t10987-kvm-ol8.zip
Smoke tests completed. 132 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link

[SF] Trillian test result (tid-10990)
Environment: xcpng82 (x2), Advanced Networking with Mgmt server a9
Total time taken: 63919 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9447-t10990-xcpng82.zip
Smoke tests completed. 129 look OK, 3 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
ContextSuite context=TestClusterDRS>:setup Error 0.00 test_cluster_drs.py
test_list_system_vms_metrics_history Failure 0.25 test_metrics_api.py
test_01_non_strict_host_anti_affinity Error 201.78 test_nonstrict_affinity_group.py
test_02_non_strict_host_affinity Error 117.49 test_nonstrict_affinity_group.py

@borisstoyanov
Copy link
Contributor

@blueorangutan help

@blueorangutan
Copy link

@borisstoyanov [SL] I understand these words: "help", "hello", "thanks", "package", "test"
Test command usage: test [mgmt os] [hypervisor] [keepEnv] [qemuEv] [basicZone|securityGroups]
Mgmt OS options: ['ol8', 'ol9', 'alma8', 'alma9', 'suse15', 'centos7', 'centos6', 'rocky8', 'ubuntu18', 'ubuntu22', 'ubuntu20', 'ubuntu24']
Hypervisor options: ['kvm-centos6', 'kvm-centos7', 'kvm-rocky8', 'kvm-ol8', 'kvm-ol9', 'kvm-alma8', 'kvm-alma9', 'kvm-ubuntu18', 'kvm-ubuntu20', 'kvm-ubuntu22', 'kvm-ubuntu24', 'kvm-suse15', 'vmware-55u3', 'vmware-60u2', 'vmware-65u2', 'vmware-67u3', 'vmware-70u1', 'vmware-70u2', 'vmware-70u3', 'vmware-80', 'vmware-80u1', 'xenserver-65sp1', 'xenserver-71', 'xenserver-74', 'xcpng74', 'xcpng76', 'xcpng80', 'xcpng81', 'xcpng82']
Note: when keepEnv is passed, you need to specify mgmt server os and hypervisor or use the matrix command.
when qemuEv is passed, it will deploy KVM hyperviosr hosts with qemu-kvm-ev, else it will default to stock qemu.
When basicZone and/or securityGroups are passed it will create a zone of the last type specified (default is Advanced)
Package command usage: package [all(default value),kvm,xen,vmware,hyperv,ovm] - a comma separated list can be passed with package command to bundle the required hypervisor's systemVM templates. Not passing any argument will bundle all - kvm,xen and vmware templates.

Blessed contributors for kicking Trillian test jobs: ['rohityadavcloud', 'shwstppr', 'vishesh92', 'Pearl1594', 'harikrishna-patnala', 'nvazquez', 'DaanHoogland', 'weizhouapache', 'borisstoyanov', 'vladimirpetrov', 'kiranchavala', 'andrijapanicsb', 'NuxRo', 'rajujith', 'alexandremattioli', 'sureshanaparti', 'abh1sar']

@borisstoyanov
Copy link
Contributor

@blueorangutan test ol8 vmware-80u1

@blueorangutan
Copy link

@borisstoyanov a [SL] Trillian-Jenkins test job (ol8 mgmt + vmware-80u1) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian Build Failed (tid-11003)

Copy link
Contributor

@borisstoyanov borisstoyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, manually check the PR following the steps

@JoaoJandre
Copy link
Contributor Author

Are your concerns met @harikrishna-patnala ? if so, can we move on with this PR @shwstppr ?

Copy link
Contributor

@harikrishna-patnala harikrishna-patnala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@harikrishna-patnala
Copy link
Contributor

Are your concerns met @harikrishna-patnala ? if so, can we move on with this PR @shwstppr ?

Yes, thanks @JoaoJandre

@shwstppr
Copy link
Contributor

shwstppr commented Aug 1, 2024

Merging based on review and tests

@shwstppr shwstppr merged commit 9033ab7 into apache:4.19 Aug 1, 2024
@blueorangutan
Copy link

[SF] Trillian test result (tid-11002)
Environment: vmware-80u1 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 153663 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr9447-t11002-vmware-80u1.zip
Smoke tests completed. 126 look OK, 6 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_invalid_upgrade_kubernetes_cluster Failure 3658.25 test_kubernetes_clusters.py
test_02_upgrade_kubernetes_cluster Failure 3660.70 test_kubernetes_clusters.py
test_03_deploy_and_scale_kubernetes_cluster Failure 3676.16 test_kubernetes_clusters.py
test_04_autoscale_kubernetes_cluster Failure 3656.85 test_kubernetes_clusters.py
test_05_basic_lifecycle_kubernetes_cluster Failure 3655.30 test_kubernetes_clusters.py
test_06_delete_kubernetes_cluster Failure 3655.01 test_kubernetes_clusters.py
test_08_upgrade_kubernetes_ha_cluster Failure 3691.69 test_kubernetes_clusters.py
test_10_vpc_tier_kubernetes_cluster Failure 3710.79 test_kubernetes_clusters.py
test_list_vms_metrics_admin Error 3613.98 test_metrics_api.py
test_list_vms_metrics_history Error 3621.00 test_metrics_api.py
test_list_volumes_metrics_history Error 3625.72 test_metrics_api.py
test_01_restore_vm Error 3620.07 test_restore_vm.py
test_02_restore_vm_allocated_root Error 8.89 test_restore_vm.py
test_01_isolate_network_FW_PF_default_routes_egress_true Failure 172.91 test_routers_network_ops.py
test_01_RVR_Network_FW_PF_SSH_default_routes_egress_true Failure 281.52 test_routers_network_ops.py
test_02_RVR_Network_FW_PF_SSH_default_routes_egress_false Failure 236.34 test_routers_network_ops.py
test_01_snapshot_to_volume Error 9.40 test_snapshots.py
test_01_deploy_vm_on_specific_host Error 19.79 test_vm_deployment_planner.py
test_02_deploy_vm_on_specific_cluster Error 3602.31 test_vm_deployment_planner.py
test_03_deploy_vm_on_specific_pod Error 1.39 test_vm_deployment_planner.py
test_04_deploy_vm_on_host_override_pod_and_cluster Error 2.42 test_vm_deployment_planner.py
test_05_deploy_vm_on_cluster_override_pod Error 1.32 test_vm_deployment_planner.py

nvazquez added a commit that referenced this pull request Aug 6, 2024
* server, api, ui: access improvements and assorted fixes

Fixes domain-admin access check to prevent unauthorized access.

Co-authored-by: Fabricio Duarte <fabricio.duarte.jr@gmail.com>
Co-authored-by: nvazquez <nicovazquez90@gmail.com>
Co-authored-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* Revert "server: refactor listNetworks api database retrievals (#9184)"

This reverts commit c7f1ba5.

* Fix snapshot chain being deleted on XenServer (#9447)

Using XenServer as the hypervisor, when deleting a snapshot that has a parent, that parent will also get erased on storage, causing data loss. This behavior was introduced with #7873, where the list of snapshot states that can be deleted was changed to add BackedUp snapshots.

This PR changes the states list back to the original list, and swaps the while loop for a do while loop to account for the changes in #7873.

Fixes #9446

* UI: Display Firewall, LB and Port Forwading rules tab for CKS clusters deployed on isolated networks (#9458)

---------

Co-authored-by: nvazquez <nicovazquez90@gmail.com>
Co-authored-by: Fabricio Duarte <fabricio.duarte.jr@gmail.com>
Co-authored-by: João Jandre <48719461+JoaoJandre@users.noreply.github.com>
Co-authored-by: Pearl Dsilva <pearl1594@gmail.com>
nvazquez added a commit that referenced this pull request Aug 6, 2024
* server, api, ui: access improvements and assorted fixes

Fixes domain-admin access check to prevent unauthorized access.

Co-authored-by: Fabricio Duarte <fabricio.duarte.jr@gmail.com>
Co-authored-by: nvazquez <nicovazquez90@gmail.com>
Co-authored-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* Revert "server: refactor listNetworks api database retrievals (#9184)"

This reverts commit c7f1ba5.

* Fix snapshot chain being deleted on XenServer (#9447)

Using XenServer as the hypervisor, when deleting a snapshot that has a parent, that parent will also get erased on storage, causing data loss. This behavior was introduced with #7873, where the list of snapshot states that can be deleted was changed to add BackedUp snapshots.

This PR changes the states list back to the original list, and swaps the while loop for a do while loop to account for the changes in #7873.

Fixes #9446

* UI: Display Firewall, LB and Port Forwading rules tab for CKS clusters deployed on isolated networks (#9458)

---------

Co-authored-by: nvazquez <nicovazquez90@gmail.com>
Co-authored-by: Fabricio Duarte <fabricio.duarte.jr@gmail.com>
Co-authored-by: João Jandre <48719461+JoaoJandre@users.noreply.github.com>
Co-authored-by: Pearl Dsilva <pearl1594@gmail.com>
@rohityadavcloud rohityadavcloud modified the milestones: 4.19.1.1, 4.19.2.0 Aug 8, 2024
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Aug 22, 2024
Using XenServer as the hypervisor, when deleting a snapshot that has a parent, that parent will also get erased on storage, causing data loss. This behavior was introduced with apache#7873, where the list of snapshot states that can be deleted was changed to add BackedUp snapshots.

This PR changes the states list back to the original list, and swaps the while loop for a do while loop to account for the changes in apache#7873.

Fixes apache#9446
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request Aug 22, 2024
* server, api, ui: access improvements and assorted fixes

Fixes domain-admin access check to prevent unauthorized access.

Co-authored-by: Fabricio Duarte <fabricio.duarte.jr@gmail.com>
Co-authored-by: nvazquez <nicovazquez90@gmail.com>
Co-authored-by: Abhishek Kumar <abhishek.mrt22@gmail.com>

* Revert "server: refactor listNetworks api database retrievals (apache#9184)"

This reverts commit c7f1ba5.

* Fix snapshot chain being deleted on XenServer (apache#9447)

Using XenServer as the hypervisor, when deleting a snapshot that has a parent, that parent will also get erased on storage, causing data loss. This behavior was introduced with apache#7873, where the list of snapshot states that can be deleted was changed to add BackedUp snapshots.

This PR changes the states list back to the original list, and swaps the while loop for a do while loop to account for the changes in apache#7873.

Fixes apache#9446

* UI: Display Firewall, LB and Port Forwading rules tab for CKS clusters deployed on isolated networks (apache#9458)

---------

Co-authored-by: nvazquez <nicovazquez90@gmail.com>
Co-authored-by: Fabricio Duarte <fabricio.duarte.jr@gmail.com>
Co-authored-by: João Jandre <48719461+JoaoJandre@users.noreply.github.com>
Co-authored-by: Pearl Dsilva <pearl1594@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Whole snapshot chain is lost when the latest snapshot is deleted

6 participants