Skip to content

Conversation

@RodrigoDLopez
Copy link
Contributor

@RodrigoDLopez RodrigoDLopez commented Aug 24, 2020

Description

If a VM gets consecutive migrations, a null pointer exception is thrown because these VMs do not have host_id or last_host_id; ACS clears these fields when the first migration is over.
With that in mind, this PR prevents the respective null pointer. Additionally, it logs the right context and gives some information to the operator.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

How Has This Been Tested?

To test this, I used the cloudmonkey to request migrateVirtualMachine commands

@RodrigoDLopez
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@RodrigoDLopez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos7 ✔centos8 ✔debian. JID-1813

@shwstppr
Copy link
Contributor

@blueorangutan test centos7 vmware-67u3

@blueorangutan
Copy link

@shwstppr a Trillian-Jenkins test job (centos7 mgmt + vmware-67u3) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-2576)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server 7
Total time taken: 55420 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4282-t2576-vmware-67u3.zip
Intermittent failure detected: /marvin/tests/smoke/test_deploy_vm_root_resize.py
Intermittent failure detected: /marvin/tests/smoke/test_kubernetes_clusters.py
Intermittent failure detected: /marvin/tests/smoke/test_ssvm.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 78 look OK, 3 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_deploy_kubernetes_cluster Error 0.11 test_kubernetes_clusters.py
test_02_deploy_kubernetes_ha_cluster Error 0.05 test_kubernetes_clusters.py
test_04_deploy_and_upgrade_kubernetes_cluster Error 0.07 test_kubernetes_clusters.py
test_05_deploy_and_upgrade_kubernetes_ha_cluster Error 0.07 test_kubernetes_clusters.py
test_06_deploy_and_invalid_upgrade_kubernetes_cluster Error 0.06 test_kubernetes_clusters.py
test_07_deploy_and_scale_kubernetes_cluster Error 0.04 test_kubernetes_clusters.py
test_07_reboot_ssvm Failure 58.23 test_ssvm.py
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Failure 671.18 test_vpc_redundant.py
test_04_rvpc_network_garbage_collector_nics Error 3977.82 test_vpc_redundant.py
test_04_rvpc_network_garbage_collector_nics Error 3977.85 test_vpc_redundant.py
test_05_rvpc_multi_tiers Error 0.01 test_vpc_redundant.py
ContextSuite context=TestVPCRedundancy>:teardown Error 0.01 test_vpc_redundant.py

@RodrigoDLopez RodrigoDLopez deleted the prevent_null_pointer_on_consecutive_vm_migrates branch August 30, 2020 01:52
@RodrigoDLopez RodrigoDLopez restored the prevent_null_pointer_on_consecutive_vm_migrates branch August 30, 2020 14:42
@RodrigoDLopez RodrigoDLopez reopened this Aug 30, 2020
@rohityadavcloud
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✖centos7 ✖centos8 ✖debian. JID-1905

@rohityadavcloud
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos7 ✖centos8 ✖debian. JID-1945

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos7 ✔centos8 ✔debian. JID-2045

@DaanHoogland
Copy link
Contributor

@blueorangutan test matrix

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes look good, starting tests on assorted platforms as I'm not sure this matters

@blueorangutan
Copy link

@DaanHoogland a Trillian-Jenkins matrix job (centos7 mgmt + xs71, centos7 mgmt + vmware67, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-2818)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server 7
Total time taken: 56851 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4282-t2818-vmware-67u3.zip
Intermittent failure detected: /marvin/tests/smoke/test_vpc_vpn.py
Smoke tests completed. 82 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@ravening
Copy link
Member

@RodrigoDLopez I tried migrating same vm like 10 times between 5 hypervisors and never got any NPE. Also the lost_host_id field was not cleared.

@RodrigoDLopez
Copy link
Contributor Author

Hi, @ravening can you report the methodology you used?

  • ACS version
  • How are you executing the tests? Are you migrating VMs running? Stopped VMs? Stopped VMs, and then starting them, and just then migrating again?

If you take a look at the code, a null pointer will for sure happen. Just look at the lines I am changing.

Before this PR, the first migration will have host_id or last_host_id, but during the migrate, ACS will set the field last_host_id to null using the afterStorageMigrationCleanup method, and on the next executions of the migrate command (without starting the VM), this field will be null and then on this line, a NPE will be launched

final HostVO srcHost = _hostDao.findById(srchostId);
final Long srcClusterId = srcHost.getClusterId();

Furthermore, some of these parameters were never used for some hypervisors, seems like VMWare needs it, that is why I moved it where they will be used, and did the necessary validations providing information about the process with an appropriate log message

@ravening
Copy link
Member

ravening commented Sep 23, 2020

Hi, @ravening can you report the methodology you used?

  • ACS version

  • How are you executing the tests? Are you migrating VMs running? Stopped VMs? Stopped VMs, and then starting them, and just then migrating again?

If you take a look at the code, a null pointer will for sure happen. Just look at the lines I am changing.

Before this PR, the first migration will have host_id or last_host_id, but during the migrate, ACS will set the field last_host_id to null using the afterStorageMigrationCleanup method, and on the next executions of the migrate command (without starting the VM), this field will be null and then on this line, a NPE will be launched

final HostVO srcHost = _hostDao.findById(srchostId);
final Long srcClusterId = srcHost.getClusterId();

Furthermore, some of these parameters were never used for some hypervisors, seems like VMWare needs it, that is why I moved it where they will be used, and did the necessary validations providing information about the process with an appropriate log message

@RodrigoDLopez im using ACS 4.14

I doing regular VM migration of running VM between multiple hosts. I didn't do migration with storage. Also I'm doing migration on KVM host and not VMware

@RodrigoDLopez
Copy link
Contributor Author

Hi @ravening, that was my fault.
You are doing a live migration, this NPE will occur when you:

  • stop an instance;
  • migrate it with volume
  • do not start the instance
  • try to migrate again to another pool

with these steps, after the first migrate host_id and last_host_id will be set to null. And then, when you try to migrate it again a NPE will be thrown.

@ravening
Copy link
Member

ravening commented Oct 1, 2020

Hi @ravening, that was my fault.

You are doing a live migration, this NPE will occur when you:

  • stop an instance;

  • migrate it with volume

  • do not start the instance

  • try to migrate again to another pool

with these steps, after the first migrate host_id and last_host_id will be set to null. And then, when you try to migrate it again a NPE will be thrown.

@RodrigoDLopez ah ok... So you are just doing volume migration of VM...I will try these steps and let you know the result

@RodrigoDLopez
Copy link
Contributor Author

@rhtyd, @Ravenin, @DaanHoogland

Hello guys, happy new year.

Is there something missing here? Everything seems to be ok with the patch. It is a good bug fix, it has no errors, and tests are passing

@RodrigoDLopez
Copy link
Contributor Author

@ravening @rhtyd @harikrishna-patnala @nvazquez

Hello everyone, this PR fixes a bug and is a simple change.
I believe that it is in the interest of the whole community that this bug found to be solved.
would it be possible to receive some reviews or tests in this PR?

@ravening
Copy link
Member

@ravening @rhtyd @harikrishna-patnala @nvazquez

Hello everyone, this PR fixes a bug and is a simple change.

I believe that it is in the interest of the whole community that this bug found to be solved.

would it be possible to receive some reviews or tests in this PR?

Will review and test it this week

Copy link
Contributor

@nvazquez nvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I've left a few comments

@DaanHoogland
Copy link
Contributor

@RodrigoDLopez do you want this on latest or on one of the release branches (4.14 or 4.15)?

@RodrigoDLopez
Copy link
Contributor Author

Looks good, I've left a few comments

I will make this changes. Thanks for the hint

@RodrigoDLopez
Copy link
Contributor Author

@RodrigoDLopez do you want this on latest or on one of the release branches (4.14 or 4.15)?

For me, it's fine to be merged just on the master, since it's a nonreported bug. But if you wanna I can do the necessary changes to merge this fix on 4.14 as well

@nvazquez @ravening @rhtyd @harikrishna-patnala
any update related to this PR? something that maybe I need to change?

@rohityadavcloud
Copy link
Member

@RodrigoDLopez can you fix the conflict?

@GabrielBrascher
Copy link
Member

I just reproduced this very issue, this PR would be nice to have (4.15.2+, I think hat for 4.15.1 it is not feasible).
@RodrigoDLopez can you please fix the conflicts?

@rohityadavcloud
Copy link
Member

I'm okay to get it in 4.15.1 if it's fixing the NPE. cc @shwstppr @Pearl1594

@shwstppr
Copy link
Contributor

shwstppr commented Jun 8, 2021

@GabrielBrascher were you able to reproduce this with 4.15.1 RC?
From the code changes, I feel this affects only VMware. Is that correct @RodrigoDLopez ?
Last time I was not able to reproduce this after #4385. Can check again

@GabrielBrascher
Copy link
Member

@rhtyd we got it on a 4.15.0.0, but it seems that the codebase at 4.15.1.0 RC1 did not change at that point.
@shwstppr I think that it can happen with any hypervisor as long as one migrate the VM, I saw this issue with KVM.

Steps to reproduce:

Case: migrate VM (e.g. for maintenance), keep VM stopped, try to migrate back the VM -> NullPointer

1. Migrate VM from host A to host B
2. VM stills stopped at host B
3. Try to migrate VM back to host A
4. Null pointer

Stack trace:

java.lang.NullPointerException
        at com.cloud.vm.VirtualMachineManagerImpl.migrateThroughHypervisorOrStorage(VirtualMachineManagerImpl.java:2302)
        at com.cloud.vm.VirtualMachineManagerImpl.orchestrateStorageMigration(VirtualMachineManagerImpl.java:2194)
        at com.cloud.vm.VirtualMachineManagerImpl.orchestrateStorageMigration(VirtualMachineManagerImpl.java:5625)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)

When checking the stack trace + DB it is possible to detect that the issue is indeed caused by the fact of last last_host_id being null, as reported by @RodrigoDLopez.

@shwstppr
Copy link
Contributor

shwstppr commented Jun 8, 2021

@GabrielBrascher I've not tested yet and will try it tomorrow but I feel there has been a change in code that would prevent NPE that you shared,
4.15.0.0 (failing to get clusterId for the source host as srcHost is null): https://github.com/apache/cloudstack/blob/4.15.0.0/engine/orchestration/src/main/java/com/cloud/vm/VirtualMachineManagerImpl.java#L2302
4.15 branch (logic for finding source clusterId has been changed. It is now found from host_id or last_host_id or storage pool of the VMs volumes): https://github.com/apache/cloudstack/blob/4.15/engine/orchestration/src/main/java/com/cloud/vm/VirtualMachineManagerImpl.java#L2318

@DaanHoogland
Copy link
Contributor

@RodrigoDLopez can you address the conflict?
@rhtyd @nvazquez are your comments addressed to satisfaction?

@GabrielBrascher
Copy link
Member

@shwstppr thanks for bringing those changes, I might have tested with 4.15.0.0 then instead of RC1. I will take another look at it.

@RodrigoDLopez RodrigoDLopez force-pushed the prevent_null_pointer_on_consecutive_vm_migrates branch from fdf5f9b to f22c5a9 Compare June 11, 2021 13:01
@GabrielBrascher
Copy link
Member

@rhtyd @RodrigoDLopez @shwstppr @DaanHoogland I have been running some tests on RC2 and I was not able to reproduce this issue with 4.15.1.0 RC2.

This one looks to be fixed.

@rohityadavcloud
Copy link
Member

Thanks for testing @GabrielBrascher, @RodrigoDLopez can you test 4.15.1.0 RC2?

@RodrigoDLopez
Copy link
Contributor Author

not needed anymore...

@RodrigoDLopez RodrigoDLopez deleted the prevent_null_pointer_on_consecutive_vm_migrates branch June 23, 2021 14:03
@rafaelweingartner
Copy link
Member

@RodrigoDLopez why is this one not needed anymore? Was it addressed somewhere else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.