Prevents null pointers when doing consecutive VM migrates #4282

RodrigoDLopez · 2020-08-24T14:08:29Z

Description

If a VM gets consecutive migrations, a null pointer exception is thrown because these VMs do not have host_id or last_host_id; ACS clears these fields when the first migration is over.
With that in mind, this PR prevents the respective null pointer. Additionally, it logs the right context and gives some information to the operator.

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

How Has This Been Tested?

To test this, I used the cloudmonkey to request migrateVirtualMachine commands

RodrigoDLopez · 2020-08-24T15:11:06Z

@blueorangutan package

blueorangutan · 2020-08-24T15:12:33Z

@RodrigoDLopez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2020-08-24T16:13:31Z

Packaging result: ✔centos7 ✔centos8 ✔debian. JID-1813

shwstppr · 2020-08-27T06:42:09Z

@blueorangutan test centos7 vmware-67u3

blueorangutan · 2020-08-27T06:42:49Z

@shwstppr a Trillian-Jenkins test job (centos7 mgmt + vmware-67u3) has been kicked to run smoke tests

blueorangutan · 2020-08-27T22:40:12Z

Trillian test result (tid-2576)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server 7
Total time taken: 55420 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4282-t2576-vmware-67u3.zip
Intermittent failure detected: /marvin/tests/smoke/test_deploy_vm_root_resize.py
Intermittent failure detected: /marvin/tests/smoke/test_kubernetes_clusters.py
Intermittent failure detected: /marvin/tests/smoke/test_ssvm.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 78 look OK, 3 have error(s)
Only failed tests results shown below:

Test	Result	Time (s)	Test File
test_01_deploy_kubernetes_cluster	`Error`	0.11	test_kubernetes_clusters.py
test_02_deploy_kubernetes_ha_cluster	`Error`	0.05	test_kubernetes_clusters.py
test_04_deploy_and_upgrade_kubernetes_cluster	`Error`	0.07	test_kubernetes_clusters.py
test_05_deploy_and_upgrade_kubernetes_ha_cluster	`Error`	0.07	test_kubernetes_clusters.py
test_06_deploy_and_invalid_upgrade_kubernetes_cluster	`Error`	0.06	test_kubernetes_clusters.py
test_07_deploy_and_scale_kubernetes_cluster	`Error`	0.04	test_kubernetes_clusters.py
test_07_reboot_ssvm	`Failure`	58.23	test_ssvm.py
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL	`Failure`	671.18	test_vpc_redundant.py
test_04_rvpc_network_garbage_collector_nics	`Error`	3977.82	test_vpc_redundant.py
test_04_rvpc_network_garbage_collector_nics	`Error`	3977.85	test_vpc_redundant.py
test_05_rvpc_multi_tiers	`Error`	0.01	test_vpc_redundant.py
ContextSuite context=TestVPCRedundancy>:teardown	`Error`	0.01	test_vpc_redundant.py

rohityadavcloud · 2020-09-05T19:12:20Z

@blueorangutan package

blueorangutan · 2020-09-05T19:13:16Z

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2020-09-05T19:55:29Z

Packaging result: ✖centos7 ✖centos8 ✖debian. JID-1905

rohityadavcloud · 2020-09-09T23:10:16Z

@blueorangutan package

blueorangutan · 2020-09-09T23:11:51Z

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2020-09-10T00:14:02Z

Packaging result: ✔centos7 ✖centos8 ✖debian. JID-1945

DaanHoogland · 2020-09-18T15:06:45Z

@blueorangutan package

blueorangutan · 2020-09-18T15:07:31Z

@DaanHoogland a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2020-09-18T15:58:41Z

Packaging result: ✔centos7 ✔centos8 ✔debian. JID-2045

DaanHoogland · 2020-09-21T10:07:26Z

@blueorangutan test matrix

DaanHoogland

changes look good, starting tests on assorted platforms as I'm not sure this matters

blueorangutan · 2020-09-21T10:08:06Z

@DaanHoogland a Trillian-Jenkins matrix job (centos7 mgmt + xs71, centos7 mgmt + vmware67, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

blueorangutan · 2020-09-22T02:41:47Z

Trillian test result (tid-2818)
Environment: vmware-67u3 (x2), Advanced Networking with Mgmt server 7
Total time taken: 56851 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr4282-t2818-vmware-67u3.zip
Intermittent failure detected: /marvin/tests/smoke/test_vpc_vpn.py
Smoke tests completed. 82 look OK, 0 have error(s)
Only failed tests results shown below:

Test	Result	Time (s)	Test File

engine/orchestration/src/main/java/com/cloud/vm/VirtualMachineManagerImpl.java

ravening · 2020-09-23T10:01:34Z

@RodrigoDLopez I tried migrating same vm like 10 times between 5 hypervisors and never got any NPE. Also the lost_host_id field was not cleared.

RodrigoDLopez · 2020-09-23T19:39:22Z

Hi, @ravening can you report the methodology you used?

ACS version
How are you executing the tests? Are you migrating VMs running? Stopped VMs? Stopped VMs, and then starting them, and just then migrating again?

If you take a look at the code, a null pointer will for sure happen. Just look at the lines I am changing.

Before this PR, the first migration will have host_id or last_host_id, but during the migrate, ACS will set the field last_host_id to null using the afterStorageMigrationCleanup method, and on the next executions of the migrate command (without starting the VM), this field will be null and then on this line, a NPE will be launched

cloudstack/engine/orchestration/src/main/java/com/cloud/vm/VirtualMachineManagerImpl.java

Lines 2282 to 2283 in d657ef7

 final HostVO srcHost = _hostDao.findById(srchostId); 

 final Long srcClusterId = srcHost.getClusterId();

Furthermore, some of these parameters were never used for some hypervisors, seems like VMWare needs it, that is why I moved it where they will be used, and did the necessary validations providing information about the process with an appropriate log message

ravening · 2020-09-23T20:12:28Z

Hi, @ravening can you report the methodology you used?

ACS version

How are you executing the tests? Are you migrating VMs running? Stopped VMs? Stopped VMs, and then starting them, and just then migrating again?

If you take a look at the code, a null pointer will for sure happen. Just look at the lines I am changing.

Before this PR, the first migration will have host_id or last_host_id, but during the migrate, ACS will set the field last_host_id to null using the afterStorageMigrationCleanup method, and on the next executions of the migrate command (without starting the VM), this field will be null and then on this line, a NPE will be launched

cloudstack/engine/orchestration/src/main/java/com/cloud/vm/VirtualMachineManagerImpl.java

Lines 2282 to 2283 in d657ef7

final HostVO srcHost = _hostDao.findById(srchostId);

final Long srcClusterId = srcHost.getClusterId();

Furthermore, some of these parameters were never used for some hypervisors, seems like VMWare needs it, that is why I moved it where they will be used, and did the necessary validations providing information about the process with an appropriate log message

@RodrigoDLopez im using ACS 4.14

I doing regular VM migration of running VM between multiple hosts. I didn't do migration with storage. Also I'm doing migration on KVM host and not VMware

RodrigoDLopez · 2020-09-29T14:17:49Z

Hi @ravening, that was my fault.
You are doing a live migration, this NPE will occur when you:

stop an instance;
migrate it with volume
do not start the instance
try to migrate again to another pool

with these steps, after the first migrate host_id and last_host_id will be set to null. And then, when you try to migrate it again a NPE will be thrown.

ravening · 2020-10-01T20:58:46Z

Hi @ravening, that was my fault.

You are doing a live migration, this NPE will occur when you:

stop an instance;

migrate it with volume

do not start the instance

try to migrate again to another pool

with these steps, after the first migrate host_id and last_host_id will be set to null. And then, when you try to migrate it again a NPE will be thrown.

@RodrigoDLopez ah ok... So you are just doing volume migration of VM...I will try these steps and let you know the result

RodrigoDLopez · 2021-01-04T14:52:41Z

@rhtyd, @Ravenin, @DaanHoogland

Hello guys, happy new year.

Is there something missing here? Everything seems to be ok with the patch. It is a good bug fix, it has no errors, and tests are passing

RodrigoDLopez · 2021-01-18T17:15:19Z

@ravening @rhtyd @harikrishna-patnala @nvazquez

Hello everyone, this PR fixes a bug and is a simple change.
I believe that it is in the interest of the whole community that this bug found to be solved.
would it be possible to receive some reviews or tests in this PR?

ravening · 2021-01-18T17:18:21Z

@ravening @rhtyd @harikrishna-patnala @nvazquez

Hello everyone, this PR fixes a bug and is a simple change.

I believe that it is in the interest of the whole community that this bug found to be solved.

would it be possible to receive some reviews or tests in this PR?

Will review and test it this week

nvazquez

Looks good, I've left a few comments

engine/orchestration/src/main/java/com/cloud/vm/VirtualMachineManagerImpl.java

DaanHoogland · 2021-01-19T10:00:14Z

@RodrigoDLopez do you want this on latest or on one of the release branches (4.14 or 4.15)?

RodrigoDLopez · 2021-01-19T23:22:39Z

Looks good, I've left a few comments

I will make this changes. Thanks for the hint

RodrigoDLopez · 2021-02-19T20:16:00Z

@RodrigoDLopez do you want this on latest or on one of the release branches (4.14 or 4.15)?

For me, it's fine to be merged just on the master, since it's a nonreported bug. But if you wanna I can do the necessary changes to merge this fix on 4.14 as well

@nvazquez @ravening @rhtyd @harikrishna-patnala
any update related to this PR? something that maybe I need to change?

rohityadavcloud · 2021-03-06T09:14:44Z

@RodrigoDLopez can you fix the conflict?

GabrielBrascher · 2021-06-08T12:07:16Z

I just reproduced this very issue, this PR would be nice to have (4.15.2+, I think hat for 4.15.1 it is not feasible).
@RodrigoDLopez can you please fix the conflicts?

rohityadavcloud · 2021-06-08T12:23:15Z

I'm okay to get it in 4.15.1 if it's fixing the NPE. cc @shwstppr @Pearl1594

shwstppr · 2021-06-08T12:56:54Z

@GabrielBrascher were you able to reproduce this with 4.15.1 RC?
From the code changes, I feel this affects only VMware. Is that correct @RodrigoDLopez ?
Last time I was not able to reproduce this after #4385. Can check again

GabrielBrascher · 2021-06-08T16:57:55Z

@rhtyd we got it on a 4.15.0.0, but it seems that the codebase at 4.15.1.0 RC1 did not change at that point.
@shwstppr I think that it can happen with any hypervisor as long as one migrate the VM, I saw this issue with KVM.

Steps to reproduce:

Case: migrate VM (e.g. for maintenance), keep VM stopped, try to migrate back the VM -> NullPointer

1. Migrate VM from host A to host B
2. VM stills stopped at host B
3. Try to migrate VM back to host A
4. Null pointer

Stack trace:

java.lang.NullPointerException
        at com.cloud.vm.VirtualMachineManagerImpl.migrateThroughHypervisorOrStorage(VirtualMachineManagerImpl.java:2302)
        at com.cloud.vm.VirtualMachineManagerImpl.orchestrateStorageMigration(VirtualMachineManagerImpl.java:2194)
        at com.cloud.vm.VirtualMachineManagerImpl.orchestrateStorageMigration(VirtualMachineManagerImpl.java:5625)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)

When checking the stack trace + DB it is possible to detect that the issue is indeed caused by the fact of last last_host_id being null, as reported by @RodrigoDLopez.

shwstppr · 2021-06-08T17:46:33Z

@GabrielBrascher I've not tested yet and will try it tomorrow but I feel there has been a change in code that would prevent NPE that you shared,
4.15.0.0 (failing to get clusterId for the source host as srcHost is null): https://github.com/apache/cloudstack/blob/4.15.0.0/engine/orchestration/src/main/java/com/cloud/vm/VirtualMachineManagerImpl.java#L2302
4.15 branch (logic for finding source clusterId has been changed. It is now found from host_id or last_host_id or storage pool of the VMs volumes): https://github.com/apache/cloudstack/blob/4.15/engine/orchestration/src/main/java/com/cloud/vm/VirtualMachineManagerImpl.java#L2318

DaanHoogland · 2021-06-10T15:17:25Z

@RodrigoDLopez can you address the conflict?
@rhtyd @nvazquez are your comments addressed to satisfaction?

GabrielBrascher · 2021-06-10T21:39:46Z

@shwstppr thanks for bringing those changes, I might have tested with 4.15.0.0 then instead of RC1. I will take another look at it.

GabrielBrascher · 2021-06-17T17:25:56Z

@rhtyd @RodrigoDLopez @shwstppr @DaanHoogland I have been running some tests on RC2 and I was not able to reproduce this issue with 4.15.1.0 RC2.

This one looks to be fixed.

rohityadavcloud · 2021-06-17T17:31:29Z

Thanks for testing @GabrielBrascher, @RodrigoDLopez can you test 4.15.1.0 RC2?

RodrigoDLopez · 2021-06-23T14:02:58Z

not needed anymore...

rafaelweingartner · 2021-06-23T14:22:33Z

@RodrigoDLopez why is this one not needed anymore? Was it addressed somewhere else?

svenvogel added type:bug type:enhancement component:management-server labels Aug 25, 2020

rohityadavcloud requested review from DaanHoogland, harikrishna-patnala and nvazquez August 28, 2020 09:26

RodrigoDLopez closed this Aug 30, 2020

RodrigoDLopez deleted the prevent_null_pointer_on_consecutive_vm_migrates branch August 30, 2020 01:52

RodrigoDLopez restored the prevent_null_pointer_on_consecutive_vm_migrates branch August 30, 2020 14:42

RodrigoDLopez reopened this Aug 30, 2020

DaanHoogland approved these changes Sep 21, 2020

View reviewed changes

rohityadavcloud reviewed Sep 23, 2020

View reviewed changes

engine/orchestration/src/main/java/com/cloud/vm/VirtualMachineManagerImpl.java Outdated Show resolved Hide resolved

PaulAngus removed the type:enhancement label Nov 7, 2020

nvazquez reviewed Jan 18, 2021

View reviewed changes

engine/orchestration/src/main/java/com/cloud/vm/VirtualMachineManagerImpl.java Show resolved Hide resolved

engine/orchestration/src/main/java/com/cloud/vm/VirtualMachineManagerImpl.java Outdated Show resolved Hide resolved

fix conflicts

f22c5a9

RodrigoDLopez force-pushed the prevent_null_pointer_on_consecutive_vm_migrates branch from fdf5f9b to f22c5a9 Compare June 11, 2021 13:01

RodrigoDLopez closed this Jun 23, 2021

RodrigoDLopez deleted the prevent_null_pointer_on_consecutive_vm_migrates branch June 23, 2021 14:03

GabrielBrascher mentioned this pull request Jul 28, 2021

remove the unnecessary check for tags when migrating volumes #4257

Merged

5 tasks

Prevents null pointers when doing consecutive VM migrates #4282

Prevents null pointers when doing consecutive VM migrates #4282

Uh oh!

Conversation

RodrigoDLopez commented Aug 24, 2020 • edited by PaulAngus Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Types of changes

Screenshots (if appropriate):

How Has This Been Tested?

Uh oh!

RodrigoDLopez commented Aug 24, 2020

Uh oh!

blueorangutan commented Aug 24, 2020

Uh oh!

blueorangutan commented Aug 24, 2020

Uh oh!

shwstppr commented Aug 27, 2020

Uh oh!

blueorangutan commented Aug 27, 2020

Uh oh!

blueorangutan commented Aug 27, 2020

Uh oh!

rohityadavcloud commented Sep 5, 2020

Uh oh!

blueorangutan commented Sep 5, 2020

Uh oh!

blueorangutan commented Sep 5, 2020

Uh oh!

rohityadavcloud commented Sep 9, 2020

Uh oh!

blueorangutan commented Sep 9, 2020

Uh oh!

blueorangutan commented Sep 10, 2020

Uh oh!

DaanHoogland commented Sep 18, 2020

Uh oh!

blueorangutan commented Sep 18, 2020

Uh oh!

blueorangutan commented Sep 18, 2020

Uh oh!

DaanHoogland commented Sep 21, 2020

Uh oh!

DaanHoogland left a comment

Choose a reason for hiding this comment

Uh oh!

blueorangutan commented Sep 21, 2020

Uh oh!

blueorangutan commented Sep 22, 2020

Uh oh!

Uh oh!

ravening commented Sep 23, 2020

Uh oh!

RodrigoDLopez commented Sep 23, 2020

Uh oh!

ravening commented Sep 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RodrigoDLopez commented Sep 29, 2020

Uh oh!

ravening commented Oct 1, 2020

Uh oh!

RodrigoDLopez commented Jan 4, 2021

Uh oh!

RodrigoDLopez commented Jan 18, 2021

Uh oh!

ravening commented Jan 18, 2021

Uh oh!

nvazquez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

DaanHoogland commented Jan 19, 2021

Uh oh!

RodrigoDLopez commented Jan 19, 2021

Uh oh!

RodrigoDLopez commented Feb 19, 2021

Uh oh!

rohityadavcloud commented Mar 6, 2021

RodrigoDLopez commented Aug 24, 2020 •

edited by PaulAngus

Loading

ravening commented Sep 23, 2020 •

edited

Loading