Virt-who takes a long time because it sends all hosts-to-guest mappings, and cannot do two things at the same time
Using Satellite 6.2.2.
We have two datacenters that we typically provision to in pairs (for redundancy). We have problems with virt-who taking too long sending the hosts-to-guest mapping for vSphere in data center 1 to Satellite, so when the vm in datacenter 2 is built, registers (successfully) , and then tries to enable repos, it failes with "Status: Not Subscribed".
I see in the logs that as the vm in dc1 is started up we get the message
[INFO] @subscriptionmanager.py:195 - Sending update in hosts-to-guests mapping for config "vsphere-dc1": 88 hypervisors and 1606 guests found
In Satellite I see the task Actions::Katello::Host::Hypervisors start, and it takes 4 minutes !
In the meantime the vm in dc2 is started up and tries to run the following:
# add subscription manager
yum -t -y -e 0 install subscription-manager
rpm -ivh http://satellite.example.com/pub/katello-ca-consumer-latest.noarch.rpm
echo "Registering the System"
subscription-manager register --org="ACME" --name="deploytest-dc2" --activationkey="acme-rhel7"
echo "Enabling Satellite Tools Repo"
echo "DEPRECATED: This may be removed in a future version of Satellite, please add Satellite Tools to your activation key(s)."
subscription-manager repos --enable=rhel-*-satellite-tools-*-rpms
echo "Installing Katello Agent"
yum -t -y -e 0 install katello-agent
... and fails to enable the repos and install since it doesn't have a valid subscription.
A minute after the Katello task was finished I see virt-who log:
[INFO] @subscriptionmanager.py:195 - Sending update in hosts-to-guests mapping for config "vsphere-dc2": 34 hypervisors and 357 guests found
and a new Katello task is started, for the hosts-to-guests mapping from the vSphere in dc2.
Now it is possible to run the commands manually.
So, what believe I see is this (After I've tried to read the code, having tcpdump going, etc.):
- There are two constant TCP connections from virt-who, one to each vSphere,
- virt-who gets the mappings from the vSpheres when the VMs are provisioned (and queues them)
- There is only one connection set up from virt-who to the Satellite server, when there are updates.
- virt-who sends the mapping over (quickly), but then waits until Satellite has handled all the mappings, and is done, then disconnects (And that takes 4 minutes for 88 hypers and 1606 guests).
- Then it waits another minute (there is a 60 sec timeout specified in virtwho/init.py)
- First now it sends over the mappings for the next vSphere. To late....
Is there anything I can do to make this go quicker? Including getting a "Request for enhancement" made?
Obviously Satellite (Candlepin) has performance problems. (E.g. The 60 sec timeout between seems to have been added to prevent it being overloaded).
Is it really necessary to send the full hosts-to-guests mapping everytime one of these more than a thousand hosts are changed?
Responses
Terje,
Thank you for including so much information with this discussion. There is one aspect here I don't understand. You've noted that you typically deploy VMs in pairs, one VM in each pair being deployed to a different data centre (DC). What I don't understand is how virt-who's processing of the host/guest mappings on vSphere instance 1 (at DC 1) could be delaying the successful registration of the second VM's deployment. When a VM is deployed and its registration is attempted, it ought to be granted a temporary subscription until Satellite can confirm on which hypervisor it is hosted, and therefore what subscriptions are available. A temporary subscription is valid for 24 hours.
On the topic of virt-who's performance, I cannot explain just why it is performing as you've identified. I'm hoping that someone from Engineering may provide insight. In the meantime, perhaps you could run two instances of virt-who, one per data centre? Alternatively, are there VMs which could be excluded from virt-who's scope? This may also help reduce the processing being done by virt-who and so improve responsiveness.
For details of limiting virt-who's scope, see Limiting the scope of virt-who access.
I would suggest you raise a support case for this problem as it's a difficult one to diagnose and resolve via an online discussion. If you come to a conclusion, it would be great to have that recorded here, for the benefit of anyone in a similar situation.
Terje,
I'll contact Engineering and see if they can contribute to this discussion. I believe we have many other customers with similarly-sized fleets of virtual machines, and I'm curious about their experience with virt-who. If I understand correctly, the essential problem here is that because virt-who insists on reporting details on the entire fleet of VMs, it takes a long time to do so, which can result in timeouts.
Could you please share the support case number regarding the SSL timeout? I'd be curious to look at the case's history.
Terje,
Some follow-up questions...
- What version of virt-who is installed?
- How are you provisioning VMs? Could you insert a delay of 5 minutes between deployments? I'd be curious to know if you encountered the same problem when there was a delay between deployments. I'm not suggesting this as a solution but merely a test.
Terje,
Thank you for providing that additional information. It helps me understand your environment better, including sizing and workflows.
Reading about reversing the order of deployment, to DC2 first, then DC1 second, is interesting. Given virt-who's behaviour, it makes sense that this would work since there's less data to be processed from DC2.
I am certainly not proposing adding a delay as a permanent solution. Instead it would be at best a temporary workaround, at least a means of better understanding what's happening in virt-who and Satellite. I will raise the profile of this discussion with Engineering to try and get their views on what's happening here.
Until virt-who's behaviour can be modified, assuming it's possible, so that it is handling only changes in the virtual environment, there are two other options available. The first is to run two instances of virt-who, one each for DC1 and DC2. Although the environments in each are not the same size, it would allow the processing of changes in each DC to be processed in parallel. At present, virt-who handles events in succession. While it is not possible to make a single virt-who instance's processing parallel, having multiple virt-who instances allows for some parallelisation of virt-who processing. Another suggestion is to register the hosts with an activation key with no products associated with it. The product subscriptions would be contained in a second activation key, and the activation keys stacked in the host group - e.g. "acme-empty7,acme-custom7". In this way, when hosts are registered, they would be immediately be granted a 24-hour grace period.
You have mentioned that VMs' state changes are also being reported. Details of subscription events should be being reported in the RHSM log file - /var/log/rhsm/rhsm.log. The virt-who daemon can be configured to output debug logging, as described in the Virtual Instances Guide. That level of logging may provide more insight into the virt-who daemon's behaviour.
Terje,
I agree that the root cause of the issue you're facing should be resolved.
The multiple instances of virt-who is not mentioned in the Virtual Instances Guide, but I will raise a ticket to get that done.
In the meantime, I recommend you try the split virt-who configuration. I expect this will avoid the long processing time on one DC from delaying the processing of hosts from the other DC.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
