Virt-who takes a long time because it sends all hosts-to-guest mappings, and cannot do two things at the same time

Latest response

Using Satellite 6.2.2.

We have two datacenters that we typically provision to in pairs (for redundancy). We have problems with virt-who taking too long sending the hosts-to-guest mapping for vSphere in data center 1 to Satellite, so when the vm in datacenter 2 is built, registers (successfully) , and then tries to enable repos, it failes with "Status: Not Subscribed".

I see in the logs that as the vm in dc1 is started up we get the message

 [INFO] @subscriptionmanager.py:195 - Sending update in hosts-to-guests mapping for config "vsphere-dc1": 88 hypervisors and 1606 guests found

In Satellite I see the task Actions::Katello::Host::Hypervisors start, and it takes 4 minutes !
In the meantime the vm in dc2 is started up and tries to run the following:

  # add subscription manager
   yum -t -y -e 0 install subscription-manager
   rpm -ivh http://satellite.example.com/pub/katello-ca-consumer-latest.noarch.rpm

   echo "Registering the System"
   subscription-manager register --org="ACME" --name="deploytest-dc2" --activationkey="acme-rhel7"

    echo "Enabling Satellite Tools Repo"
    echo "DEPRECATED: This may be removed in a future version of Satellite, please add Satellite Tools to your activation key(s)."
    subscription-manager repos --enable=rhel-*-satellite-tools-*-rpms

     echo "Installing Katello Agent"
     yum -t -y -e 0 install katello-agent

... and fails to enable the repos and install since it doesn't have a valid subscription.

A minute after the Katello task was finished I see virt-who log:

 [INFO] @subscriptionmanager.py:195 - Sending update in hosts-to-guests mapping for config "vsphere-dc2": 34 hypervisors and 357 guests found

and a new Katello task is started, for the hosts-to-guests mapping from the vSphere in dc2.

Now it is possible to run the commands manually.

So, what believe I see is this (After I've tried to read the code, having tcpdump going, etc.):
- There are two constant TCP connections from virt-who, one to each vSphere,
- virt-who gets the mappings from the vSpheres when the VMs are provisioned (and queues them)
- There is only one connection set up from virt-who to the Satellite server, when there are updates.
- virt-who sends the mapping over (quickly), but then waits until Satellite has handled all the mappings, and is done, then disconnects (And that takes 4 minutes for 88 hypers and 1606 guests).
- Then it waits another minute (there is a 60 sec timeout specified in virtwho/init.py)
- First now it sends over the mappings for the next vSphere. To late....

Is there anything I can do to make this go quicker? Including getting a "Request for enhancement" made?

Obviously Satellite (Candlepin) has performance problems. (E.g. The 60 sec timeout between seems to have been added to prevent it being overloaded).

Is it really necessary to send the full hosts-to-guests mapping everytime one of these more than a thousand hosts are changed?

Responses

Terje,

Thank you for including so much information with this discussion. There is one aspect here I don't understand. You've noted that you typically deploy VMs in pairs, one VM in each pair being deployed to a different data centre (DC). What I don't understand is how virt-who's processing of the host/guest mappings on vSphere instance 1 (at DC 1) could be delaying the successful registration of the second VM's deployment. When a VM is deployed and its registration is attempted, it ought to be granted a temporary subscription until Satellite can confirm on which hypervisor it is hosted, and therefore what subscriptions are available. A temporary subscription is valid for 24 hours.

On the topic of virt-who's performance, I cannot explain just why it is performing as you've identified. I'm hoping that someone from Engineering may provide insight. In the meantime, perhaps you could run two instances of virt-who, one per data centre? Alternatively, are there VMs which could be excluded from virt-who's scope? This may also help reduce the processing being done by virt-who and so improve responsiveness.

For details of limiting virt-who's scope, see Limiting the scope of virt-who access.

I would suggest you raise a support case for this problem as it's a difficult one to diagnose and resolve via an online discussion. If you come to a conclusion, it would be great to have that recorded here, for the benefit of anyone in a similar situation.

We (this is for a customer) have opened a case, and their answer eventually was also to rely on the temporary subscription. This doesn't work because there is at the moment subscriptions assigned to the Activation Keys used, both for Red Hat and custom product. I'll discuss this with them if this can be done differently, but at the moment it is not working and this was the reason for me to dive into virt-who to see how it really works.

Actually, the support case was originally opened because we were seeing SSLTimeout when the task exceeds 3 minutes. Like this:

2016-11-29 16:16:42,318 [INFO] @subscriptionmanager.py:194 - Sending update in hosts-to-guests mapping for config "vcenterdc1": 88 hypervisors and 1609 guests found 2016-11-29 16:19:42,406 [ERROR] @executor.py:156 - Error in communication with subscription manager: SSLTimeoutError: timed out

This 3 minute timeout is settable and after some (clueless) suggestions from support I dived into the code (and I'm not a programmer, but I sometimes make scripts in python so I know the basics) and found that it was set in rhsm's config.py. Finally support came up with the info that one now can add server_timeout = X in /etc/rhsm/rhsm.conf.

So a tip to others that have many hypervisors and VMs: If you get SSLTimeouts in /var/log/rhsm/rhsm.log, (something that it seems will happen if the "Hypervisors" task in Satellite takes more than 3 minutes to complete), the setting is in rhsm, not in virt-who. It seems it is a recent addition. (See e.g.: Bug 1346368 - man rhsm.conf is missing a description for the server_timeout configuration https://bugzilla.redhat.com/show_bug.cgi?id=1346368)

About your suggestion of running two virt-whos: Does Katello (candlepin) handle this, or handle this well? From the virt-who code and from bug reports I see that there has been work done in order to prevent it from being overloaded, or at least so it looks to me. E.g.: the 60 second pause that it added (in virt-who's _ _ init.py _ _ before allowing a new report to be sent. If you add a new virt-who, this will not be done since it it for the one instance of the virt-who (an is it really needed?). Is this a tested scenario?

Anyway, one of the things I really started wondering about when looking at how it works is why is it sending all hypervisor/guest mappings everytime there is a single change. There can be hundreds/thousands of them. It probably worked well for the developer(s) on, may we guess, a developers workstation and a little test system with a few hypervisors and hosts. But this isn't exactly a very "enterprisey" way of doing it. It just doesn't scale very well. Is there a reason for this or wouldn't it be better to just send the changes? Maybe a full report at startup (and additionally at some interval), but normally just the one that has changed?

And there are so many things we have had problems with that is of similar character (like the infamous errata db search on the dashboard that together with some of the other widgets contributed to a 3 1/2 minute login delay), so I have this idea of making awareness of this -- in order to get it fixed.

So it would be nice if we could get a discussion about it. I don't know if anyone from Engineering "hangs around" here, or maybe there is a better place to discuss it. Ideas welcome.

Terje,

I'll contact Engineering and see if they can contribute to this discussion. I believe we have many other customers with similarly-sized fleets of virtual machines, and I'm curious about their experience with virt-who. If I understand correctly, the essential problem here is that because virt-who insists on reporting details on the entire fleet of VMs, it takes a long time to do so, which can result in timeouts.

Could you please share the support case number regarding the SSL timeout? I'd be curious to look at the case's history.

01748744

Terje,

Some follow-up questions...

  1. What version of virt-who is installed?
  2. How are you provisioning VMs? Could you insert a delay of 5 minutes between deployments? I'd be curious to know if you encountered the same problem when there was a delay between deployments. I'm not suggesting this as a solution but merely a test.

virt-who-0.17-10.el7.noarch

We haven't tried adding a delay, but I believe that would work. since what I have done sometimes is to run the commands from the kickstart file manually after I see that the virt-who has run for datacenter 2, (Then, under the content host I can go to Subscriptions > Events and see that a subsciption for RHEL for Virtual Datacenters has been attached).

Actually, since datacenter 2 has much fever hypervisors and VMs (34 hypervisors and 357 guests), it even works if I provision a pair of servers in the reverse order, so that the one in datacenter 2 is starting up before the one in datacenter 1.

We might add a delay as a temporary workaround, but it is not a solution, especially since we are growing, and are in the process of adding more hypervisors and VMs. The task takes longer and longer, so this timeout value would have to be bigger and bigger, and the bigger the datacenter gets the more often there are changes. Remember it is not only if a new server is provisioned, but also if its state changes (up, down, paused). Many (most?) of our servers are development and testservers that can change often.

Already the Monitor > Tasks page in Satellite is mostly listing Hypervisors task, with a few other tasks in between.

Terje,

Thank you for providing that additional information. It helps me understand your environment better, including sizing and workflows.

Reading about reversing the order of deployment, to DC2 first, then DC1 second, is interesting. Given virt-who's behaviour, it makes sense that this would work since there's less data to be processed from DC2.

I am certainly not proposing adding a delay as a permanent solution. Instead it would be at best a temporary workaround, at least a means of better understanding what's happening in virt-who and Satellite. I will raise the profile of this discussion with Engineering to try and get their views on what's happening here.

Until virt-who's behaviour can be modified, assuming it's possible, so that it is handling only changes in the virtual environment, there are two other options available. The first is to run two instances of virt-who, one each for DC1 and DC2. Although the environments in each are not the same size, it would allow the processing of changes in each DC to be processed in parallel. At present, virt-who handles events in succession. While it is not possible to make a single virt-who instance's processing parallel, having multiple virt-who instances allows for some parallelisation of virt-who processing. Another suggestion is to register the hosts with an activation key with no products associated with it. The product subscriptions would be contained in a second activation key, and the activation keys stacked in the host group - e.g. "acme-empty7,acme-custom7". In this way, when hosts are registered, they would be immediately be granted a 24-hour grace period.

You have mentioned that VMs' state changes are also being reported. Details of subscription events should be being reported in the RHSM log file - /var/log/rhsm/rhsm.log. The virt-who daemon can be configured to output debug logging, as described in the Virtual Instances Guide. That level of logging may provide more insight into the virt-who daemon's behaviour.

Thanks. I think involving Engineering is "The Right Thing"(tm) to do here. Even if we at our place have workarounds for the not-getting-a-subscription-in-time problem, the real problem is that virt-who (related to candlepin) doesn't scale as one should expect from an enterprise product.

And we need to be able to scale, from this week we have added more vCenters, so we now have 4. There are only a few servers in the new centers yet, but that will change.

I might test running two virt-who instances, separating out the one for DC1, but have been a bit reluctant to try it because browsing through bugeports etc. I see that there has been work done for lowering the load on candlepin (like virt-who adding a 60 second delay before contacting it again, something that of course will not happen it there are multiple instances). And I don't see this scenario mentioned in the docs either. Is anybody else using it, and if so what is their experience?

By the way, I have used the debug mode extensively, (and even adding more logging statements in the python files to get more info), but not fun when you come to the big json structure that is dumped into it.

Terje,

I agree that the root cause of the issue you're facing should be resolved.

The multiple instances of virt-who is not mentioned in the Virtual Instances Guide, but I will raise a ticket to get that done.

In the meantime, I recommend you try the split virt-who configuration. I expect this will avoid the long processing time on one DC from delaying the processing of hosts from the other DC.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.