Installing errata from satellite web UI (Sat6.1.7)

Latest response

When installing errata either from bulk actions or for a single content host we cannot get the action to complete as we always receive a host did not respond within 20 seconds message.

There appear to be a total of 15 poll_attempts when looking at the dynflow console for pulp_tasks and non failed, then there is a runtime error with the host did not respond.

This happens even though gofer and katello-agent are installed on the systems.

We are running over a very limited bandwidth network, with 3,700 content hosts, and 13 capsules out there as well.

Is there any way to debug why this is occurring?

Responses

If you found something like this in your log files it would indicate rest_client_timeout value is being reached "Request Timeout (RestClient::RequestTimeout)". Also did you install the katello-agent from the Red Hat Satellite Tools repository?

I think you hit https://bugzilla.redhat.com/show_bug.cgi?id=1281947 - let try restarting qdrouterd on Satellite (and on Capsule, if the hosts you try to install errata on are registered to a capsule).

If that won't help, let's try:

qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 -q | grep pulp.agent | grep UUID

where UUID is UUID of some content host expecting the errata to install. If that qpid broker queue will have some content (first number (queue depth) will be >0), then katello agent on the host hasnt successfully fetched messages from the qpid broker (it inter-connects there via qdrouter).

If there will be zero, then either katello didnt generate pulp task (improbable - let try apply the action once again and see if 2nd and 3rd number in that output will raise - # enqueues and # dequeues), or rather katello agent got the tasks and put them to /var/lib/gofer/messaging/pending/katelloplugin/ on the content host - worth checking if the directory has something (see timestamps in the json filenames for a hint when task was fetched).

Kind regards, Pavel Moravec GSS SEG

Thanks will see if this is what's causing the issue(s).

We have seen that on those content hosts that have issue goferd is at 100% cpu core utilization. We did update the the bugfix of goferd that was supposed to resolve this issue when there are connection problems to the capsule, however we are still seeing this issue even when the content host can connect to the capsule.

Hello. It looks like I'm running into this problem as well. I ran the qpid-stat command and my results are below. Also there was 1 json file in the /var/lib/gofer/messaging/katelloplugin/ directory. Unsure what to do from here. Thanks.

 pulp.agent.160be7a8-92b3-48ef-b094-373ecf8da101               Y                     22    22      0    16.7k  16.7k       0         0     1
  pulp.agent.510549d2-08bf-47a1-86dd-f9f510e8eaf5               Y                      0     0      0       0      0        0         0     1
  pulp.agent.666574c5-beb0-4868-b744-5c25507cf2ea               Y                     23    23      0    17.3k  17.3k       0         0     1
  pulp.agent.8ab0c655-1b14-4d04-b3ec-b15bdb98d43f               Y                     32    32      0    23.3k  23.3k       0         0     1

Hello, the three content hosts with UUIDs:

160be7a8-92b3-48ef-b094-373ecf8da101
666574c5-beb0-4868-b744-5c25507cf2ea
8ab0c655-1b14-4d04-b3ec-b15bdb98d43f

(you can check their hostnames e.g. via hammer content-host list) have pending requests to errata apply / package (un)install or potentially capsule sync. The 510549d2-08bf-47a1-86dd-f9f510e8eaf5 has nothing pending since no request has been made to it - its queue has no enqueues at all (2nd number in that row).

Generally, this means either qdrouterd on Satellite or Capsule is down or not responding (check that service or listening port 5647 - service restart might help, then) or goferd on those Content Hosts not running / connected to qdrouterd on port 5647.

Worth checking statuses of qdrouterd on Satellite/Capsule, goferd on the content hosts and established TCP connection initiated from the hosts to Satellite/Capsule port 5647 (is it opened on firewall?).

The fact goferd has permanent some json file means it does not process that request - that sounds like goferd not running or some bug / problem in executing the job, if process is running and the file persists there forever - worth opening a support case and providing the file, then.

We have seen this when goferd gets stuck ... there were a few bugs reported on goferd and race conditions,

https://access.redhat.com/solutions/1605793 https://access.redhat.com/solutions/1344383

But we still continue to see goferd get stuck :

Mar 16 07:02:21 px1-bcq goferd: [INFO][pulp.agent.c9442202-ba23-4ca4-92eb-28c0c8422ddd] gofer.messaging.adapter.proton.connection:100 - connecting: URL: amqps://satellite-ancf.er.awips.noaa.gov:5647|SSL: ca: /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate: /etc/pki/consumer/bundle.pem|host-validation: None
Mar 16 07:02:21 px1-bcq goferd: [INFO][pulp.agent.c9442202-ba23-4ca4-92eb-28c0c8422ddd] root:487 - connecting to satellite-ancf.er.awips.noaa.gov:5647...
Mar 16 07:02:32 px1-bcq goferd: [INFO][pulp.agent.c9442202-ba23-4ca4-92eb-28c0c8422ddd] root:527 - Disconnected
Mar 16 07:02:32 px1-bcq goferd: [ERROR][pulp.agent.c9442202-ba23-4ca4-92eb-28c0c8422ddd] gofer.messaging.adapter.proton.connection:106 - connect: proton+amqps://satellite-ancf.er.awips.noaa.gov:5647, failed: Connection amqps://satellite-ancf.er.awips.noaa.gov:5647 disconnected
Mar 16 07:02:32 px1-bcq goferd: [INFO][pulp.agent.c9442202-ba23-4ca4-92eb-28c0c8422ddd] gofer.messaging.adapter.proton.connection:108 - retry in 10 seconds
Mar 16 07:02:42 px1-bcq goferd: [INFO][pulp.agent.c9442202-ba23-4ca4-92eb-28c0c8422ddd] gofer.messaging.adapter.proton.connection:100 - connecting: URL: amqps://satellite-ancf.er.awips.noaa.gov:5647|SSL: ca: /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate: /etc/pki/consumer/bundle.pem|host-validation: None
Mar 16 07:02:42 px1-bcq goferd: [INFO][pulp.agent.c9442202-ba23-4ca4-92eb-28c0c8422ddd] root:487 - connecting to satellite-ancf.er.awips.noaa.gov:5647...
Mar 16 07:02:53 px1-bcq goferd: [INFO][pulp.agent.c9442202-ba23-4ca4-92eb-28c0c8422ddd] root:527 - Disconnected
Mar 16 07:02:53 px1-bcq goferd: [ERROR][pulp.agent.c9442202-ba23-4ca4-92eb-28c0c8422ddd] gofer.messaging.adapter.proton.connection:106 - connect: proton+amqps://satellite-ancf.er.awips.noaa.gov:5647, failed: Connection amqps://satellite-ancf.er.awips.noaa.gov:5647 disconnected
Mar 16 07:02:53 px1-bcq goferd: [INFO][pulp.agent.c9442202-ba23-4ca4-92eb-28c0c8422ddd] gofer.messaging.adapter.proton.connection:108 - retry in 12 seconds
Mar 16 07:03:05 px1-bcq goferd: [INFO][pulp.agent.c9442202-ba23-4ca4-92eb-28c0c8422ddd] gofer.messaging.adapter.proton.connection:100 - connecting: URL: amqps://satellite-ancf.er.awips.noaa.gov:5647|SSL: ca: /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate: /etc/pki/consumer/bundle.pem|host-validation: None
Mar 16 07:03:05 px1-bcq goferd: [INFO][pulp.agent.c9442202-ba23-4ca4-92eb-28c0c8422ddd] root:487 - connecting to satellite-ancf.er.awips.noaa.gov:5647...
Mar 16 07:03:23 px1-bcq goferd: [INFO][pulp.agent.c9442202-ba23-4ca4-92eb-28c0c8422ddd] gofer.messaging.adapter.proton.connection:103 - connected: amqps://satellite-ancf.er.awips.noaa.gov:5647
Mar 16 07:03:23 px1-bcq goferd: [INFO][pulp.agent.c9442202-ba23-4ca4-92eb-28c0c8422ddd] root:507 - connected to satellite-ancf.er.awips.noaa.gov:5647
Mar 16 07:03:39 px1-bcq goferd: [INFO][Thread-3] rhsm.connection:646 - Using certificate authentication: key = /etc/pki/consumer/key.pem, cert = /etc/pki/consumer/cert.pem, ca = /etc/rhsm/ca/, insecure = False
Mar 16 07:03:39 px1-bcq goferd: [INFO][Thread-3] rhsm.connection:657 - Connection Built: host: satellite-ancf.er.awips.noaa.gov, port: 443, handler: /rhsm
Mar 16 07:03:39 px1-bcq goferd: [ERROR][Thread-3] katelloplugin:148 - send enabled report failed: Error loading certificate
Mar 16 09:28:50 px1-bcq goferd: [INFO][Thread-3] rhsm.connection:646 - Using certificate authentication: key = /etc/pki/consumer/key.pem, cert = /etc/pki/consumer/cert.pem, ca = /etc/rhsm/ca/, insecure = False
Mar 16 09:28:50 px1-bcq goferd: [INFO][Thread-3] rhsm.connection:657 - Connection Built: host: satellite-ancf.er.awips.noaa.gov, port: 443, handler: /rhsm
Mar 16 09:28:50 px1-bcq goferd: [ERROR][Thread-3] katelloplugin:148 - send enabled report failed: Error loading certificate
Mar 16 10:25:06 px1-bcq auditd[2400]: Audit daemon rotating log files
Mar 16 13:28:59 px1-bcq goferd: [INFO][Thread-3] rhsm.connection:646 - Using certificate authentication: key = /etc/pki/consumer/key.pem, cert = /etc/pki/consumer/cert.pem, ca = /etc/rhsm/ca/, insecure = False
Mar 16 13:28:59 px1-bcq goferd: [INFO][Thread-3] rhsm.connection:657 - Connection Built: host: satellite-ancf.er.awips.noaa.gov, port: 443, handler: /rhsm
Mar 16 13:28:59 px1-bcq goferd: [INFO][Thread-3] katelloplugin:345 - reporting: {'enabled_repos': {'repos': []}}

Where it just seems to loose it's connection and then gets hung up.

Also strace on the python process following child PIDs shows:

[pid  4384] read(12, 0x7f4e6a6819f0, 64) = -1 EBADF (Bad file descriptor)
[pid  3761] <... select resumed> )      = 0 (Timeout)
[pid  4384] futex(0x1fdf920, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  3761] futex(0x1fdf920, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  4384] <... futex resumed> )       = 0
[pid  3761] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  4384] futex(0x1fdf920, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  4384] futex(0x1fdf920, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  3761] futex(0x1fdf920, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4384] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  3761] <... futex resumed> )       = 0
[pid  4384] futex(0x1fdf920, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  3761] futex(0x1fdf920, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  4384] <... futex resumed> )       = 0
[pid  3761] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  4384] futex(0x1fdf920, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  3761] stat("/etc/pki/consumer/cert.pem",  <unfinished ...>
[pid  4384] poll([{fd=12, events=POLLIN}, {fd=16, events=POLLIN}], 2, 1853 <unfinished ...>
[pid  3761] <... stat resumed> {st_mode=S_IFREG|0640, st_size=1602, ...}) = 0
[pid  4384] <... poll resumed> )        = 1 ([{fd=12, revents=POLLIN}])
[pid  4384] read(12, 0x7f4e6a6819f0, 64) = -1 EBADF (Bad file descriptor)
[pid  4384] futex(0x1fdf920, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  3761] futex(0x1fdf920, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4384] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  4384] poll([{fd=12, events=POLLIN}, {fd=16, events=POLLIN}], 2, 1852 <unfinished ...>
[pid  3761] <... futex resumed> )       = 0
[pid  4384] <... poll resumed> )        = 1 ([{fd=12, revents=POLLIN}])
[pid  4384] read(12, 0x7f4e6a6819f0, 64) = -1 EBADF (Bad file descriptor)
[pid  4384] futex(0x1fdf920, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  3761] futex(0x1fdf920, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  4384] <... futex resumed> )       = 0
[pid  4384] futex(0x1fdf920, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  4384] futex(0x1fdf920, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  3761] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  4384] <... futex resumed> )       = 0
[pid  4384] futex(0x1fdf920, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  3761] futex(0x1fdf920, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4384] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  3761] <... futex resumed> )       = 0
[pid  4384] futex(0x1fdf920, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  3761] futex(0x1fdf920, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4384] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  3761] <... futex resumed> )       = 0
[pid  4384] poll([{fd=12, events=POLLIN}, {fd=16, events=POLLIN}], 2, 1852 <unfinished ...>
[pid  3761] stat("/etc/yum.repos.d/redhat.repo",  <unfinished ...>
[pid  4384] <... poll resumed> )        = 1 ([{fd=12, revents=POLLIN}])
[pid  4384] read(12, 0x7f4e6a6819f0, 64) = -1 EBADF (Bad file descriptor)
[pid  4384] poll([{fd=12, events=POLLIN}, {fd=16, events=POLLIN}], 2, 1852 <unfinished ...>
[pid  3761] <... stat resumed> {st_mode=S_IFREG|0664, st_size=2987, ...}) = 0
[pid  4384] <... poll resumed> )        = 1 ([{fd=12, revents=POLLIN}])
[pid  4384] read(12, 0x7f4e6a6819f0, 64) = -1 EBADF (Bad file descriptor)
[pid  4384] futex(0x1fdf920, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid  3761] futex(0x1fdf920, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  4384] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid  3761] <... futex resumed> )       = 0
[pid  3761] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[pid  4384] poll([{fd=12, events=POLLIN}, {fd=16, events=POLLIN}], 2, 1852) = 1 ([{fd=12, revents=POLLIN}])
[pid  4384] read(12, 0x7f4e6a6819f0, 64) = -1 EBADF (Bad file descriptor)
[pid  4384] poll([{fd=12, events=POLLIN}, {fd=16, events=POLLIN}], 2, 1851) = 1 ([{fd=12, revents=POLLIN}])
[pid  4384] read(12, 0x7f4e6a6819f0, 64) = -1 EBADF (Bad file descriptor)
[pid  4384] poll([{fd=12, events=POLLIN}, {fd=16, events=POLLIN}], 2, 1851) = 1 ([{fd=12, revents=POLLIN}])
[pid  4384] read(12, 0x7f4e6a6819f0, 64) = -1 EBADF (Bad file descriptor)

Which still looks like a race condition in the updated goferd package.

Mar 16 09:28:50 px1-bcq goferd: [ERROR][Thread-3] katelloplugin:148 - send enabled report failed: Error loading certificate

That sounds like SSL certificate issue. Also see that goferd repeatedly logs connecting to satellite-ancf.er.awips.noaa.gov:5647... followed by Disconnected. That also supports the invalid certificate theory.

Let try reinstalling the latest CA certificate and restart goferd to reload them:

rpm -Uvh http://satellite-ancf.er.awips.noaa.gov/pub/katello-ca-consumer-latest.noarch.rpm
service goferd restart

So with firewall disabled on both satellite and client I still get the error. Running ss -antup doesn't show any connections to the client from the server aside from the ssh connection. I've restarted goferd on the client and qdrouterd on the satellite with no luck. Time to open up a support case? I have a few different errata that I would like to install via satellite right now. I can see they're available via yum check-update on the clients but the point of satellite is to not have to do anything on the clients.

Thanks!

First, there should be a connection from the client to the Satellite (to Satellite's listening port 5647, assuming you have Sat and Sat tools 6.1 and not 6.0), not vice versa.

What does goferd logs to /var/log/messages after the restart? Do you have proper verison of katello-ca-default package installed? Worth trying on the client:

rpm -qa | grep katello-ca-consumer
rpm -Uvh http://your.satellite.example.com/pub/katello-ca-consumer-latest.noarch.rpm

If you wont see an evident cause in /var/log/messages or in the CA certificate installed for SSL communication to the Satellite, let open a support case with: sosreport from the client and Satellite.

Pavel,

Definitely no connection between client and satellite server. Definitely using the Sat tools 6.1 repo in satellite and clients see the repo correctly.

# rpm -qa | grep katello-ca-consumer
katello-ca-consumer-satellite.server.com-1.0-1.noarch

# rpm -Uvh http://your.satellite.example.com/pub/katello-ca-consumer-latest.noarch.rpm
...
package katello-ca-consumer-satellite.server.com-1.0-1.noarch is already installed

# grep -i gofer /var/log/messages

Mar 22 07:43:45 client systemd: Stopping Gofer Agent...
Mar 22 07:43:45 client systemd: Started Gofer Agent.
Mar 22 07:43:45 client systemd: Starting Gofer Agent...
Mar 22 07:43:45 client goferd: [INFO][Thread-1] gofer.rmi.store:116 - Using: /var/lib/gofer/messaging/pending/builtin
Mar 22 07:43:45 client goferd: [INFO][MainThread] gofer.agent.plugin:615 - plugin:builtin loaded using: /usr/share/gofer/plugins/builtin.py
Mar 22 07:43:45 client goferd: [INFO][MainThread] gofer.agent.main:79 - agent started.

Doesn't look like anything is out of order with those results?

There should be a message similar to this in /var/log/messages on successful connection. I would try re-installing goferd and the certificate package to see if it helps any. This is for the 6.1 version of Satellite tools and Satellite 6.1.x server.

Mar 21 14:52:08 {client name} goferd: [INFO][pulp.agent.07fda6b3-2962-4de3-b436-af64ca8a0534] gofer.messaging.adapter.proton.connection:100 - connecting: URL: amqps://{satellite server}:5647|SSL: ca: /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate: /etc/pki/consumer/bundle.pem|host-validation: None
Mar 21 14:52:08 {client name} goferd: [INFO][pulp.agent.07fda6b3-2962-4de3-b436-af64ca8a0534] root:487 - connecting to {satellite server}:5647...
Mar 21 14:52:08 {client name} goferd: [INFO][pulp.agent.07fda6b3-2962-4de3-b436-af64ca8a0534] gofer.messaging.adapter.proton.connection:103 - connected: amqps://{satellite server}:5647
Mar 21 14:52:08 {client name} goferd: [INFO][pulp.agent.07fda6b3-2962-4de3-b436-af64ca8a0534] root:507 - connected to {satellite server}:5647
Mar 21 14:55:02 {client name} goferd: [INFO][worker-0] gofer.messaging.adapter.proton.connection:100 - connecting: URL: amqps://{satellite server}:5647|SSL: ca: /etc/rhsm/ca/katello-default-ca.pem|key: None|certificate: /etc/pki/consumer/bundle.pem|host-validation: None
Mar 21 14:55:02 {client name} goferd: [INFO][worker-0] root:487 - connecting to {satellite server}:5647...
Mar 21 14:55:02 {client name} goferd: [INFO][worker-0] gofer.messaging.adapter.proton.connection:103 - connected: amqps://{satellite server}:5647
Mar 21 14:55:02 {client name} goferd: [INFO][worker-0] root:507 - connected to {satellite server}:5647

Also check your subscription-manager config, just to make sure

I am not seeing anything like that after reinstalling the katello ca cert and reinstalling goferd. I will open up a support ticket this afternoon. THanks all for the help.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.