[Satellite6] candlepin not working, hammer ping returns 404 on candlepin

Solution Verified - Updated -

Environment

  • Red Hat Satellite 6

Issue

  • various actions related to subscriptions (like content host register, managing manifest or so) are failing
  • hammer ping returns 404 error on candlepin
  • candlepin logs having exceptions in sending messages to qpid broker

Resolution

To resolve the problem

Ensure virt-who is not misconfigured and sending subscription status too frequently to Satellite / candlepin. See this solution for more.

Ensure foreman tasks have the ListenOnCandlepinEvents task running (in WebUI, go to Monitor -> Tasks -> apply filter "label = Actions::Candlepin::ListenOnCandlepinEvents" and check if there is exactly one task running).

Only on Satellite 6.1, ensure further gutterball is running properly by inspecting /var/log/gutterball/gutterball.log file.

To alleviate the problem

Optionally, after/in parallel to identifying the root cause, temporarily allow qpid broker to buffer more messages from candlepin to katello (or in Sat6.1, also audit logs that candlepin sends to gutterbal) via the broker.

  • Add to /etc/qpid/qpidd.conf:
default-queue-limit=0
  • Apply the change:
service qpidd restart
  • Monitor qpidd process memory usage, since it can take a lot due to the necessity to keep in memory much more messages for the candlepin->katello (or candlepin->gutterball in Sat6.1) communication. In case it consumes too much memory, revert the change to prevent out-of-memory issues.

  • Monitor message depth of qpid broker queue katello_event_queue (or also $(hostname -f):event on Sat6.1) that will buffer the messages / audits from candlepin to katello (or also to gutterball on Sat6.1):

qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 -q katello_event_queue
qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 -q $(hostname -f):event

where queue-depth should be decreasing over the time. If it is not, then the root cause has not been resolved yet.

Root Cause

Candlepin failed to send subscription event to katello (or also an audit log to gutterball in Sat6.1) via qpid broker that has buffered maximal number of messages in its relevant queue. This can be caused by either foreman-tasks not running ListenOnCandlepinEvents task, (or also gutterball not working properly in Sat6.1), or candlepin sending the audits too frequently - this usually means some virt-who instance reports subscriptions status too often.

The root cause of the problem must be fixed to resolve the issue, while increasing maximal queue depth can alleviate the problem in temporarily restoring candlepin operation.

Diagnostic Steps

  • hammer ping returns:
candlepin:
    Status:          FAIL
    Server Response: Message: 404 Resource Not Found
candlepin_auth:
    Status:          FAIL
    Server Response: Message: Katello::Resources::Candlepin::CandlepinPing: 404 Resource Not Found <html>... (skipping some generic 404 html page details) </html> (GET /candlepin/status)
pulp:
    Status:          ok
    Server Response: Duration: 61ms
pulp_auth:
    Status:          ok
    Server Response: Duration: 27ms
elasticsearch:
    Status:          ok
    Server Response: Duration: 35ms
foreman_tasks:
    Status:          ok
    Server Response: Duration: 1ms

qpid-stat -q having katello_event_queue (or also $(hostname -f):event queue on Sat6.1) large backlog of messages (in thousands, like 11.5k in below example) and mainly zero consumers (2nd column from the right is zero):

  katello_event_queue                                 Y                   11.5k  11.5k     0    95.4m  95.4m       0         0     2

(optionally, check that qpid-stat -u listing queues' subscriptions/consumers does not have a consumer of that queue).

Both qpid-stat outputs are in collected by foreman-debug (see files qpid_stat_queues and qpid_stat_subscriptions).

  • /var/log/candlepin/candlepin.log having either Exception when sending message:timed out waiting for completion or Error sending event to message bus (backtrace details might differ in various Satellite / candlepin versions):
2016-02-09 14:31:08,841 [=, org=] ERROR org.candlepin.audit.AMQPBusPublisher - Unable to send event: Event [id=null, target=COMPLIANCE, type=CREATED, time=Mon Dec 07 15:26:57 GMT 2015, entity=8a2e8c9851444d5601517d0d75c35615]
javax.jms.JMSException: Exception when sending message:timed out waiting for completion
    at org.apache.qpid.client.BasicMessageProducer_0_10.sendMessage(BasicMessageProducer_0_10.java:260) ~[qpid-client-0.30.redhat-1.jar:0.30.redhat-1]
    at org.apache.qpid.client.BasicMessageProducer.sendImpl(BasicMessageProducer.java:528) ~[qpid-client-0.30.redhat-1.jar:0.30.redhat-1]
    at org.apache.qpid.client.BasicMessageProducer.send(BasicMessageProducer.java:326) ~[qpid-client-0.30.redhat-1.jar:0.30.redhat-1]
    at org.apache.qpid.client.TopicPublisherAdapter.send(TopicPublisherAdapter.java:120) ~[qpid-client-0.30.redhat-1.jar:0.30.redhat-1]
    at org.candlepin.audit.AMQPBusPublisher.onEvent(AMQPBusPublisher.java:60) ~[AMQPBusPublisher.class:na]
    at org.candlepin.audit.ListenerWrapper.onMessage(ListenerWrapper.java:52) [ListenerWrapper.class:na]
    at org.hornetq.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1114) [hornetq-core-client-2.3.5.Final-redhat-2.jar:2.3.5.Final-redhat-2]
    at org.hornetq.core.client.impl.ClientConsumerImpl.access$500(ClientConsumerImpl.java:57) [hornetq-core-client-2.3.5.Final-redhat-2.jar:2.3.5.Final-redhat-2]
    at org.hornetq.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1249) [hornetq-core-client-2.3.5.Final-redhat-2.jar:2.3.5.Final-redhat-2]
    at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:106) [hornetq-core-client-2.3.5.Final-redhat-2.jar:2.3.5.Final-redhat-2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_95]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_95]
    at java.lang.Thread.run(Thread.java:745) [na:1.7.0_95]
Caused by: org.apache.qpid.transport.SessionException: timed out waiting for completion
    at org.apache.qpid.transport.Session.invoke(Session.java:747) ~[qpid-common-0.30.redhat-1.jar:0.30.redhat-1]
    at org.apache.qpid.transport.Session.invoke(Session.java:627) ~[qpid-common-0.30.redhat-1.jar:0.30.redhat-1]
    at org.apache.qpid.transport.SessionInvoker.messageTransfer(SessionInvoker.java:93) ~[qpid-common-0.30.redhat-1.jar:0.30.redhat-1]
    at org.apache.qpid.client.BasicMessageProducer_0_10.sendMessage(BasicMessageProducer_0_10.java:246) ~[qpid-client-0.30.redhat-1.jar:0.30.redhat-1]
    ... 12 common frames omitted

or:

Jan 13 13:54:34 satellite.example.com server[11208]: ERROR [client] - HQ214000: Failed to call onMessage
Jan 13 13:54:34 satellite.example.com server[11208]: java.lang.RuntimeException: Error sending event to message bus
Jan 13 13:54:34 satellite.example.com server[11208]: at org.candlepin.audit.AMQPBusPublisher.onEvent(AMQPBusPublisher.java:69)
Jan 13 13:54:34 satellite.example.com server[11208]: at org.candlepin.audit.ListenerWrapper.onMessage(ListenerWrapper.java:52)
Jan 13 13:54:34 satellite.example.com server[11208]: at org.hornetq.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1117)
Jan 13 13:54:34 satellite.example.com server[11208]: at org.hornetq.core.client.impl.ClientConsumerImpl.access$500(ClientConsumerImpl.java:57)
Jan 13 13:54:34 satellite.example.com server[11208]: at org.hornetq.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1252)
Jan 13 13:54:34 satellite.example.com server[11208]: at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:107)
Jan 13 13:54:34 satellite.example.com server[11208]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
Jan 13 13:54:34 satellite.example.com server[11208]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Jan 13 13:54:34 satellite.example.com server[11208]: at java.lang.Thread.run(Thread.java:745)
Jan 13 13:54:34 satellite.example.com server[11208]: Caused by: javax.jms.JMSException: Exception when sending message:timed out waiting for sync: comple
Jan 13 13:54:34 satellite.example.com server[11208]: at org.apache.qpid.client.BasicMessageProducer_0_10.sendMessage(BasicMessageProducer_0_10.java:260)
Jan 13 13:54:34 satellite.example.com server[11208]: at org.apache.qpid.client.BasicMessageProducer.sendImpl(BasicMessageProducer.java:528)
Jan 13 13:54:34 satellite.example.com server[11208]: at org.apache.qpid.client.BasicMessageProducer.send(BasicMessageProducer.java:326)
Jan 13 13:54:34 satellite.example.com server[11208]: at org.apache.qpid.client.TopicPublisherAdapter.send(TopicPublisherAdapter.java:120)
Jan 13 13:54:34 satellite.example.com server[11208]: at org.candlepin.audit.AMQPBusPublisher.onEvent(AMQPBusPublisher.java:60)
Jan 13 13:54:34 satellite.example.com server[11208]: ... 8 more
Jan 13 13:54:34 satellite.example.com server[11208]: Caused by: org.apache.qpid.transport.SessionException: timed out waiting for sync: complete = 151770
Jan 13 13:54:34 satellite.example.com server[11208]: at org.apache.qpid.transport.Session.sync(Session.java:867)
Jan 13 13:54:34 satellite.example.com server[11208]: at org.apache.qpid.transport.Session.sync(Session.java:837)
Jan 13 13:54:34 satellite.example.com server[11208]: at org.apache.qpid.client.BasicMessageProducer_0_10.sendMessage(BasicMessageProducer_0_10.java:253)
Jan 13 13:54:34 satellite.example.com server[11208]: ... 12 more

Only on Satellite 6.1 with gutterball triggering the problem:

  • /var/lib/candlepin/hornetq/largemsgs directory having nontrivial content (backlog of candlepin audits pending to be sent via qpidd to gutterball):
# du -hs /var/lib/candlepin/hornetq/journal /var/lib/candlepin/hornetq/largemsgs
31M     /var/lib/candlepin/hornetq/journal
215M    /var/lib/candlepin/hornetq/largemsgs
#

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

1 Comments

It would be useful if there was some sort of "Health Status" monitor in the Web UI that had these scripts either be executed via click or ran via quartz driven events or cron and reported a status.