candlepin not working on Red Hat Satellite and hammer ping returns 404 on candlepin

Solution Verified - Updated -

Environment

  • Red Hat Satellite 6

Issue

  • hammer ping returns 404 error on candlepin
  • candlepin logs having exceptions in sending messages to qpid broker

Resolution

To resolve the problem

Ensure virt-who is not misconfigured and sending subscription status too frequently to Satellite / candlepin. See this solution for more.

Ensure foreman tasks have the ListenOnCandlepinEvents task running (in WebUI, go to Monitor -> Tasks -> apply filter "label = Actions::Candlepin::ListenOnCandlepinEvents" and check if there is exactly one task running).

Only on Satellite 6.1, ensure further gutterball is running properly by inspecting /var/log/gutterball/gutterball.log file.

To alleviate the problem

Optionally, after/in parallel to identifying the root cause, temporarily allow qpid broker to buffer more messages from candlepin to katello (or in Sat6.1, also audit logs that candlepin sends to gutterbal) via the broker.

  • Add to /etc/qpid/qpidd.conf:

    default-queue-limit=0
    
  • Apply the change:

    service qpidd restart
    
  • Monitor qpidd process memory usage, since it can take a lot due to the necessity to keep in memory much more messages for the candlepin->katello (or candlepin->gutterball in Sat6.1) communication. In case it consumes too much memory, revert the change to prevent out-of-memory issues.

  • Monitor message depth of qpid broker queue katello_event_queue (or also $(hostname -f):event on Sat6.1) that will buffer the messages / audits from candlepin to katello (or also to gutterball on Sat6.1):

    qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 -q katello_event_queue
    qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 -q $(hostname -f):event
    

    where queue-depth should be decreasing over the time. If it is not, then the root cause has not been resolved yet.

For more KB articles/solutions related to Red Hat Satellite 6.x Candlepin Issues, please refer to the Consolidated Troubleshooting Article for Red Hat Satellite 6.x Candlepin Issues

Root Cause

Candlepin failed to send subscription event to katello (or also an audit log to gutterball in Sat6.1) via qpid broker that has buffered maximal number of messages in its relevant queue. This can be caused by either foreman-tasks not running ListenOnCandlepinEvents task, (or also gutterball not working properly in Sat6.1), or candlepin sending the audits too frequently - this usually means some virt-who instance reports subscriptions status too often.

The root cause of the problem must be fixed to resolve the issue, while increasing maximal queue depth can alleviate the problem in temporarily restoring candlepin operation.

Diagnostic Steps

  • hammer ping returns:

    candlepin:
        Status:          FAIL
        Server Response: Message: 404 Resource Not Found
    candlepin_auth:
        Status:          FAIL
        Server Response: Message: Katello::Resources::Candlepin::CandlepinPing: 404 Resource Not Found <html>... (skipping some generic 404 html page details) </html> (GET /candlepin/status)
    pulp:
        Status:          ok
        Server Response: Duration: 61ms
    pulp_auth:
        Status:          ok
        Server Response: Duration: 27ms
    elasticsearch:
        Status:          ok
        Server Response: Duration: 35ms
    foreman_tasks:
        Status:          ok
        Server Response: Duration: 1ms
    
  • qpid-stat -q having katello_event_queue (or also $(hostname -f):event queue on Sat6.1) large backlog of messages (in thousands, like 11.5k in below example) and mainly zero consumers (2nd column from the right is zero):

    katello_event_queue                                 Y                   11.5k  11.5k     0    95.4m  95.4m       0         0     2
    

    (optionally, check that qpid-stat -u listing queues subscriptions/consumers does not have a consumer of that queue).

    Both qpid-stat outputs are in collected by foreman-debug (see files qpid_stat_queues and qpid_stat_subscriptions).

  • /var/log/candlepin/candlepin.log having either Exception when sending message:timed out waiting for completion or Error sending event to message bus (backtrace details might differ in various Satellite / candlepin versions):

    2016-02-09 14:31:08,841 [=, org=] ERROR org.candlepin.audit.AMQPBusPublisher - Unable to send event: Event [id=null, target=COMPLIANCE, type=CREATED, time=Mon Dec 07 15:26:57 GMT 2015, entity=8a2e8c9851444d5601517d0d75c35615]
    javax.jms.JMSException: Exception when sending message:timed out waiting for completion
        at org.apache.qpid.client.BasicMessageProducer_0_10.sendMessage(BasicMessageProducer_0_10.java:260) ~[qpid-client-0.30.redhat-1.jar:0.30.redhat-1]
        at org.apache.qpid.client.BasicMessageProducer.sendImpl(BasicMessageProducer.java:528) ~[qpid-client-0.30.redhat-1.jar:0.30.redhat-1]
        at org.apache.qpid.client.BasicMessageProducer.send(BasicMessageProducer.java:326) ~[qpid-client-0.30.redhat-1.jar:0.30.redhat-1]
        at org.apache.qpid.client.TopicPublisherAdapter.send(TopicPublisherAdapter.java:120) ~[qpid-client-0.30.redhat-1.jar:0.30.redhat-1]
        at org.candlepin.audit.AMQPBusPublisher.onEvent(AMQPBusPublisher.java:60) ~[AMQPBusPublisher.class:na]
        at org.candlepin.audit.ListenerWrapper.onMessage(ListenerWrapper.java:52) [ListenerWrapper.class:na]
        at org.hornetq.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1114) [hornetq-core-client-2.3.5.Final-redhat-2.jar:2.3.5.Final-redhat-2]
        at org.hornetq.core.client.impl.ClientConsumerImpl.access$500(ClientConsumerImpl.java:57) [hornetq-core-client-2.3.5.Final-redhat-2.jar:2.3.5.Final-redhat-2]
        at org.hornetq.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1249) [hornetq-core-client-2.3.5.Final-redhat-2.jar:2.3.5.Final-redhat-2]
        at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:106) [hornetq-core-client-2.3.5.Final-redhat-2.jar:2.3.5.Final-redhat-2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_95]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_95]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_95]
    Caused by: org.apache.qpid.transport.SessionException: timed out waiting for completion
        at org.apache.qpid.transport.Session.invoke(Session.java:747) ~[qpid-common-0.30.redhat-1.jar:0.30.redhat-1]
        at org.apache.qpid.transport.Session.invoke(Session.java:627) ~[qpid-common-0.30.redhat-1.jar:0.30.redhat-1]
        at org.apache.qpid.transport.SessionInvoker.messageTransfer(SessionInvoker.java:93) ~[qpid-common-0.30.redhat-1.jar:0.30.redhat-1]
        at org.apache.qpid.client.BasicMessageProducer_0_10.sendMessage(BasicMessageProducer_0_10.java:246) ~[qpid-client-0.30.redhat-1.jar:0.30.redhat-1]
        ... 12 common frames omitted
    

    or:

    Jan 13 13:54:34 satellite.example.com server[11208]: ERROR [client] - HQ214000: Failed to call onMessage
    Jan 13 13:54:34 satellite.example.com server[11208]: java.lang.RuntimeException: Error sending event to message bus
    Jan 13 13:54:34 satellite.example.com server[11208]: at org.candlepin.audit.AMQPBusPublisher.onEvent(AMQPBusPublisher.java:69)
    Jan 13 13:54:34 satellite.example.com server[11208]: at org.candlepin.audit.ListenerWrapper.onMessage(ListenerWrapper.java:52)
    Jan 13 13:54:34 satellite.example.com server[11208]: at org.hornetq.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1117)
    Jan 13 13:54:34 satellite.example.com server[11208]: at org.hornetq.core.client.impl.ClientConsumerImpl.access$500(ClientConsumerImpl.java:57)
    Jan 13 13:54:34 satellite.example.com server[11208]: at org.hornetq.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1252)
    Jan 13 13:54:34 satellite.example.com server[11208]: at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:107)
    Jan 13 13:54:34 satellite.example.com server[11208]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    Jan 13 13:54:34 satellite.example.com server[11208]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    Jan 13 13:54:34 satellite.example.com server[11208]: at java.lang.Thread.run(Thread.java:745)
    Jan 13 13:54:34 satellite.example.com server[11208]: Caused by: javax.jms.JMSException: Exception when sending message:timed out waiting for sync: comple
    Jan 13 13:54:34 satellite.example.com server[11208]: at org.apache.qpid.client.BasicMessageProducer_0_10.sendMessage(BasicMessageProducer_0_10.java:260)
    Jan 13 13:54:34 satellite.example.com server[11208]: at org.apache.qpid.client.BasicMessageProducer.sendImpl(BasicMessageProducer.java:528)
    Jan 13 13:54:34 satellite.example.com server[11208]: at org.apache.qpid.client.BasicMessageProducer.send(BasicMessageProducer.java:326)
    Jan 13 13:54:34 satellite.example.com server[11208]: at org.apache.qpid.client.TopicPublisherAdapter.send(TopicPublisherAdapter.java:120)
    Jan 13 13:54:34 satellite.example.com server[11208]: at org.candlepin.audit.AMQPBusPublisher.onEvent(AMQPBusPublisher.java:60)
    Jan 13 13:54:34 satellite.example.com server[11208]: ... 8 more
    Jan 13 13:54:34 satellite.example.com server[11208]: Caused by: org.apache.qpid.transport.SessionException: timed out waiting for sync: complete = 151770
    Jan 13 13:54:34 satellite.example.com server[11208]: at org.apache.qpid.transport.Session.sync(Session.java:867)
    Jan 13 13:54:34 satellite.example.com server[11208]: at org.apache.qpid.transport.Session.sync(Session.java:837)
    Jan 13 13:54:34 satellite.example.com server[11208]: at org.apache.qpid.client.BasicMessageProducer_0_10.sendMessage(BasicMessageProducer_0_10.java:253)
    Jan 13 13:54:34 satellite.example.com server[11208]: ... 12 more
    

Only on Satellite 6.1 with gutterball triggering the problem:

  • /var/lib/candlepin/hornetq/largemsgs directory having nontrivial content (backlog of candlepin audits pending to be sent via qpidd to gutterball):

    # du -hs /var/lib/candlepin/hornetq/journal /var/lib/candlepin/hornetq/largemsgs
    31M     /var/lib/candlepin/hornetq/journal
    215M    /var/lib/candlepin/hornetq/largemsgs
    #
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments