Frequent osad client disconnects and replaced jabberd sessions in /var/log/messages on Red Hat Satellite

Solution Verified - Updated -

Environment

  • Red Hat Satellite 5.6/5.7/5.8
  • Red Hat Enterprise Linux
  • OSAD

Issue

  • Several Red Hat Enterprise Linux virtual systems sending XMLRPC requests to the Satellite every few minutes
  • High CPU usage on Satellite server when jabberd process is running.
  • PostgreSQL reports deadlock errors on rhnPushClient
            2015-03-08 11:41:34.747 EDT ERROR:  deadlock detected
            2015-03-08 11:41:34.747 EDT DETAIL:  Process 3327 waits for ShareLock on transaction 70793416; blocked by process 3521.
                    Process 3521 waits for ShareLock on transaction 70793412; blocked by process 3327.
                    Process 3327:
                        update rhnPushClient
                        set jabber_id = null
                        where jabber_id = E'osad-865bb87b50@satellite.example.com/osad' and
                            server_id <> 1000060957

                    Process 3521:
                        update rhnPushClient
                        set jabber_id = E'osad-865bb87b50@satellite.example.com/osad',
                            next_action_time = NULL,
                            last_ping_time = NULL
                        where server_id = 1000050534

            2015-03-08 11:41:34.747 EDT HINT:  See server log for query details.
            2015-03-08 11:41:34.747 EDT STATEMENT:  
                        update rhnPushClient
                        set jabber_id = null
                        where jabber_id = E'osad-865bb87b50@satellite.example.com/osad' and
                            server_id <> 1000060957
  • Disconnect and session replacement messages in /var/log/messages on the Satellite server:
Aug  8 14:50:05 satellite jabberd/c2s[3706]: [81] [::ffff:10.20.30.39, port=57797] connect
Aug  8 14:50:05 satellite jabberd/c2s[3706]: [81] legacy authentication succeeded: host=, username=osad-ed1c1bc548, resource=osad, TLS negotiated
Aug  8 14:50:05 satellite jabberd/c2s[3706]: [81] requesting session: jid=osad-ed1c1bc548@satellite.example.com/osad
Aug  8 14:50:05 satellite jabberd/c2s[3706]: [131] [::ffff:10.20.30.38, port=38157] disconnect jid=osad-ed1c1bc548@satellite.example.com/osad, packets: 6
Aug  8 14:50:05 satellite jabberd/sm[3698]: session replaced: jid=osad-ed1c1bc548@satellite.example.com/osad

Resolution

To reset osad-auth.conf file for individual system:

  • Ensure each client has a unique username in /etc/sysconfig/rhn/osad-auth.conf:
[osad-auth]
username = <unique-username>
  • Deleting /etc/sysconfig/rhn/osad-auth.conf and restarting osad on the client system will force it to regenerate a unique identifier:
# rm -f /etc/sysconfig/rhn/osad-auth.conf
# service osad restart

To reset the osad-auth.conf file across all clients;

  • Firstly stop osad daemons on client systems through Satellite WebUI by selecting all the systems through SSM and schedule execution of commands for them:

a) To select all the systems you have to do the following:

Go to Systems -> Select items per page (500) -> Click on "Select All" -> Add Selected to SSM

  • Repeat the above step for every page (this is a restriction of webui that it cannot select all systems at once but you need to choose the max allowed systems on one page and then click "Select All" + "Add Selected to SSM")

b) After the systems are in SSM, schedule execution of needed commands:

Go to SSM - Systems -> System Set Manager -> Run remote commands and schedule the two bellow commands:

# service osad stop
# rm -f /etc/sysconfig/rhn/osad-auth.conf
  • Now wait for >4h to let all the systems execute the scheduled commands by rhnsd (default interval for rhnsd to check in with satellite is 240 minutes)

  • After 4h do the following on Satellite command line:

 # service osa-dispatcher stop
 # service jabberd stop
 # spacewalk-sql -i 
 rhnschema=# delete from rhnPushClient;
 rhnschema=# commit;

the bellow command will take some time:

 rhnschema=# vacuum;
 rhnschema=# \q
  • Start the osad daemons on clients by scheduling it through SSM again using the same steps mentioned in step 1 - b) (systems should remain in SSM so you don't need to select them again by following a)):
# service osad start
  • Start the needed services on Satellite again:
 # service jabberd start
 # service osa-dispatcher start
  • It will take another 4 hours for clients to pick up the command again.

Even after following all the above steps, there still can be systems which will have duplicate osad-auth.conf files due to not having script execution enabled through rhnsd/rhn_check (using commands from rhncfg-actions RPM). For these systems the duplicate osad-auth.conf will have to be deleted manually. To get the list of such systems we will need a fresh spacewalk-debug generated at least 2 hours after 4 hours interval from the last step.

Root Cause

  • Multiple osad clients were connecting to jabberd using the same username. Each time a different client connected with the same osad username as an existing client, jabberd would close the existing client's connection and allow the new client's connection.
  • This behaviour is consistent with how XMPP (implemented by jabberd) is implemented. As per XMPP RFC:
If there is already an active resource of the same name, the server MUST either
 (1) terminate the active resource and allow the newly-requested session, or
 (2) disallow the newly-requested session and maintain the active resource.
 Which of these the server does is up to the implementation, although it is
 RECOMMENDED to implement case #1.
  • osad reconnects to jabberd if it loses its connection, causing clients to reconnect over and over again, resulting in continued XMLRPC requests on the Satellite.
  • To enhance this behavior, this issue is reported in an internal bugzilla; https://bugzilla.redhat.com/show_bug.cgi?id=1205846

Diagnostic Steps

  • Errors in /var/log/osad on the clients:
2013-04-16 18:41:43 jabber_lib._orig_dispatch: <error>
<conflict xmlns = 'urn:ietf:params:xml:ns:xmpp-streams'  /></error>
If there is already an active resource of the same name, the server MUST either
 (1) terminate the active resource and allow the newly-requested session, or
 (2) disallow the newly-requested session and maintain the active resource.
 Which of these the server does is up to the implementation, although it is
 RECOMMENDED to implement case #1. In case #1, the server SHOULD send a
 <conflict/> stream error to the active resource, terminate the XML stream and
 underlying TCP connection for the active resource, and return a IQ stanza of
 type "result" (indicating success) to the newly-requested session. In case #2,
 the server SHOULD send a <conflict/> stanza error to the newly-requested
 session but maintain the XML stream for that connection so that the
 newly-requested session has an opportunity to negotiate a non-conflicting resource
 identifier before sending another request for session establishment.
[...]
Step 2 (alt): Server informs existing active resource of resource conflict (case #1):

<stream:error>
  <conflict xmlns='urn:ietf:params:xml:ns:xmpp-streams'/>
</stream:error>
  • Confirm that the Satellite's /var/log/messages shows this behavior:
  • First, 192.168.29.1 connects using username osad-fc2a194137:
Apr 15 12:35:07 satellite jabberd/c2s[31237]: [7] [::ffff:192.168.29.1, port=40802] connect
Apr 15 12:35:07 satellite jabberd/c2s[31237]: [7] legacy authentication succeeded: host=, username=osad-fc2a194137, resource=osad, TLS negotiated
Apr 15 12:35:07 satellite jabberd/c2s[31237]: [7] requesting session: jid=osad-fc2a194137@satellite.example.com/osad
Apr 15 12:35:08 satellite jabberd/sm[31230]: session started: jid=osad-fc2a194137@satellite.example.com/osad
  • Then 192.168.29.2 connects using osad-fc2a194137, causing 192.168.29.1 to be disconnected:
Apr 15 12:36:30 satellite jabberd/c2s[31237]: [10] [::ffff:192.168.29.2, port=51381] connect
Apr 15 12:36:30 satellite jabberd/c2s[31237]: [10] legacy authentication succeeded: host=, username=osad-fc2a194137, resource=osad, TLS negotiated
Apr 15 12:36:30 satellite jabberd/c2s[31237]: [10] requesting session: jid=osad-fc2a194137@satellite.example.com/osad
Apr 15 12:36:30 satellite jabberd/c2s[31237]: [7] [::ffff:192.168.29.1, port=40802] disconnect jid=osad-fc2a194137@satellite.example.com/osad, packets: 6
Apr 15 12:36:30 satellite jabberd/sm[31230]: session replaced: jid=osad-fc2a194137@satellite.example.com/osad
  • Then 192.168.29.3 connects using osad-fc2a194137, causing 192.168.29.2 to be disconnected:
Apr 15 12:36:45 satellite jabberd/c2s[31237]: [7] [::ffff:192.168.29.3, port=57149] connect
Apr 15 12:36:45 satellite jabberd/c2s[31237]: [7] legacy authentication succeeded: host=, username=osad-fc2a194137, resource=osad, TLS negotiated
Apr 15 12:36:45 satellite jabberd/c2s[31237]: [7] requesting session: jid=osad-fc2a194137@satellite.example.com/osad
Apr 15 12:36:45 satellite jabberd/c2s[31237]: [10] [::ffff:192.168.29.2, port=51381] disconnect jid=osad-fc2a194137@satellite.example.com/osad, packets: 6
Apr 15 12:36:45 satellite jabberd/sm[31230]: session replaced: jid=osad-fc2a194137@satellite.example.com/osad
  • Then 192.168.29.1 connects again using osad-fc2a194137, causing 192.168.29.3 to be disconnected:
Apr 15 12:38:03 satellite jabberd/c2s[31237]: [10] [::ffff:192.168.29.1, port=40804] connect
Apr 15 12:38:03 satellite jabberd/c2s[31237]: [10] legacy authentication succeeded: host=, username=osad-fc2a194137, resource=osad, TLS negotiated
Apr 15 12:38:03 satellite jabberd/c2s[31237]: [10] requesting session: jid=osad-fc2a194137@satellite.example.com/osad
Apr 15 12:38:03 satellite jabberd/c2s[31237]: [7] [::ffff:192.168.29.3, port=57149] disconnect jid=osad-fc2a194137@satellite.example.com/osad, packets: 6
Apr 15 12:38:03 satellite jabberd/sm[31230]: session replaced: jid=osad-fc2a194137@satellite.example.com/osad
  • Multiple IP addresses / osad clients using the same osad username:
$ cat /var/log/messages | grep 'disconnect jid=osad-fc2a194137' | cut -d' ' -f7,9-10 | sort | uniq
[::ffff:192.168.28.10, disconnect jid=osad-fc2a194137@satellite.example.com/osad,
[::ffff:192.168.28.11, disconnect jid=osad-fc2a194137@satellite.example.com/osad,
[::ffff:192.168.28.12, disconnect jid=osad-fc2a194137@satellite.example.com/osad,
[::ffff:192.168.28.13, disconnect jid=osad-fc2a194137@satellite.example.com/osad,
[::ffff:192.168.28.14, disconnect jid=osad-fc2a194137@satellite.example.com/osad,
[::ffff:192.168.28.15, disconnect jid=osad-fc2a194137@satellite.example.com/osad,
[::ffff:192.168.28.16, disconnect jid=osad-fc2a194137@satellite.example.com/osad,
[::ffff:192.168.28.17, disconnect jid=osad-fc2a194137@satellite.example.com/osad,
[::ffff:192.168.28.18, disconnect jid=osad-fc2a194137@satellite.example.com/osad,
[::ffff:192.168.28.19, disconnect jid=osad-fc2a194137@satellite.example.com/osad,
[::ffff:192.168.28.20, disconnect jid=osad-fc2a194137@satellite.example.com/osad,
[::ffff:192.168.28.21, disconnect jid=osad-fc2a194137@satellite.example.com/osad,
[...]

$ cat /var/log/messages | grep 'disconnect jid=osad-fc2a194137' | cut -d' ' -f7,9-10 | sort | uniq | wc -l
51
  • In this case, there are 51 different clients using the same osad username.

  • Searching for errors alert on the database:

$ grep ERR postgresql-*.log | cut -d' ' -f5-|sort|uniq -c
      1  current transaction is aborted, commands ignored until end of transaction block
  90411  deadlock detected
      2  duplicate key value violates unique constraint "rhn_serverpath_sid_pos_uq"
      1  duplicate key value violates unique constraint "rhn_sp_snep_uq"

$ grep DETA postgresql-*.log|cut -d' ' -f4-|sort -k4|sed -e 's/[0-9]//g'|uniq -c
  45370 DETAIL:  Process  waits for ExclusiveLock on tuple (,) of relation  of database ; blocked by process .
  45041 DETAIL:  Process  waits for ShareLock on transaction ; blocked by process .
  • Display top connections per osad-id
$ grep osad  /var/log/messages |awk '{print $10}'|sort|uniq -c|sort -nr
  52379 
  23809 host=,
  14594 jid=osad-865bb87b50@satellite.example.com/osad,
   1829 jid=osad-45cb51a31c@satellite.example.com/osad,
    856 jid=osad-d219696bf4@satellite.example.com/osad,
    817 jid=osad-9e138596ca@satellite.example.com/osad,
    791 jid=osad-6971511ebd@satellite.example.com/osad,
    709 jid=osad-14feed8f17@satellite.example.com/osad,
    638 jid=osad-d7bdae500c@satellite.example.com/osad,
    415 jid=osad-603a8799a3@satellite.example.com/osad,
    348 jid=osad-94247d45fb@satellite.example.com/osad,
    339 jid=osad-888ed12267@satellite.example.com/osad,
    182 jid=osad-c202d4c0f4@satellite.example.com/osad,
    159 jid=osad-7477a43294@satellite.example.com/osad,
    134 jid=osad-84b3444e74@satellite.example.com/osad,
    128 jid=osad-6bf7105d67@satellite.example.com/osad,
    125 jid=osad-6551f80633@satellite.example.com/osad,
     93 jid=osad-febe6076d9@satellite.example.com/osad,
     85 jid=osad-9b9e3a4820@satellite.example.com/osad,
     75 jid=osad-a172f36d12@satellite.example.com/osad,
     75 jid=osad-8de8c36ae7@satellite.example.com/osad,
     75 jid=osad-58766e10bb@satellite.example.com/osad,
     72 jid=osad-abe660ab83@satellite.example.com/osad,
     68 jid=osad-754eb8ff0e@satellite.example.com/osad,
     63 jid=osad-61ac4f63b9@satellite.example.com/osad,
     58 jid=osad-7afcce9507@satellite.example.com/osad,
     57 jid=osad-f52c135c27@satellite.example.com/osad,
     53 jid=osad-ad3d0b4982@satellite.example.com/osad,
     51 jid=osad-d217e8f83f@satellite.example.com/osad,
     51 jid=osad-ab3c2fb690@satellite.example.com/osad,
     51 jid=osad-7a6d916cde@satellite.example.com/osad,
     51 jid=osad-69555a9a90@satellite.example.com/osad,
     50 jid=osad-3685e6b159@satellite.example.com/osad,
     48 jid=osad-7ed6116c8e@satellite.example.com/osad,
     47 jid=osad-80bcc9cce5@satellite.example.com/osad,
     45 jid=osad-2f8458fc62@satellite.example.com/osad,
     45 jid=osad-12b6904185@satellite.example.com/osad,
     30 jid=osad-77d3a272d9@satellite.example.com/osad,
     28 jid=osad-d54cc7e42e@satellite.example.com/osad,
     27 jid=osad-0ad7fe49ea@satellite.example.com/osad,
     26 jid=osad-b9b09958fc@satellite.example.com/osad,
     26 jid=osad-af484315c6@satellite.example.com/osad,
     26 jid=osad-67916f1411@satellite.example.com/osad,
     26 jid=osad-347dd00696@satellite.example.com/osad,
     24 jid=osad-c5e26473a1@satellite.example.com/osad,
     24 jid=osad-185cab905d@satellite.example.com/osad,
     23 jid=osad-52249ab6f6@satellite.example.com/osad,
     23 jid=osad-2db0271925@satellite.example.com/osad,
     22 jid=osad-9e5c7cfb8d@satellite.example.com/osad,
     21 jid=osad-cb8e23abfd@satellite.example.com/osad,
  • You can use the script monitor_deadlocks.sh attached to monitor the database for any deadlocks conditions.

Attachments

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

1 Comments

Why is this important?

I had previously been using our Satellite (5.6) to push out our patch cycles, but had found that the failure rate (not picking up jobs quickly) was unsatisfactory, and I suspect that this would be at least one large reason (the other major reason was that our configuration management system was exhibiting some race-condition with rpm that corrupts the rpm database... but that's another story).

What hosts are affected?

Here's a command-line I made to show which hosts are exhibiting this behaviour. You may find it useful. Note that there is a bit of logic ($1 > 3) to give a little grace, because clients disconnect for various reasons (like being restarted due to a patch cycle), so this only shows entries in the current /var/log/messages file (so assumes it hasn't been rotated too soon). I considered using this as part of an Ansible playlist, to do all this, but the SQL provided only deletes everything, not particular clients.

grep 'jabberd.*disconnect jid=osad-' /var/log/messages | sed -re 's/.*\[([^,]*), port=[0-9]*\] disconnect jid=(osad-[^@]+)@.*/\1 \2/' | sort | uniq -c | awk '$1 > 3 { print $2 }' | while read ip ; do getent hosts $ip ; done | awk '{print $2}'

The output is a list of DNS machines (so assumes that all machines have a valid DNS PTR). When I looked at the results, it only seems to report VMs that would have been cloned (or cloned from).

Ansible playbook

Make sure you review this. You'll need to change the 3 parts that say "my_satellite_server" on the hosts lines. Beware that this will run across all hosts in your inventory (because the sql will affect all hosts in your satellite -- so you can't do this piece by piece without changing how the sql works).

---
# Non-idempotent script to reset OSAD authentication file.
# This will delete /etc/sysconfig/rhn/osad-auth.conf and
# restart OSAD. This is important to do on cloned machines
# so that the satellite can keep track of each client's
# session data reliably.
#
# cf. https://access.redhat.com/solutions/337003
#
# Cameron Kerr <cameron.kerr.nz@gmail.com>
#
# History:
#   2016-08-19   Initial release

- hosts: all:!my_satellite_server
  become: true
  gather_facts: false
  tasks:

   - name: stop osad
     service: name=osad state=stopped

   - name: delete osad auth file
     file: path=/etc/sysconfig/rhn/osad-auth.conf state=absent

- hosts: my_satellite_server
  become: true
  gather_facts: false
  tasks:

   - name: stop osa-dispatcher
     service: name=osa-dispatcher state=stopped

   - name: stop jabberd
     service: name=jabberd state=stopped

   - name: clean up the database
      # The vacuuming will take a while...
     shell: echo 'delete from rhnpushclient; commit; vacuum' | spacewalk-sql -
     register: sql_out

   - name: start jabberd
     service: name=jabberd state=started

   - name: start osa-dispatcher
     service: name=osa-dispatcher state=started

- hosts: all:!my_satellite_server
  become: true
  gather_facts: false
  serial: 10
  tasks:

   - name: start osad
     service: name=osad state=started

To run it, you would use: