Intermittent SSSD issues after moving from RHEL 6.5 to RHEL 6.6

Latest response

I have a feeling this will likely turn into a bugzilla, but I was hoping someone may be able to provide some additional suggestions for troubleshooting.

Basically, one site has been using AD SSSD authentication from RHEL 6.5 for just on 12 months without issue (and are still 100%), after adding some RHEL 6.6 hosts to the environment it became apparent (over time) that there were intermittent (but regularly occurring) issues with authentication. After collating logs, this issue was narrowed down to RHEL 6.6 hosts.

The server configuration is as follows:
Windows Server 2012, (not R2)
IMU configured and providing UID,GID,home,shell etc.

Both 'builds' of RHEL hosts use the same configuration
RHEL 6.5 (working)
sssd-1.9.2-129.el6_5.4.x86_64

RHEL 6.6 (intermittent authentication issues)
sssd-1.11.6-30.el6_6.3.x86_64

The issue manifests itself as intermittent messages of "Authentication service cannot retrieve authentication info". The most common scenario is that a user will login to a server over SSH using SSSD backend and will authenticate OK, then when attempting to sudo (using the same account < 10 seconds later) SSSD will return the error. The same error will be received for a period of time (~1-2 minutes) then authentication may start working again. The error isn't just limited to sudo, if a user attempts to login during the 'broken' period, this attempt will also be denied with the same error.

The full error looks like:

Apr 10 16:14:32 <servername> sudo: pam_sss(sudo:auth): received for user <username>: 9 (Authentication service cannot retrieve authentication info)

After some troubleshooting (and multiple packet captures), it appears that when this error occurs there is no traffic going to the AD servers, which leads me to suspect that the SSSD cache may be involved. In the above scenario, when the user logs in the user is authenticated and the Kerberos traffic is seen. For the multiple sudo attempts after that (that fail) no traffic is seen across the network. During this period of 'broken' authentication, 'id', 'groups' and 'getent' etc. can all return the user attributes and all AD provided groups appear as expected.

Some additional testing involved manually clearing the cache (with rm, not sss_cache), which appears to work. On instances where I have cleared the cache I haven't seen the error re-occur, but I suspect it will when the servers are used in anger again.

Last point worth noting:
Initially thought this may be a nesting / group issue as some of the users are members of 100+ posix groups and we witnessed "KRB_ERROR - KRB_ERR_RESPONSE_TOO_BIG (52)" coinciding with login failures being returned from the AD server. This appears to be a UDP limit, but when switching to TCP only using "udp_preference_limit = 1" the KRB ERR went away but the authentication error on the Linux hosts remained the same.

I will be moving some of the 6.6 hosts to the next SSSD version (latest) soon, hoping it may include a fix, but I was unable to find anything relevant in the changelogs. I am also considering downgrading the SSSD installation to 1.9.2 on a RHEL 6.6 host to see if this resolves the issue.

Has anyone else had any issues moving from RHEL 6.5 -> RHEL 6.6 in regards to SSSD?
Does anyone know if any obvious defaults were changed between the two versions? The summary in the release notes doesn't mention specifics
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/6.6_Release_Notes/authentication.html
Has anyone come across any bugs that sound even remotely similar to what I have described?

Thanks in advance!

Responses