Intermittent SSSD issues after moving from RHEL 6.5 to RHEL 6.6
I have a feeling this will likely turn into a bugzilla, but I was hoping someone may be able to provide some additional suggestions for troubleshooting.
Basically, one site has been using AD SSSD authentication from RHEL 6.5 for just on 12 months without issue (and are still 100%), after adding some RHEL 6.6 hosts to the environment it became apparent (over time) that there were intermittent (but regularly occurring) issues with authentication. After collating logs, this issue was narrowed down to RHEL 6.6 hosts.
The server configuration is as follows:
Windows Server 2012, (not R2)
IMU configured and providing UID,GID,home,shell etc.
Both 'builds' of RHEL hosts use the same configuration
RHEL 6.5 (working)
sssd-1.9.2-129.el6_5.4.x86_64
RHEL 6.6 (intermittent authentication issues)
sssd-1.11.6-30.el6_6.3.x86_64
The issue manifests itself as intermittent messages of "Authentication service cannot retrieve authentication info". The most common scenario is that a user will login to a server over SSH using SSSD backend and will authenticate OK, then when attempting to sudo (using the same account < 10 seconds later) SSSD will return the error. The same error will be received for a period of time (~1-2 minutes) then authentication may start working again. The error isn't just limited to sudo, if a user attempts to login during the 'broken' period, this attempt will also be denied with the same error.
The full error looks like:
Apr 10 16:14:32 <servername> sudo: pam_sss(sudo:auth): received for user <username>: 9 (Authentication service cannot retrieve authentication info)
After some troubleshooting (and multiple packet captures), it appears that when this error occurs there is no traffic going to the AD servers, which leads me to suspect that the SSSD cache may be involved. In the above scenario, when the user logs in the user is authenticated and the Kerberos traffic is seen. For the multiple sudo attempts after that (that fail) no traffic is seen across the network. During this period of 'broken' authentication, 'id', 'groups' and 'getent' etc. can all return the user attributes and all AD provided groups appear as expected.
Some additional testing involved manually clearing the cache (with rm, not sss_cache), which appears to work. On instances where I have cleared the cache I haven't seen the error re-occur, but I suspect it will when the servers are used in anger again.
Last point worth noting:
Initially thought this may be a nesting / group issue as some of the users are members of 100+ posix groups and we witnessed "KRB_ERROR - KRB_ERR_RESPONSE_TOO_BIG (52)" coinciding with login failures being returned from the AD server. This appears to be a UDP limit, but when switching to TCP only using "udp_preference_limit = 1" the KRB ERR went away but the authentication error on the Linux hosts remained the same.
I will be moving some of the 6.6 hosts to the next SSSD version (latest) soon, hoping it may include a fix, but I was unable to find anything relevant in the changelogs. I am also considering downgrading the SSSD installation to 1.9.2 on a RHEL 6.6 host to see if this resolves the issue.
Has anyone else had any issues moving from RHEL 6.5 -> RHEL 6.6 in regards to SSSD?
Does anyone know if any obvious defaults were changed between the two versions? The summary in the release notes doesn't mention specifics
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/6.6_Release_Notes/authentication.html
Has anyone come across any bugs that sound even remotely similar to what I have described?
Thanks in advance!
Responses
Just a thought - I've been looking into similar 6.6 strangeness recently. We're moving from 6.4 rather than 6.5 but this may apply,
The sssd AD backend has acquired the capability to do dynamic DNS updates - see man sssd-ad and look for dyndns. The feature is on by default but seems to get confused in some environments. I'd suggest you check your DNS is fully consistent for the misbehaving host, and in particular there isn't a PTR record which points to an unqualified version of your hostname. A messed up DNS could possibly be causing your GSSAPI errors, but I'm not enough of a Kerberos expert to say for certain.
Hi, if you have those instructions on yum downgrading the entire sssd suite to a specific version, I would very much appreciate it. Thanks in advance. I have similar problems, being that I can have issues with sudo command not resolving the groups, even straight away after logon. If I issue an 'id" command it will work for a few minutes, then stop again. My issue on RH 6.5 is that I have it working on sssd 1.12.4-47.el6_7.8, but not 1.13.3-22.el6_8.4
I want to downgrade my servers to my working set 1.12.4.-47, but its not straightforward in yum
Neil
The thing that caught me was that the DNS dynamic updates were enabled by default, and our DNS configuration was not as tight as it should have been. To cut a long (embarrassing) story short, the DNS updates were producing A records in the root zone and PTR records to effectively unqualified names. That was sufficient to break Kerberos for us. I've now disabled updates in the root (they shouldn't really have been allowed in the first place) and set ad_hostname and all seems well. For now. After the other stories I've read about 6.6, I think I'll be letting the test boxes bed in for a fair bit before we roll this one out.
Here's the fella. I've removed the LDAP search bases. You'll need to make the expected substitutions for %{domain} and %{fqdn}.
The file was hand-crafted. for 6.4. Since then, I've only added ad_hostname for 6.6. We originally set the file up with the LDAP backend rather than AD, as I also needed to support some ancient Solaris machines on the same network. This had its own interesting and varied edge cases, so I moved to the AD provider for Linux.
UNIX UIDs and GIDs are stored as LDAP attributes. We're using AD for users, groups, services, netgroups and the automounter only at the moment. /etc/nsswitch.conf entries or all of these are 'files sss'. /etc/pam.d/system-auth-ac looks the same for both 6.4 and 6.6.
Rather annoyingly (for the purposes of debugging), the DCs have just been upgrade to 2012R2, so I'm not convinced were necessarily exercising all the code in the AD back end anyway!
Another possible scenario which could generate your symptoms occurs to me, but I think it's pretty unlikely. We had something similar to this a while ago. Originally one of our three DCs did not support all the encryption types supported by the others - it was missing AES256. The brokenness happened when the initial Kerberos authentication exchanged grabbed an AES256 TGT from a working DC. That TGT wouldn't work when presented to the TGS on the faulty DC, but it was fine with the others.
The only reason why this could be causing your problem would be if your 6.6 process for adding the machine to the domain is for some reason using a subset of the encryption types your 6.4 process uses. Might be worth trying 'klist -k -e /etc/krb5.keytab' on both a 6.4 and a 6.6 machine and making sure the expected encryption types are listed, and all the key version numbers are consistent.
[sssd]
services = nss, pam, autofs
config_file_version = 2
domains = %{domain}
[nss]
filter_groups = root
filter_users = root
[pam]
[sudo]
[autofs]
[ssh]
[pac]
# Name used here determines Active Directory domain
[domain/%{domain}]
debug_level =0xf0
# Use 'ad' rather than 'ldap/krb5'. See OR28801
id_provider = ad
auth_provider = ad
# The AD access provider checks for account expiration (default for sssd
# is 'permit')
access_provider = ad
# We have Posix attributes stored in AD already
ldap_id_mapping = false
# (RHEL 6.6) for dynamic DNS updates to work, we need this...
ad_hostname = %{fqdn}
# (RHEL 6.4) autofs has to be provided by LDAP...
autofs_provider = ldap
# ...for which we need sensible settings
ldap_sasl_mech = GSSAPI
# Kerberos configuration:-
# - Enable automatic renewals. The specified time (#seconds) is the
# interval at which we check to see if the ticket is 50% gone.
# - Store offline passwords. This copes with the DCs being unavailable
# at logon.
# - Use the standard template name for the Kerberos credential cache so that
# standard Kerberos utilities can see it too.
# - Allow a TGT to be renewed for a working week.
krb5_renew_interval = 10800
krb5_store_password_if_offline = true
#krb5_renewable_lifetime = 7d
# In our setup, referrals are not needed
ldap_referrals = false
# Generic options. Cache credentials. Don't enumerate passwd and group
cache_credentials = true
enumerate = false
# -----------------------------------------------------------------
# S C H E M A
#
# We use the Active Directory default of rfc2307bis, which is pretty
# much mandated by "id_provider = ad". See OR28837
# -----------------------------------------------------------------
#----------------------------
ldap_user_search_base = ...
#----------------------------
ldap_group_search_base = ...
ldap_group_name = mail
#----------------------------
ldap_service_search_base = ...
#----------------------------
ldap_netgroup_search_base = ...
#----------------------------
# We store the automounter data in NIS maps, as our current AD servers do
# not support the automount classes of RFC2307(bis)
ldap_autofs_search_base = ...
ldap_autofs_map_object_class = nisMap
ldap_autofs_map_name = nisMapName
ldap_autofs_entry_object_class = nisObject
ldap_autofs_entry_key = cn
ldap_autofs_entry_value = nisMapEntry
Hey Team,
If you would like any assistance from RedHat regarding this issue , please feel free to open a support ticket.
I'm sure we would be able assist you.
Thanks
Try setting:
ad_server = dc1.domain.com, dc2.domain.com, dc3...
under
[domain/...]
Alternatively set
dns_discovery_domain
if your env allows it.
Both solutions above worked for me pointing with
id_provider = ad
There's also the SSSD-Troubleshooting page with a lot of useful and, perhaps, pertinent information. As well as a number of tickets specific to RHEL6.6 issues.
Yes, I was having the exact problem as you described above...
"Apr 10 16:14:32
... in the secure log on at least two dozen servers upgraded from 1.9.x to 1.11 latest after updating to rhel6.6.
This behaviour happened mostly but not always after a reboot, and intermittently after.
I looked really hard and found nothing in the logs that helped. Turned on debug to 8 in sssd.conf and still nothing stood out to me.
Like you I had my DCs defined in the krb5.conf.
Nothing really led me directly to this solution, I just started thinking it was perhaps related to DNS, so I specified the IP of my DCs (worked) and the rest followed from that.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
