Intermittent SSSD issues after moving from RHEL 6.5 to RHEL 6.6

Latest response

I have a feeling this will likely turn into a bugzilla, but I was hoping someone may be able to provide some additional suggestions for troubleshooting.

Basically, one site has been using AD SSSD authentication from RHEL 6.5 for just on 12 months without issue (and are still 100%), after adding some RHEL 6.6 hosts to the environment it became apparent (over time) that there were intermittent (but regularly occurring) issues with authentication. After collating logs, this issue was narrowed down to RHEL 6.6 hosts.

The server configuration is as follows:
Windows Server 2012, (not R2)
IMU configured and providing UID,GID,home,shell etc.

Both 'builds' of RHEL hosts use the same configuration
RHEL 6.5 (working)
sssd-1.9.2-129.el6_5.4.x86_64

RHEL 6.6 (intermittent authentication issues)
sssd-1.11.6-30.el6_6.3.x86_64

The issue manifests itself as intermittent messages of "Authentication service cannot retrieve authentication info". The most common scenario is that a user will login to a server over SSH using SSSD backend and will authenticate OK, then when attempting to sudo (using the same account < 10 seconds later) SSSD will return the error. The same error will be received for a period of time (~1-2 minutes) then authentication may start working again. The error isn't just limited to sudo, if a user attempts to login during the 'broken' period, this attempt will also be denied with the same error.

The full error looks like:

Apr 10 16:14:32 <servername> sudo: pam_sss(sudo:auth): received for user <username>: 9 (Authentication service cannot retrieve authentication info)

After some troubleshooting (and multiple packet captures), it appears that when this error occurs there is no traffic going to the AD servers, which leads me to suspect that the SSSD cache may be involved. In the above scenario, when the user logs in the user is authenticated and the Kerberos traffic is seen. For the multiple sudo attempts after that (that fail) no traffic is seen across the network. During this period of 'broken' authentication, 'id', 'groups' and 'getent' etc. can all return the user attributes and all AD provided groups appear as expected.

Some additional testing involved manually clearing the cache (with rm, not sss_cache), which appears to work. On instances where I have cleared the cache I haven't seen the error re-occur, but I suspect it will when the servers are used in anger again.

Last point worth noting:
Initially thought this may be a nesting / group issue as some of the users are members of 100+ posix groups and we witnessed "KRB_ERROR - KRB_ERR_RESPONSE_TOO_BIG (52)" coinciding with login failures being returned from the AD server. This appears to be a UDP limit, but when switching to TCP only using "udp_preference_limit = 1" the KRB ERR went away but the authentication error on the Linux hosts remained the same.

I will be moving some of the 6.6 hosts to the next SSSD version (latest) soon, hoping it may include a fix, but I was unable to find anything relevant in the changelogs. I am also considering downgrading the SSSD installation to 1.9.2 on a RHEL 6.6 host to see if this resolves the issue.

Has anyone else had any issues moving from RHEL 6.5 -> RHEL 6.6 in regards to SSSD?
Does anyone know if any obvious defaults were changed between the two versions? The summary in the release notes doesn't mention specifics
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/6.6_Release_Notes/authentication.html
Has anyone come across any bugs that sound even remotely similar to what I have described?

Thanks in advance!

Responses

I have narrowed this down a little more. There is a log message that coincides with the failed login attempts:

sssd_be: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Server not found in Kerberos database)

Strangely though, Kerberos works immediately before the failed attempt.. (for user login) but then generates this error when attempting to sudo (seconds later).

I put sssd into debug and found the following messages coincided with the sudo failures:

(Mon Apr 13 17:22:44 2015) [sssd[nss]] [nss_cmd_getbynam] (0x0400): Running command [17] with input [shadowman].
(Mon Apr 13 17:22:44 2015) [sssd[nss]] [check_cache] (0x0400): Cached entry is valid, returning..
(Mon Apr 13 17:22:44 2015) [sssd[nss]] [nss_cmd_getpwnam_search] (0x0400): Returning info for user [shadowman@domain.local]
(Mon Apr 13 17:22:44 2015) [sssd[nss]] [nss_cmd_getbynam] (0x0400): Running command [17] with input [shadowman].
(Mon Apr 13 17:22:44 2015) [sssd[nss]] [check_cache] (0x0400): Cached entry is valid, returning..
(Mon Apr 13 17:22:44 2015) [sssd[nss]] [nss_cmd_getpwnam_search] (0x0400): Returning info for user [shadowman@domain.local]
(Mon Apr 13 17:22:44 2015) [sssd[pam]] [accept_fd_handler] (0x0400): Client connected!
(Mon Apr 13 17:22:44 2015) [sssd[pam]] [sss_dp_issue_request] (0x0400): Issuing request for [0x40b570:3:shadowman@domain.local]
(Mon Apr 13 17:22:44 2015) [sssd[pam]] [sss_dp_get_account_msg] (0x0400): Creating request for [domain.local][3][1][name=shadowman]
(Mon Apr 13 17:22:44 2015) [sssd[pam]] [sss_dp_internal_get_send] (0x0400): Entering request [0x40b570:3:shadowman@domain.local]
(Mon Apr 13 17:22:44 2015) [sssd[be[domain.local]]] [be_req_set_domain] (0x0400): Changing request domain from [domain.local] to [domain.local]
(Mon Apr 13 17:22:44 2015) [sssd[pam]] [pam_check_user_search] (0x0400): Returning info for user [shadowman@domain.local]
(Mon Apr 13 17:22:44 2015) [sssd[pam]] [sss_dp_req_destructor] (0x0400): Deleting request: [0x40b570:3:shadowman@domain.local]
(Mon Apr 13 17:22:44 2015) [sssd[be[domain.local]]] [be_req_set_domain] (0x0400): Changing request domain from [domain.local] to [domain.local]

What concerns me is "Cached entry is valid, returning", which then results in a failed authentication attempt. This lines up with the fact that the failed login attempts don't appear to generate any network traffic.

In a successful auth attempt you can see it generate the request for the user, connect to the AD server and return the details for the user (rather than above that appears to delete the request). I would paste that output but it's a little too verbose.

I have also attempted to disable cache with 'entry_cache_timeout = 1' and 'memcache_timeout = 1', but unfortunately the error still persists.

-edit-

Also have the following from the logs:

(Mon Apr 13 17:24:01 2015) [[sssd[krb5_child[18719]]]] [main] (0x0400): krb5_child started.
(Mon Apr 13 17:24:01 2015) [[sssd[krb5_child[18719]]]] [main] (0x0400): Will perform online auth
(Mon Apr 13 17:24:01 2015) [[sssd[krb5_child[18719]]]] [get_and_save_tgt] (0x0400): Attempting kinit for realm [DOMAIN.LOCAL]
(Mon Apr 13 17:24:01 2015) [[sssd[krb5_child[18719]]]] [validate_tgt] (0x0400): TGT verified using key for [host/hostname.domain.local@domain.LOCAL].
(Mon Apr 13 17:24:02 2015) [[sssd[krb5_child[18719]]]] [main] (0x0400): krb5_child completed successfully

In my capture, this appeared after the failed sudo attempts. Once this appeared, auth seemed to function as expected again... which would also explain the 'Server not found' Kerberos messages.

Just a thought - I've been looking into similar 6.6 strangeness recently. We're moving from 6.4 rather than 6.5 but this may apply,

The sssd AD backend has acquired the capability to do dynamic DNS updates - see man sssd-ad and look for dyndns. The feature is on by default but seems to get confused in some environments. I'd suggest you check your DNS is fully consistent for the misbehaving host, and in particular there isn't a PTR record which points to an unqualified version of your hostname. A messed up DNS could possibly be causing your GSSAPI errors, but I'm not enough of a Kerberos expert to say for certain.

Thanks for the suggestion.

All DNS is configured statically through an IPAM appliance, but I double checked the forward/reverse records and they all look good.

I have also updated a pool of servers to the latest SSSD version (minor increment up) and the issue has continued.

I have now reverted another group of hosts back to the 1.9.2-129 packages for SSSD and the problems look to have gone away on these servers. If anyone wants to carry out an SSSD downgrade on RHEL 6.6 for testing let me know and I will post up the procedure.. plenty of yum tedium due to the fairly substantial difference in packages between 1.9 and 1.11!

Hi, if you have those instructions on yum downgrading the entire sssd suite to a specific version, I would very much appreciate it. Thanks in advance. I have similar problems, being that I can have issues with sudo command not resolving the groups, even straight away after logon. If I issue an 'id" command it will work for a few minutes, then stop again. My issue on RH 6.5 is that I have it working on sssd 1.12.4-47.el6_7.8, but not 1.13.3-22.el6_8.4

I want to downgrade my servers to my working set 1.12.4.-47, but its not straightforward in yum

Neil

Matt,

It would appear that we are not the only ones having issues with the 6.6 changes:
http://freeipa-users.redhat.narkive.com/TjBl4Ym5/centos-ipa-client-fails-after-upgrade-to-6-6

The thing that caught me was that the DNS dynamic updates were enabled by default, and our DNS configuration was not as tight as it should have been. To cut a long (embarrassing) story short, the DNS updates were producing A records in the root zone and PTR records to effectively unqualified names. That was sufficient to break Kerberos for us. I've now disabled updates in the root (they shouldn't really have been allowed in the first place) and set ad_hostname and all seems well. For now. After the other stories I've read about 6.6, I think I'll be letting the test boxes bed in for a fair bit before we roll this one out.

I'd prefer having an embarrassing story over unresolved intermittent auth issues any day!

If it doesn't put you out, would you be able to paste your (sanitised) config file for sssd? Interested to see what it looks like. Did you generate your sssd.conf fresh or are you using your 6.5 config with only the ad_hostname change? Also, which version of Windows AD are you using? and are you using IMU or ID mapping?

I am using a very basic config that has worked since very early RHEL 6, but I am now considering regenerating it using authconfig with the 1.11 version of SSSD to see if any of the standard configuration authconfig generates has changed.

I am finding that the caching is a black box, and it's not straightforward to completely turn caching off for troubleshooting... which slows progress.

Here's the fella. I've removed the LDAP search bases. You'll need to make the expected substitutions for %{domain} and %{fqdn}.

The file was hand-crafted. for 6.4. Since then, I've only added ad_hostname for 6.6. We originally set the file up with the LDAP backend rather than AD, as I also needed to support some ancient Solaris machines on the same network. This had its own interesting and varied edge cases, so I moved to the AD provider for Linux.

UNIX UIDs and GIDs are stored as LDAP attributes. We're using AD for users, groups, services, netgroups and the automounter only at the moment. /etc/nsswitch.conf entries or all of these are 'files sss'. /etc/pam.d/system-auth-ac looks the same for both 6.4 and 6.6.

Rather annoyingly (for the purposes of debugging), the DCs have just been upgrade to 2012R2, so I'm not convinced were necessarily exercising all the code in the AD back end anyway!

Another possible scenario which could generate your symptoms occurs to me, but I think it's pretty unlikely. We had something similar to this a while ago. Originally one of our three DCs did not support all the encryption types supported by the others - it was missing AES256. The brokenness happened when the initial Kerberos authentication exchanged grabbed an AES256 TGT from a working DC. That TGT wouldn't work when presented to the TGS on the faulty DC, but it was fine with the others.

The only reason why this could be causing your problem would be if your 6.6 process for adding the machine to the domain is for some reason using a subset of the encryption types your 6.4 process uses. Might be worth trying 'klist -k -e /etc/krb5.keytab' on both a 6.4 and a 6.6 machine and making sure the expected encryption types are listed, and all the key version numbers are consistent.

[sssd]
services = nss, pam, autofs
config_file_version = 2
domains = %{domain}

[nss]
filter_groups = root
filter_users = root

[pam]

[sudo]

[autofs]

[ssh]

[pac]

# Name used here determines Active Directory domain
[domain/%{domain}]
debug_level =0xf0
# Use 'ad' rather than 'ldap/krb5'. See OR28801
id_provider = ad
auth_provider = ad
# The AD access provider checks for account expiration (default for sssd
# is 'permit')
access_provider = ad
# We have Posix attributes stored in AD already
ldap_id_mapping = false

# (RHEL 6.6) for dynamic DNS updates to work, we need this...
ad_hostname = %{fqdn}

# (RHEL 6.4) autofs has to be provided by LDAP...
autofs_provider = ldap

# ...for which we need sensible settings
ldap_sasl_mech = GSSAPI

# Kerberos configuration:-
# - Enable automatic renewals. The specified time (#seconds) is the
#   interval at which we check to see if the ticket is 50% gone.
# - Store offline passwords. This copes with the DCs being unavailable
#   at logon.
# - Use the standard template name for the Kerberos credential cache so that
#   standard Kerberos utilities can see it too.
# - Allow a TGT to be renewed for a working week.
krb5_renew_interval = 10800
krb5_store_password_if_offline = true
#krb5_renewable_lifetime = 7d

# In our setup, referrals are not needed
ldap_referrals = false

# Generic options. Cache credentials. Don't enumerate passwd and group
cache_credentials = true
enumerate = false

# -----------------------------------------------------------------
# S C H E M A
#
# We use the Active Directory default of rfc2307bis, which is pretty
# much mandated by "id_provider = ad". See OR28837
# -----------------------------------------------------------------

#----------------------------
ldap_user_search_base = ...

#----------------------------
ldap_group_search_base = ...
ldap_group_name = mail

#----------------------------
ldap_service_search_base = ...

#----------------------------
ldap_netgroup_search_base = ...

#----------------------------
# We store the automounter data in NIS maps, as our current AD servers do
# not support the automount classes of RFC2307(bis)
ldap_autofs_search_base = ...
ldap_autofs_map_object_class = nisMap
ldap_autofs_map_name = nisMapName
ldap_autofs_entry_object_class = nisObject
ldap_autofs_entry_key = cn
ldap_autofs_entry_value = nisMapEntry

Thanks Matt,

Investigated kerberos tickets, all look good. When troubleshooting we could see all requests going to a single AD server (pcap), because my initial thought was that one of the AD servers was upsetting SSSD.

Interestingly I have found this SSSD bug ticket
https://fedorahosted.org/sssd/ticket/2531

This references the sssd-common-1.11.6-30.el6 package as having the problem, and also references the following Bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1173738

Unfortunately this linked Bugzilla is private.. but if the bug is now resolved upstream and was originally raised in RHEL Bugzilla, I am wondering why the backport hasn't shown up in RHEL yet? (has been closed for ~3 months)

I am not sure this is even related, but the 1.11.6-30 should hopefully get a version bump soon.

Hey Team,
If you would like any assistance from RedHat regarding this issue , please feel free to open a support ticket.
I'm sure we would be able assist you.

Thanks

Try setting:
ad_server = dc1.domain.com, dc2.domain.com, dc3...
under
[domain/...]

Alternatively set
dns_discovery_domain
if your env allows it.
Both solutions above worked for me pointing with
id_provider = ad

Gianni,

Thanks for the response.

Were you having the same issue as above?

I have explicitly specified the AD servers during testing to limit traffic to a single DC and I was still able to replicate the issue.

I still feel it is caching, and potentially something to do with nested groups. I have never had a first login fail (before cache exists), only subsequent logins that appear (when sssd is logging debug) to be returning a cache hit. I can also clarify that if one user has the authentication issue, other users can login OK (ie. it seems to be user specific also).

I have reverted 50+ hosts to 1.9.2-129 using the identical configuration that I was using for 11.1.6. and the issue has completely gone away on these hosts.

Added the ad_server line and it's worked wonders so far. Thanks for the insight.

Take a look at the changelog for the 1.11.7 version; ticket #2399 specifically.

Hi Amel,

Thanks for the suggestion. SSSD service is running fine (it successfully authenticates users), it just fails to auth users intermittently. These hosts are also built from 6.6 (ie. not 6.5 and upgraded).. although I may not have made that clear prior to this post.

There's also the SSSD-Troubleshooting page with a lot of useful and, perhaps, pertinent information. As well as a number of tickets specific to RHEL6.6 issues.

Thanks again.

I have been through the SSSD tickets, and have linked above to what I believe is the issue could be.
https://fedorahosted.org/sssd/ticket/2531

I am going to wait for a new 1.11 release from Red Hat to see if that resolves the issue (and hoping for a backport of above patch). As it stands, 1.9.2 doesn't have any outstanding Errata and doesn't display the issue... so rolling back is a safe/reliable option.

As mentioned above, this configuration has been fine for 12+ months and hasn't been changed for 1.11.6. As soon as 1.11.6 is installed only 1.11.6 hosts start displaying this behaviour.

Another related question, can anyone provide insight as to the changein SSSD version from 1.9 -> 1.11 when 1.9.x is still listed as the stable LTM release on the SSSD project page? Was this for the Samba integration support?

Yes, I was having the exact problem as you described above...
"Apr 10 16:14:32 sudo: pam_sss(sudo:auth): received for user : 9 (Authentication service cannot retrieve authentication info)"
... in the secure log on at least two dozen servers upgraded from 1.9.x to 1.11 latest after updating to rhel6.6.
This behaviour happened mostly but not always after a reboot, and intermittently after.

Thanks Gianni, I really appreciate you registering and posting the details (it also helps confirm that i'm not crazy).

Looking back through the configuration changes i've tested, I explicitly defined the servers in the krb5.conf, not in SSSD, so it is definitely something I will try. I also found a potentially related bug (don't have details with me) where if the primary DC lookup fails for whatever reason, and it is also the primary DNS server in resolv.conf, the DNS resolution won't attempt subsequent DNS servers.

Can I ask what lead you to this solution? Were you able to find something specific in the logs? or a Bugzilla / error that matched your symptoms? If you have an error etc. I can search the debug logs I have dumped to confirm.

I looked really hard and found nothing in the logs that helped. Turned on debug to 8 in sssd.conf and still nothing stood out to me.

Like you I had my DCs defined in the krb5.conf.

Nothing really led me directly to this solution, I just started thinking it was perhaps related to DNS, so I specified the IP of my DCs (worked) and the rest followed from that.

We had similar issues recently, though we were migrating to 2012 domain controllers. I haven't investigated your issue deeply, but wanted to share. We had to add the following to sssd.conf:

ldap_referrals = False

Thanks for that Robert,

I've had one site in the past that suffered slowness when ldap_referrals was enabled due to AD server configuration, but this was consistent across all Linux hosts in the environment (ie. not intermittent), it was also on older SSSD.

Were you seeing issues specifically on 1.11.6?

I have now rolled back all hosts in the environment to 1.9.2 (custom channel with newer SSSD components removed) and haven't seen the problem since.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.