Best practice for bulk patching of RHEL with Satellite

Latest response

Hi all,

Has anybody already established some best practice for bulk patching of RHEL machines with the help of the Satellite Server (version 6.2+)?
Our use case would be more or less something like that: a few scores of RedHat machines (RHEL 6.X and RHEL 7.2) need to be upgraded to the newest minor version (resp. 6.9 and 7.3). All the test machines should be upgraded first, all the production afterwards.

If anybody has gained some experience already with mass patching of RHEL systems using the Satellite Server, could you please share it?

Thanks and best regards,
Joanna

Responses

Satellite Object refresher

Before I begin, here is a refresher on The relationships between the Satellite objects. (This is my own interpretation, if what I say goes against official documentation, then trust the docs).

Once you view the items as objects with one to one, or one to many relationships, their purpose becomes a lot more clear.

A repository is a collection of content types ('yum, 'puppet', 'docker', or 'ostree') but you will most likely use yum and the repository will contain RPMs. A Product consists of one or more repositories that are relevant to the Product name a Content view can consist of one or more repositories whose inclusion is logically organized by you.

Note - Products appear to only be created for organizing subscriptions. Content-views are what you will be used in your patching workflow

Publishing a new version of a content view will create a content-view version object that maintains a catalog of all the packages in the content-view in that current time.

A lifecycle-environment allows you to define logical environments, of which other objects can be added to (This explanation sucks but the concrete example below makes it clear)

Promoting a content-view means that you are taking an existing content-view version, and assigning it to a new lifecycle-environment

What ties all these objects in together? Activation-keys. You can create a single activation-key that is assigned to a specific content-view, and a specific lifecyle-environment.

Workflow

Now I'm going to give you my actual workflow to make these objects a bit more concrete. This is done in a small environment ( < 100 machines) so take that into consideration.

We have Red hat 6 Workstations, and Servers, so I have created 4 content-views (cv_RedHat6_Server, cv_RedHat7_Workstation ...) For each content-view, I added the relevant repositories (RedHat server 6 Software Collections, Redhat 6 Server rpms all go in cv_RedHat6_Server and so on)

We then create two life-cycle environments. I just called them 'Development' and 'Production'

Keep in mind there is a default life-cycle environment called 'Library'. Any time you publish a new content-view version, it will automatically be placed in the 'Library' environment.

Now create the needed activation keys. You should have one activation key for every combination of your defined environments and content views. Since I have two environments (development, production) and four content views(cv_Redhat6_Server, cv_RedHat7_Server, cv_RedHat6_Workstation, cv_RedHat7_Workstation) I will create 8 activation keys (ak_Development_RedHat6_Server, ak_Production_RedHat6_Server and so on)

Assign each activation key to each machine that it belongs in. I used Ansible to define group variables, and substitute the variable for the activation key in my Satellite-client playbook. Whatever your method, you should absolutely automate the process of joining clients to Satellite. I have it automated and it's still a headache with trouble machines!

At this point lets say you synced new content, and you want to test machines. You will Publish a new version of your content view, and promote that content-view version to the Development environment. Once promoted, all machines that belong to your development environment will have the new content available, while the Production machines will still be pointing to the previous content view version. After patching and rebooting your development machines successfully, you can with some confidence (not full confidence though) promote the content to Production, and patch your production machines. If you found out a content view version had issues, you can promote a content view version to a previous version (effectively demoting it).

So this provides one layer of defense, but really isn't foolproof unless your development machines are exactly the same as your production in every way.

Alternatives

If you are in a small environment, you can get away with taking a snapshot on your important servers, but I have seen that this will take way too much time in a larger environment, so it really isn't feasible.

Another method is to rollback your updates with yum history list -> yum history rollback [id]. This works fine for simple packages, but if a kernel breaks your machine, you're out of luck.

My final idea, is that it's time to finally learn lvm snapshots. They look like a lighter alternative to VM snapshots, and can really save your butt. The only problem is, I can't be the only one who has servers with the volume group space fully used up, and snapshots require extra space. I'll leave it at that as I need to do much more research on these before mentioning their use for patching.

This is just my workflow, which isn't close to perfect, but it may give you some ideas on how to patch.

Hi Luke, Thanks so much for you answer. Your explanation is really good and helps a lot to prepare a suitable scenatio of patching in our Red Hat Environment. One additional question: do you automate the patching itself as well (e.g. with remote execution + katello-agents) or do you have to log into every RHEL machine and do the upgrade manually? Nice day, Joanna

We use host_collections to isolate our patching to selective systems (eg a host collection for UNIT systems, host collection for INTG systems, etc) On given days during the month we schedule Remote Execution jobs to a target host_collection that calls a script locally on the system to do the updates. (eg a local script that does a yum -y update or yum -y update --security depending on the type of errata we are applying) This script in turn also schedules a reboot via at command if its necessary (kernel update, etc).

What errata are available is controlled from the content-view. We simply publish and promote to the lab and prod environments when we are ready for errata to be made available.

If you are using katello to do patching, then it seems to work well, but we've experienced some timeout issues with REX and scheduling multiple errata (eg using the REX template to schedule specific errata to hundreds or thousands of systems). So we ended up writing our own local script to handle the patching.

Until 6.2.10 this has been very well for us, even at scale.

I think the nice thing about Satellite is that there are many ways to accomplish the same thing, and they largely depend on your use case and what your security restrictions are.

My systems are in host collections which are assigned a content view; non-prod or prod. From there I use a series of bash and hammer commands to reboot, apply errata.

Hi ,

Do we have any mechanism to Patch bulk of servers at one time using satellite.

Please post it will be helpful.

Joanna,

For bulk patching, one must take (what I and our customers consider to be) an unacceptable risk of assigning a nopassword sudo directive to the foreman account. So (for us) ansible (not tower) seems like a nice method to patch multiple systems at once.

Each organization, business, group of people will likely have different operational needs. That being said, patch things that are less-mission critical to your business, evaluate for function and proceed with a wider scope of patching. If you have a test, development network, patch there first. Perhaps again in an incremental approach. If you have more than one set of Oracle (or whatever unique function set of) servers, patch one, then the other, or patch one group, then the other.

The whole idea overall in this is to mitigate risk. I'd recommend stridently against patching everything at once. Examine your own sets/groups of systems, look for what is sensible to patch in an incremental approach. I'd recommend setting up a consistent patch date so you can point to the calendar to those who are resistant against patches and say - "this is our agreed upon patch window".

Regards,

RJ

We have a considerable delay between when we patch development, test and prod, for example. Using Ansible, how would you ensure that the production system is getting the exact same version as the patches applied to dev? Typically we find that our repo has changed notably in the 1-2 months for dev->prod patching.

Disable 'Mirror on Sync' for your repositories. This will keep the older downloaded patches as well as the newer ones from the upstream repos when they sync.

We use specific date(e.g. end of month) base rule filters on content-view. Giving the opportunity to keep base-line common to Dev and Prod.

Regarding: "[..] an unacceptable risk of assigning a nopassword sudo directive to the foreman account. [..]"

We're also looking into this.

Can some of your/our worries be mitigated by - from the top of my head (not tested/verified) ? :

  • Change username for the SSH remote-execution (remote_execution_ssh_user).

  • Use Kerberos instead of SSH-keys for the authentication. Doing that you must create a ticket to be able to do a remote SSH login (run remote commands). You can wait for automatic invalidation of the ticket - or do it manually, so that the SSH-user no longer can login to the remote servers.

  • If you have your sudo-rules in LDAP, how about using "sudoNotAfter" ?

  • Control sudo-commands available for the remote user.

Additionally; there are probably some other ways to run arbitrary code on the servers:

  • Local users on the Satellite that have access in the Puppet module-directories can already run remote arbitrary commands as root on every server (if Puppet-master).

  • Every user that can control the Satellite qpidd can probably install arbitrary RPM's... including post-scripts.

  • The Apache-user ("apache") that owns the Puppet-master-process can probably do anything, by sending bogus Puppet-modules to the agents.

  • Permissions to the Pulp-files can probably add bogus RPM's that you and I willingly install next time we patch our servers.

  • Permissions to some of the databases/files on the Satellite can probably make either Puppet or Katello run arbitrary code on the servers.

Close

Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.