Impact of Disk Speed on Satellite 6 Operations
Environment
- Red Hat Satellite 6 server
- Red Hat Satellite 6 Capsule server
Issue
Satellite 6 is highly dependent on fast-performing IO to ensure proper operations throughout the system. Database queries, file copies, API traffic and more are all greatly affected by the storage configured in Satellite 6.
The primary partition that has the largest effect is the directories in /var, as outlined in the Installation Guide:
Red Hat Satellite 6.11 Installation Guide
Having poorly performing IO can cause:
- High load averages
- Slow to exceedingly slow content operations such as synchronizations, Content View publish and promote.
- Long-running API queries: API queries that query the database may take extra time to complete, causing unexpected consequences
- Client-Initiated API throughput issues: If you are seeing a growing number of Actions::Katello::Host::* API tasks taking longer than expected and backing up in the queue, you may want to investigate your IO
The Satellite 6 server and its capsules require disk IO to be at or above 60-80 Megabytes per second of average throughput for reading operations. Anything below this value can have severe implications for the operation of the Satellite.
As we will outline below, this value is not particularly hard to achieve if you are using local spinning HDD technology and is easily achievable with local SSDs.
The difficulty is when using Satellite 6 with network-attached storage, especially on 1G or slower networks which can quickly become over-saturated and unable to provide the performance necessary for proper Satellite 6 operations.
Diagnostic Steps
The Satellite team has built a tool that can be utilized to test disk IO on your Satellite Server found here:
This replaces the checks built into foreman-maintain's quicker and less intensive 'fio' based testing routine which can sometimes lead to misleading results.
This storage-benchmark script will execute a series of more intensive 'fio' based IO tests against a targeted directory specified in its execution. This test will create a very large file that is double (2x) the size of the physical RAM on this system to ensure that we are not just testing the caching at the OS level of the storage. This test is meant to provide guidance and is not a hard-and-fast indicator of how your Satellite will perform.
NOTE: We recommend you stop all services before executing this script and will be prompted to do so.
Our Satellite Performance and Scale team has executed the storage-benchmark against a series of different hardware in our lab environment to help provide some examples of expected performance from different vendors. This is not an exhaustive list
RESULTS
[1] Toshiba MG03ACA1 SATA Disk 931GiB (1TB)
Running READ test via fio:
READ: bw=115MiB/s (121MB/s), 115MiB/s-115MiB/s (121MB/s-121MB/s), io=66.0GiB (70.9GB), run=586724-586724msec
Running WRITE test via fio:
WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=66.0GiB (70.9GB), run=523016-523016msec
READ MiB/s: 115MiB/s
WRITE MiB/s: 129MiB/s
[2] DELL PERC H710 SCSI Disk 2791GiB
Running READ test via fio:
READ: bw=773MiB/s (811MB/s), 773MiB/s-773MiB/s (811MB/s-811MB/s), io=132GiB (142GB), run=174866-174866msec
Running WRITE test via fio:
WRITE: bw=685MiB/s (719MB/s), 685MiB/s-685MiB/s (719MB/s-719MB/s), io=132GiB (142GB), run=197195-197195msec
READ MiB/s: 773MiB/s
WRITE MiB/s: 685MiB/s
[3] NFS via 1G network
Running READ test via fio:
READ: bw=102MiB/s (107MB/s), 102MiB/s-102MiB/s (107MB/s-107MB/s), io=98.0GiB (105GB), run=982855-982855msec
Running WRITE test via fio:
WRITE: bw=55.8MiB/s (58.5MB/s), 55.8MiB/s-55.8MiB/s (58.5MB/s-58.5MB/s), io=52.8GiB (56.7GB), run=968853-968853msec
READ MiB/s: 102MiB/s
WRITE MiB/s: 55MiB/s
[4] DELL PERC H710 SAS 931 GiB (999GB)
Running READ test via fio:
READ: bw=92.3MiB/s (96.8MB/s), 92.3MiB/s-92.3MiB/s (96.8MB/s-96.8MB/s), io=264GiB (283GB), run=2929191-2929191msec
Running WRITE test via fio:
WRITE: bw=109MiB/s (115MB/s), 109MiB/s-109MiB/s (115MB/s-115MB/s), io=264GiB (283GB), run=2473006-2473006msec
READ MiB/s: 92MiB/s
WRITE MiB/s: 109MiB/s
[5] NVMe Solid State Drive
Running READ test via fio:
READ: bw=2124MiB/s (2227MB/s), 2124MiB/s-2124MiB/s (2227MB/s-2227MB/s), io=788GiB (846GB), run=379896-379896msec
Running WRITE test via fio:
WRITE: bw=1409MiB/s (1477MB/s), 1409MiB/s-1409MiB/s (1477MB/s-1477MB/s), io=698GiB (750GB), run=507484-507484msec
READ MiB/s: 2124MiB/s
WRITE MiB/s: 1409MiB/s
[6] Solid State Drive - SATA
Running READ test via fio:
READ: bw=692MiB/s (725MB/s), 692MiB/s-692MiB/s (725MB/s-725MB/s), io=788GiB (846GB), run=1166398-1166398msec
Running WRITE test via fio:
WRITE: bw=361MiB/s (379MB/s), 361MiB/s-361MiB/s (379MB/s-379MB/s), io=443GiB (476GB), run=1256281-1256281msec
READ MiB/s: 692MiB/s
WRITE MiB/s: 361MiB/s
Observations:
- Tested on 6 hardware combinations - SATA,SAS,SCSI,NFS,SSD,NVMe.
- Cleaned the cache before running the tests using this command “swapoff -a; echo 3 > /proc/sys/vm/drop_caches; swapon -a”
- Overall the testing went well and noticed that average throughput for read operations is above 80 MiB /s which is good. Satellites with this type of storage perform well.
- If you see speeds below the 60-80MB/s range you should consider alternative configurations or hardware.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
22 Comments
Got two completely different results when running foreman-maintain and fio.
vs
stracing foreman-maintain reveals that it runs
hdparm -t <device>
I got pretty slow results when using the defaults (4K block size), but when bumping up the block size, things improved drastically.
[ root@rhsat01aplpd.rjf.com : Fri Aug 10, 08:18 AM : /root ] $ fio --name=job1 --rw=read --size=1g --directory=/var/lib/pulp --direct=1 --blocksize=4k READ: bw=15.3MiB/s (16.0MB/s), 15.3MiB/s-15.3MiB/s (16.0MB/s-16.0MB/s), io=150MiB (157MB), run=9803-9803msec
block size 16K:
[ root@rhsat01aplpd.rjf.com : Fri Aug 10, 08:35 AM : /root ] $ fio --name=job1 --rw=read --size=1g --directory=/var/lib/pulp --direct=1 --blocksize=16k Run status group 0 (all jobs): READ: bw=52.0MiB/s (55.6MB/s), 52.0MiB/s-52.0MiB/s (55.6MB/s-55.6MB/s), io=1024MiB (1074MB), run=19321-19321msec
block size = 1MB:
[ root@rhsat01aplpd.rjf.com : Fri Aug 10, 08:17 AM : /root ] $ fio --name=job1 --rw=read --size=1g --directory=/var/lib/pulp --direct=1 --blocksize=1024k Run status group 0 (all jobs): READ: bw=489MiB/s (513MB/s), 489MiB/s-489MiB/s (513MB/s-513MB/s), io=1024MiB (1074MB), run=2095-2095 msec
blocksize = 4MB:
[ root@rhsat01aplpd.rjf.com : Fri Aug 10, 08:18 AM : /root ] $ fio --name=job1 --rw=read --size=1g --directory=/var/lib/pulp --direct=1 --blocksize=4096k Run status group 0 (all jobs): READ: bw=828MiB/s (869MB/s), 828MiB/s-828MiB/s (869MB/s-869MB/s), io=1024MiB (1074MB), run=1236-1236msec
Can someone comment on how this will relate to Sat performance?
Thanks
Matt, thanks for the comment, this document definitely needs updating taking block size into consideration.
We also, like you, have seen very significant speed differences with varying block sizes where the default value used in fio can produce misleading low benchmark values that may make it look like your storage is performing worse than it is during normal operations.
Will ensure we get this updated in the coming days.
I stumbled across this KB because one of my customers has this slow performance issue. Looks like this KB is still not updated because it still reads "last update April 2nd".
Knowing the most relevant block size for Satellite would indeed be critical to make this article great!
One more thing: fio is also simply available in the base RHEL 7 repo, it makes it much simpler for someone to check the performance before starting to install Satellite.
Mike,
I am a little confused by all of this myself as I am seeing some very poor results. I tested using fio without "--blocksize=" set and then with it set for 4K and I saw wildly different results. When I looked at "man fio", though, it specifically says that the default blocksize value is 4096:
Results with no "--blocksize=" value set:
Results with "--blocksize=4K" value set:
Granted, both results fall far below the the required 60-80 Megabytes per second but I am still left wondering why there is such a large and repeatable discrepancy in results using no "--blocksize=" value (accepting the default of 4096 bytes) and setting a "--blocksize=4K" which equates to the default.
Is this the same behavior you have seen?
Any updates here? I am working on some tuning and have not seen any clarification on the block size/disk performance issue.
Thanks
All,
We are working to update foreman-maintain to only warn the users when its internal quick 'fio' benchmark results in numbers below our recommended throughput but will not require a whitelist parameter to continue.
Also working on an updated benchmark script you can run (which will likely be integrated into foreman-maintain in the future) to get more accurate real-world storage information. This test does **not* use directio and will utilize the OS + caching as normal operations would.
You can find our first version of the script here:
https://github.com/RedHatSatellite/satellite-support/blob/master/storage-benchmark
To execute just download to your Satellite, chmod +x and run:
As noted in the README block in the script:
Feedback is welcome.
Could you please do also tests on a SAN where the latency isn't as good as with local disks but compensated by high throughput, in order to make sure that the measurement also fits such setups common at enterprise customers? And accordingly document a minimal expected throughput.
Note: one may have to reduce the ram of the machine if there is note enuff space for the /var/lib/pulp/<rnd.file.2.x.ram>
so mem=20G at grub/ilo
https://stackoverflow.com/questions/13484016/setting-limit-to-total-physical-memory-available-in-linux
This article seems to effectively eliminate using a VM as a Satellite Server or capsule.
How do you come to this conclusion? The requirements are relatively high but not that high that it would keep you from virtualizing Satellite/Capsule, you just need to pay attention to what you do in terms of performance.
Not sure this will help, but here's some additional data points. Our satellite is a 16 vCPU/64GB virtual machine on VMWare. We're using pure storage ssd-only arrays.
You definitely need to set the block size to the blocksize of your device. You can make fio return some really nice results if you set blocksize much higher than what the device is configured for. If you did everything using defaults, you probably have a 4k block device.
Pure Storage recommends using the --ioengine=libaio flag for linux, but that's the standard async engine. Since Red Hat's original recommendations are for synchronous reads, change the ioengine to sync.
Based on the docs, since I'm using the sync engine, iodepth doesn't matter and since I'm only doing reads, the fsync is also not needed.
However, dropping the numjobs to 1 drops my results to a consistent ~10MB/s, so instead of what I thought were 8 parallel jobs, maybe it's 8 separate jobs and then the results are combined. Not sure how useful that is.
Anyway, doing all of this just showed to me that i/o in a virtual machine seems all over the place. I can make this tool give me good results, but I'm not entirely sure how to make it give me accurate results.
We've been using this satellite for over a year now with over 3000 content hosts and multiple capsules, so personally I think the concern in the docs about I/O is a little overblown.
Quote: "Network backed storage quickly hits the 1G limit of ~10MB/sec throughput."
I hope you mean 100MB/sec throughput because the theoretical throughput for 1G network is 125MB/sec. Just saying. :)
VMDK?
I think this information is incorrect or out of date. The foreman-maintain scripts disk check says our disk is too slow, but uses dio. When I use it or do manually, my host fails.
fio --name=job1 --rw=read --size=1g --directory=/var --direct=1When I turn dio off, my results are excellent.
The performance tuning guide for Satellite addresses this: https://access.redhat.com/node/5001161/40/0/17386571
Perfscale team is validating this kbase with different storage media. Will update results/doc soon.
All, see the updated refresh to this document that utilizes the more intensive and accurate storage-benchmark tool outlined above. This is more accurate than the built in checks to foreman-maintain which will likely be phased out in a future release.
Awesome! Thanks Mike!
Hi Mike, I noticed that you only published read and write results but your script also does a random read/write test which you didn't publish. Wouldn't that test be a more accurate representation of disk performance than sequential read and write tests? Doesn't the foreman-maintain script do a random test as well?
Erehwin,
random read/write may provide a bit more accurate representation. If we do re-run these tests in the future we will include them!
foreman-maintain currently just does a read test:
The article is really nice, but only if you run Satellite in a physical host with local storage. However, in most enterprises, Satellite (and Capsules) are installed as VMs (in our case in RHV) running on top of an SAN device. I would really like to see some results based on this scenario and various optimization proposals.