multipath - parallel or multiplexed I/O?
Hi,
I'm a little bit confused.
I would like to ask if round-robin mode of dm-multipath on rhel6 imply that on any given time only one path is used (inside its path group) for the occurring I/O.
As far as I know multipath cannot do "parallel" or "simultaneous" I/O over multiple paths, instead it can read or write from only one path at time, but I may be wrong because other colleagues pointed out that one of the multipath purpouse is to spread I/O over available paths, so it sholud be done in "parallel".
My guess is that is true, meaning that all available paths are used, but in a "multiplexed" fashion rather than "at the same time".
I'm not able to explicitly get confirmation over this on the official documentation.
Thanks in advance.
Responses
Great question. Now I have to ask, are you just having an "intellectual discussion" or actually trying to solve a problem? ;-)
PREFACE: this is ALL speculation on my part. I doubt that a true multipath (where both paths are writting the same data) could exist. I'm not sure why that would be advantegous to have (i.e. why would the subsystem want to have to process two (or more) identical streams, each with a likely difference in response time, etc??? This would also drive up the I/O unecessarily, I would think). To address your question "imply that on any given time only one path is used ply that on any given time only one path is used " - I would assume that technically that is accurate... but in our observation, depending on the polling-interval, it could possibly appear that that they are using all paths simultaneously.
A lot of this conversation still predicates on what type of SAN you have deployed (Array Vendor and Fabric Topology). Specifcally whether the array can handle any sort of Active-Active traffic.
The DM Multipathing documentation seems to indicate that Active-Active is actually Round-Robin (which makes sense). My opinion is that Round-Robin is as simultaneous as it can get ;-)
I looked at one of our "flagship" hosts which has 4 paths (2 separate fabrics I believe) connecting to 2 FEPs on the Array, using Round-Robin. And this is certainly not using any Scientific Method of Analysis ;-)
[root@tmsdba01 ~]# multipath -ll -v2 | egrep -B3 -A3 sdg
ORION1_1809 dm-9 HITACHI,OPEN-V
size=10G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
|- 0:0:0:6 sdg 8:96 active ready running
|- 1:0:0:6 sdx 65:112 active ready running
|- 2:0:0:6 sdao 66:128 active ready running
`- 3:0:0:6 sdbg 67:160 active ready running
[root@tmsdba01 ~]# grep -B2 -A4 \"OPEN-V\" /etc/multipath.conf
device {
vendor "HITACHI"
product "OPEN-V"
path_grouping_policy multibus
path_checker readsector0
getuid_callout "/opt/sysadmin/_id_rhel6 --whitelisted -g -u /dev/%n"
}
[root@tmsdba01 ~]# ls /sys/class/fc_host/
host0 host1 host2 host3
[root@tmsdba01 ~]# cat /sys/class/fc_host/host[0-3]/port_state
Online
Online
Online
Online
[root@tmsdba01 ~]# iostat 2 | egrep 'sdg|sdx|sdao|sdbg|dm-9'
sdg 24.92 892.02 13.48 1001627040 15132417
sdx 24.92 891.97 13.47 1001567398 15125828
sdao 24.92 892.06 13.34 1001670255 14978704
sdbg 24.92 891.86 13.52 1001438592 15177297
dm-9 99.65 3559.14 53.80 3996448245 60414246
sdg 2.50 18.00 0.50 36 1
sdx 3.00 80.00 1.00 160 2
sdao 2.50 33.00 32.00 66 64
sdbg 2.50 48.00 1.50 96 3
dm-9 10.50 179.00 35.00 358 70
sdg 1.50 16.00 32.00 32 64
sdx 1.50 32.00 0.50 64 1
sdao 2.00 48.50 0.00 97 0
sdbg 1.50 32.00 4.00 64 8
dm-9 6.50 128.50 36.50 257 73
Again - opinion here ;-)
I think the loop assumption is fair. When you think about it though - I think the sequential aspect only becomes an issue if one part of the loop is not as responsive as the rest (i.e. pathA/B/C respond in 5ms and pathD is 10m - then the entire operation would be waiting on pathD. I believe there is a way to tune Multipath for access which I would refer to as "tiered" (some faster/better than others). I can only assume that the timing differential between when pathA returns and ACK and pathD would literally be 3 clock cycles. So, very inconsequential.
My question is: can those request happen simultaneously, each one on its own path (I mean do they can overlap on the timeline, if it makes sense)?
Or does multipath wait 1st request to finish before switching the 2nd one to the other path? (read or write from only one path at time, as I said before)
I guess the latter. But I may be wrong ...
I get what you mean (I think) - you're describing an out-of-order scenario. This really would be a fun conversation to hear at Summit. I wonder if this dilemma is a Kernel method, or multipath, or both?
To simplify, if we needed a 1MB chunk of data for an operation, MP would then break the read into 4 equal parts (256K) and send the request down each path at the same time. I believe it would send the 4 reads and accept what it could (out-of-order) and wait until the final chunk is received and then assemble them in the correct order. And since I assume that's how it works, I would also assume that scenario must occur quite often. EDIT: I don't think anything can EVER happen simultaneously. Regardless of how many procs, HBAs, etc.. I assume there has to be at least one clock cycle difference between every single operation.
I now wonder if IP multipathing actually sends the same "bits" down multiple paths, or in essence round-robins everything sequentially...
If we discussed this in college, I was asleep!
In general, multipathing solutions don't do "I've got 1MB of traffic and four paths I can send it down, so I'll cut the datagram into four 250KB chunks and send 250KB down each path." In the aggregate, doing so would be VERY bad for a shared-SAN. In general, you want your I/Os to be as large as a given path will allow. Smaller-packets mean greater processing overhead. In a busy storage environment (e.g., multiple hosts attached to the same SAN array), too many undersized I/Os can saturate the array's storage processors' processing capability long before reaching their maximum ingest size limits (think about performance difference between trying to transfer scads of tiny files versus a few large files where the aggregated data set-sizes are identical).
Sort of like RAID stripe-widths, you have a set atomicity. If your original I/O-chunk is greater than one width-wide and up to two widths-wide, and you've got four available paths, the original data-chunk will get split into only two chunks that are transmitted across two of the four paths. How subsequent chunks are broken down will depend on your multipathing routing policy.
Back when I was doing storage optimization work for a storage software vendor, one of the most frequent tuning tasks was optimizing the multipathing software's framing atomicity to the array's atomicity. As the software and arrays matured, much of the need for that kind of tuning went away as the storage software vendor partnered with array vendors to create plugins that auto-tuned the host-side multipather's I/O policies to the attached array's capabilities. :p
In any kinds of I/O-balancing scenario (IP- or FC-networking, DNS lookups, backup media server selection, etc), you're typically going to have three option types: failover, round-robin or adaptive selection
In failover, no matter how many paths you have, you only ever use one path. The other paths are there for redundancy/service-resiliency.
In a round-robin scenario, you're taking X amount of data and chunking it up into uniformly sized chunks, then spitting those chunks out across as many paths as are available. You don't really care whether path A and path B have uniform effective throughput capability (i.e., is path A 100% idle while path B might be 90% saturated). You send the IOs out each path and let your buffers and queues handle the overruns.
In adaptive multipathing, your algorithm is sensitive to the responsiveness of any given path and distrubutes the chunks based on each path's measured available carrying capacity.
You need to tweak your path_select* (and related multipath policy-options) to get the best performance out of your HBA/SAN/array capabilities.
Hello all.
I am an experienced multipath user, with over 11PB of usable storage connected to over 20 GPFS clusters, all running dm-multipath, with several storage vendors and several storage array models. We are getting 98+% scaling across 4 x 8 gbit FC host controllers, and 92+% scaling across 6 storage arrays.
There are many different parameters in at least 7 layers of the Linux IO stack need to be cross-coordinated with ultra-large IO and wide striping in mind:
• Storage
• Fibre Channel / SAS / IB SRP Driver
• SCSI Device Handler Module
• Block Layer
• DM-Multipath
• Logical Volume Manager
• File System
A constriction in any of the seven layers can limit the overall end-to-end efficiency and throughput. Such a constriction can also mask the effect of changes in other layers.
You can often experiment by changing some parameters at a given level ... with no apparent change. In this case, there are often constrictions above and/or below the layer you are changing that are affecting the outcome.
Automotive Drive Train Analogy
Using an automotive analogy, request-based DM-Multipath under Red Hat® Enterprise Linux® (RHEL) 6.1+ is a competent element of the storage "drive train". However, like automotive drive trains, all the proper "gear ratios" need to be assigned depending on the capabilities of the sub-components. The end-to-end gear ratio is an arithmetic function of all the intermediate components with multiple combinations yielding similar end results. There can be more than one "right" combination.
The gear ratios and number of transmission speeds for a "car" will be much different from those of a tractor-trailer truck. The "out of the box" experience of RHEL Linux 6.1+ and DM-Multipath is closer to a mid-range "car" than a tractor-trailer truck in this analogy. Putting a high power truck engine into a "car" drive train will be sub-optimal without changing some gear ratios. Correspondingly, putting a car engine into a "truck" drive train will also be sub-optimal without changing some gear ratios.
Linux DM-Multipath and other proprietary multipath managers are components in the disk "drive chain", and they all have their own view of their internal "gear ratios" and what they expect the “engine” and the rest of drive train to look like. Some proprietary multipath “gear ratios" are very appropriate for many configurations of engines and drive trains, but are relatively unchangeable.
The Linux native DM-Multipath facilities can be more flexible, but may have to be explicitly configured to operate at stellar levels. By properly crafting all the “gear ratios”, request-based DM-Multipath can scale to higher levels of performance than many proprietary multipath managers can.
To your specific "academic" question, you CAN tailor the Linux IO stack to perform BOTH parallel or multiplexed IO, depending on the ratio of the IO size, and the max IO sizes of intervening layers. Using a parallel capability can improve throughput for a given IO, but does so by consuming more CPU overhead, and using more resources. The parallel smaller IOs when eventually being serviced by the storage could run slower or faster, depending on their relationship with "full stripes" on the storage array.
Also, individual components in the IO stack "drive train" have different capabilities, and constraints and may do less-well handling IO above certain sizes, or below other sizes. So the "academic" discussion ends up being tainted by the practical constraints of a specific fibre channel controller, driver, or storage model.
Of course, doing "large IO" that ends up being fragmented on disk, requiring multiple IOs and seeks to be serviced is usually inefficient. So, the filesytem and LVM can play a big part.
For this quick discussion ... I am referring to large, multi-MB RANDOM IO, where the IO_SIZE worth of data is contiguous on disk. Accessing a single disk in physical sequential order is much easier, but breaks when additional "sequential" streams are competing for the same disks. Scale this to dozens and hundreds of sequential streams ... and the access pattern looks random.
Here is a brief cheat sheet.
Storage:
The storage should be configured to be IO_SIZE “friendly” and large-IO focused. In general, RAID groups should be configured with the smallest stripe segment size that allows IO_SIZE IOs without an extra rotational latency penalty.
From a practical standpoint, many storage vendors use fixed 64kb stripe segment sizes, which limits a (8+2) RAID6 to a full stripe width of only 8 x 64kb = 512kb. Practitioner observation: few storage vendors can perform more than 4 back-to-back full-stripe IOs without an additional rotational penalty. In this case, the storage vendor is likely limited to 4 x full-stripe = 4 x 512kb = 2MB IO size, without hitting a performance penalty. For this type of vendor, a 4MB IO will likely not "scale" compared to 2MB IO. It will be faster, but not as fast as expected, due to the additional rotational latency.
BTW: There are vendors out there with 512kb stripe segment sizes, so a (8+2) RAID6 topology would have a 4 MB hardware full-stripe. This type of storage can likely perform 16MB IO, without a performance penalty.
If you are really doing RANDOM IO, you may not want to enable read-ahead caching on the storage array. Enable durable write caching if it is faster. If the cache mirroring overhead is high, it may be faster overall to disable write cache. If you are using RAID5 or RAID6, you most often need write caching enabled to avoid excessive read/modify/write cycles for less-than-full-stripe IOs.
Fibre channel controller and driver.
High speed controllers belong in high speed PCIe slots. Dual port 8 gbit FC controller need a PCIe 2.0 x4 4-lane slot, for example. The "lspci -vv" command can be used to explore the PCI topology. Are the high speed controllers in high speed slots?
Most high speed controllers also support the PCI “MSI-X” feature (Message Signaled Interrupts – Extended). It should be enabled. This is typically done via a modprobe “options” entry in the modprobe.conf file. For the Emulex fibre channel driver, the modprobe.conf syntax is:
options lpfc lpfc_use_msi=2
Other vendors’ drivers have different syntax. Use the “modinfo” on the driver module name to display what the driver options are.
The MSI-X status can be validated via "cat /proc/interrupts".
The Linux “scatter/gather” table size needs to be large enough to allow IO_SIZE IO, if possible. For most drivers, this typically requires a “sg_tablesize” value of 256 or greater for 4MB IO. Different vendors have different defaults for this parameter, and may require a modprobe.conf entry to increase the value. The QLogic driver defaults to a value of 1024. For Emulex, a modeprobe.conf entry needs to be added to increase the value to 256 or greater, such as:
options lpfc lpfc_sg_seg_cnt=256
The LSI SAS driver, “mpt2sas”, uses the parameter named “max_sgl_entries” to control this value. Its maximum value in RHEL 6.x currently only 128.
The value of sg_tablesize can be validated via:
“cat /sys/class/scsi_host/host{n}/sg_tablesize”
If any changes are made to the modprobe.conf file, the kernel boot image needs to be rebuilt (via “dracut” or “mkinitrd”) and the system rebooted.
SCSI Device Handler Module Pre-Load
This is not actually a "large IO" issue, but a tuning that should be done on all systems with a large number of disks.
The following discussion about pre-loading SCSI device handlers is normally documented in the Linux “readme” or “release notice” documentation, but it is often overlooked.
Not all storage systems support fully symmetric “active/active” path topologies. For storage systems with asymmetric or “active/passive” topologies, special handling is needed during the boot path discovery process to avoid unnecessary disk path IO errors and delays.
Storage-specific SCSI “device handler” modules must be loaded early in the kernel boot process. These storage specific device handler modules are intelligent enough to avoid sending IO to the non-primary active disk paths during the path discovery process, before Linux DM-Multipath is running. These modules can be found at: /lib/modules/{version}/kernel/drivers/scsi/device_handler/*
The typical “in-box” storage specific modules for RHEL 6.x are:
scsi_dh_alua
scsi_dh_emc
scsi_dh_hp_sw
scsi_dh_rdac
The storage vendors may also provide their own SCSI device handler module as part of their Linux driver kit.
Unfortunately, the default Linux kernel build procedure has inconsistently pre-loaded these required SCSI device handlers over time. Depending on the Linux version and patch level being used, these SCSI device handler modules may or may not be pre-loaded by default. Failure to pre-load these modules can cause greatly elongated boot times as the disk path discovery process waits for ten second timeouts on the IO commands sent to the non-active paths. We experienced a 48-minute boot time delay on a system with 288 passive paths without the proper SCSI device handler pre-loaded, for example.
For RHEL 6.x, the workaround is to add the “rdloaddriver=scsi_dh_xxxx” option to the “kernel” statement in the /boot/grub/grub.conf file. For example, for rdac-style storage, the option “rdloaddriver=scsi_dh_rdac” would be added.
For RHEL 5.x, the workaround is to add the “--preload=scsi_dh_xxxx” to the mkinitrd command used to build the kernel. Other Linux distributions use similar procedures to pre-load kernel modules early during boot.
Linux Block Layer *** this is important
At the Linux block layer, you want to enable IO_SIZE or larger. For file systems with sophisticated IO management routines, you may also want disable most block-layer-level IO scheduling features. In general, it is better to say to the kernel "get out of my way" until you understand what is going on, and then add-back various "smart" features that are worthwhile. Fortunately these settings are specified on a per path basis, so you can use "cfq" timesharing scheduling for you home directory, and "noop" for a large-file Lustre file system.
scheduler noop
read_ahead_kb 0
max_sectors_kb The IO_SIZE in kb or greater. Note ... the default is only 512kb.
These parameters are changed by setting the corresponding parameter value in:
/sys/block/$D/queue/*
Where $D is the “sd” disk path name or “dm- multipath device name. The parameters associated with each individual disk path need to be properly configured. For Linux releases with request-based DM-Multipath, such as RHEL 6.x, the “dm-*” multipath pseudo disks’ parameters also need to be properly configured.
Remember ... in RHEL 6.x, you also need to configure the block-layer parameters for the dm-* entries ALSO. Only when you "open up" both the block-layer params at the dm-* and slave paths level can you do large IO.
Multipath:
Along with the settings needed for the specific model of storage that you are using, there are a few additional critical parameters:
The two key multipath.conf parameters that need to be properly configured are “rr_min_io” and “rr_weight”:
rr_min_io 1 (under RHEL 6.1)
rr_min_io_rq 1 (under RHEL 6.2+)
rr_min_io 8 (under RHEL 5.x = 4MB / 512KB)
rr_weight uniform (NOT “priorities”)
Under RHEL 6.2+ this will result in each individual IO (limited by the block layer) to be sent down a different path. If you have 2 active paths, with a default block layer max size of 512kb, then an incoming 1 MB IO will be split in 2, and sent along both paths in parallel. If the storage can't re-coalesce the IO back to the large size, performance may suffer.
Some in-box and vendor provided multipath.conf entries use default values of 100 to 1000 for rr_min_io or rr_min_io_rq, and "priorities" for the rr_weight, rather than uniform. On one popular brand of storage, the "priorities" setting results in a "weight" of 6, multiplied by rr_min_io, of 100 to 1000 ... so that 600 to 6000 IOs are sent to the same path before switching to the second path .... quite a bit of clumping on one path, and starvation on others. If your disk is doing 60 IO/sec.... 6000 IOs could be 10 seconds worth of IO.
Logical Volume Manager
The Linux Logical Volume Manager (LVM) is a thin software layer on top of logical disks that allow concatenation, striping, and other combinations of logical disks into larger virtual disks. Logical volume management is a form of host-based storage virtualization providing methods of allocating space more flexible than conventional partitioning.
You may also use the logical volume management functionality built into the advanced file systems, rather than using the Linux Logical Volume Manager or the Linux Clustered Logical Volume Manager. As such, no special configuration notes are needed for the Linux LVM, since it is not used.
File System (large IO aware)
The file system needs to be configured to allow IO_SIZE or greater. For IBM GPFS, this is the cluster-wide “maxblocksize” parameter, and the per-file-system “blocksize” parameter (-B option on mmcrfs).
It is also recommended to use the “scatter” block allocation scheme (-j option on mmcrfs).
Other large-file optimized file system managers have similar settings. Quantum StorNext, for example, uses a combination of their “blocksize” and “stripe breadth” to convey the largest contiguous IO size.
Net-Net ... You can do large IO with proper selection of sub-components, and proper configuration. This IO can be multiplexed across paths, or done in parallel across paths for large IO. The request-based Multipath in RHEL 6.1 helps dramatically.
With a strong enough storage farm, we're doing over 2,950 MiB/sec across 4 x 8 Gbit FC controllers ... using a single-threaded "dd". We have also done over 5,900 MiB/sec in read/write full duplex mode.
I would recommend starting with "round-robin" as the multipath path-selector. Once you understand this configuration, you can try "queue-length 0" or "service-time 0". "Service-time 0" works well if your IO is not otherwise balanced. If the IO is very balanced to begin with, both service-time and queue-length degenerate to round-robin. Also, remember, if you have many LUNs ... much of this is a zero-sum game. If you bias IO to one path for a given LUN, you are stealing bandwidth on that path for another LUN.
Scaling beyond 2 active paths is tricky and requires some extra diligence. Using the "default" often result in negative scaling ... where 4 active paths run slower than 2 active paths. Beware.
However, it can be done. We normally operate with 8 active paths across 4 controllers.
Welcome! Check out the Getting Started with Red Hat page for quick tours and guides for common tasks.
