220.127.116.11.1. Logical, physical, cpu, ack thread counts
The logical, physical, cpu, and I/O acknowledgement work can be spread across multiple threads, the number of which can be specified during initial configuration or later if the VDO device is restarted.
One core, or one thread, can do a finite amount of work during a given time. Having one thread compute all data-block hash values, for example, would impose a hard limit on the number of data blocks that could be processed per second. Dividing the work across multiple threads (and cores) relieves that bottleneck.
As a thread or core approaches 100% usage, more work items will tend to queue up for processing. While this may result in CPU having fewer idle cycles, queueing delays and latency for individual I/O requests will typically increase. According to some queueing theory models, utilization levels above 70% or 80% can lead to excessive delays that can be several times longer than the normal processing time. Thus it may be helpful to distribute work further for a thread or core with 50% or higher utilization, even if those threads or cores are not always busy.
In the opposite case, where a thread or CPU is very lightly loaded (and thus very often asleep), supplying work for it to do is more likely to incur some additional cost. (A thread attempting to wake another thread must acquire a global lock on the scheduler's data structures, and may potentially send an inter-processor interrupt to transfer work to another core). As more cores are configured to run VDO threads, it becomes less likely that a given piece of data will be cached as work is moved between threads or as threads are moved between cores — so too much work distribution can also degrade performance.
The work performed by the logical, physical, and CPU threads per I/O request will vary based on the type of workload, so systems should be tested with the different types of workloads they are expected to service.
Write operations in sync mode involving successful deduplication will entail extra I/O operations (reading the previously stored data block), some CPU cycles (comparing the new data block to confirm that they match), and journal updates (remapping the LBN to the previously-stored data block's PBN) compared to writes of new data. When duplication is detected in async mode, data write operations are avoided at the cost of the read and compare operations described above; only one journal update can happen per write, whether or not duplication is detected.
If compression is enabled, reads and writes of compressible data will require more processing by the CPU threads.
Blocks containing all zero bytes (a zero block) are treated specially, as they commonly occur. A special entry is used to represent such data in the block map, and the zero block is not written to or read from the storage device. Thus, tests that write or read all-zero blocks may produce misleading results. The same is true, to a lesser degree, of tests that write over zero blocks or uninitialized blocks (those that were never written since the VDO device was created) because reference count updates done by the physical threads are not required for zero or uninitialized blocks.
Acknowledging I/O operations is the only task that is not significantly affected by the type of work being done or the data being operated upon, as one callback is issued per I/O operation.
18.104.22.168.2. CPU Affinity and NUMA
Accessing memory across NUMA node boundaries takes longer than accessing memory on the local node. With Intel processors sharing the last-level cache between cores on a node, cache contention between nodes is a much greater problem than cache contention within a node.
Tools such as
top can not distinguish between CPU cycles that do work and cycles that are stalled. These tools interpret cache contention and slow memory accesses as actual work. As a result, moving a thread between nodes may appear to reduce the thread's apparent CPU utilization while increasing the number of operations it performs per second.
While many of VDO's kernel threads maintain data structures that are accessed by only one thread, they do frequently exchange messages about the I/O requests themselves. Contention may be high if VDO threads are run on multiple nodes, or if threads are reassigned from one node to another by the scheduler. If it is possible to run other VDO-related work (such as I/O submissions to VDO, or interrupt processing for the storage device) on the same node as the VDO threads, contention may be further reduced. If one node does not have sufficient cycles to run all VDO-related work, memory contention should be considered when selecting threads to move onto other nodes.
If practical, collect VDO threads on one node using the
taskset utility. If other VDO-related work can also be run on the same node, that may further reduce contention. In that case, if one node lacks the CPU power to keep up with processing demands then memory contention must be considered when choosing threads to move onto other nodes. For example, if a storage device's driver has a significant number of data structures to maintain, it may help to move both the device's interrupt handling and VDO's I/O submissions (the bio threads that call the device's driver code) to another node. Keeping I/O acknowledgment (ack threads) and higher-level I/O submission threads (user-mode threads doing direct I/O, or the kernel's page cache
flush thread) paired is also good practice.
22.214.171.124.3. Frequency throttling
If power consumption is not an issue, writing the string
performance to the
/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor files if they exist might produce better results. If these
sysfs nodes do not exist, Linux or the system's BIOS may provide other options for configuring CPU frequency management.
Performance measurements are further complicated by CPUs that dynamically vary their frequencies based on workload, because the time needed to accomplish a specific piece of work may vary due to other work the CPU has been doing, even without task switching or cache contention.