5.7. Kernel Notes

This section notes the differences between 2.6.9 (on which Red Hat Enterprise Linux 4 is based) and 2.6.18 (which Red Hat Enterprise Linux 5 will inherit) as of July 12, 2006. Additional features which we are currently working on upstream (for example, virtualization) that will appear late in 2.6.18 or 2.6.19 are not highlighted here. In other words, this list only shows what is already included in the upstream Linus tree; not what is currently in development. Consequently, this list is not a final, or complete list of the new Red Hat Enterprise Linux 5 features, although it does give a good overview of what can be expected. Also, note that this section only picks out highlights of upstream changes, and as such it is not fully comprehensive. It does not include mention of several low-level hardware support enhancements and device driver info.
The following is a good source for a next level-of-detail view:
Performance / Scalability
  • Big Kernel Lock preemption (2.6.10)
  • Voluntary preemption patches (2.6.13) (subset in Red Hat Enterprise Linux 4)
  • Lightweight user-space priority inheritance (PI) support for futexes, useful for real-time applications (2.6.18)
  • New 'mutex' locking primitive (2.6.16)
  • High resolution timers (2.6.16)
    • In contrast to the low-resolution timeout API implemented in kernel/timer.c, hrtimers provide finer resolution and accuracy depending on system configuration and capabilities. These timers are currently used for itimers, POSIX timers, nanosleep and precise in-kernel timing.
  • Modular, on-the-fly switchable I/O schedulers (2.6.10)
    • This was adjustable only by boot option in Red Hat Enterprise Linux 4 (also system-wide instead of per-queue).
  • New Pipe implementation (2.6.11)
    • 30-90% performance improvement in pipe bandwidth
    • circular buffer allows more buffering than blocking writers
  • "Big Kernel Semaphore": turns the Big Kernel Lock into a semaphore
    • reduces latency by breaking up long lock hold times and adding voluntary preemption
  • X86 "SMP alternatives"
  • kernel-headers package
    • replaces the glibc-kernheaders package
    • provides better suitability with the new headers_install feature of the 2.6.18 kernel
    • notable kernel header-related changes:
      • removed <linux/compiler.h> header file, as it is no longer useful
      • removed _syscallX() macros; user-space should use syscall() from the C library instead
      • removed <asm/atomic.h> and <asm/bitops.h> header files; C compiler provides its own atomic built-in functions better suitable for user-space programs
      • content previously protected with #ifdef __KERNEL__ is now removed completely with the unifdef tool; defining __KERNEL__ in order to view parts which should not be visible to user-space is no longer effective
      • removed the PAGE_SIZE macro from some architectures, due to variance in page sizes; user-space should be using sysconf (_SC_PAGE_SIZE) or getpagesize()
    • to provide better suitability for user-space, removed several header files and header content
Generic Feature Additions
  • kexec and kdump (2.6.13)
    • diskdump and netdump have been replaced by kexec and kdump, which ensure faster boot-up and creation of reliable kernel vmcores for diagnostic purposes. For more information and configuration instructions, please refer to /usr/share/doc/kexec-tools-<version>/kexec-kdump-howto.txt (replace <version> with the corresponding version of the kexec-tools package installed).
    • Note that at present, virtualized kernels cannot use the kdump function.
  • inotify (2.6.13)
    • user interface for this is through the following syscalls: sys_inotify_init, sys_inotify_add_watch, and sys_inotify_rm_watch.
  • Process Events Connector (2.6.15)
    • reports fork, exec, id change, and exit events for all processes to user-space.
    • Applications that may find these events useful include accounting / auditing (for example, ELSA), system activity monitoring (for example, top), security, and resource management (for example, CKRM). Semantics provide the building blocks for features like per-user-namespace, "files as directories" and versioned file systems.
  • Generic RTC (RealTime Clock) subsystem (2.6.17)
  • splice (2.6.17)
File System / LVM
  • EXT3
    • support for Extended Attributes in the body of large inode in ext3: saves space and improves performance in some cases (2.6.11)
  • Device mapper multipath support
  • ACL support for NFSv3 and NFSv4 (2.6.13)
  • NFS: supports large reads and writes on the wire (2.6.16)
    • The Linux NFS client now supports transfer sizes of up to 1MB.
  • VFS changes
  • Big CIFS update (2.6.15)
    • features several performance improvements as well as support for Kerberos and CIFS ACL
  • autofs4: updated to provide direct mount support for user-space autofs (2.6.18)
  • cachefs core enablers (2.6.18)
Security
  • Multilevel security implementation for SELinux (2.6.12)
  • Audit subsystem
    • support for process-context based filtering (2.6.17)
    • more filter rule comparators (2.6.17)
  • TCP/UDP getpeercon: enabled security-aware applications to retrieve the entire security context of a process on the other side of a socket using an IPSec security association. If only MLS-level information is needed or interoperability with legacy unix system is required, NetLabel can be used in place of IPSec.
Networking
  • Added several TCP congestion modules (2.6.13)
  • IPv6: supports several new sockopt / ancillary data in Advanced API (2.6.14)
  • IPv4/IPv6: UFO (UDP Fragmentation Offload) Scatter-gather approach (2.6.15)
    • UFO is a feature wherein the Linux kernel network stack will offload the IP fragmentation functionality of large UDP datagram to hardware. This will reduce the overhead of stack in fragmenting the large UDP datagram to MTU-sized packets.
  • Added nf_conntrack subsystem (2.6.15)
    • The existing connection tracking subsystem in netfilter can only handle ipv4. There were two choices present to add connection tracking support for ipv6; either duplicate all of the ipv4 connection tracking code into an ipv6 counterpart, or (the choice taken by these patches) design a generic layer that could handle both ipv4 and ipv6 and thus requiring only one sub-protocol (TCP, UDP, etc.) connection tracking helper module to be written. In fact, nf_conntrack is capable of working with any layer 3 protocol.
  • IPV6
    • RFC 3484-compliant source address selection (2.6.15)
    • added support for Router Preference (RFC4191) (2.6.17)
    • added Router Reachability Probing (RFC4191) (2.6.17)
    • added support for Multiple Routing Tables and Policy Routing
  • Wireless updates
    • hardware crypto and fragmentation offload support
    • QoS (WME) support, "wireless spy support"
    • mixed PTK/GTK
    • CCMP/TKIP support and WE-19 HostAP support
    • BCM43xx wireless driver
    • ZD1211 wireless driver
    • WE-20, version 20 of the Wireless Extensions (2.6.17)
    • added the hardware-independent software MAC layer, "Soft MAC" (2.6.17)
    • added LEAP authentication type
  • Added generic segmentation offload (GSO) (2.6.18)
    • can improve performance in some cases, though it needs to be enabled through ethtool
  • DCCPv6 (2.6.16)
Added Hardware Support

Note

This section only enumerates the most generic features among many.
  • x86-64 clustered APIC support (2.6.10)
  • Infiniband support (2.6.11)
  • Hot plug
    • added generic memory add/remove and supporting functions for memory hotplug (2.6.15)
  • SATA/libata enhancements, additional hardware support
    • A completely reworked libata error handler; the result of all this work should be a more robust SATA subsystem which can recover from a wider range of errors.
    • Native Command Queuing (NCQ), the SATA version of tagged command queuing - the ability to have several I/O requests to the same drive outstanding at the same time. (2.6.18)
    • Hotplug support (2.6.18)
  • EDAC support (2.6.16)
    • The EDAC goal is to detect and report errors that occur within the system.
  • Added a new ioatdma driver for the Intel(R) I/OAT DMA engine (2.6.18)
NUMA (Non-Uniform Memory Access) / Multi-core
  • Cpusets (2.6.12)
    • Cpusets now provide a mechanism for assigning a set of CPUs and Memory Nodes to a set of tasks. Cpusets constrain the CPU and memory placement of tasks only to the resources within a task's current cpuset. These are essential in managing dynamic job placement on large systems.
  • NUMA-aware slab allocator (2.6.14)
    • This creates slabs on multiple nodes and manages slabs in such a way that locality of allocations is optimized. Each node has its own list of partial, free and full slabs. All object allocations for a node occur from node-specific slab lists.
  • Swap migration (2.6.16)
    • Swap migration allows the moving of physical location of pages between nodes in a NUMA system while the process is running.
  • Huge pages (2.6.16)
    • Added NUMA policy support for huge pages: the huge_zonelist() function in the memory policy layer provides a list of zones ordered by NUMA distance. The hugetlb layer will walk that list looking for a zone that has available huge pages but is also in the nodeset of the current cpuset.
    • Huge pages now obey cpusets.
  • Per-zone VM counters
    • provide zone-based VM statistics, which are necessary in determining what state of memory a zone is in
  • Netfilter ip_tables: NUMA-aware allocation. (2.6.16)
  • Multi-core
    • Added a new scheduler domain for representing multi-core with shared caches between cores. This makes it possible to make smarter cpu scheduling decisions on such systems, improving performance greatly for some cases. (2.6.17)
    • Power saving policy for the CPU scheduler: with multicore/smt cpus, the power consumption can be improved by leaving some packages idle while others do all the work, instead of spreading the tasks over all CPUs.