What are DRAM faults?
Dynamic random access memory (DRAM) provides the bulk of low-latency random-access data storage in a computer.
DRAM-related faults can be roughly classified along the following lines:
- Data is read from an incorrect address.
- Data is written to an incorrect address.
- Data is corrupted while it is read (that is, wrong data is returned, but the data in memory is unchanged).
- Data is corrupted while it is written (that is, it is written to the right address, but the wrong bits are stored).
- Data is corrupted spontaneously, without a write to that memory address.
The “dynamic” part of DRAM refers to the fact that a periodic refresh is needed to avoid spontaneous corruption (item 5 in the list above). The system will perform this refresh automatically, in the background. The refresh operation needs energy and blocks access to the memory regions being refreshed, which is why there is an incentive to minimize refresh.
Systems come with varying levels of protection against DRAM faults. During hardware design, simulations are used to make sure that the refresh operation is sufficient to prevent data corruption, and that the memory bus does not corrupt addresses and data in transit (items 1 to 4 in the list). Current consumer systems typically do not protect stored data at all because there is no redundancy, and rely solely on a well-designed memory refresh. Many server systems use DRAM with error-correction codes (ECC), greatly increasing resiliency against spontaneous corruption. Some server systems have further protections.
How can DRAM faults be triggered?
DRAM faults can happen spontaneously, like any hardware failure. Sometimes this is attributed to “cosmic rays”.
If DRAM modules are defective or there is some other hardware defect, such as a faulty power supply that operates outside its specification, DRAM faults can happen at a rate that they are quite noticeable and affect system stability.
It has been known for a long time that certain memory access patterns are more likely to expose defective DRAM modules. The
memtest86+ RAM tester provided as part of Red Hat Enterprise Linux uses such test patterns to expose hardware defects in the memory subsystem.
How does the RowHammer attack work?
“Rowhammer” refers to a particular technique for DRAM stressing, using the CLFLUSH/CLFLUSHOPT/CLWB family of CPU instructions. This family of instruction is an unprivileged instruction provided by the i386 and x86_64 architectures. It can be abused to increase DRAM traffic to specific locations. This makes it particularly interesting for writing DRAM fault inducers. These programs can run as ordinary processes, in hypervisor guests, because the CLFLUSH instruction does not require special privileges.
The kind of corruption induced by “rowhammer” corresponds to item 5 in the initial list. Stored data is altered, without an explicit memory write, and not necessarily at the addresses being accessed. Reportedly, this happens because the access pattern invalidates designed-in assumptions about the required DRAM refresh rate: the memory refresh does not happen often enough to preserve the stored data reliably when DRAM is accessed in this way.
The CLFLUSH instruction allows to create a particularly effective memory stresser that is difficult for a large variety of systems. CPUs provide other means (such as non-temporal memory access) which can bypass caches as well. A particularly clever stress tester might achieve a very similar effect by carefully chosing memory accesses, evicting cache lines as desired.
What is the SPOILER vulnerability and how does it relate to the RowHammer attack?
The SPOILER vulnerability is a micro-architectural leakage which allows an attacker to determine virtual-to-physical page mappings in unprivileged user space processes. It leverages data dependency of speculative load and store operations in the Memory Order Buffer and uses
mfence instructions to measure the timing discrepancies that reveal memory layout. This allows to detect ranges of contiguous physical memory pages which makes RowHammer much more effective and easier, just seconds of an attack instead of weeks.
What is the impact of DRAM faults?
Memory is a fundamental system component. Everything depends on its correct operation, and if it does not work correctly, all bets are off. Consequently, most DRAM faults have the potential to undermine the correct operation of the system. This includes enforcement of security boundaries and protection of critical data, such as private key material.
If such faults happen, both local and remote attackers may benefit from it. As usual, local attackers are in a better position to trigger DRAM faults explicitly.
Are sandboxing solutions affected?
SELinux and containers do not offer protection because they do not intercept the entire instruction stream before execution, so they cannot block instructions such as CLFLUSH.
What about hypervisors?
Hypervisors such as RHEV do not currently provide protection against DRAM faults and their abuse by guests.
Is the Chrome browser able to protect users?
Google has publicly claimed that they addressed a vulnerability related to DRAM fault injection in their Chrome browser, specifically in the Native Client functionality.
The Native Client component of Chrome uses a trusted compiler approach to execute native machine code downloaded from untrusted sources. The instruction stream is scanned and audited for dangerous instructions which would allow escaping the sandbox. Previous Chrome versions permitted the CLFLUSH instruction, which enabled CLFLUSH-based DRAM fault inducers to run. Native Client in current Chrome versions does not permit the CLFLUSH, which prevents this approach from working.
Is a software-based solution possible for this kind of problem?
It might be theoretically possible to map memory to hypervisor guests or operating system processes in such a way that “rowhammer”-style DRAM fault injection only affects data within the same security domain (guest or process). It is unlikely that this is feasible in practice for several reasons:
- It would be necessary to disable read-only page sharing across security boundaries (such as Kernel Same-page merging, or shared read-only mappings between processes), greatly reducing density of system loads.
- The mechanism is highly dependent on the memory configuration, and would only be effective for very specific system configurations, depending on CPU silicon and microcode revision, system firmware version, mainboard revision, and so on.
Even today, on systems with such facilities, it is possible to monitor the occurrence of MCE events (using
mcelog), and especially EDAC (Error Detection And Correction) counters, using the
edac-util command from the
edac-utils package. These tools can be used to spot the early signs of DRAM-related system degradation, before exploitable DRAM faults happen.
What should customers do to deal with DRAM faults?
Red Hat considers DRAM faults an issue that hardware vendors need to address, with sufficient design reserves and technologies like ECC memory.
Red Hat recommends to run
memtest86+ in case DRAM defects are suspected, for example after the kernel has logged MCE events. See this article for information about relevant log messages.
System vendors may have additional information about how DRAM faults affect their hardware platforms, and Red Hat customers are advised to contact them for additional information, as required.