Large difference in performance observed between RHEL 5 & RHEL 6 when running programs using memory mapping
Environment
- Red Hat Enterprise Linux (RHEL) 6.2
- Red Hat Enterprise Linux 6.4
- kernel-2.6.32-279.19.1.el6.x86_64
Issue
- A large difference in performance between RHEL 5 and RHEL 6 when running programs using memory mapping. In particular, when using
msync
. For instance, the difference for a simple program that opens a file, maps it into memory and then syncs it is:
FIXED IN kernel-2.6.32-358.18.1.el6
RHEL 5
real 0m5.33s
user 0m0.24s
sys 0m0.22s
RHEL 6
real 1m49.92s
user 0m0.21s
sys 0m0.54s
- Another environment sees a similar result:
RHEL 5
time ./mmapRH
real 0m8.593s
user 0m0.349s
sys 0m0.344s
RHEL 6
time ./mmapRH
real 1m24.281s
user 0m0.346s
sys 0m0.744s
- Results from
iozone
(http://www.iozone.org/) when using themmap
andmsync
options (-B
and-G
).
RHEL 5 (Machine 1)
random random bkwd record stride
KB reclen write rewrite read reread read write read rewrite read
512 4 402 34856 3198002 3393088 2992871 2891 3242347 21572 3102214
RHEL 5 (Machine 2)
random random bkwd record stride
KB reclen write rewrite read reread read write read rewrite read
512 4 380 23380 1158641 3370026 3239987 10210 3239419 18574 3083779
RHEL 6
random random bkwd record stride
KB reclen write rewrite read reread read write read rewrite read
512 4 285 525 1484182 1610307 2736355 971 2316371 455 1560822
- An additional set of
iozone
in another environment:
RHEL 5
./iozone -B -G
Iozone: Performance Test of File I/O
Version $Revision: 3.414 $
Compiled for 32 bit mode.
Build: linux
Run began: Sat Mar 19 13:01:50 2033
Using mmap files
Using msync(MS_SYNC) on mmap files
Command line used: ./iozone -B -G
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
random random bkwd record stride
KB reclen write rewrite read reread read write read rewrite read
512 4 2176 5920 2245789 3389239 3158460 14555 2370476 13646 2992263
RHEL 6
./iozone -B -G
random random bkwd record stride
KB reclen write rewrite read reread read write read rewrite read
512 4 177 652 2484401 2780921 2439220 1027 2497211 651 2667377
----------------------------------
Resolution
- Internal Red Hat Bugzilla: 951937 - Provides this fix to RHEL 6.5 release in kernel-2.6.32-431.el6.
-
Internal Red Hat Bugzilla: 981177 - Provides this fix to the RHEL 6.4 Errata release in kernel-2.6.32-358.18.1.el6.
-
Patchset submitted does the following:
-
Introduces small fixes and cleanups to
writeback_in_progress()
and
determine_dirtyable_memory()
routines. -
Provides users (devices and filesystems) of a
backing_dev_info
the ability
to declare whether or not it is necessary to ensure that page contents cannot
change during write-out, while the actual code assumes this always is true. -
Relaxes the
wait_on_page_writeback()
calls, so that they only occur if
something needs it. -
Adjusts the remaining supported filesystems to use this improved
conditional-wait logic in the hope of providing stable page writes on
all filesystems.
-
-
Patches (5):
mm: fix writeback_in_progress()
mm/page-writeback.c: subtract min_free_kbytes from dirtyable memory
bdi: allow block devices to say that they require stable page writes
mm: only enforce stable page writes if the backing device requires it
block: optionally snapshot page contents to provide stable pages
during write
Documentation/ABI/testing/sysfs-class-bdi | 5 ++++
block/blk-core.c | 8 ++++--
block/blk-integrity.c | 4 +++
fs/buffer.c | 2 +-
fs/ext3/super.c | 1 +
fs/ext4/inode.c | 2 +-
fs/fs-writeback.c | 4 ++-
fs/gfs2/file.c | 2 +-
include/linux/backing-dev.h | 7 +++++
include/linux/fs.h | 1 +
include/linux/pagemap.h | 1 +
mm/Kconfig | 13 +++++++++
mm/backing-dev.c | 11 +++++++
mm/bounce.c | 48 ++++++++++++++++++++++++++++---
mm/filemap.c | 3 +-
mm/page-writeback.c | 27 +++++++++++++++++
16 files changed, 127 insertions(+), 12 deletions(-)
Root Cause
- During the course of investigation for Bug 951937, it was discovered that complaints about non-deterministic long write latency
were due to the changes introduced by the following commit:
bac3645 [fs] Stable page cache to better support T10/DIF
- Buffered I/O was being delayed arbitrarily, even when stable page writes
were not necessary.
Diagnostic Steps
Reproducer #1
Compile the following test code (gcc -o mmapRH mmapRH.c
, assuming the source file is named mmapRH.c
) on a RHEL 5 and RHEL 6 machine. Run time mmapRH
on both machines and see the elapsed time difference.
#include <stdio.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <errno.h>
#include <time.h>
#include <unistd.h>
#include <string.h>
#include <sys/types.h>
#define vm_offset_t size_t
#define SIZE 0x80000
#define PAGE_MASK (sysconf(_SC_PAGE_SIZE) - 1)
#define page_aligned(x) ((((vm_offset_t)(x)) & PAGE_MASK) == 0)
#define trunc_page(x) ((vm_offset_t)(((vm_offset_t)(x)) & ~PAGE_MASK))
#define PAGELET_SIZE 512
int main(){
int *baseaddr;
int j;
int *current;
int fd;
clock_t cputime;
char filename[200];
size_t size = SIZE;
char *buf;
int flag = O_RDWR | O_CREAT;
int map_flags=MAP_SHARED | MAP_FILE;
strcpy(filename,"mmap.out");
j=sysconf(_SC_PAGE_SIZE);
int i;
for( i=0;i<100;i++)
{
flag += O_DSYNC;
fd=open(filename,flag,0666);
ftruncate(fd,size);
current=(int *)mmap((void *)0, size, PROT_WRITE, map_flags, fd, 0);
if (current == (int *)MAP_FAILED) {
perror("mmap");
exit(0);
}
close(fd);
baseaddr=current;
madvise((caddr_t)current,size,MADV_SEQUENTIAL);
for(j=0;j<SIZE;j++)
{
*baseaddr=j;
if ((j) && (page_aligned(baseaddr))){
if (msync((char *)(trunc_page(baseaddr)
- sysconf(_SC_PAGE_SIZE)),
sysconf(_SC_PAGE_SIZE),6) == -1){
perror("msync");
exit(EXIT_FAILURE);
}
}
baseaddr++;
if (baseaddr >= current+SIZE/4)
break;
}
}
}
Reproducer #2
Reproduce using iozone
: Download and compile iozone, run iozone -B -G
on both a RHEL 5 and RHEL 6 and notice the performance difference with write operations.
$ cd iozone3_414/src/current
$ make linux
Reproducer #3
/*
* g++ -O3 --std=c++0x -o test_mmap test_mmap.cpp
*/
#include <vector>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <stdint.h>
#include <time.h>
#include <cstdlib>
#include <ctime>
static const size_t MBYTES = 1048576;
static const size_t SEG_SIZE = 4 * MBYTES;
static const size_t STEP_SIZE = SEG_SIZE / 2048;
static inline uint64_t
get_clock_ticks(){
register uint32_t low, high;
asm volatile("rdtscp" : "=a" (low), "=d" (high): : "%ecx");
return ( ( (uint64_t)high ) << 32 ) | low;
}
static double
get_cpu_freq_in_hz()
{
char buf[1024];
double clock_in_mhz = 0;
FILE* f = fopen( "/proc/cpuinfo", "r" );
while ( fgets( buf, sizeof(buf), f ) != 0 ) {
if ( sscanf( buf, "cpu MHz : %lf", &clock_in_mhz ) == 1 ) {
// printf( "Found clock : %.4f\n", clock_in_mhz );
break;
}
}
fclose( f );
return clock_in_mhz * 1000000;
}
static inline double
calc_interval_in_secs( uint64_t start, uint64_t finish )
{
static double freq_in_hz = get_cpu_freq_in_hz();
return (finish - start) / freq_in_hz;
}
static inline uint64_t
convert_usecs_to_ticks( double usecs ) {
return usecs / 1000000.0 * get_cpu_freq_in_hz();
}
class Stopwatch {
public:
static uint64_t get_default_report_threshold_in_ticks() { /* in ticks */
static uint64_t report_threshold_in_ticks_ = convert_usecs_to_ticks( 500000.0 );
return report_threshold_in_ticks_;
}
Stopwatch( const char* text )
: text_( text ), start_( get_clock_ticks() )
, report_threshold_in_ticks_( get_default_report_threshold_in_ticks() )
{
++indent_level_;
}
Stopwatch( const char* text, uint64_t report_threshold_in_usecs )
: text_( text ), start_( get_clock_ticks() )
, report_threshold_in_ticks_( convert_usecs_to_ticks( report_threshold_in_usecs ) )
{
++indent_level_;
}
~Stopwatch()
{
uint64_t finish = get_clock_ticks();
uint64_t dur = finish - this->start_;
--indent_level_;
if ( dur >= this->report_threshold_in_ticks_ ) {
double dur_in_secs = calc_interval_in_secs( this->start_, finish );
printf( "%.*s%s / took %.1f usecs, report_threshold_in_usecs in usecs=%llu\n", indent_level_ * 2, "", this->text_, dur_in_secs * 1000000.0, (unsigned long long)( report_threshold_in_ticks_ / ( get_cpu_freq_in_hz() ) * 1000000 ) ); fflush(stdout);
}
}
private:
const char* text_;
uint64_t start_;
uint64_t report_threshold_in_ticks_;
static int indent_level_;
};
int Stopwatch::indent_level_ = 0;
class Test_File {
public:
Test_File()
: file_name_( 0 ), fd_( -1 ), offset_( 0 ), seg_size_( SEG_SIZE ), step_size_( SEG_SIZE/8 ), segs_()
{
}
void open_file_name( const char* file_name )
{
this->file_name_ = strdup( file_name );
this->fd_ = open( this->file_name_, O_RDWR | O_CREAT, 0777 );
if ( this->fd_ < 0 ) {
fprintf( stderr, "Error calling open(): %d (%s), file_name=%s\n", errno, strerror(errno), this->file_name_ );
exit( errno );
}
}
void read_out_of_band(){
int random = 0;
random = std::rand() % (segs_.size());
memcpy(&read_char_[0], segs_[random], 32);
if(random > 0) memcpy(&read_char_[0], segs_[random-1], 32);
}
void step() {
size_t offset = this->offset_;
this->offset_ += this->step_size_;
if ( ( offset % this->seg_size_ ) == 0 ) {
{
Stopwatch s_lseek( "lseek" );
lseek( this->fd_, offset + this->seg_size_ - 1, SEEK_SET );
}
{
Stopwatch s_write( "write 1 byte at end of seg" );
write( this->fd_, "", 1 );
}
char* p = 0;
int prot = PROT_READ | PROT_WRITE;
int flags = MAP_SHARED;
{
Stopwatch s_mmap( "mmap 1 seg" );
p = (char*) mmap( 0, this->seg_size_, prot, flags, this->fd_, offset );
}
if ( p == MAP_FAILED ) {
fprintf( stderr, "Error calling mmap(): offset=%llu, %d (%s)\n", (unsigned long long)offset, errno, strerror(errno) );
exit( errno );
}
this->segs_.push_back( p );
size_t seg_num = offset / SEG_SIZE;
if ( false && seg_num >= 100 ) {
Stopwatch s_madvise( "madvise MADV_DONTNEED", 500 );
char* q = this->segs_[ seg_num - 100 ];
int r = madvise( q, SEG_SIZE, MADV_DONTNEED );
if ( r == - 1 ) {
fprintf( stderr, "Error calling madvise(): offset=%llu, %d (%s)\n",
(unsigned long long)offset, errno, strerror(errno) );
exit( errno );
}
}
}
{
Stopwatch s_memset( "memset 1 seg", 1000 );
char* p = this->segs_[ this->segs_.size() - 1 ];
memset( p + offset, '\1', this->step_size_ );
}
}
~Test_File() {
close( this->fd_ );
free( (char*)file_name_ );
}
private:
const char* file_name_;
int fd_;
int fd_rd_;
size_t offset_;
size_t seg_size_;
size_t step_size_;
char read_char_[32];
std::vector<char*> segs_;
};
int
main( int argc, const char* argv[] )
{
srand(time(0));
char outdir[256];
Test_File test_files[100];
size_t max = 16 * 1024 * MBYTES;
size_t num_files = 50;
if(argc < 2) strcpy(outdir, ".");
else strcpy(outdir, argv[1]);
// Workaround for not picking up /etc/localtime automatically
setenv( "TZ", ":/etc/localtime", 0 );
for ( size_t k=0; k < num_files; ++k ) {
char file_name[100];
snprintf( file_name, sizeof(file_name), "%s/TEST_%d_%llu", outdir, getpid(), (unsigned long long)k );
printf( "file_name=%s\n", file_name );
test_files[k].open_file_name( file_name );
}
time_t t;
time( &t );
unsigned int seed = t;
srandom( seed );
for ( size_t o=0; o < max; o += STEP_SIZE ) {
if ( o % (1 * STEP_SIZE) == 0 ) {
long v = (rand()%1000) * 2000;//( random() + 0.0 ) / RAND_MAX * 200000;
usleep( v );
time_t t;
struct tm tm;
char date_time_buf[200];
time( &t );
localtime_r( &t, &tm );
strftime( date_time_buf, sizeof(date_time_buf), "%T", &tm );
//printf( "o=%zu, %s sleep = %llu\n", o, date_time_buf, v );
}
for ( size_t k=0; k < num_files; ++k ) {
test_files[k].step();
test_files[k].read_out_of_band();
}
}
return 0;
}
Steps to reproduce:
- Start the application on RHEL 5 and RHEL 6:
;; RHEL 5
# ./mmap_test-static.2.6.18-194.32.1.el5 1>nostrace.rhel5.static.log 2>&1 &
;; RHEL 6
# ./mmap_test.static 1>nostrace.rhel63.static.log 2>&1 &
- Let it run.
- After a while, check the time taken for the
memset
call:
# cat nostrace.rhel5.static.log | grep usecs= | cut -d ' ' -f6 | sort -n | tail -5
1735.3
1834.3
1877.0
3059.5
8256.7
# cat nostrace.rhel63.static.log | grep usecs= | cut -d ' ' -f6 | sort -n | tail -5
1423835.9
1428859.7
1612747.8
1731135.4
2190878.2
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments