Large difference in performance observed between RHEL 5 & RHEL 6 when running programs using memory mapping

Solution Verified - Updated -

Environment

  • Red Hat Enterprise Linux (RHEL) 6.2
  • Red Hat Enterprise Linux 6.4
  • kernel-2.6.32-279.19.1.el6.x86_64

Issue

  • A large difference in performance between RHEL 5 and RHEL 6 when running programs using memory mapping. In particular, when using msync. For instance, the difference for a simple program that opens a file, maps it into memory and then syncs it is:

FIXED IN kernel-2.6.32-358.18.1.el6

RHEL 5
real    0m5.33s
user    0m0.24s
sys     0m0.22s
RHEL 6
real    1m49.92s
user    0m0.21s
sys     0m0.54s
  • Another environment sees a similar result:
RHEL 5
    time ./mmapRH

    real    0m8.593s
    user    0m0.349s
    sys     0m0.344s
RHEL 6
    time ./mmapRH

    real    1m24.281s
    user    0m0.346s
    sys     0m0.744s
  • Results from iozone (http://www.iozone.org/) when using the mmap and msync options (-B and -G).
RHEL 5 (Machine 1)
                                                                       random  random    bkwd      record     stride                    
    KB  reclen   write rewrite    read      reread      read       write       read        rewrite     read                   
    512       4     402  34856  3198002  3393088 2992871    2891     3242347    21572   3102214
RHEL 5 (Machine 2)
                                                                       random  random    bkwd       record     stride                   
    KB  reclen   write rewrite    read       reread     read       write        read        rewrite     read                   
   512       4     380   23380  1158641  3370026 3239987   10210     3239419    18574   3083779
RHEL 6
                                                                      random  random     bkwd       record     stride                    
    KB  reclen   write rewrite    read      reread     read       write         read        rewrite     read                   
   512       4     285     525   1484182   1610307 2736355    971      2316371      455     1560822
  • An additional set of iozone in another environment:
RHEL 5
        ./iozone -B -G
        Iozone: Performance Test of File I/O
                Version $Revision: 3.414 $
                Compiled for 32 bit mode.
                Build: linux
        Run began: Sat Mar 19 13:01:50 2033

        Using mmap files
        Using msync(MS_SYNC) on mmap files
        Command line used: ./iozone -B -G
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                                                 random    random    bkwd        record    stride                                 
              KB  reclen   write   rewrite    read       reread    read          write      read         rewrite    read                                 
             512       4     2176    5920  2245789  3389239 3158460   14555    2370476      13646    2992263
RHEL 6
        ./iozone -B -G
                                                                                 random    random    bkwd        record     stride                                  
             KB   reclen   write   rewrite    read       reread    read          write      read         rewrite     read                                  
             512       4     177       652  2484401  2780921 2439220    1027      2497211      651      2667377
----------------------------------

Resolution

  • Internal Red Hat Bugzilla: 951937 - Provides this fix to RHEL 6.5 release in kernel-2.6.32-431.el6.
  • Internal Red Hat Bugzilla: 981177 - Provides this fix to the RHEL 6.4 Errata release in kernel-2.6.32-358.18.1.el6.

  • Patchset submitted does the following:

    • Introduces small fixes and cleanups to writeback_in_progress() and
      determine_dirtyable_memory() routines.

    • Provides users (devices and filesystems) of a backing_dev_info the ability
      to declare whether or not it is necessary to ensure that page contents cannot
      change during write-out, while the actual code assumes this always is true.

    • Relaxes the wait_on_page_writeback() calls, so that they only occur if
      something needs it.

    • Adjusts the remaining supported filesystems to use this improved
      conditional-wait logic in the hope of providing stable page writes on
      all filesystems.

  • Patches (5):

  mm: fix writeback_in_progress()
  mm/page-writeback.c: subtract min_free_kbytes from dirtyable memory
  bdi: allow block devices to say that they require stable page writes
  mm: only enforce stable page writes if the backing device requires it
  block: optionally snapshot page contents to provide stable pages
    during write

 Documentation/ABI/testing/sysfs-class-bdi |  5 ++++
 block/blk-core.c                          |  8 ++++--
 block/blk-integrity.c                     |  4 +++
 fs/buffer.c                               |  2 +-
 fs/ext3/super.c                           |  1 +
 fs/ext4/inode.c                           |  2 +-
 fs/fs-writeback.c                         |  4 ++-
 fs/gfs2/file.c                            |  2 +-
 include/linux/backing-dev.h               |  7 +++++
 include/linux/fs.h                        |  1 +
 include/linux/pagemap.h                   |  1 +
 mm/Kconfig                                | 13 +++++++++
 mm/backing-dev.c                          | 11 +++++++
 mm/bounce.c                               | 48 ++++++++++++++++++++++++++++---
 mm/filemap.c                              |  3 +-
 mm/page-writeback.c                       | 27 +++++++++++++++++
 16 files changed, 127 insertions(+), 12 deletions(-)

Root Cause

  • During the course of investigation for Bug 951937, it was discovered that complaints about non-deterministic long write latency
    were due to the changes introduced by the following commit:
bac3645 [fs] Stable page cache to better support T10/DIF
  • Buffered I/O was being delayed arbitrarily, even when stable page writes
    were not necessary.

Diagnostic Steps

Reproducer #1

Compile the following test code (gcc -o mmapRH mmapRH.c, assuming the source file is named mmapRH.c) on a RHEL 5 and RHEL 6 machine. Run time mmapRH on both machines and see the elapsed time difference.

#include <stdio.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <errno.h>
#include <time.h>
#include <unistd.h>
#include <string.h>
#include <sys/types.h>

#define vm_offset_t size_t 

#define SIZE 0x80000 
#define PAGE_MASK       (sysconf(_SC_PAGE_SIZE) - 1)
#define page_aligned(x) ((((vm_offset_t)(x)) & PAGE_MASK) == 0)
#define trunc_page(x)   ((vm_offset_t)(((vm_offset_t)(x)) & ~PAGE_MASK))
#define PAGELET_SIZE 512

int main(){
   int *baseaddr;
   int j;
   int *current;
   int fd;
   clock_t cputime;
   char filename[200];
   size_t size = SIZE;
   char *buf;
   int flag = O_RDWR | O_CREAT;
   int map_flags=MAP_SHARED | MAP_FILE;

   strcpy(filename,"mmap.out");
   j=sysconf(_SC_PAGE_SIZE);

int i;
for( i=0;i<100;i++)
{
   flag += O_DSYNC;
   fd=open(filename,flag,0666);
   ftruncate(fd,size);
   current=(int *)mmap((void *)0, size, PROT_WRITE, map_flags, fd, 0);
   if (current == (int *)MAP_FAILED) {
       perror("mmap");
       exit(0);
   }
   close(fd);
   baseaddr=current;
   madvise((caddr_t)current,size,MADV_SEQUENTIAL);
   for(j=0;j<SIZE;j++)
   {
         *baseaddr=j;
         if ((j) && (page_aligned(baseaddr))){
             if (msync((char *)(trunc_page(baseaddr) 
                         - sysconf(_SC_PAGE_SIZE)),
                       sysconf(_SC_PAGE_SIZE),6) == -1){
                 perror("msync");
                 exit(EXIT_FAILURE);
             }
         }
         baseaddr++;
         if (baseaddr >= current+SIZE/4)
             break;
   }
}
}

Reproducer #2

Reproduce using iozone: Download and compile iozone, run iozone -B -G on both a RHEL 5 and RHEL 6 and notice the performance difference with write operations.

     $ cd iozone3_414/src/current
     $ make linux

Reproducer #3

/*
 *  g++ -O3 --std=c++0x -o test_mmap test_mmap.cpp
 */

#include <vector>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <stdint.h>
#include <time.h>
#include <cstdlib>
#include <ctime>

static const size_t         MBYTES = 1048576;
static const size_t         SEG_SIZE = 4 * MBYTES;
static const size_t         STEP_SIZE = SEG_SIZE / 2048;

static inline uint64_t
get_clock_ticks(){
    register uint32_t   low, high; 
    asm volatile("rdtscp" : "=a" (low), "=d" (high): : "%ecx");
    return ( ( (uint64_t)high ) << 32 ) | low;
}

static double
get_cpu_freq_in_hz()
{
    char            buf[1024];
    double          clock_in_mhz = 0;
    FILE*           f = fopen( "/proc/cpuinfo", "r" );

    while ( fgets( buf, sizeof(buf), f ) != 0 ) {
        if ( sscanf( buf, "cpu MHz         : %lf", &clock_in_mhz ) == 1 ) {
//            printf( "Found clock : %.4f\n", clock_in_mhz );
            break;
        }
    }
    fclose( f );
    return clock_in_mhz * 1000000;
}

static inline double
calc_interval_in_secs( uint64_t start, uint64_t finish )
{
    static double   freq_in_hz = get_cpu_freq_in_hz();
    return (finish - start) / freq_in_hz;
}

static inline uint64_t
convert_usecs_to_ticks( double usecs ) {
    return usecs / 1000000.0 * get_cpu_freq_in_hz();
}

class Stopwatch {
public:
    static uint64_t get_default_report_threshold_in_ticks() {    /* in ticks */
        static uint64_t report_threshold_in_ticks_   = convert_usecs_to_ticks( 500000.0 );
        return report_threshold_in_ticks_;
    }

    Stopwatch( const char* text )
        : text_( text ), start_( get_clock_ticks() )
        , report_threshold_in_ticks_( get_default_report_threshold_in_ticks() )
    {
        ++indent_level_;
    }

    Stopwatch( const char* text, uint64_t report_threshold_in_usecs )
        : text_( text ), start_( get_clock_ticks() )
        , report_threshold_in_ticks_( convert_usecs_to_ticks( report_threshold_in_usecs ) )
    {
        ++indent_level_;
    }

    ~Stopwatch()
    {
        uint64_t    finish = get_clock_ticks();
        uint64_t    dur = finish - this->start_;

        --indent_level_;
        if ( dur >= this->report_threshold_in_ticks_ ) {
            double  dur_in_secs = calc_interval_in_secs( this->start_, finish );
            printf( "%.*s%s / took %.1f usecs, report_threshold_in_usecs in usecs=%llu\n", indent_level_ * 2, "", this->text_, dur_in_secs * 1000000.0, (unsigned long long)( report_threshold_in_ticks_ / ( get_cpu_freq_in_hz() ) * 1000000 ) ); fflush(stdout);
        }
    }
private:
    const char*         text_;
    uint64_t            start_;
    uint64_t            report_threshold_in_ticks_;
    static int          indent_level_;
};

int         Stopwatch::indent_level_        = 0;

class Test_File {
public:
    Test_File()
        : file_name_( 0 ), fd_( -1 ), offset_( 0 ), seg_size_( SEG_SIZE ), step_size_( SEG_SIZE/8 ), segs_()
    {
    }

    void open_file_name( const char* file_name )
    {
        this->file_name_ = strdup( file_name );
        this->fd_ = open( this->file_name_, O_RDWR | O_CREAT, 0777 );
        if ( this->fd_ < 0 ) {
            fprintf( stderr, "Error calling open(): %d (%s), file_name=%s\n", errno, strerror(errno), this->file_name_ );
            exit( errno );
        }
    }

    void read_out_of_band(){
      int random  = 0;
      random = std::rand() % (segs_.size());    
      memcpy(&read_char_[0], segs_[random], 32);
      if(random > 0) memcpy(&read_char_[0], segs_[random-1], 32);
    }


    void step() {
        size_t  offset = this->offset_;
        this->offset_ += this->step_size_;

        if ( ( offset % this->seg_size_ ) == 0 ) {
            {
                Stopwatch   s_lseek( "lseek" );
                lseek( this->fd_, offset + this->seg_size_ - 1, SEEK_SET );
            }

            {
                Stopwatch   s_write( "write 1 byte at end of seg" );
                write( this->fd_, "", 1 );
            }

            char*       p = 0;
            int         prot = PROT_READ | PROT_WRITE;
            int         flags = MAP_SHARED;
            {
                Stopwatch   s_mmap( "mmap 1 seg" );
                p = (char*) mmap( 0, this->seg_size_, prot, flags, this->fd_, offset );
            }
            if ( p  == MAP_FAILED ) {
                fprintf( stderr, "Error calling mmap(): offset=%llu, %d (%s)\n", (unsigned long long)offset, errno, strerror(errno) );
                exit( errno );
            }

            this->segs_.push_back( p );

            size_t      seg_num = offset / SEG_SIZE;
            if ( false && seg_num >= 100 ) {
                Stopwatch   s_madvise( "madvise MADV_DONTNEED", 500 );
                char*       q = this->segs_[ seg_num - 100 ];
                int         r = madvise( q, SEG_SIZE, MADV_DONTNEED );
                if ( r == - 1 ) {
                    fprintf( stderr, "Error calling madvise(): offset=%llu, %d (%s)\n",
                                            (unsigned long long)offset, errno, strerror(errno) );
                    exit( errno );
                }
            }
        }

        {
            Stopwatch   s_memset( "memset 1 seg", 1000 );
            char*       p = this->segs_[ this->segs_.size() - 1 ];
            memset( p + offset, '\1', this->step_size_ );
        }
    }

    ~Test_File() {
        close( this->fd_ );
        free( (char*)file_name_ );
    }
private:
    const char*         file_name_;
    int                 fd_;
    int                 fd_rd_;
    size_t              offset_;
    size_t              seg_size_;
    size_t              step_size_;
    char                read_char_[32];
    std::vector<char*>  segs_;
};


int
main( int argc, const char* argv[] )
{
    srand(time(0));
    char outdir[256];

    Test_File   test_files[100];
    size_t      max = 16 * 1024 * MBYTES;
    size_t      num_files = 50;
    if(argc < 2) strcpy(outdir, ".");
    else         strcpy(outdir, argv[1]);
    // Workaround for not picking up /etc/localtime automatically
    setenv( "TZ", ":/etc/localtime", 0 );

    for ( size_t k=0; k < num_files; ++k ) {
        char        file_name[100];
        snprintf( file_name, sizeof(file_name), "%s/TEST_%d_%llu", outdir, getpid(), (unsigned long long)k );

        printf( "file_name=%s\n", file_name );
        test_files[k].open_file_name( file_name );
    }

    time_t              t;
    time( &t );
    unsigned int        seed = t;
    srandom( seed );

    for ( size_t o=0; o < max; o += STEP_SIZE ) {
        if ( o % (1 * STEP_SIZE) == 0 ) {
            long        v = (rand()%1000) * 2000;//( random() + 0.0 ) / RAND_MAX * 200000;
            usleep( v );

            time_t      t;
            struct tm   tm;
            char        date_time_buf[200];
            time( &t );
            localtime_r( &t, &tm );
            strftime( date_time_buf, sizeof(date_time_buf), "%T", &tm );
            //printf( "o=%zu, %s sleep = %llu\n", o, date_time_buf, v );
        }

        for ( size_t k=0; k < num_files; ++k ) {
            test_files[k].step();
            test_files[k].read_out_of_band();
        }
    }

    return 0;
}

Steps to reproduce:

  • Start the application on RHEL 5 and RHEL 6:
;; RHEL 5
# ./mmap_test-static.2.6.18-194.32.1.el5 1>nostrace.rhel5.static.log 2>&1 &
;; RHEL 6
# ./mmap_test.static 1>nostrace.rhel63.static.log 2>&1 &
  • Let it run.
  • After a while, check the time taken for the memset call:
# cat nostrace.rhel5.static.log | grep usecs= | cut -d ' ' -f6 | sort -n | tail -5
1735.3
1834.3
1877.0
3059.5
8256.7

# cat nostrace.rhel63.static.log | grep usecs= | cut -d ' ' -f6 | sort -n | tail -5
1423835.9
1428859.7
1612747.8
1731135.4
2190878.2

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments