Development

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • 100x perfomance regression between gcc 3.4.5 and gcc 4.X

    6 answers - 1870 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    During "bashmark" memory benchmark perfomance analyze, I found 100x perfomance
    regression between gcc 3.4.5 and gcc 4.X.
    test_cmd.cpp (simplified bashmark memory RW test)
    #include <stdint.h>
    #include <cstring>
    template <const uint8_t Block_Size, const uint32_t Loops>
    static void int_membench(uint8_t* mb1, uint8_t* mb2)
    {
    for(uint32_t i = 0; i < Loops; i+=1)
    {
    #define T memcpy(mb1, mb2, Block_Size); memset(mb2, i, Block_Size);
    T T T T T
    T T T T T
    #undef T
    }
    }
    template <const uint32_t Buf_Size, const uint32_t Loops>
    static void membench()
    {
    static uint8_t mb1[Buf_Size];
    static uint8_t mb2[Buf_Size];
    for(uint32_t i = 0; i < 10000; i+=1)
    int_membench<Buf_Size, Loops>(mb1, mb2);
    }
    int main()
    {
    membench<128, 4000>();
    return 0;
    }
    GCC 3.4.5: 0.43user 0.00system 0:00.44elapsed
    GCC 4.0.2: 34.83user 0.68system 0:36.09elapsed
    GCC 4.1.0: 33.86user 0.58system 0:34.96elapsed
    Compiler options:
    -march=athlon-xp
    -fomit-frame-pointer
    -mfpmath=sse -msse
    -ftracer -fweb
    -maccumulate-outgoing-args
    -ffast-math
    I've played with various settings (, , without march, without tracer and
    web, etc) without any serious difference. I.e. GCC4 is always many times slower
    than GCC 3.4.5.
    Lurking inside assembler generation showed that GCC4 don't inline memcpy and
    memset calls.
    test.c (uber simplified problem demonstration)
    #include <string.h>
    char* f(char* b)
    {
    static char a[64];
    memcpy(a, b, 64);
    memset(a, 0, 64);
    return a;
    }
    GCC4 will generate calls to memcpy and memset in this example. GCC3 will inline
    all calls.
    So, it looks like GCC4 inliner is broken at some point.
  • No.1 | | 2099 bytes | |

    3/12/06, Nickolay Kolchin <nbkolchin (AT) gmail (DOT) comwrote:
    During "bashmark" memory benchmark perfomance analyze, I found 100x perfomance
    regression between gcc 3.4.5 and gcc 4.X.

    test_cmd.cpp (simplified bashmark memory RW test)
    #include <stdint.h>
    #include <cstring>

    template <const uint8_t Block_Size, const uint32_t Loops>
    static void int_membench(uint8_t* mb1, uint8_t* mb2)
    {
    for(uint32_t i = 0; i < Loops; i+=1)
    {
    #define T memcpy(mb1, mb2, Block_Size); memset(mb2, i, Block_Size);
    T T T T T
    T T T T T
    #undef T
    }
    }

    template <const uint32_t Buf_Size, const uint32_t Loops>
    static void membench()
    {
    static uint8_t mb1[Buf_Size];
    static uint8_t mb2[Buf_Size];
    for(uint32_t i = 0; i < 10000; i+=1)
    int_membench<Buf_Size, Loops>(mb1, mb2);
    }

    int main()
    {
    membench<128, 4000>();
    return 0;
    }

    GCC 3.4.5: 0.43user 0.00system 0:00.44elapsed
    GCC 4.0.2: 34.83user 0.68system 0:36.09elapsed
    GCC 4.1.0: 33.86user 0.58system 0:34.96elapsed

    Compiler options:
    -march=athlon-xp
    -fomit-frame-pointer
    -mfpmath=sse -msse
    -ftracer -fweb
    -maccumulate-outgoing-args
    -ffast-math

    I've played with various settings (, , without march, without tracer and
    web, etc) without any serious difference. I.e. GCC4 is always many times slower
    than GCC 3.4.5.

    Lurking inside assembler generation showed that GCC4 don't inline memcpy and
    memset calls.

    test.c (uber simplified problem demonstration)
    #include <string.h>

    char* f(char* b)
    {
    static char a[64];
    memcpy(a, b, 64);
    memset(a, 0, 64);
    return a;
    }

    GCC4 will generate calls to memcpy and memset in this example. GCC3 will inline
    all calls.

    So, it looks like GCC4 inliner is broken at some point.

    Inlining of memcpy/memset is architecture dependent (I see calls
    on ppc for gcc 3.4, too). This is a stupid benchmark and as such
    not worth optimizing for.

    Richard.
  • No.2 | | 2698 bytes | |

    3/12/06, Richard Guenther <richard.guenther (AT) gmail (DOT) comwrote:
    3/12/06, Nickolay Kolchin <nbkolchin (AT) gmail (DOT) comwrote:
    During "bashmark" memory benchmark perfomance analyze, I found 100x perfomance
    regression between gcc 3.4.5 and gcc 4.X.

    test_cmd.cpp (simplified bashmark memory RW test)
    #include <stdint.h>
    #include <cstring>

    template <const uint8_t Block_Size, const uint32_t Loops>
    static void int_membench(uint8_t* mb1, uint8_t* mb2)
    {
    for(uint32_t i = 0; i < Loops; i+=1)
    {
    #define T memcpy(mb1, mb2, Block_Size); memset(mb2, i, Block_Size);
    T T T T T
    T T T T T
    #undef T
    }
    }

    template <const uint32_t Buf_Size, const uint32_t Loops>
    static void membench()
    {
    static uint8_t mb1[Buf_Size];
    static uint8_t mb2[Buf_Size];
    for(uint32_t i = 0; i < 10000; i+=1)
    int_membench<Buf_Size, Loops>(mb1, mb2);
    }

    int main()
    {
    membench<128, 4000>();
    return 0;
    }

    GCC 3.4.5: 0.43user 0.00system 0:00.44elapsed
    GCC 4.0.2: 34.83user 0.68system 0:36.09elapsed
    GCC 4.1.0: 33.86user 0.58system 0:34.96elapsed

    Compiler options:
    -march=athlon-xp
    -fomit-frame-pointer
    -mfpmath=sse -msse
    -ftracer -fweb
    -maccumulate-outgoing-args
    -ffast-math

    I've played with various settings (, , without march, without tracer and
    web, etc) without any serious difference. I.e. GCC4 is always many times slower
    than GCC 3.4.5.

    Lurking inside assembler generation showed that GCC4 don't inline memcpy and
    memset calls.

    test.c (uber simplified problem demonstration)
    #include <string.h>

    char* f(char* b)
    {
    static char a[64];
    memcpy(a, b, 64);
    memset(a, 0, 64);
    return a;
    }

    GCC4 will generate calls to memcpy and memset in this example. GCC3 will inline
    all calls.

    So, it looks like GCC4 inliner is broken at some point.

    Inlining of memcpy/memset is architecture dependent (I see calls
    on ppc for gcc 3.4, too). This is a stupid benchmark and as such
    not worth optimizing for.

    bashmark (http://bashmark.coders-net.de/ ) is a benchmark. My code is
    just a test to demonstrate problem and as such can't be stupid. :)

    Situation when compiler generates code from simple test that run 100
    times slower, than code from previous compiler version is not normal
    anyway. (and GCC3 generates smaller code, too)

    I thought that this regression was caused by different "max-inline-*"
    params setting in 4.X.

    In any case: memcpy/memset inlining is broken in current GCC at least
    on athlon arch.
  • No.3 | | 3163 bytes | |

    Sun, 2006-03-12 at 16:55 +0300, Nickolay Kolchin wrote:
    3/12/06, Richard Guenther <richard.guenther (AT) gmail (DOT) comwrote:
    3/12/06, Nickolay Kolchin <nbkolchin (AT) gmail (DOT) comwrote:
    During "bashmark" memory benchmark perfomance analyze, I found 100x perfomance
    regression between gcc 3.4.5 and gcc 4.X.

    test_cmd.cpp (simplified bashmark memory RW test)
    #include <stdint.h>
    #include <cstring>

    template <const uint8_t Block_Size, const uint32_t Loops>
    static void int_membench(uint8_t* mb1, uint8_t* mb2)
    {
    for(uint32_t i = 0; i < Loops; i+=1)
    {
    #define T memcpy(mb1, mb2, Block_Size); memset(mb2, i, Block_Size);
    T T T T T
    T T T T T
    #undef T
    }
    }

    template <const uint32_t Buf_Size, const uint32_t Loops>
    static void membench()
    {
    static uint8_t mb1[Buf_Size];
    static uint8_t mb2[Buf_Size];
    for(uint32_t i = 0; i < 10000; i+=1)
    int_membench<Buf_Size, Loops>(mb1, mb2);
    }

    int main()
    {
    membench<128, 4000>();
    return 0;
    }

    GCC 3.4.5: 0.43user 0.00system 0:00.44elapsed
    GCC 4.0.2: 34.83user 0.68system 0:36.09elapsed
    GCC 4.1.0: 33.86user 0.58system 0:34.96elapsed

    Compiler options:
    -march=athlon-xp
    -fomit-frame-pointer
    -mfpmath=sse -msse
    -ftracer -fweb
    -maccumulate-outgoing-args
    -ffast-math

    I've played with various settings (, , without march, without tracer and
    web, etc) without any serious difference. I.e. GCC4 is always many times slower
    than GCC 3.4.5.

    Lurking inside assembler generation showed that GCC4 don't inline memcpy and
    memset calls.

    test.c (uber simplified problem demonstration)
    #include <string.h>

    char* f(char* b)
    {
    static char a[64];
    memcpy(a, b, 64);
    memset(a, 0, 64);
    return a;
    }

    GCC4 will generate calls to memcpy and memset in this example. GCC3 will inline
    all calls.

    So, it looks like GCC4 inliner is broken at some point.

    Inlining of memcpy/memset is architecture dependent (I see calls
    on ppc for gcc 3.4, too). This is a stupid benchmark and as such
    not worth optimizing for.

    bashmark (http://bashmark.coders-net.de/ ) is a benchmark. My code is
    just a test to demonstrate problem and as such can't be stupid. :)

    Situation when compiler generates code from simple test that run 100
    times slower, than code from previous compiler version is not normal
    anyway. (and GCC3 generates smaller code, too)

    I thought that this regression was caused by different "max-inline-*"
    params setting in 4.X.

    In any case: memcpy/memset inlining is broken in current GCC at least
    on athlon arch.

    Yes, why is the benchmark not valid?
    Then we would appreciate if the developers could recommend a valid test.

    Here is what I get on my platform:

    gcc version 4.0.2 20051125 (Red Hat 4.0.2-8)
    Architecture = i686
    S: Linux
    Kernel: 2.6.15-1.1833_FC4

    [williams@bengal src]$ time ./test_cmd

    real 0m50.583s
    user 0m50.003s
    sys 0m0.220s

    Thanks,
    Ernesto
  • No.4 | | 155 bytes | |

    Yes, why is the benchmark not valid?
    It is valid. We should understand why this behavior has changed so drastically.
    Gr.
    Steven
  • No.5 | | 497 bytes | |

    3/12/06, Ernest L. Williams Jr. <ernesto (AT) ornl (DOT) govwrote:
    In any case: memcpy/memset inlining is broken in current GCC at least
    on athlon arch.

    let's say it changed. Also memcpy/memset "inlining" is not regular inlining
    but driven by completely different heuristics.

    Yes, why is the benchmark not valid?
    Then we would appreciate if the developers could recommend a valid test.

    What is the benchmark supposed to measure?

    Richard.
  • No.6 | | 595 bytes | |

    3/12/06, Steven Bosscher <stevenb.gcc (AT) gmail (DOT) comwrote:

    It is valid. We should understand why this behavior has changed so drastically.

    I've attached assembler output from different compiler versions:

    3.4.5-athlon-xp: gcc-3.4.5 -march=athlon-xp
    3.4.5-pentium4: gcc-3.4.5 -march=pentium4
    4.1.0-athlon-xp: gcc-4.1.0 -march=athlon-xp

    As you can see, gcc-3.4.5 generates fastest code for
    "-march=athlon-xp". This code should also run faster on any pentium
    machine.

    gcc-4.1.0 generates "same" slow code for "pentium" and "athlon" arch.

Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X


max 4000 letters.
Your nickname that display:
In order to stop the spam: 2 + 1 =
QUESTION ON "Development"

EMSDN.COM