Compilers

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • The case of directly-executable parsers

    6 answers - 1510 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    Directly-executable LR parsers are well-known and a number of
    researchers have published amazing speedups over table-driver LR
    parsers, in the 2x to 6x or more range, cf.:
    Achyutram Bhamidipaty and Todd A. Proebsting
    "Very Fast YACC-Compatible Parsers (For Very Little Effort)
    R.N.Horspool and M. Whitney
    "Even faster LR parsing"
    Thomas J. Pennello
    "Very fast LR parsing"
    I have developed such a LALR(1) generator, which generates a
    directly-executable parser in IS C. Admittedly, the parser performs
    only a few low-level optimizations - inverted table for non-terminal
    transitions and separating the most frequent transition as a default.
    However, the speedup I observe is nowhere near the previously reported
    ones. Compared to a parser, generated by Bison 2.1 for the IS C 99
    grammar (taken straight from the standard, with one minor
    modification), the directly-executable parser is only 5-9% faster on
    P4 and about 12% faster on P3 Xeon, at the same time being about 3
    times bigger.
    I assume I'm not actually doing anything stupid with the generated
    parser. A sample of the generated code (preprocessed and indent(1)ed)
    is here:
    What could be the reason for this parser performing so poorly? for
    the Bison parser performing so well? Has anyone got some recent
    observations with directly-executable vs. table driven recognizers -
    scanning (re2c?), parsing, tree-matching?
    ~velco
  • No.1 | | 1339 bytes | |

    Velco,

    What could be the reason for this parser performing so poorly? for
    the Bison parser performing so well? Has anyone got some recent
    observations with directly-executable vs. table driven recognizers -
    scanning (re2c?), parsing, tree-matching?

    The generated code contains lots of jumps. I am guessing that
    instruction pipeline never gets to crank up before it is flushed, also
    all those jumps must be very bad for lookahead prediction and I dread
    to think about the cache swapping that must be going on.

    In the "good old days" jumps did not have such a big impact on
    performance (well, at least on many processors). So I am guessing
    that the excessive number of jumps are skewing the performance
    characteristics, which did not happen on the processors used for the
    previous timing comparisons.

    Also gcc does not seem to be consistent in optimizing those switch ()
    { case L: goto Q; } constructs. Sometimes it turns the switch into
    a nested-if, other times it populates the jump table with the
    destination address of the goto.

    Have you tried using dynamic profiling information to get gcc to tune
    the generated code? That might result in some of the less frequently
    executed code moved some place where it won't keep filling the cache.
  • No.2 | | 2251 bytes | |

    Derek M. Jones wrote:
    Velco,

    What could be the reason for this parser performing so poorly? for
    the Bison parser performing so well? Has anyone got some recent
    observations with directly-executable vs. table driven recognizers -
    scanning (re2c?), parsing, tree-matching?

    The generated code contains lots of jumps. I am guessing that
    instruction pipeline never gets to crank up before it is flushed, also
    all those jumps must be very bad for lookahead prediction and I dread
    to think about the cache swapping that must be going on.

    In the "good old days" jumps did not have such a big impact on
    performance (well, at least on many processors). So I am guessing
    that the excessive number of jumps are skewing the performance
    characteristics, which did not happen on the processors used for the
    previous timing comparisons.

    Yeah, that sounds reasonable. It probably also explains why on PIII
    Xeon the speedup is a bit better.

    Also gcc does not seem to be consistent in optimizing those switch ()
    { case L: goto Q; } constructs. Sometimes it turns the switch into
    a nested-if, other times it populates the jump table with the
    destination address of the goto.

    I have also implemented a kind of a "switch optimizer". After
    removing the most frequent transition, the keys are separated in
    "dense" ranges, where density threshold is a parameter (say, 0.33 or
    0.5). Then a linear or a binary search is performed depending on the
    number of ranges (say >= 8 binary search, otherwise - linear).

    This did not result in any performance gain, although it did reduce the
    code size by a decent amount.

    Have you tried using dynamic profiling information to get gcc to
    tune the generated code? That might result in some of the less
    frequently executed code moved some place where it won't keep
    filling the cache.

    Yes, the results were obtained with gcc 4.0.3 or 4.2 as follows:
    1) compilation with -march=<whatever-fomit-frame-pointer
    -fprofile-generate
    2) one run on sample data
    3) compilation with -march=<whatever-fomit-frame-pointer
    -fprofile-use
    4) timing on the same data

    ~velco
  • No.3 | | 949 bytes | |

    2006-01-30, momchil.velikov@gmail.com <momchil.velikov@gmail.comwrote:
    I have also implemented a kind of a "switch optimizer". After removing
    the most frequent transition, the keys are separated in "dense" ranges,
    where density threshold is a parameter (say, 0.33 or 0.5). Then a linear
    or a binary search is performed depending on the number of ranges (say >=
    8 binary search, otherwise - linear).

    This did not result in any performance gain, although it did reduce the
    code size by a decent amount.

    The authors of re2c did fiddle around with the binary search in some
    circumstances, and they discuss this in the accompanying LPLAS
    article that comes with the source. However, their conclusions might
    be somewhat dated. I seem to recall they had different results using
    if/else blocks vs. switch statements. Re2c appears to have a new
    maintainer:

    http://re2c.org
    -Clint
  • No.4 | | 845 bytes | |

    momchil.velikov@gmail.com wrote:

    What could be the reason for this parser performing so poorly? for
    the Bison parser performing so well? Has anyone got some recent

    GNU Bison uses a different algorithm from YACC. See:

    *checkout*/bison/bison/REFERENCES?rev=1.2

    I quote:

    "Also, Bison uses a faster but less space-efficient encoding for the
    parse tables (see Corbett's PhD thesis from Berkeley, "Static
    Semantics in Compiler Error Recovery", June 1985, Report No. UCB/CSD
    85/251), and more modern technique for generating the look-ahead sets.
    (See Frank DeRemer and Thomas Pennello, "Efficient Computation of
    LALR(1) Look-Ahead Sets", ACM Transactions on Programming Languages
    and Systems (TPLAS) 4, 4 ( 1982), 615-649. Their
    technique is the standard one now.)"

    HTH,
    Ranjit.
  • No.5 | | 6594 bytes | |

    momchil.velikov@gmail.com wrote:
    Directly-executable LR parsers are well-known and a number of
    researchers have published amazing speedups over table-driver LR
    parsers, in the 2x to 6x or more range, cf.:

    Achyutram Bhamidipaty and Todd A. Proebsting
    "Very Fast YACC-Compatible Parsers (For Very Little Effort)

    R.N.Horspool and M. Whitney
    "Even faster LR parsing"

    Thomas J. Pennello
    "Very fast LR parsing"

    I have developed such a LALR(1) generator, which generates a
    directly-executable parser in IS C. Admittedly, the parser performs
    only a few low-level optimizations - inverted table for non-terminal
    transitions and separating the most frequent transition as a default.
    However, the speedup I observe is nowhere near the previously reported
    ones. Compared to a parser, generated by Bison 2.1 for the IS C 99
    grammar (taken straight from the standard, with one minor
    modification), the directly-executable parser is only 5-9% faster on
    P4 and about 12% faster on P3 Xeon, at the same time being about 3
    times bigger.

    I assume I'm not actually doing anything stupid with the generated
    parser. A sample of the generated code (preprocessed and indent(1)ed)
    is here:

    What could be the reason for this parser performing so poorly? for
    the Bison parser performing so well? Has anyone got some recent
    observations with directly-executable vs. table driven recognizers -
    scanning (re2c?), parsing, tree-matching?

    There are several possibilities. You have not written, which compiler
    you are using, so we can only guess. Your code is full of examples of
    obvious inefficiencies, e.g.,

    do
    {
    }
    while (0);

    Such construct will most likely be removed by the optimizer, but they
    may confuse it such that the code does not get as good as it should be.
    It is a dangerous assumption to assume that a compiler generates good
    code for constructs human programmers would never write. It might be
    that your compiler can handle only so many loops or basic blocks before
    it gives up optimizing.

    You are manipulating the stack through a set of inlined procedures. You
    are passing the stack as a pointer to a record. First, are you sure the
    procedures are inlined? Secondly, the compiler might not be smart
    enough to realize that the stack as a variable is not truly aliased.
    There could be a severe performance hit because the compiler may think
    the stack is aliased and therefore put the top of the stack in main
    memory in stead of having it in registers. I would drop that library of
    stack operations and implement the stack without the use of a record
    structure and as local variables of the parse routine.

    Then you are jumping around much more than is necessary, e.g.,

    goto symbol_387;

    // lots of code

    symbol_387:
    switch (state)
    {
    default:
    goto push_317;
    }

    First you are again assuming that the compiler will generate good code
    from weird constructs that human programmers will never write. You are
    doing that with the switch statement with only a default case label.
    Secondly, the compiler will have to be very smart in order to generate
    code such that you jump directly from the first goto to push_317. Why
    not substitute "goto push_317;" for "goto symbol_387;" and delete the
    code following "symbol_387:".

    Along the same line: From each block of code starting with a label like
    reduce_* you jump to some code starting with a label like symbol Why
    not move the code of symbol_* to after the first piece of code jumping
    to it?

    Then there is your variable state. There is no need to assign to it if
    the non kernel of the state does not include items of productions very
    the right hand side is not the empty string. You do not need to push
    the state, because it will never be read. All you need to do is to
    increase the stack top pointer with 1 and let the stack top have an
    undefined value. It will never be read anyway.

    Also, you are passing the get_token routine in a record. The optimizer
    cannot find out that it is the same routine that is called each time.
    It will have to generate code to dereference the ctx variable each time
    the routine is called. In most cases there is one input source, one
    output, and the parser is called once. I would call the lexer directly
    and only make support for multiple concurrent calls to the parser optional.

    You call xg__stack_ensure in lots of places, but it is possible to
    optimize most of these away. Consider a DFA were you store all the read
    tokens in a stack, i.e., the highs of the stack is equal to the number
    of read symbols. If there is no loops in the DFA, then it is a simple
    search to find the maximum number of symbols that can be read. Thus you
    can always make sure the stack is big enough before running the DFA. If
    there are loops, then you can do with one check for each time the DFA
    run through all the states of the loop to ensure enough free space in
    the stack, e.g., assume that the numbers are states and 1->2 is a
    transition from state 1 to state 2. We do not care about which symbols
    are read:

    1->2
    2->3
    3->4
    4->5
    5->2
    5->6

    state 1 is the start state and state 6 is the accept state. You can do
    with checking the among of free space on the stack at the start and in
    one of the states 2, 3, 4, or 5. The same trick can be used in a LR(1)
    automaton. In your case it will most likely save you most of the
    procedure calls to xg__stack_ensure.

    A few more comments: First what about error recovery? A parser
    generator that generates parsers without error recovery has only so much
    use, and the error recovery can have many implications on how the
    directly executed parser is implemented. Secondly, you do not have any
    semantic actions on your reductions. All modern computers have cache
    between main storage and the processor. That cache will speed up code
    as long as the code can be in the cache. That has two implications.
    The first implication is that you current benchmarking really is not
    that interesting. It should wait until you have your semantic actions
    in place. The second implication is that if the table driven parser can
    be in the cache, but you directly executed cannot, then it is no
    surprise that the directly executed parser is not that fast.

    Karsten Nyblad
  • No.6 | | 8905 bytes | |

    Kdarsten Nyblad wrote:
    momchil.velikov@gmail.com wrote:
    Directly-executable LR parsers are well-known and a number of
    researchers have published amazing speedups over table-driver LR
    parsers, in the 2x to 6x or more range, cf.:

    Achyutram Bhamidipaty and Todd A. Proebsting
    "Very Fast YACC-Compatible Parsers (For Very Little Effort)

    R.N.Horspool and M. Whitney
    "Even faster LR parsing"

    Thomas J. Pennello
    "Very fast LR parsing"

    I have developed such a LALR(1) generator, which generates a
    directly-executable parser in IS C. Admittedly, the parser performs
    only a few low-level optimizations - inverted table for non-terminal
    transitions and separating the most frequent transition as a default.
    However, the speedup I observe is nowhere near the previously reported
    ones. Compared to a parser, generated by Bison 2.1 for the IS C 99
    grammar (taken straight from the standard, with one minor
    modification), the directly-executable parser is only 5-9% faster on
    P4 and about 12% faster on P3 Xeon, at the same time being about 3
    times bigger.

    I assume I'm not actually doing anything stupid with the generated
    parser. A sample of the generated code (preprocessed and indent(1)ed)
    is here:

    What could be the reason for this parser performing so poorly? for
    the Bison parser performing so well? Has anyone got some recent
    observations with directly-executable vs. table driven recognizers -
    scanning (re2c?), parsing, tree-matching?

    There are several possibilities. You have not written, which compiler
    you are using, so we can only guess. Your code is full of examples of
    obvious inefficiencies, e.g.,

    do
    {
    }
    while (0);

    [You have probably missed it, I have given the compilers and
    the command line options used in another message in this
    very thread.]

    This is an artefact from discarding (via -DNDEBUG) parser trace
    code ("Entering state X. Next token is "). The compiler does
    discard it.

    Such construct will most likely be removed by the optimizer, but they
    may confuse it such that the code does not get as good as it should be.
    It is a dangerous assumption to assume that a compiler generates good
    code for constructs human programmers would never write. It might be
    that your compiler can handle only so many loops or basic blocks before
    it gives up optimizing.

    Not the case.

    You are manipulating the stack through a set of inlined procedures. You
    are passing the stack as a pointer to a record. First, are you sure the
    procedures are inlined?

    Yes.

    Secondly, the compiler might not be smart
    enough to realize that the stack as a variable is not truly aliased.
    There could be a severe performance hit because the compiler may think
    the stack is aliased and therefore put the top of the stack in main
    memory in stead of having it in registers. I would drop that library of
    stack operations and implement the stack without the use of a record
    structure and as local variables of the parse routine.

    This indeed had an insignificant improvement (along with a fixed
    size stack), but I think the improvement was from the lack of stack
    limit checks.

    Then you are jumping around much more than is necessary, e.g.,

    goto symbol_387;

    // lots of code

    symbol_387:
    switch (state)
    {
    default:
    goto push_317;
    }

    First you are again assuming that the compiler will generate good code
    from weird constructs that human programmers will never write.

    Indeed, I fully expect from the compiler to perform dead code
    ellimination
    and simple jump optimizations. From research or toy compilers I do not
    expect good performance anyway. Unconditional jump to uncoinditional
    jump is trivial to optimize. Yes, the compiler does it.

    You are
    doing that with the switch statement with only a default case label.

    The compiler emits a single jump, in the case it's not subsumed by
    other optimization. I see no reason for it to be compiled to different
    code.

    Secondly, the compiler will have to be very smart in order to generate
    code such that you jump directly from the first goto to push_317. Why
    not substitute "goto push_317;" for "goto symbol_387;" and delete the
    code following "symbol_387:".

    In the general case there can be many jumps to "symbol_N:", one for
    each production having N as its left-hand side.

    And, no, the compiler does not need to be very smart, this is a trivial
    optimization and the compiler in fact performs it.

    Along the same line: From each block of code starting with a label like
    reduce_* you jump to some code starting with a label like symbol Why
    not move the code of symbol_* to after the first piece of code jumping
    to it?

    See above. And in the cases there are multiple jumps, it is up to
    the compiler to decide whether to duplicate the code.

    Then there is your variable state. There is no need to assign to it if
    the non kernel of the state does not include items of productions very
    the right hand side is not the empty string. You do not need to push
    the state, because it will never be read. All you need to do is to
    increase the stack top pointer with 1 and let the stack top have an
    undefined value. It will never be read anyway.

    I'm not sure if I understand this. Do you mean there's no need to
    push the state number for states with no non-terminal transitions?
    Indeed, I can do this (when discarding trace code anyway).

    Also, you are passing the get_token routine in a record. The optimizer
    cannot find out that it is the same routine that is called each time.
    It will have to generate code to dereference the ctx variable each time
    the routine is called. In most cases there is one input source, one
    output, and the parser is called once. I would call the lexer directly
    and only make support for multiple concurrent calls to the parser optional.

    This is not an acceptible design.

    You call xg__stack_ensure in lots of places, but it is possible to
    optimize most of these away. Consider a DFA were you store all the read
    tokens in a stack, i.e., the highs of the stack is equal to the number
    of read symbols. If there is no loops in the DFA, then it is a simple
    search to find the maximum number of symbols that can be read. Thus you
    can always make sure the stack is big enough before running the DFA. If
    there are loops, then you can do with one check for each time the DFA
    run through all the states of the loop to ensure enough free space in
    the stack, e.g., assume that the numbers are states and 1->2 is a
    transition from state 1 to state 2. We do not care about which symbols
    are read:

    1->2
    2->3
    3->4
    4->5
    5->2
    5->6

    state 1 is the start state and state 6 is the accept state. You can do
    with checking the among of free space on the stack at the start and in
    one of the states 2, 3, 4, or 5. The same trick can be used in a LR(1)
    automaton. In your case it will most likely save you most of the
    procedure calls to xg__stack_ensure.

    Thanks, I'll consider this one. However, I've tried a fixed size
    stack, with no checks whatsoever and it improved the code
    insignificantly. Thus, this also is cannot be a reason for 2x or
    4x slowdown.

    A few more comments: First what about error recovery? A parser
    generator that generates parsers without error recovery has only so much
    use, and the error recovery can have many implications on how the
    directly executed parser is implemented.

    I'm not interested in optimizing the case of syntax errors present
    in the input. Also, I have no reasons to believe the current way of
    generating the parser routine is an obstacle to error recovery,
    see Bhamidipaty and Proebsting's paper.

    Secondly, you do not have any
    semantic actions on your reductions. All modern computers have cache
    between main storage and the processor. That cache will speed up code
    as long as the code can be in the cache. That has two implications.
    The first implication is that you current benchmarking really is not
    that interesting. It should wait until you have your semantic actions
    in place. The second implication is that if the table driven parser can
    be in the cache, but you directly executed cannot, then it is no
    surprise that the directly executed parser is not that fast.

    This is a likely reason. Both the Bison generated and this parser fit
    in L2 cache, but the Bison parser probably fits in L1 too (it's 9K text
    + rodata, ~300B data and bss).

    ~velco

Re: The case of directly-executable parsers


max 4000 letters.
Your nickname that display:
In order to stop the spam: 4 + 3 =
QUESTION ON "Compilers"

EMSDN.COM