Kdarsten Nyblad wrote:
momchil.velikov@gmail.com wrote:
Directly-executable LR parsers are well-known and a number of
researchers have published amazing speedups over table-driver LR
parsers, in the 2x to 6x or more range, cf.:
Achyutram Bhamidipaty and Todd A. Proebsting
"Very Fast YACC-Compatible Parsers (For Very Little Effort)
R.N.Horspool and M. Whitney
"Even faster LR parsing"
Thomas J. Pennello
"Very fast LR parsing"
I have developed such a LALR(1) generator, which generates a
directly-executable parser in IS C. Admittedly, the parser performs
only a few low-level optimizations - inverted table for non-terminal
transitions and separating the most frequent transition as a default.
However, the speedup I observe is nowhere near the previously reported
ones. Compared to a parser, generated by Bison 2.1 for the IS C 99
grammar (taken straight from the standard, with one minor
modification), the directly-executable parser is only 5-9% faster on
P4 and about 12% faster on P3 Xeon, at the same time being about 3
times bigger.
I assume I'm not actually doing anything stupid with the generated
parser. A sample of the generated code (preprocessed and indent(1)ed)
is here:
What could be the reason for this parser performing so poorly? for
the Bison parser performing so well? Has anyone got some recent
observations with directly-executable vs. table driven recognizers -
scanning (re2c?), parsing, tree-matching?
There are several possibilities. You have not written, which compiler
you are using, so we can only guess. Your code is full of examples of
obvious inefficiencies, e.g.,
do
{
}
while (0);
[You have probably missed it, I have given the compilers and
the command line options used in another message in this
very thread.]
This is an artefact from discarding (via -DNDEBUG) parser trace
code ("Entering state X. Next token is "). The compiler does
discard it.
Such construct will most likely be removed by the optimizer, but they
may confuse it such that the code does not get as good as it should be.
It is a dangerous assumption to assume that a compiler generates good
code for constructs human programmers would never write. It might be
that your compiler can handle only so many loops or basic blocks before
it gives up optimizing.
Not the case.
You are manipulating the stack through a set of inlined procedures. You
are passing the stack as a pointer to a record. First, are you sure the
procedures are inlined?
Yes.
Secondly, the compiler might not be smart
enough to realize that the stack as a variable is not truly aliased.
There could be a severe performance hit because the compiler may think
the stack is aliased and therefore put the top of the stack in main
memory in stead of having it in registers. I would drop that library of
stack operations and implement the stack without the use of a record
structure and as local variables of the parse routine.
This indeed had an insignificant improvement (along with a fixed
size stack), but I think the improvement was from the lack of stack
limit checks.
Then you are jumping around much more than is necessary, e.g.,
goto symbol_387;
// lots of code
symbol_387:
switch (state)
{
default:
goto push_317;
}
First you are again assuming that the compiler will generate good code
from weird constructs that human programmers will never write.
Indeed, I fully expect from the compiler to perform dead code
ellimination
and simple jump optimizations. From research or toy compilers I do not
expect good performance anyway. Unconditional jump to uncoinditional
jump is trivial to optimize. Yes, the compiler does it.
You are
doing that with the switch statement with only a default case label.
The compiler emits a single jump, in the case it's not subsumed by
other optimization. I see no reason for it to be compiled to different
code.
Secondly, the compiler will have to be very smart in order to generate
code such that you jump directly from the first goto to push_317. Why
not substitute "goto push_317;" for "goto symbol_387;" and delete the
code following "symbol_387:".
In the general case there can be many jumps to "symbol_N:", one for
each production having N as its left-hand side.
And, no, the compiler does not need to be very smart, this is a trivial
optimization and the compiler in fact performs it.
Along the same line: From each block of code starting with a label like
reduce_* you jump to some code starting with a label like symbol Why
not move the code of symbol_* to after the first piece of code jumping
to it?
See above. And in the cases there are multiple jumps, it is up to
the compiler to decide whether to duplicate the code.
Then there is your variable state. There is no need to assign to it if
the non kernel of the state does not include items of productions very
the right hand side is not the empty string. You do not need to push
the state, because it will never be read. All you need to do is to
increase the stack top pointer with 1 and let the stack top have an
undefined value. It will never be read anyway.
I'm not sure if I understand this. Do you mean there's no need to
push the state number for states with no non-terminal transitions?
Indeed, I can do this (when discarding trace code anyway).
Also, you are passing the get_token routine in a record. The optimizer
cannot find out that it is the same routine that is called each time.
It will have to generate code to dereference the ctx variable each time
the routine is called. In most cases there is one input source, one
output, and the parser is called once. I would call the lexer directly
and only make support for multiple concurrent calls to the parser optional.
This is not an acceptible design.
You call xg__stack_ensure in lots of places, but it is possible to
optimize most of these away. Consider a DFA were you store all the read
tokens in a stack, i.e., the highs of the stack is equal to the number
of read symbols. If there is no loops in the DFA, then it is a simple
search to find the maximum number of symbols that can be read. Thus you
can always make sure the stack is big enough before running the DFA. If
there are loops, then you can do with one check for each time the DFA
run through all the states of the loop to ensure enough free space in
the stack, e.g., assume that the numbers are states and 1->2 is a
transition from state 1 to state 2. We do not care about which symbols
are read:
1->2
2->3
3->4
4->5
5->2
5->6
state 1 is the start state and state 6 is the accept state. You can do
with checking the among of free space on the stack at the start and in
one of the states 2, 3, 4, or 5. The same trick can be used in a LR(1)
automaton. In your case it will most likely save you most of the
procedure calls to xg__stack_ensure.
Thanks, I'll consider this one. However, I've tried a fixed size
stack, with no checks whatsoever and it improved the code
insignificantly. Thus, this also is cannot be a reason for 2x or
4x slowdown.
A few more comments: First what about error recovery? A parser
generator that generates parsers without error recovery has only so much
use, and the error recovery can have many implications on how the
directly executed parser is implemented.
I'm not interested in optimizing the case of syntax errors present
in the input. Also, I have no reasons to believe the current way of
generating the parser routine is an obstacle to error recovery,
see Bhamidipaty and Proebsting's paper.
Secondly, you do not have any
semantic actions on your reductions. All modern computers have cache
between main storage and the processor. That cache will speed up code
as long as the code can be in the cache. That has two implications.
The first implication is that you current benchmarking really is not
that interesting. It should wait until you have your semantic actions
in place. The second implication is that if the table driven parser can
be in the cache, but you directly executed cannot, then it is no
surprise that the directly executed parser is not that fast.
This is a likely reason. Both the Bison generated and this parser fit
in L2 cache, but the Bison parser probably fits in L1 too (it's 9K text
+ rodata, ~300B data and bss).
~velco