Michael Champion said:
See
"XML Parsing - A Threat to Database Performance." Be forewarned that the
conclusion may be unpalatable:
By rights, it seems that there should be some market for a highly
optimized XML parser. You need high performance, you seek high performance
libraries; if there are none, you get them made internally or externally.
But I don't recall ever having seen any requests on XML-DEV for high speed
parsers: certainly none with any dollars behind them.
If some companies get together and say "We will pay $$$ for a higher
performance XML parser" they would get one. A $10,000 first prize and
$5,000 second prize for the winning parser on specified data, schema and
platform would be enough stimulate a lot of hackers and researchers, not
to mention prompting people with inhouse, private parsers to oen source
them. When you move to an Source software economy, the issue for
business becomes "How do we stimulate development in areas that help
us?"
this week I was listening to people from a client airline who had to
write their own XML parser in PLI for optimized access to mainframe DB2.
The lack of such a parser suggests to me that organizations using
databases need to adopt a new,
pro-active stance in getting high performance, open source XML software
written. Passivity in this area will assure they only have unsuitable
implementations.
If you look at, say, Apache Xerces and Xalan, you can see that
hyper-efficiency plays little part of the game. The same is true, by and
large, for the other open source software. Hyper-efficient design is not
an optimization that can be tacked on after, it has to be the core of the
design; you cannot expect a general-purpose, cross-platform parser to be
optimal. (For example, one trick that goes as far back as Mark's
predecessor in the late 80s (I believe) was for parsers to have two
parsers:
one optimized for the most common case and encoding XML this would be
for an entity-less document--, and another to handle
all the other cases.)
My expectation is that XML parsing can be significantly sped up with
better use of SSE intrinsics*, integrating parsing and transcoding, also
validation and type assignment using streaming path-matching rather than
automata (i.e. transform horizontal grammars into vertical paths), direct
parsing to native data types for numbers, for example. I am sure many
other people have a shopping list of good ideas: but there are no parsers
that implement any of these things AFAIK at the moment. Parser innovation
has stalled, and it surely should be an issue of serious concern (and by
serious concern I mean $$$) to high-volume companies to get it restarted.
The other aspect is that there is no "type aware SAX" API. Without this,
Source or even proprietary versus public parsers are not
interchangeable. this applies to Java most, but the principle is
the same: we need agreements at the interfaces (a.k.a. standards).
Cheers
Rick Jelliffe
* See and search for
Intrinsics. The Reilly blog site is being altered, it is a complete mess
at the moment, so sorry about the odd format for this archive.
The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
initiative of ASIS <http://www.oasis-open.org>
The list archives are at
To subscribe or unsubscribe from this list use the subscription
manager: <>