Here are some patches for standard XM XML parsing, simultaneously
improving efficiency both in time and space. , the patches
improve throughput by roughly a factor 1.5 for a wide range of
documents. Memory consumption is reduced, depending on the amount of
whitespace-only Text nodes (think pretty-printing and deeply nested
structures). It passes all XM tests.
I'm sending the test input data files privately to Elliotte as it's
too big for the mailing list.
Here are some results for lhc-teams.xml, lhc-teams-noindent.xml,
randj.xml, periodic.xml
For each file, the results before and after the patch are shown
below, in that order.
[Note that all results are with the current xom-1.1-CVS plus the
"fat" stringified Text.java, (1) ParentN() patch,
Verifier improvements, XMHandler Stack -ArrayList replacement,
plus a variety of assorted simple one or two line performance
patches, minus the QName LRU cache. I've grown tired of backporting,
so I've now moved my complete patchset forward to xom-1.1-CVS. This
will most likely be the basis of the next Nux version. Yes, XM is
improving much slower than my patch set is growing, so I'm not sure
where this trajectory is leading. day I might have to make some
hard decisions that address this issue, I'm not sure yet.]
As always, measurements are on Linux, sun-jdk-1.5, server VM,
xerces-2.7.0.
[pabst /] export JAVAPTS='-
server -agentlib:hprof=cpu=samples,depth=10,interval=10 -
'
[pabst /] bin/fire-java
nux.xom.tests.BinaryXMLTest deser xom 0 500 data/samples/data/lhc-
teams.xml
patchesEnabled=true
now processing
compression factor = 2.4675896
SUMMARY
files = 500
secs = 50.411
mean throughput MB/s = 13.8365345
mean compression factor = 2.4675894
files/sec = 9.91847
checksum = 0
Dumping CPU usage by sampling running threads done.
[pabst /] bin/fire-java
nux.xom.tests.BinaryXMLTest deser xom 0 500 data/samples/data/lhc-
teams.xml
patchesEnabled=true
now processing
compression factor = 2.4675896
SUMMARY
files = 500
secs = 35.741
mean throughput MB/s = 19.515781
mean compression factor = 2.4675894
files/sec = 13.989535
checksum = 0
Dumping CPU usage by sampling running threads done.
[pabst /] bin/fire-java
nux.xom.tests.BinaryXMLTest deser xom 0 500 data/samples/data/lhc-
teams-noindent.xml
patchesEnabled=true
now processing
compression factor = 1.9681828
SUMMARY
files = 500
secs = 36.832
mean throughput MB/s = 13.242769
mean compression factor = 1.9681828
files/sec = 13.575151
checksum = 0
Dumping CPU usage by sampling running threads done.
[pabst /] bin/fire-java
nux.xom.tests.BinaryXMLTest deser xom 0 500 data/samples/data/lhc-
now processing
compression factor = 1.9681828
SUMMARY
files = 500
secs = 27.751
mean throughput MB/s = 17.576221
mean compression factor = 1.9681828
files/sec = 18.017368
checksum = 0
Dumping CPU usage by sampling running threads done.
[pabst /] less java.hprof.txt
[pabst /] less java.hprof.txt
[pabst /]
[pabst /]
[pabst /]
[pabst /] bin/fire-java
nux.xom.tests.BinaryXMLTest deser xom 0 3000 data/samples/data/
randj.xml
patchesEnabled=true
now processing data/samples/data/randj.xml
compression factor = 1.3920223
SUMMARY
files = 3000
secs = 45.016
mean throughput MB/s = 14.851439
mean compression factor = 1.3920223
files/sec = 66.642975
checksum = 0
Dumping CPU usage by sampling running threads done.
[pabst /]
[pabst /] bin/fire-java
nux.xom.tests.BinaryXMLTest deser xom 0 3000 data/samples/data/
randj.xml
patchesEnabled=true
now processing data/samples/data/randj.xml
compression factor = 1.3920223
SUMMARY
files = 3000
secs = 39.171
mean throughput MB/s = 17.067533
mean compression factor = 1.3920223
files/sec = 76.587265
checksum = 0
Dumping CPU usage by sampling running threads done.
[pabst /] bin/fire-java
nux.xom.tests.BinaryXMLTest deser xom 0 6000 data/samples/data/
periodic.xml
patchesEnabled=true
now processing data/samples/data/periodic.xml
compression factor = 3.619404
SUMMARY
files = 6000
secs = 45.9
mean throughput MB/s = 14.523899
mean compression factor = 3.6194043
files/sec = 130.71895
checksum = 0
Dumping CPU usage by sampling running threads done.
[pabst /] bin/fire-java
nux.xom.tests.BinaryXMLTest deser xom 0 6000 data/samples/data/
periodic.xml
patchesEnabled=true
now processing data/samples/data/periodic.xml
compression factor = 3.619404
SUMMARY
files = 6000
secs = 38.575
mean throughput MB/s = 17.28184
mean compression factor = 3.6194043
files/sec = 155.54115
checksum = 0
Dumping CPU usage by sampling running threads done.
[pabst /]
XMHandler.java (sorry for the completely messed up copy&paste
indentation):
// protected Stack parents;
protected ArrayList parents; // WH
public void startDocument() {
inDTD = false;
document = factory.startMakingDocument();
parent = document;
current = document;
// parents = new Stack();
parents = new ArrayList(); // WH
// parents.push(document);
parents.add(document); // WH
inProlog = true;
position = 0;
// buffer = new StringBuffer(); // WH
buffer = null;
first = null;
doctype = null;
if (locator != null) {
documentBaseURI = locator.getSystemId();
// According to the XML spec,
// "It is an error for a fragment identifier
// (beginning with a # character) to be part of a system
identifier"
// but some parsers including Xerces seem to get this
wrong, so we'll
document.setBaseURI(documentBaseURI);
}
}
public void endDocument() {
factory.finishMakingDocument(document);
// parents.pop();
parents.remove(parents.size()-1); // WH
// // tmp; for diagnostics only
// String s = "stats= [";
// long sum = 0;
// for (int i=0; i < stats.length; i++) {
// sum += stats[i];
// s += stats[i] + " ";
// }
// s += "] = [";
// for (int i=0; i < stats.length; i++) {
// s += Math.round(100.0f * stats[i] / sum);
// s += "% ";
// stats[i] = 0;
// }
// s += "]";
// System.err.println(s);
//// System.err.println("multiWhitespace=" + multiWhitespace);
}
public void startElement(String namespaceURI, String localName,
String qualifiedName, org.xml.sax.Attributes attributes) {
// Need to push this, even if it's null
// parents.push(element);
parents.add(element); // WH
int length = attributes.getLength();
for (int i = 0; i < length; i++) {
}
public void endElement(
String namespaceURI, String localName, String qualifiedName) {
// If we're immediately inside a skipped element
// we need to reset current to null, not to the parent
// current = (ParentNode) parents.pop();
current = (ParentNode) parents.remove(parents.size()-1); // WH
flushText();
}
protected StringBuffer buffer;
protected String first;
private static final String[] CRLF_SP;
private static final String[] LF_SP;
private static final String[] SPACES;
private static final String[] CRLF_TAB;
private static final String[] LF_TAB;
private static final String[] TABS;
// tmp; for diagnostics only
// protected int[] stats = new int[4];
// protected int callCounter;
// private int totalTextMemory;
// protected boolean isWhitespace = false;
// protected int multiWhitespace;
static { // initializer
// Make interned string for memory sharing of frequent
whitespace.
// CR LF = 0x0D 0x0A = \r \n
int size = 50; // somewhat arbitrary (12 nesting levels
* 4 blanks)
char[] spaceChars = new char[size + 2];
spaceChars[0] = '\r';
spaceChars[1] = '\n';
for (int i = 2; i < spaceChars.length; i++) spaceChars[i] =
' ';
String spaces = new String(spaceChars);
CRLF_SP = new String[size]; // "\r\n", "\r\n ", "\r\n ",
LF_SP = new String[size]; // "\n", "\n ", "\n ",
SPACES = new String[size]; // "", " ", " ",
for (int j = 0; j < size; j++) { // overlay and share memory
CRLF_SP[j] = spaces.substring(0, j + 2);
LF_SP[j] = spaces.substring(1, j + 2);
SPACES[j] = spaces.substring(2, j + 2);
}
// now do the same for tab characters rather than blanks
char[] tabChars = new char[size + 2];
tabChars[0] = '\r';
tabChars[1] = '\n';
for (int i = 2; i < tabChars.length; i++) tabChars[i] = '\t';
String tabs = new String(tabChars);
CRLF_TAB = new String[size]; // "\r\n", "\r\n\t", "\r\n\t
\t",
LF_TAB = new String[size]; // "\n", "\n\t", "\n\t\t",
TABS = new String[size]; // "", "\t", "\t\t",
for (int j = 0; j < size; j++) { // overlay and share memory
CRLF_TAB[j] = tabs.substring(0, j + 2);
CRLF_TAB[j] = tabs.substring(1, j + 2);
TABS[j] = tabs.substring(2, j + 2);
}
}
public void characters(char[] text, int start, int length) {
/*
* Based on measurements of a wide variety of real-world
documents, one
* call to characters() per flushText() is extremely common,
a second
* one quite uncommon, and more calls to it are extremely
uncommon.
* Hence, unnecessary temporary String and StringBuffer
memory copies
* and allocations can be avoided, improving performance,
*irrespective* whether
* the "fat" stringified Text.java is used, or the Text.java
version with
* the UTF-8 storage trick.
* See endDocument() on how to measure character() per
flushText() frequency, uncommenting
* some diagnostic statistics gathering code.
*
* Plus, whitepace separator lines make up a large fraction
of data in
* heavily pretty printed documents. So, let's share most of
those
* string objects via an extremely low-overhead interning
mechanism.
*
* All in all this should make average memory consumption of
a XM
* document with the stringified Text patch comparable or
lower than with the
* Text.java UTF-8 trick. Plus, it retains the performance
benefits of
* the stringified Text patch, for use cases reading or
writing Text a
* lot, for example many XPath/XQuery use cases.
*/
if (length <= 0) return; // nothing to do (important: do not
remove; ensures correct algo semantics)
if (first == null && buffer == null) {
// first call to characters()
if (length < SPACES.length) { // any possibility for
interning whitespace-only?
// scan backwards to beginning of run of blanks or tabs
int i = start + length;
char ch = text[];
if (ch != ' ' && ch != '\t') {
ch = ' ';
} else {
i++; while ( >= start && text[i] == ch) ;
}
switch (i+1-start) { // is there internable
whitespace-only?
case 0: {
first = ch == ' ' ? SPACES[length] : TABS
[length];
break;
}
case 1: {
if (text[start] == '\n')
first = ch == ' ' ? LF_SP[length-1] :
LF_TAB[length-1];
break;
}
case 2: {
if (text[start] == '\r' && text[start+1] ==
'\n')
first = ch == ' ' ? CRLF_SP[length-2] :
CRLF_TAB[length-2];
break;
}
default : // fall through to non-interned string
copy
}
}
// fallback to non-interned string copy
if (first == null) first = new String(text, start, length);
} else {
if (buffer == null) {
// second call to characters(); infrequently called
buffer = new StringBuffer(first.length() +
length); // doesn't waste memory
buffer.append(first);
first = null;
}
// second or higher call to characters(); third or
higher call is very rare
buffer.append(text, start, length);
}
if (finishedCDATA) inCDATA = false;
// callCounter++;
}
// accumulate all text that's in the buffer into a text node
protected void flushText() {
String text = buffer != null ? buffer.toString() : first;
if (text != null) {
// stats[Math.min(3, callCounter)]++;
// if (counter == 2) System.err.println("counter == 2: ("
+ text + ")");
// if (counter >= 3) System.err.println("counter >= 3: ("
+ text + ")");
Nodes result;
if (!inCDATA) {
// result = factory.makeText(buffer.toString());
result = factory.makeText(text);
} else {
// result = factory.makeCDATASection(buffer.toString
());
result = factory.makeCDATASection(text);
}
int count = result.size();
for (int i = 0; i < count; i++) {
Node node = result.get(i);
if (node.isAttribute()) {
((Element) parent).addAttribute((Attribute) node);
} else {
parent.appendChild(node);
}
}
// buffer = new StringBuffer();
first = null;
buffer = null;
}
inCDATA = false;
finishedCDATA = false;
// callCounter = 0;
}
protected boolean inCDATA = false;
protected boolean finishedCDATA = false;
public void startCDATA() {
// System.out.println("startCDATA: first="+ first +",
buffer="+buffer + ".");
if (first == null && buffer == null) inCDATA = true;
// if (first != null || buffer != null) inCDATA = true; //
bug!
finishedCDATA = false;
}
// public void startCDATA() {
// if (buffer.length() == 0) inCDATA = true;
// finishedCDATA = false;
// }
public void endCDATA() {
// System.out.println("end CDATA: first="+ first +",
buffer="+buffer + ".");
finishedCDATA = true;
}
class NonVerifyingHandler.java:
public void startElement(String namespaceURI, String localName,
String qualifiedName, org.xml.sax.Attributes attributes) {
// Need to push this, even if it's null
// parents.push(element);
parents.add(element); // WH
}
public void endElement(
String namespaceURI, String localName, String qualifiedName) {
// If we're immediately inside a skipped element
// we need to reset current to null, not to the parent
// current = (ParentNode) parents.pop();
current = (ParentNode) parents.remove(parents.size()-1); // WH
flushText();
}
// accumulate all text that's in the buffer into a text node
protected void flushText() {
String text = buffer != null ? buffer.toString() : first;
// System.out.println("flush:" + inCDATA +", first="+ first
+", buffer="+buffer + ".");
if (text != null) {
// stats[Math.min(3, callCounter)]++;
// if (counter 1 && isWhitespace) System.err.println
("concat whitespace!!!");
// if (counter == 2) System.err.println("counter == 2:" +
text + "#");
// if (counter >= 3) System.err.println("counter >= 3:" +
text + "#");
Text result;
if (!inCDATA) {
result = Text.build(text);
// result = Text.build(buffer.toString());
} else {
result = CDATASection.build(text);
// result = CDATASection.build(buffer.toString());
}
parent.fastInsertChild(result, parent.getChildCount());
// buffer = new StringBuffer();
first = null;
buffer = null;
}
inCDATA = false;
finishedCDATA = false;
// callCounter = 0;
}
XM-interest mailing list
XM-interest (AT) lists (DOT) ibiblio.org