Java

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • More parsing performance patches

    8 answers - 15791 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    Here are some patches for standard XM XML parsing, simultaneously
    improving efficiency both in time and space. , the patches
    improve throughput by roughly a factor 1.5 for a wide range of
    documents. Memory consumption is reduced, depending on the amount of
    whitespace-only Text nodes (think pretty-printing and deeply nested
    structures). It passes all XM tests.
    I'm sending the test input data files privately to Elliotte as it's
    too big for the mailing list.
    Here are some results for lhc-teams.xml, lhc-teams-noindent.xml,
    randj.xml, periodic.xml
    For each file, the results before and after the patch are shown
    below, in that order.
    [Note that all results are with the current xom-1.1-CVS plus the
    "fat" stringified Text.java, (1) ParentN() patch,
    Verifier improvements, XMHandler Stack -ArrayList replacement,
    plus a variety of assorted simple one or two line performance
    patches, minus the QName LRU cache. I've grown tired of backporting,
    so I've now moved my complete patchset forward to xom-1.1-CVS. This
    will most likely be the basis of the next Nux version. Yes, XM is
    improving much slower than my patch set is growing, so I'm not sure
    where this trajectory is leading. day I might have to make some
    hard decisions that address this issue, I'm not sure yet.]
    As always, measurements are on Linux, sun-jdk-1.5, server VM,
    xerces-2.7.0.
    [pabst /] export JAVAPTS='-
    server -agentlib:hprof=cpu=samples,depth=10,interval=10 -
    '
    [pabst /] bin/fire-java
    nux.xom.tests.BinaryXMLTest deser xom 0 500 data/samples/data/lhc-
    teams.xml
    patchesEnabled=true
    now processing
    compression factor = 2.4675896
    SUMMARY
    files = 500
    secs = 50.411
    mean throughput MB/s = 13.8365345
    mean compression factor = 2.4675894
    files/sec = 9.91847
    checksum = 0
    Dumping CPU usage by sampling running threads done.
    [pabst /] bin/fire-java
    nux.xom.tests.BinaryXMLTest deser xom 0 500 data/samples/data/lhc-
    teams.xml
    patchesEnabled=true
    now processing
    compression factor = 2.4675896
    SUMMARY
    files = 500
    secs = 35.741
    mean throughput MB/s = 19.515781
    mean compression factor = 2.4675894
    files/sec = 13.989535
    checksum = 0
    Dumping CPU usage by sampling running threads done.
    [pabst /] bin/fire-java
    nux.xom.tests.BinaryXMLTest deser xom 0 500 data/samples/data/lhc-
    teams-noindent.xml
    patchesEnabled=true
    now processing
    compression factor = 1.9681828
    SUMMARY
    files = 500
    secs = 36.832
    mean throughput MB/s = 13.242769
    mean compression factor = 1.9681828
    files/sec = 13.575151
    checksum = 0
    Dumping CPU usage by sampling running threads done.
    [pabst /] bin/fire-java
    nux.xom.tests.BinaryXMLTest deser xom 0 500 data/samples/data/lhc-
    now processing
    compression factor = 1.9681828
    SUMMARY
    files = 500
    secs = 27.751
    mean throughput MB/s = 17.576221
    mean compression factor = 1.9681828
    files/sec = 18.017368
    checksum = 0
    Dumping CPU usage by sampling running threads done.
    [pabst /] less java.hprof.txt
    [pabst /] less java.hprof.txt
    [pabst /]
    [pabst /]
    [pabst /]
    [pabst /] bin/fire-java
    nux.xom.tests.BinaryXMLTest deser xom 0 3000 data/samples/data/
    randj.xml
    patchesEnabled=true
    now processing data/samples/data/randj.xml
    compression factor = 1.3920223
    SUMMARY
    files = 3000
    secs = 45.016
    mean throughput MB/s = 14.851439
    mean compression factor = 1.3920223
    files/sec = 66.642975
    checksum = 0
    Dumping CPU usage by sampling running threads done.
    [pabst /]
    [pabst /] bin/fire-java
    nux.xom.tests.BinaryXMLTest deser xom 0 3000 data/samples/data/
    randj.xml
    patchesEnabled=true
    now processing data/samples/data/randj.xml
    compression factor = 1.3920223
    SUMMARY
    files = 3000
    secs = 39.171
    mean throughput MB/s = 17.067533
    mean compression factor = 1.3920223
    files/sec = 76.587265
    checksum = 0
    Dumping CPU usage by sampling running threads done.
    [pabst /] bin/fire-java
    nux.xom.tests.BinaryXMLTest deser xom 0 6000 data/samples/data/
    periodic.xml
    patchesEnabled=true
    now processing data/samples/data/periodic.xml
    compression factor = 3.619404
    SUMMARY
    files = 6000
    secs = 45.9
    mean throughput MB/s = 14.523899
    mean compression factor = 3.6194043
    files/sec = 130.71895
    checksum = 0
    Dumping CPU usage by sampling running threads done.
    [pabst /] bin/fire-java
    nux.xom.tests.BinaryXMLTest deser xom 0 6000 data/samples/data/
    periodic.xml
    patchesEnabled=true
    now processing data/samples/data/periodic.xml
    compression factor = 3.619404
    SUMMARY
    files = 6000
    secs = 38.575
    mean throughput MB/s = 17.28184
    mean compression factor = 3.6194043
    files/sec = 155.54115
    checksum = 0
    Dumping CPU usage by sampling running threads done.
    [pabst /]
    XMHandler.java (sorry for the completely messed up copy&paste
    indentation):
    // protected Stack parents;
    protected ArrayList parents; // WH
    public void startDocument() {
    inDTD = false;
    document = factory.startMakingDocument();
    parent = document;
    current = document;
    // parents = new Stack();
    parents = new ArrayList(); // WH
    // parents.push(document);
    parents.add(document); // WH
    inProlog = true;
    position = 0;
    // buffer = new StringBuffer(); // WH
    buffer = null;
    first = null;
    doctype = null;
    if (locator != null) {
    documentBaseURI = locator.getSystemId();
    // According to the XML spec,
    // "It is an error for a fragment identifier
    // (beginning with a # character) to be part of a system
    identifier"
    // but some parsers including Xerces seem to get this
    wrong, so we'll
    document.setBaseURI(documentBaseURI);
    }
    }
    public void endDocument() {
    factory.finishMakingDocument(document);
    // parents.pop();
    parents.remove(parents.size()-1); // WH
    // // tmp; for diagnostics only
    // String s = "stats= [";
    // long sum = 0;
    // for (int i=0; i < stats.length; i++) {
    // sum += stats[i];
    // s += stats[i] + " ";
    // }
    // s += "] = [";
    // for (int i=0; i < stats.length; i++) {
    // s += Math.round(100.0f * stats[i] / sum);
    // s += "% ";
    // stats[i] = 0;
    // }
    // s += "]";
    // System.err.println(s);
    //// System.err.println("multiWhitespace=" + multiWhitespace);
    }
    public void startElement(String namespaceURI, String localName,
    String qualifiedName, org.xml.sax.Attributes attributes) {
    // Need to push this, even if it's null
    // parents.push(element);
    parents.add(element); // WH
    int length = attributes.getLength();
    for (int i = 0; i < length; i++) {
    }
    public void endElement(
    String namespaceURI, String localName, String qualifiedName) {
    // If we're immediately inside a skipped element
    // we need to reset current to null, not to the parent
    // current = (ParentNode) parents.pop();
    current = (ParentNode) parents.remove(parents.size()-1); // WH
    flushText();
    }
    protected StringBuffer buffer;
    protected String first;
    private static final String[] CRLF_SP;
    private static final String[] LF_SP;
    private static final String[] SPACES;
    private static final String[] CRLF_TAB;
    private static final String[] LF_TAB;
    private static final String[] TABS;
    // tmp; for diagnostics only
    // protected int[] stats = new int[4];
    // protected int callCounter;
    // private int totalTextMemory;
    // protected boolean isWhitespace = false;
    // protected int multiWhitespace;
    static { // initializer
    // Make interned string for memory sharing of frequent
    whitespace.
    // CR LF = 0x0D 0x0A = \r \n
    int size = 50; // somewhat arbitrary (12 nesting levels
    * 4 blanks)
    char[] spaceChars = new char[size + 2];
    spaceChars[0] = '\r';
    spaceChars[1] = '\n';
    for (int i = 2; i < spaceChars.length; i++) spaceChars[i] =
    ' ';
    String spaces = new String(spaceChars);
    CRLF_SP = new String[size]; // "\r\n", "\r\n ", "\r\n ",
    LF_SP = new String[size]; // "\n", "\n ", "\n ",
    SPACES = new String[size]; // "", " ", " ",
    for (int j = 0; j < size; j++) { // overlay and share memory
    CRLF_SP[j] = spaces.substring(0, j + 2);
    LF_SP[j] = spaces.substring(1, j + 2);
    SPACES[j] = spaces.substring(2, j + 2);
    }
    // now do the same for tab characters rather than blanks
    char[] tabChars = new char[size + 2];
    tabChars[0] = '\r';
    tabChars[1] = '\n';
    for (int i = 2; i < tabChars.length; i++) tabChars[i] = '\t';
    String tabs = new String(tabChars);
    CRLF_TAB = new String[size]; // "\r\n", "\r\n\t", "\r\n\t
    \t",
    LF_TAB = new String[size]; // "\n", "\n\t", "\n\t\t",
    TABS = new String[size]; // "", "\t", "\t\t",
    for (int j = 0; j < size; j++) { // overlay and share memory
    CRLF_TAB[j] = tabs.substring(0, j + 2);
    CRLF_TAB[j] = tabs.substring(1, j + 2);
    TABS[j] = tabs.substring(2, j + 2);
    }
    }
    public void characters(char[] text, int start, int length) {
    /*
    * Based on measurements of a wide variety of real-world
    documents, one
    * call to characters() per flushText() is extremely common,
    a second
    * one quite uncommon, and more calls to it are extremely
    uncommon.
    * Hence, unnecessary temporary String and StringBuffer
    memory copies
    * and allocations can be avoided, improving performance,
    *irrespective* whether
    * the "fat" stringified Text.java is used, or the Text.java
    version with
    * the UTF-8 storage trick.
    * See endDocument() on how to measure character() per
    flushText() frequency, uncommenting
    * some diagnostic statistics gathering code.
    *
    * Plus, whitepace separator lines make up a large fraction
    of data in
    * heavily pretty printed documents. So, let's share most of
    those
    * string objects via an extremely low-overhead interning
    mechanism.
    *
    * All in all this should make average memory consumption of
    a XM
    * document with the stringified Text patch comparable or
    lower than with the
    * Text.java UTF-8 trick. Plus, it retains the performance
    benefits of
    * the stringified Text patch, for use cases reading or
    writing Text a
    * lot, for example many XPath/XQuery use cases.
    */
    if (length <= 0) return; // nothing to do (important: do not
    remove; ensures correct algo semantics)
    if (first == null && buffer == null) {
    // first call to characters()
    if (length < SPACES.length) { // any possibility for
    interning whitespace-only?
    // scan backwards to beginning of run of blanks or tabs
    int i = start + length;
    char ch = text[];
    if (ch != ' ' && ch != '\t') {
    ch = ' ';
    } else {
    i++; while ( >= start && text[i] == ch) ;
    }
    switch (i+1-start) { // is there internable
    whitespace-only?
    case 0: {
    first = ch == ' ' ? SPACES[length] : TABS
    [length];
    break;
    }
    case 1: {
    if (text[start] == '\n')
    first = ch == ' ' ? LF_SP[length-1] :
    LF_TAB[length-1];
    break;
    }
    case 2: {
    if (text[start] == '\r' && text[start+1] ==
    '\n')
    first = ch == ' ' ? CRLF_SP[length-2] :
    CRLF_TAB[length-2];
    break;
    }
    default : // fall through to non-interned string
    copy
    }
    }
    // fallback to non-interned string copy
    if (first == null) first = new String(text, start, length);
    } else {
    if (buffer == null) {
    // second call to characters(); infrequently called
    buffer = new StringBuffer(first.length() +
    length); // doesn't waste memory
    buffer.append(first);
    first = null;
    }
    // second or higher call to characters(); third or
    higher call is very rare
    buffer.append(text, start, length);
    }
    if (finishedCDATA) inCDATA = false;
    // callCounter++;
    }
    // accumulate all text that's in the buffer into a text node
    protected void flushText() {
    String text = buffer != null ? buffer.toString() : first;
    if (text != null) {
    // stats[Math.min(3, callCounter)]++;
    // if (counter == 2) System.err.println("counter == 2: ("
    + text + ")");
    // if (counter >= 3) System.err.println("counter >= 3: ("
    + text + ")");
    Nodes result;
    if (!inCDATA) {
    // result = factory.makeText(buffer.toString());
    result = factory.makeText(text);
    } else {
    // result = factory.makeCDATASection(buffer.toString
    ());
    result = factory.makeCDATASection(text);
    }
    int count = result.size();
    for (int i = 0; i < count; i++) {
    Node node = result.get(i);
    if (node.isAttribute()) {
    ((Element) parent).addAttribute((Attribute) node);
    } else {
    parent.appendChild(node);
    }
    }
    // buffer = new StringBuffer();
    first = null;
    buffer = null;
    }
    inCDATA = false;
    finishedCDATA = false;
    // callCounter = 0;
    }
    protected boolean inCDATA = false;
    protected boolean finishedCDATA = false;
    public void startCDATA() {
    // System.out.println("startCDATA: first="+ first +",
    buffer="+buffer + ".");
    if (first == null && buffer == null) inCDATA = true;
    // if (first != null || buffer != null) inCDATA = true; //
    bug!
    finishedCDATA = false;
    }
    // public void startCDATA() {
    // if (buffer.length() == 0) inCDATA = true;
    // finishedCDATA = false;
    // }
    public void endCDATA() {
    // System.out.println("end CDATA: first="+ first +",
    buffer="+buffer + ".");
    finishedCDATA = true;
    }
    class NonVerifyingHandler.java:
    public void startElement(String namespaceURI, String localName,
    String qualifiedName, org.xml.sax.Attributes attributes) {
    // Need to push this, even if it's null
    // parents.push(element);
    parents.add(element); // WH
    }
    public void endElement(
    String namespaceURI, String localName, String qualifiedName) {
    // If we're immediately inside a skipped element
    // we need to reset current to null, not to the parent
    // current = (ParentNode) parents.pop();
    current = (ParentNode) parents.remove(parents.size()-1); // WH
    flushText();
    }
    // accumulate all text that's in the buffer into a text node
    protected void flushText() {
    String text = buffer != null ? buffer.toString() : first;
    // System.out.println("flush:" + inCDATA +", first="+ first
    +", buffer="+buffer + ".");
    if (text != null) {
    // stats[Math.min(3, callCounter)]++;
    // if (counter 1 && isWhitespace) System.err.println
    ("concat whitespace!!!");
    // if (counter == 2) System.err.println("counter == 2:" +
    text + "#");
    // if (counter >= 3) System.err.println("counter >= 3:" +
    text + "#");
    Text result;
    if (!inCDATA) {
    result = Text.build(text);
    // result = Text.build(buffer.toString());
    } else {
    result = CDATASection.build(text);
    // result = CDATASection.build(buffer.toString());
    }
    parent.fastInsertChild(result, parent.getChildCount());
    // buffer = new StringBuffer();
    first = null;
    buffer = null;
    }
    inCDATA = false;
    finishedCDATA = false;
    // callCounter = 0;
    }
    XM-interest mailing list
    XM-interest (AT) lists (DOT) ibiblio.org
  • No.1 | | 1295 bytes | |

    Wolfgang Hoschek wrote:
    Here are some patches for standard XM XML parsing, simultaneously
    improving efficiency both in time and space. , the patches
    improve throughput by roughly a factor 1.5 for a wide range of
    documents. Memory consumption is reduced, depending on the amount of
    whitespace-only Text nodes (think pretty-printing and deeply nested
    structures). It passes all XM tests.

    Scanning the patches the main change seems to be moving from Stack,
    which is a synchronized Vector, to an ArrayList which is unsynchronized.
    Is that accurate? Make sense. Shouldn't be too hard to implement. I
    might want to write my own unsynchronized Stack class just to keep the
    code cleaner however (not to mention, java.util.Stack is a classic
    example of the wrong way to use inheritance).

    I do suspect you're overestimating the prevalence of whitespace only
    text nodes. Pretty printing is great for humans but in any sort of XML
    situation where you care about performance, memory, or size, one of the
    first things you do is throw away boundary white space, ideally before
    the document is generated but if not when the document is parsed. XM
    makes it very easy to use a NodeFactory that never builds these nodes in
    the first place.
  • No.2 | | 1632 bytes | |

    The second new optimization I see here is avoiding the creation of
    intermediate string buffers for most calls to characters. Your
    measurements match my expectations that most of the time there'll be
    exactly one call to characters() per call to flushText(). The exceptions
    tend to occur on a few corners:

    1. Documents with very large (16K+) runs of plain, unmarked up text
    2. A few parsers that call characters separately for entity and/or
    character references
    3. particularly brain-damaged older version of Xerces that actually
    invoked the characters() method once for each and every character in
    plain text that was loaded from entity references
    4. A few parsers may call characters separately when a CDATA section
    interrupts non-CDATA section text

    #1 is really worth considering when optimizing, and even that's
    uncommon.

    The basic idea looks good. Given that more than one call to characters
    per text, I wonder if we can simplify it somewhat. Suppose we ignored
    StringBuffers completely? In other words, if characters was called a
    second or third time we just append directly to the string? This would
    leave the code about as simple as it is now, maybe simpler; and still
    take the same path as your patch for the 99% case where characters() is
    only called once. The hit wouldn't even be too bad when characters() was
    called more than once, because it almost never would be called more than
    two or three times. (Aside from the very weird #3 where performance
    would be abysmal, but honestly that really requires an updated parser.)
  • No.3 | | 773 bytes | |

    Wolfgang Hoschek wrote:

    private static final String[] CRLF_SP;
    private static final String[] LF_SP;
    private static final String[] SPACES;

    private static final String[] CRLF_TAB;
    private static final String[] LF_TAB;
    private static final String[] TABS;

    Do you really need the carriage returns here? The XML parser is going to
    normalize all carriage returns and carriage return/linefeed pairs in the
    input document to linefeeds when parsing. A carriage return is only gong
    to appear if the user has used a numeric character reference to
    specially insert a carriage return. That does happen but it's quite
    uncommon. I suspect you'd be safe (and the code simpler) if it only
    recognized raw linefeeds as a line break,
  • No.4 | | 1351 bytes | |

    >
    I do suspect you're overestimating the prevalence of whitespace
    only text nodes. Pretty printing is great for humans but in any
    sort of XML situation where you care about performance, memory, or
    size, one of the first things you do is throw away boundary white
    space, ideally before the document is generated but if not when the
    document is parsed. XM makes it very easy to use a NodeFactory
    that never builds these nodes in the first place.

    Nice theory. But in practice, the performance impact of using a
    NodeFactory is so large that it is way faster to build the tree
    without a NodeFactory (that is, with NonVerifyingHandler) and then
    strip away thereafter, if so desired. I forgot the exact numbers, but
    they were not pretty.

    As note in private some time ago, the current xom-1.1-CVS has a very
    different profile depending on whether a tree is build from 1) a non
    verifying Builder, or 2) a Builder with a custom NodeFactory, or 3)
    via the normal node constructors, or 4) the node copy constructors.
    The QName LRU patch would make XM performance behaviour much more
    uniform. It would become faster for (2) and (3), have certainly
    negligible overhead for (1) and no overhead for (4).

    XM-interest mailing list
    XM-interest (AT) lists (DOT) ibiblio.org
  • No.5 | | 1828 bytes | |

    Jul 22, 2005, at 4:42 AM, Elliotte Harold wrote:

    >private static final String[] TABS;
    >>

    >
    >

    Do you really need the carriage returns here? The XML parser is
    going to normalize all carriage returns and carriage return/
    linefeed pairs in the input document to linefeeds when parsing. A
    carriage return is only gong to appear if the user has used a
    numeric character reference to specially insert a carriage return.
    That does happen but it's quite uncommon. I

    It doesn't hurt to play it safe. Parser behaviour may unexpectedly
    change.

    The basic idea looks good. Given that more than one call to
    characters per text, I wonder if we can simplify it somewhat.
    Suppose we ignored StringBuffers completely? In other words, if
    characters was called a second or third time we just append
    directly to the string? This would leave the code about as simple
    as it is now, maybe simpler; and still take the same path as your
    patch for the 99% case where characters() is only called once. The
    hit wouldn't even be too bad when characters() was called more than
    once, because it almost never would be called more than two or
    three times. (Aside from the very weird #3 where performance would
    be abysmal, but honestly that really requires an updated parser.)

    Again, the idea is to play it safe, ensuring a performance desaster
    can never happen. With the variant you checked into CVS today a
    String0 + + StringN concatenation desaster would be possible. The
    algorithm is really not that difficult, and there's a reason it is
    formulated the way it is.

    Wolfgang.

    XM-interest mailing list
    XM-interest (AT) lists (DOT) ibiblio.org
  • No.6 | | 3770 bytes | |

    The
    algorithm is really not that difficult, and there's a reason it is
    formulated the way it is.

    Wolfgang.

    A little cosmetic modularization yields:

    public void characters(char[] text, int start, int length) {
    /*
    * Based on measurements of a wide variety of real-world
    documents, one
    * call to characters() per flushText() is extremely common,
    a second
    * one quite uncommon, and more calls to it are extremely
    uncommon.
    * Hence, unnecessary temporary String and StringBuffer
    memory copies
    * and allocations can be avoided, improving performance,
    *irrespective* whether
    * the "fat" stringified Text.java is used, or the Text.java
    version with
    * the UTF-8 storage trick.
    * See endDocument() on how to measure character() per
    flushText() frequency, uncommenting
    * some diagnostic statistics gathering code.
    *
    * Plus, whitepace separator lines make up a large fraction
    of data in
    * heavily pretty printed documents. So, let's share most of
    those
    * string objects via an extremely low-overhead interning
    mechanism.
    *
    * All in all this should make average memory consumption of
    a XM
    * document with the stringified Text patch comparable or
    lower than with the
    * Text.java UTF-8 trick. Plus, it retains the performance
    benefits of
    * the stringified Text patch, for use cases reading or
    writing Text a
    * lot, for example many XPath/XQuery use cases.
    */
    // invariants throughout this class:
    // only one of (first, buffer) has content, i.e. at any time
    there holds:
    // first != null implies buffer == null
    // buffer != null implies first == null
    // if there is content, it is non-empty, i.e at any time
    there holds:
    // first != null implies first.length() 0
    // buffer != null implies buffer.length() 0

    if (length <= 0) return; // nothing to do (important: do not
    remove; ensures correct algo semantics)

    if (first == null && buffer == null) { // first call to
    characters()
    first = firstCharacters(text, start, length);
    // first = new String(text, start, length);
    } else {
    if (buffer == null) { // second call to characters();
    infrequently called
    buffer = new StringBuffer(first.length() +
    length); // doesn't waste memory
    buffer.append(first);
    first = null;
    }
    // second or higher call to characters(); third or
    higher call is very rare
    buffer.append(text, start, length);
    }
    if (finishedCDATA) inCDATA = false;
    // callCounter++;
    }

    private static String firstCharacters(char[] text, int start,
    int length) {
    if (length < SPACES.length) { // any possibility for
    interning whitespace-only?
    // scan backwards to beginning of run of blanks or tabs
    int i = start + length;
    char c = text[];
    if (c != ' ' && c != '\t') {
    c = ' ';
    } else {
    i++; while ( >= start && text[i] == c) ;
    }

    switch (i+1-start) { // is there internable whitespace-
    only?
    case 0: {
    return c == ' ' ? SPACES[length] : TABS[length];
    }
    case 1: {
    if (text[start] == '\n')
    return c == ' ' ? LF_SP[length-1] : LF_TAB
    [length-1];
    break;
    }
    case 2: {
    if (text[start] == '\r' && text[start+1] == '\n')
    return c == ' ' ? CRLF_SP[length-2] :
    CRLF_TAB[length-2];
    break;
    }
    default : // fall through to non-interned string copy
    }
    }
    // fallback to non-interned string copy
    return new String(text, start, length);
    }

    XM-interest mailing list
    XM-interest (AT) lists (DOT) ibiblio.org
  • No.7 | | 4352 bytes | |

    And since the whitespace interning isn't really applicable to the
    Text version with UTF8 storage, it can be skipped via

    private static String firstCharacters(char[] text, int start,
    int length) {
    if (Text.IS_FAT && length < SPACES.length) {
    int i = start + length;

    }
    return new String(text, start, length);
    }

    Text.java:
    static final boolean IS_FAT = true/false depending on the
    version

    Wolfgang.

    Jul 22, 2005, at 3:16 PM, Wolfgang Hoschek wrote:

    The
    >
    >algorithm is really not that difficult, and there's a reason it is
    >formulated the way it is.
    >>

    >Wolfgang.
    >>

    >

    A little cosmetic modularization yields:

    public void characters(char[] text, int start, int length) {
    /*
    * Based on measurements of a wide variety of real-world
    documents, one
    * call to characters() per flushText() is extremely
    common, a second
    * one quite uncommon, and more calls to it are extremely
    uncommon.
    * Hence, unnecessary temporary String and StringBuffer
    memory copies
    * and allocations can be avoided, improving performance,
    *irrespective* whether
    * the "fat" stringified Text.java is used, or the
    Text.java version with
    * the UTF-8 storage trick.
    * See endDocument() on how to measure character() per
    flushText() frequency, uncommenting
    * some diagnostic statistics gathering code.
    *
    * Plus, whitepace separator lines make up a large fraction
    of data in
    * heavily pretty printed documents. So, let's share most
    of those
    * string objects via an extremely low-overhead interning
    mechanism.
    *
    * All in all this should make average memory consumption
    of a XM
    * document with the stringified Text patch comparable or
    lower than with the
    * Text.java UTF-8 trick. Plus, it retains the performance
    benefits of
    * the stringified Text patch, for use cases reading or
    writing Text a
    * lot, for example many XPath/XQuery use cases.
    */
    // invariants throughout this class:
    // only one of (first, buffer) has content, i.e. at any
    time there holds:
    // first != null implies buffer == null
    // buffer != null implies first == null
    // if there is content, it is non-empty, i.e at any time
    there holds:
    // first != null implies first.length() 0
    // buffer != null implies buffer.length() 0

    if (length <= 0) return; // nothing to do (important: do
    not remove; ensures correct algo semantics)

    if (first == null && buffer == null) { // first call to
    characters()
    first = firstCharacters(text, start, length);
    // first = new String(text, start, length);
    } else {
    if (buffer == null) { // second call to characters();
    infrequently called
    buffer = new StringBuffer(first.length() +
    length); // doesn't waste memory
    buffer.append(first);
    first = null;
    }
    // second or higher call to characters(); third or
    higher call is very rare
    buffer.append(text, start, length);
    }
    if (finishedCDATA) inCDATA = false;
    // callCounter++;
    }
    --
    private static String firstCharacters(char[] text, int start,
    int length) {
    if (length < SPACES.length) { // any possibility for
    interning whitespace-only?
    // scan backwards to beginning of run of blanks or tabs
    int i = start + length;
    char c = text[];
    if (c != ' ' && c != '\t') {
    c = ' ';
    } else {
    i++; while ( >= start && text[i] == c) ;
    }

    switch (i+1-start) { // is there internable whitespace-
    only?
    case 0: {
    return c == ' ' ? SPACES[length] : TABS[length];
    }
    case 1: {
    if (text[start] == '\n')
    return c == ' ' ? LF_SP[length-1] : LF_TAB
    [length-1];
    break;
    }
    case 2: {
    if (text[start] == '\r' && text[start+1] == '\n')
    return c == ' ' ? CRLF_SP[length-2] :
    CRLF_TAB[length-2];
    break;
    }
    default : // fall through to non-interned string copy
    }
    }
    // fallback to non-interned string copy
    return new String(text, start, length);
    }

    XM-interest mailing list
    XM-interest (AT) lists (DOT) ibiblio.org
  • No.8 | | 816 bytes | |

    Wolfgang Hoschek wrote:

    >Do you really need the carriage returns here? The XML parser is going
    >to normalize all carriage returns and carriage return/ linefeed pairs
    >in the input document to linefeeds when parsing. A carriage return is
    >only gong to appear if the user has used a numeric character
    >reference to specially insert a carriage return. That does happen but
    >it's quite uncommon. I


    It doesn't hurt to play it safe. Parser behaviour may unexpectedly change.

    In this case the parser behavior is not going to change unless the XML
    spec changes, which doesn't seem very likely. Converting carriage
    returns to line feeds is a written into the spec. A parser that failed
    to do this would be non-conformant.

Re: More parsing performance patches


max 4000 letters.
Your nickname that display:
In order to stop the spam: 2 + 1 =
QUESTION ON "Java"

EMSDN.COM