Apache

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • VInt's as prefix. Was: bytecount as prefix

    8 answers - 4372 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    Hi,
    I'm the author of CLucene (a c++ port of lucene). I've been following
    the 'using byte count as prefix' discussion and I think this
    discussion sort of ties into something we are trying to achieve.
    We are trying to optimise the way the index writing works, and we also
    want to be able to index & store fields which are using a Reader
    object.
    The second part is in theory a very easy solution, we can use a
    streamfilter to buffer the reads that the analyser makes, and
    integrate the FieldsWriter into the invertDocument function so that
    the buffers are written while the analysers are run. Since there is no
    way of knowing the length of the reader, we would then have to go back
    and write the field length. Here is where the problem is, though: this
    is not possible currently because we are using a VInt for the field
    data length.
    If we can use non variable length integers for the field data length
    it makes it much easier for two things:
    1) memory optimisations like the compressed field can benefit from
    this: we don't have to store the entire compressed output in memory,
    but can rather write it directly to the fields output stream.
    2) it makes it possible to store AND index a field using a reader in a
    single pass, thus removing the need to read twice (which might not
    always be possible for some reader implementations).
    The second feature is very important for us!
    So I would like to propose a discussion on how this could be achieved:
    My idea is to set a bit in the config like FIELD_DNT_USE_VINT. I dont
    think using a static Int for every field is necessary, this few extra
    (unnecessary) bytes for each field would add up to a lot. A static Int
    is only used when completely necessary, and the implementation could
    decide when to use this.
    These are the rough changes that i think would need to be made:
    final Document doc(int n) throws IException {
    byte bits = fieldsStream.readByte();
    boolean dontUseVint = (bits & ) != 0;
    <<Binary fields like compressed or binary is an easy change>>
    if ((bits & FieldsWriter.FIELD_IS_BINARY) != 0) {
    final byte[] b = new byte[dontUseVint?
    fieldsStream.readInt():
    fieldsStream.readVInt()]; << CHANGE HERE
    if (compressed) {
    final byte[] b = new byte[dontUseVint?
    fieldsStream.readInt():
    fieldsStream.readVInt()]; << CHANGE HERE
    <<Reading a field value as a string>>
    string value;
    if ( dontUseVint ){
    << I'm not completely sure about this section,
    since changes relating to 'bytecount as prefix' would affect this >>
    int length = readInt();
    char[] chars = new char[length];
    readChars(chars, 0, length);
    value = new String(chars, 0, length);
    }else
    value = fieldsStream.readString()
    Field f = new Field(fi.name, // name
    value, // read value << CHANGE HERE - use different string length
    store,
    index,
    termVector);
    Now is probably the best time to implement something like this before
    lucene 2.0 is released. I think it wouldn't be a complicated change;
    for now, we don't need to make any changes to the FieldWriter
    (optimisations using this can be done later).
    ben
    5/7/06, Marvin Humphrey <marvin (AT) rectangular (DOT) comwrote:
    Got it.
    This was the problem, in TermInfosWriter.writeTerm():
    - lastTerm = term;
    + lastBytes = bytes;
    }
    Without lastTerm being updated, the auxiliary term dictionary got
    screwed up. This problem only manifested on large tests because small
    tests never moved past the first entry, which is always a field number
    of -1 and an empty string.
    I'll post a full working patch to JIRA as soon as I'm at a location
    where I can connect my laptop to the net.
    Marvin Humphrey
    Rectangular Research
    http://www.rectangular.com/
    To unsubscribe, e-mail: java-dev-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-dev-help (AT) lucene (DOT) apache.org
    --
    To unsubscribe, e-mail: java-dev-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-dev-help (AT) lucene (DOT) apache.org
  • No.1 | | 656 bytes | |

    Hi Ben,

    I now tried to compile the contributions things,
    but it failes in the highlighter
    It failes in the Encoder.

    I use the svn Revision 2063.

    I tried to use the jstreams, but I don't know how.
    Could you give me some example to add a field with content to a document.

    string Content = "something in utf-8 text";
    Document *doc= _CLNEW Document;
    doc->add( *Field::Text("Content", Content.c_str(), true ) );

    thx
    bye
    thomas

    To unsubscribe, e-mail: java-dev-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-dev-help (AT) lucene (DOT) apache.org
  • No.2 | | 1127 bytes | |

    May 11, 2006, at 3:24 AM, Ben van Klinken wrote:

    Here is where the problem is, though: this
    is not possible currently because we are using a VInt for the field
    data length.

    What we really need is the ability to add "leading zeroes" to a VInt.

    I believe that this is possible if we change the definition of VInt
    so that the high bytes are written first, rather than the low bytes.
    The "BER compressed integer", used by Perl's pack() function, is
    defined this way. A proof-of-concept Perl script is below.

    Marvin Humphrey
    Rectangular Research
    http://www.rectangular.com/

    #
    #!/usr/bin/perl
    use strict;
    use warnings;

    my $pad = pack( 'C', 128 ); # "leading zero": 1000 0000
    my $serialized = pack( 'wwawaaw', 127, 128, $pad, 129, $pad, $pad,
    154 );

    my @numbers = unpack( 'w*', $serialized );
    print "@numbers\n"; # prints "127 128 129 154"

    To unsubscribe, e-mail: java-dev-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-dev-help (AT) lucene (DOT) apache.org
  • No.3 | | 848 bytes | |

    5/11/06, Marvin Humphrey <marvin (AT) rectangular (DOT) comwrote:
    I believe that this is possible if we change the definition of VInt
    so that the high bytes are written first, rather than the low bytes.
    The "BER compressed integer"

    Great idea Marvin! The decoding could be slightly faster with
    reverse-byte order since you don't have to maintain a shift-count:

    public int readVInt() throws IException {
    byte b = readByte();
    int i = b & 0x7F;
    while ((b & 0x80)!=0)
    b = readByte();
    i = (i<<7) | (b & 0x7F);
    }
    return i;
    }

    course there is that *little* detail of backward compatability ;-)
    -Yonik

    To unsubscribe, e-mail: java-dev-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-dev-help (AT) lucene (DOT) apache.org
  • No.4 | | 866 bytes | |

    May 11, 2006, at 8:02 AM, Yonik Seeley wrote:

    course there is that *little* detail of backward compatability ;-)

    There is that. :)

    Between using bytecounts as String prefixes, transitioning from
    modified UTF-8 to standard UTF-8, and potentially changing the
    definition of VInt, there are a lot of backards incompatible changes
    looming for the I/ classes.

    Maybe we should consider loading differing subclasses of IndexInput/
    I based on the detected file format version? If this were
    C, I'd use function pointers. What's the best way to approximate
    that in Java?

    Marvin Humphrey
    Rectangular Research
    http://www.rectangular.com/

    To unsubscribe, e-mail: java-dev-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-dev-help (AT) lucene (DOT) apache.org
  • No.5 | | 1142 bytes | |

    5/11/06, Marvin Humphrey <marvin (AT) rectangular (DOT) comwrote:
    Maybe we should consider loading differing subclasses of IndexInput/
    I based on the detected file format version? If this were
    C, I'd use function pointers. What's the best way to approximate
    that in Java?

    Nothing but subclassing.

    There are already different subclasses of IndexInput and I
    The problem is, there are already 7 implementations of IndexInput, so
    one would need to create 7 more implementations with different
    readVInt() for example.

    You could perhaps decouple and factor out part of the functionality
    into a VIntReader and VIntWriter, for example, but readVInt() is
    called *so* often, I'd be pretty afraid of the performance
    implications. 1.5 HotSpot might be able to handle it but then
    there are people who need to use -client, people stuck on Java1.4,
    etc.
    -Yonik
    Solr, the open-source Lucene search server

    To unsubscribe, e-mail: java-dev-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-dev-help (AT) lucene (DOT) apache.org
  • No.6 | | 1046 bytes | |

    , haven't been following the 2.0 thing very well :)
    But we at clucene are trying to get this stream thing going, so would
    like to do something which will be compatible with java lucene.

    So if there's something i can do with the refence version so that what
    we are doing isn't incompatible, it would be a great help for us.

    ben

    5/11/06, Doug Cutting <cutting (AT) apache (DOT) orgwrote:
    Ben van Klinken wrote:
    What's the chance of this making it into Lucene 2.0? Let me know if
    there's anything i can do to get this into Lucene 2.

    Lucene 2.0 is all but out the door. We're talking about Lucene 2.x or
    Lucene 3 here.

    Doug

    To unsubscribe, e-mail: java-dev-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-dev-help (AT) lucene (DOT) apache.org
    --

    To unsubscribe, e-mail: java-dev-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-dev-help (AT) lucene (DOT) apache.org
  • No.7 | | 442 bytes | |

    Ben van Klinken wrote:
    What's the chance of this making it into Lucene 2.0? Let me know if
    there's anything i can do to get this into Lucene 2.

    Lucene 2.0 is all but out the door. We're talking about Lucene 2.x or
    Lucene 3 here.

    Doug

    To unsubscribe, e-mail: java-dev-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-dev-help (AT) lucene (DOT) apache.org
  • No.8 | | 590 bytes | |

    What we really need is the ability to add "leading zeroes" to a VInt.
    I really like this idea! A VInt can then be written with a static length.

    Then in clucene we can implement our stream optimisations without any
    changes to the code logic.

    What's the chance of this making it into Lucene 2.0? Let me know if
    there's anything i can do to get this into Lucene 2.

    cheers

    ben

    To unsubscribe, e-mail: java-dev-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-dev-help (AT) lucene (DOT) apache.org

Re: VInt's as prefix. Was: bytecount as prefix


max 4000 letters.
Your nickname that display:
In order to stop the spam: 9 + 8 =
QUESTION ON "Apache"

EMSDN.COM