Compression

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • Hutter Prize submission: paq8hp4

    0 answers - 1995 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    I have posted test results for paq8hp4 by Alexander Ratushnyak on
    enwik8 for the Hutter prize (enwik9 to follow). This improves by
    1.000024% over paq8hp3, just over the 1% threshold required for the
    prize (500 euros per 1% improvement). The prize is structured so that
    a series of 1% improvements pays a little more than one large
    improvement.
    http://cs.fit.edu/~mmahoney/compression/text.html#1400
    There is a 30 day public comment period on the Hutter Prize group. If
    you want to view or post comments, go to
    important issue is that the program is derived from GPL licensed
    code but source code is not yet published.
    I have been experimenting with paq8hp4. Like earlier versions, you can
    use it as a preprocessor to other compressors by compressing with
    option -0. The program also outputs its dictionary to a temporary
    file. The preprocessing consists of encoding capital letters with
    lower case plus special symbols, and encoding words in the dictionary
    with 1-3 byte symbols (big-endian). Details here:
    http://cs.fit.edu/~mmahoney/compression/text.html#1400
    As an experiment I preprocessed enwik8 with each of the versions
    paq8hp1 through paq8hp4 and compressed the output with ppmonstr. Each
    successive version improves compression,even though the four
    dictionaries are identical except for the order of the words. In the
    early versions, the 80 high, 2560 medium, and 40960 low frequency words
    are sorted alphabetically, except that the high frequency words are
    grouped by syntactic role (articles, prepositions, pronouns, etc). In
    paq8hp3, the medium frequency words are grouped semantically ("first"
    and "second", "atlantic" and "pacific", etc), and the low frequency
    words are sorted in reverse order from the right to bring common
    suffixes together (-s, -ed, -ing, etc). In paq8hp4 there is further
    semantic grouping of the low frequency words.
    -- Matt Mahoney

Re: Hutter Prize submission: paq8hp4


max 4000 letters.
Your nickname that display:
In order to stop the spam: 7 + 6 =
QUESTION ON "Compression"

EMSDN.COM