I have posted test results for paq8hp4 by Alexander Ratushnyak on
enwik8 for the Hutter prize (enwik9 to follow). This improves by
1.000024% over paq8hp3, just over the 1% threshold required for the
prize (500 euros per 1% improvement). The prize is structured so that
a series of 1% improvements pays a little more than one large
improvement.
http://cs.fit.edu/~mmahoney/compression/text.html#1400
There is a 30 day public comment period on the Hutter Prize group. If
you want to view or post comments, go to
important issue is that the program is derived from GPL licensed
code but source code is not yet published.
I have been experimenting with paq8hp4. Like earlier versions, you can
use it as a preprocessor to other compressors by compressing with
option -0. The program also outputs its dictionary to a temporary
file. The preprocessing consists of encoding capital letters with
lower case plus special symbols, and encoding words in the dictionary
with 1-3 byte symbols (big-endian). Details here:
http://cs.fit.edu/~mmahoney/compression/text.html#1400
As an experiment I preprocessed enwik8 with each of the versions
paq8hp1 through paq8hp4 and compressed the output with ppmonstr. Each
successive version improves compression,even though the four
dictionaries are identical except for the order of the words. In the
early versions, the 80 high, 2560 medium, and 40960 low frequency words
are sorted alphabetically, except that the high frequency words are
grouped by syntactic role (articles, prepositions, pronouns, etc). In
paq8hp3, the medium frequency words are grouped semantically ("first"
and "second", "atlantic" and "pacific", etc), and the low frequency
words are sorted in reverse order from the right to bring common
suffixes together (-s, -ed, -ing, etc). In paq8hp4 there is further
semantic grouping of the low frequency words.
-- Matt Mahoney