Networking

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • Improved OCR Plugin with approximate matching

    25 answers - 900 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    PGP SIGNED MESSAGE
    Hash: SHA1
    Hello there,
    I have improved the original Plugin (found at
    ), so it contains fuzzy
    matching. Like that, mistakes made by the CR recognition or
    intentional obfuscations in the text don't make the recognition
    impossible. This is being done with a relative distance calculation
    between the pattern (word from a given word list) and a line in the
    recognized input. Also, the plugin uses dynamic scoring (more matched
    words means more score, this can be adjusted in the source).
    You can find a full description and an example in the wiki under:
    Ideas for improvements or critics are always welcome :)
    Best regards,
    Chris
    PGP SIGNATURE
    Version: GnuPG v1.4.5 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
    azeW1/GFnW2qBUs=
    =KZIA
    PGP SIGNATURE
  • No.1 | | 834 bytes | |

    decoder wrote:
    PGP SIGNED MESSAGE
    Hash: SHA1

    Hello there,

    I have improved the original Plugin (found at
    ), so it contains fuzzy
    matching. Like that, mistakes made by the CR recognition or
    intentional obfuscations in the text don't make the recognition
    impossible. This is being done with a relative distance calculation
    between the pattern (word from a given word list) and a line in the
    recognized input. Also, the plugin uses dynamic scoring (more matched
    words means more score, this can be adjusted in the source).

    You can find a full description and an example in the wiki under:

    Ideas for improvements or critics are always welcome :)

    seems to work but i never see a score about 1.00.

    the docs say the default score is 4. did i miss something?
  • No.2 | | 153 bytes | |

    seems to work but i never see a score about 1.00.
    the docs say the default score is 4. did i miss something?
    above 1.00 i meant.
  • No.3 | | 1149 bytes | |

    From: "uNiXpSyC" <marco@uNiXpSyC>

    decoder wrote:
    >PGP SIGNED MESSAGE
    >Hash: SHA1
    >
    >Hello there,
    >
    >I have improved the original Plugin (found at
    >), so it contains fuzzy
    >matching. Like that, mistakes made by the CR recognition or
    >intentional obfuscations in the text don't make the recognition
    >impossible. This is being done with a relative distance calculation
    >between the pattern (word from a given word list) and a line in the
    >recognized input. Also, the plugin uses dynamic scoring (more matched
    >words means more score, this can be adjusted in the source).
    >
    >You can find a full description and an example in the wiki under:
    >
    >
    >
    >
    >Ideas for improvements or critics are always welcome :)
    >


    seems to work but i never see a score about 1.00.

    the docs say the default score is 4. did i miss something?

    You probably never amended your local.cf or equivalent with the
    score for the rule. So it gets the default score of 1.

    {^_^}
  • No.4 | | 1773 bytes | |

    PGP SIGNED MESSAGE
    Hash: SHA1

    Hello again,

    I only wanted to add a small note: I recently saw gifs that cannot be
    converted using imagemagick because they are either sloppy generated
    or with intention partly corrupted. Please think about using giftopnm
    and jpegtopnm instead. If you have a better idea, tell me.

    To use giftopnm and jpegtopnm, change the code from:

    if (($ctype eq "image/gif") || ($ctype eq "image/jpeg")) {
    open CR, "|/usr/bin/convert - pnm:-|/usr/bin/gocr -i - >
    /tmp/spamassassin.focr.$$";

    to:

    if (($ctype eq "image/gif") || ($ctype eq "image/jpeg")) {
    if ($ctype eq "image/gif") {
    open CR, "|/usr/bin/giftopnm - |/usr/bin/gocr -i - >
    /tmp/spamassassin.focr.$$";
    } else {
    open CR, "|/usr/bin/jpegtopnm - |/usr/bin/gocr -i -
    /tmp/spamassassin.focr.$$";
    }

    Note that with imagemagick, things can get really bad. I experienced a
    highly increased time to convert (about 30 seconds and then an error
    message from imagemagick for a 7kb gif file). So I really advise you
    to change the code to use different tools. These will also complain,
    for example:

    giftopnm: Extraneous data at end of image. Skipped to end of image
    giftopnm: bogus character 0x4f, ignoring
    giftopnm: bogus character 0xa7, ignoring
    giftopnm: bogus character 0xc0, ignoring
    giftopnm: bogus character 0x8a, ignoring
    giftopnm: Unable to read Color 33 from colormap

    But it still continues and the text gets recognized correctly.

    Best regards,

    Chris
    PGP SIGNATURE
    Version: GnuPG v1.4.5 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

    ZjYofyRHdknL5L3GcyMdgLo=
    =e1ze
    PGP SIGNATURE
  • No.5 | | 1174 bytes | |

    decoder wrote:
    PGP SIGNED MESSAGE
    Hash: SHA1

    Hello there,

    I have improved the original Plugin (found at
    ), so it contains fuzzy
    matching. Like that, mistakes made by the CR recognition or
    intentional obfuscations in the text don't make the recognition
    impossible. This is being done with a relative distance calculation
    between the pattern (word from a given word list) and a line in the
    recognized input. Also, the plugin uses dynamic scoring (more matched
    words means more score, this can be adjusted in the source).

    You can find a full description and an example in the wiki under:

    --
    Ideas for improvements or critics are always welcome :)

    Hi

    Could this plugin be extended to support png images?
    I receive quite a few of them
    I guess it's probably just a line or two in addition to the jpg and gif
    Also might it be a good idea not to trust the content-type but instead
    use file or another 'detection utility'? As mentioned on the original
    ocrplugin page - gif2pnm and jpg2pnm have been abandoned because of
    sometimes wrong content types?

    Matt
  • No.6 | | 2537 bytes | |

    PGP SIGNED MESSAGE
    Hash: SHA1

    Matthias Keller wrote:
    decoder wrote:
    >PGP SIGNED MESSAGE Hash: SHA1
    >>

    >Hello there,
    >>

    >I have improved the original Plugin (found at
    >), so it contains
    >fuzzy matching. Like that, mistakes made by the CR recognition
    >or intentional obfuscations in the text don't make the
    >recognition impossible. This is being done with a relative
    >distance calculation between the pattern (word from a given word
    >list) and a line in the recognized input. Also, the plugin uses
    >dynamic scoring (more matched words means more score, this can be
    >adjusted in the source).
    >>

    >You can find a full description and an example in the wiki under:
    >>
    >>

    >
    >>
    >>

    >Ideas for improvements or critics are always welcome :)
    >>

    Hi

    Could this plugin be extended to support png images? I receive
    quite a few of them I guess it's probably just a line or two in
    addition to the jpg and gif Also might it be a good idea not to
    trust the content-type but instead use file or another 'detection
    utility'? As mentioned on the original ocrplugin page - gif2pnm and
    jpg2pnm have been abandoned because of sometimes wrong content
    types?
    --
    Matt

    That is a good idea I will try to implement the file command
    somewhere to make sure we are really using the correct tool to
    convert. I explicitly use giftopnm and jpegtopnm here (from netpbm)
    because, as I mentioned in an earlier reply, I received some gifs
    which are corrupt, and these cause convert from imagemagick to drain
    CPU for 30 seconds and more without any result so one should really
    avoid imagemagick here.

    In the latest version I am working on, I invoke giffix (from the
    giflib) to fix these gifs before converting them with giftopnm
    Adding png support will not be hard, I will put it on the todo list.

    I will post a new version in the wiki and announce it here as soon as
    I am finished. :)

    Thanks for the input :)

    Chris
    PGP SIGNATURE
    Version: GnuPG v1.4.5 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

    lqp2m/v+vdxVJ5gZwIGZ7qo=
    =6Nt6
    PGP SIGNATURE
  • No.7 | | 1981 bytes | |

    Perhaps corrupted gifs should be treated as spam?

    decoder wrote:
    PGP SIGNED MESSAGE
    Hash: SHA1

    Hello again,
    --
    I only wanted to add a small note: I recently saw gifs that cannot be
    converted using imagemagick because they are either sloppy generated
    or with intention partly corrupted. Please think about using giftopnm
    and jpegtopnm instead. If you have a better idea, tell me.

    To use giftopnm and jpegtopnm, change the code from:

    if (($ctype eq "image/gif") || ($ctype eq "image/jpeg")) {
    open CR, "|/usr/bin/convert - pnm:-|/usr/bin/gocr -i - >
    /tmp/spamassassin.focr.$$";
    >
    >
    >

    to:

    if (($ctype eq "image/gif") || ($ctype eq "image/jpeg")) {
    if ($ctype eq "image/gif") {
    open CR, "|/usr/bin/giftopnm - |/usr/bin/gocr -i - >
    /tmp/spamassassin.focr.$$";
    } else {
    open CR, "|/usr/bin/jpegtopnm - |/usr/bin/gocr -i -

    >/tmp/spamassassin.focr.$$";
    >

    }
    --
    Note that with imagemagick, things can get really bad. I experienced a
    highly increased time to convert (about 30 seconds and then an error
    message from imagemagick for a 7kb gif file). So I really advise you
    to change the code to use different tools. These will also complain,
    for example:

    giftopnm: Extraneous data at end of image. Skipped to end of image
    giftopnm: bogus character 0x4f, ignoring
    giftopnm: bogus character 0xa7, ignoring
    giftopnm: bogus character 0xc0, ignoring
    giftopnm: bogus character 0x8a, ignoring
    giftopnm: Unable to read Color 33 from colormap

    But it still continues and the text gets recognized correctly.
    --
    Best regards,

    Chris
    PGP SIGNATURE
    Version: GnuPG v1.4.5 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

    ZjYofyRHdknL5L3GcyMdgLo=
    =e1ze
    PGP SIGNATURE
    >
    >
    >
  • No.8 | | 1199 bytes | |

    PGP SIGNED MESSAGE
    Hash: SHA1

    decoder wrote:
    Hello there,

    I have improved the original Plugin (found at
    ), so it contains
    fuzzy matching. Like that, mistakes made by the CR recognition or
    intentional obfuscations in the text don't make the recognition
    impossible. This is being done with a relative distance calculation
    between the pattern (word from a given word list) and a line in
    the recognized input. Also, the plugin uses dynamic scoring (more
    matched words means more score, this can be adjusted in the
    source).

    You can find a full description and an example in the wiki under:

    --
    Ideas for improvements or critics are always welcome :)
    --
    Best regards,
    --
    Chris

    See

    Major changes: Replaced imagemagick with netpbm, support png, invoked
    giffix for broken gifs, detect image format with magic bytes and not
    by content-type, added various configuration options.

    Feedback is welcome :)

    Chris
    PGP SIGNATURE
    Version: GnuPG v1.4.5 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

    1ZfXWyUvpaJ8ZNC1HeRMbLA=
    =/Cyu
    PGP SIGNATURE
  • No.9 | | 1330 bytes | |

    decoder wrote:
    decoder wrote:
    >Hello there,
    >>

    >I have improved the original Plugin (found at
    >), so it contains
    >fuzzy matching. Like that, mistakes made by the CR recognition or
    >intentional obfuscations in the text don't make the recognition
    >impossible. This is being done with a relative distance calculation
    >between the pattern (word from a given word list) and a line in
    >the recognized input. Also, the plugin uses dynamic scoring (more
    >matched words means more score, this can be adjusted in the
    >source).
    >>

    >You can find a full description and an example in the wiki under:
    >>

    >
    >>
    >>

    >Ideas for improvements or critics are always welcome :)
    >>
    >>

    >Best regards,
    >>
    >>

    >Chris
    >

    See

    Major changes: Replaced imagemagick with netpbm, support png, invoked
    giffix for broken gifs, detect image format with magic bytes and not
    by content-type, added various configuration options.

    Feedback is welcome :)

    Chris
  • No.10 | | 1491 bytes | |

    decoder wrote:
    decoder wrote:

    >decoder wrote:
    >

    Hello there,

    I have improved the original Plugin (found at
    ), so it contains
    fuzzy matching. Like that, mistakes made by the CR recognition or
    intentional obfuscations in the text don't make the recognition
    impossible. This is being done with a relative distance calculation
    between the pattern (word from a given word list) and a line in
    the recognized input. Also, the plugin uses dynamic scoring (more
    matched words means more score, this can be adjusted in the
    source).

    You can find a full description and an example in the wiki under:

    Ideas for improvements or critics are always welcome :)

    Best regards,

    Chris

    >See
    >>

    >Major changes: Replaced imagemagick with netpbm, support png, invoked
    >giffix for broken gifs, detect image format with magic bytes and not
    >by content-type, added various configuration options.
    >>

    >Feedback is welcome :)
    >>

    >Chris
    >

    Hi Chris

    Wanted to report back: works like a charm here, thanks for the png
    support - got one today with 23 hits :)

    Now I just have to figure out why I get so poor results on colourful
    images with gocr

    thanks for your work!!

    Matt
  • No.11 | | 702 bytes | |

    decoder wrote:

    See

    Major changes: Replaced imagemagick with netpbm, support png, invoked
    giffix for broken gifs, detect image format with magic bytes and not
    by content-type, added various configuration options.

    Feedback is welcome :)

    Chris

    Since installation yesterday, my system hit FUZZYCR in 204 messages.
    scored 18, ten scored in the 20's and the rest between 30 to 83. Scan time
    ran between 6.4 and 16.6 seconds per message. I'm using a ton of SARE rules
    on a RHE server, dual xeon 2.4 ghz with 2 gig ram.

    If CR is processor/memory intensive, could it be configured to kick in for
    lower scoring messages only?

    Tom Green
  • No.12 | | 973 bytes | |

    PGP SIGNED MESSAGE
    Hash: SHA1

    Expertsites, Inc. wrote:
    >decoder wrote:
    >>

    >See
    >>

    >Major changes: Replaced imagemagick with netpbm, support png, invoked
    >giffix for broken gifs, detect image format with magic bytes and not
    >by content-type, added various configuration options.
    >>

    >Feedback is welcome :)
    >>

    >Chris
    >

    Since installation yesterday, my system hit FUZZYCR in 204
    messages. scored 18, ten scored in the 20's and the rest
    between 30 to 83. Scan time ran between 6.4 and 16.6 seconds per
    message. I'm using a ton of SARE rules on a RHE server, dual xeon
    2.4 ghz with 2 gig ram.

    If CR is processor/memory intensive, could it be configured to kick
    in for lower scoring messages only?

    Tom Green
  • No.13 | | 1181 bytes | |

    >decoder wrote:
    >>

    >See
    >>

    >Major changes: Replaced imagemagick with netpbm, support png, invoked
    >giffix for broken gifs, detect image format with magic bytes and not
    >by content-type, added various configuration options.


    I install the above plugin, and i keep getting the same error.

    [root@beyond spamtest]# spamassassin -t < spam-gif-1.txt
    sh: /usr/bin/giffix: No such file or directory
    giftopnm: error reading magic number
    (null): Error reading magic number from Netpbm image stream. Most often,
    this means your input file is empty.
    sh: /usr/bin/giffix: No such file or directory
    giftopnm: error reading magic number
    (null): Error reading magic number from Netpbm image stream. Most often,
    this means your input file is empty.
    sh: /usr/bin/giffix: No such file or directory
    giftopnm: error reading magic number
    (null): Error reading magic number from Netpbm image stream. Most often,
    this means your input file is empty.

    I notice the error occur when the attachment is gif format.
  • No.14 | | 1506 bytes | |

    PGP SIGNED MESSAGE
    Hash: SHA1

    Spamassassin List wrote:
    decoder wrote:

    See

    Major changes: Replaced imagemagick with netpbm, support png,
    invoked giffix for broken gifs, detect image format with magic
    bytes and not by content-type, added various configuration
    options.

    I install the above plugin, and i keep getting the same error.

    [root@beyond spamtest]# spamassassin -t < spam-gif-1.txt sh:
    /usr/bin/giffix: No such file or directory giftopnm: error reading
    magic number (null): Error reading magic number from Netpbm image
    stream. Most often, this means your input file is empty. sh:
    /usr/bin/giffix: No such file or directory giftopnm: error reading
    magic number (null): Error reading magic number from Netpbm image
    stream. Most often, this means your input file is empty. sh:
    /usr/bin/giffix: No such file or directory giftopnm: error reading
    magic number (null): Error reading magic number from Netpbm image
    stream. Most often, this means your input file is empty.

    I notice the error occur when the attachment is gif format.

    You are missing a tool. It is called "giffix" and part of the "giflib"
    package. Without it, the plugin can't fix broken gifs to analyze them.
    Install giflib.

    Chris

    PGP SIGNATURE
    Version: GnuPG v1.4.5 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

    fmuMQNEuH7h9Ulm3yhdnIFM=
    =LzFn
    PGP SIGNATURE
  • No.15 | | 1647 bytes | |

    Spamassassin List wrote:
    decoder wrote:

    See

    Major changes: Replaced imagemagick with netpbm, support png,
    invoked giffix for broken gifs, detect image format with magic
    bytes and not by content-type, added various configuration
    options.
    >>

    >I install the above plugin, and i keep getting the same error.
    >>

    >[root@beyond spamtest]# spamassassin -t < spam-gif-1.txt sh:
    >/usr/bin/giffix: No such file or directory giftopnm: error reading
    >magic number (null): Error reading magic number from Netpbm image
    >stream. Most often, this means your input file is empty. sh:
    >/usr/bin/giffix: No such file or directory giftopnm: error reading
    >magic number (null): Error reading magic number from Netpbm image
    >stream. Most often, this means your input file is empty. sh:
    >/usr/bin/giffix: No such file or directory giftopnm: error reading
    >magic number (null): Error reading magic number from Netpbm image
    >stream. Most often, this means your input file is empty.
    >>

    >I notice the error occur when the attachment is gif format.
    >>

    >

    You are missing a tool. It is called "giffix" and part of the "giflib"
    package. Without it, the plugin can't fix broken gifs to analyze them.
    Install giflib.

    I did a yum install giflib, but it install another package. What is the
    package for yum?

    libungif i386 4.1.3-3.fc4.2 updates-released 39 k
  • No.16 | | 2021 bytes | |

    PGP SIGNED MESSAGE
    Hash: SHA1

    Spamassassin List wrote:
    >Spamassassin List wrote:

    decoder wrote:

    See

    Major changes: Replaced imagemagick with netpbm, support
    png, invoked giffix for broken gifs, detect image format
    with magic bytes and not by content-type, added various
    configuration options.

    I install the above plugin, and i keep getting the same error.

    [root@beyond spamtest]# spamassassin -t < spam-gif-1.txt sh:
    /usr/bin/giffix: No such file or directory giftopnm: error
    reading magic number (null): Error reading magic number from
    Netpbm image stream. Most often, this means your input file is
    empty. sh: /usr/bin/giffix: No such file or directory giftopnm:
    error reading magic number (null): Error reading magic number
    from Netpbm image stream. Most often, this means your input
    file is empty. sh: /usr/bin/giffix: No such file or directory
    giftopnm: error reading magic number (null): Error reading
    magic number from Netpbm image stream. Most often, this means
    your input file is empty.

    I notice the error occur when the attachment is gif format.

    >>

    >You are missing a tool. It is called "giffix" and part of the
    >"giflib" package. Without it, the plugin can't fix broken gifs to
    >analyze them. Install giflib.
    >>

    >

    I did a yum install giflib, but it install another package. What is
    the package for yum?

    libungif i386 4.1.3-3.fc4.2 updates-released
    39 k

    According to google, libungif seems correct for yum If the giffix
    binary still isn't present, try installing giflib from source that
    isn't a big deal

    Chris
    PGP SIGNATURE
    Version: GnuPG v1.4.5 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

    DZ1i1yJ6sWCD9fAs=
    =ZBsG
    PGP SIGNATURE
  • No.17 | | 406 bytes | |

    Just a little typo I think:

    the wiki it says:

    " focr_tmp_path - String determining the absolute path to a directory where the plugin may write temporary files to (without trailing slash) focr_verbosity -
    Verbose level (0 - 2). "

    As far as I found out it should be "focr_verbose" and not "focr_verbosity". In the example config file it is written correctly

    Mathias
  • No.18 | | 571 bytes | |

    Tue, Aug 08, 2006 at 12:43:24AM +0200, decoder wrote:

    You can find a full description and an example in the wiki under:

    Ideas for improvements or critics are always welcome :)

    Hi,

    First, thanks for working on such a great plugin!

    I have to make this adjustment to the jpegtopnm call to get it to work
    with jpeg files:

    < open IMAGE_PRCESSR, "|/usr/bin/jpegtopnm - |/usr/bin/gocr -i - $tempfile";

    open IMAGE_PRCESSR, "|/usr/bin/jpegtopnm |/usr/bin/gocr -i - $tempfile";

    Hope this helps

    Yifang
  • No.19 | | 1585 bytes | |

    PGP SIGNED MESSAGE
    Hash: SHA1

    decoder wrote:
    Hello there,

    I have improved the original Plugin (found at
    ), so it contains
    fuzzy matching. Like that, mistakes made by the CR recognition or
    intentional obfuscations in the text don't make the recognition
    impossible. This is being done with a relative distance calculation
    between the pattern (word from a given word list) and a line in
    the recognized input. Also, the plugin uses dynamic scoring (more
    matched words means more score, this can be adjusted in the
    source).

    You can find a full description and an example in the wiki under:

    --
    Ideas for improvements or critics are always welcome :)
    --
    Best regards,
    --
    Chris

    Hello again,

    I just released a new version which contains all suggestions made here
    on the mailing list. Changelog:

    * Added scoring for wrong content-type
    * Added scoring for broken gif images
    * Added configuration for helper applications
    * Added autodisable_score feature to disable the CR engine if the
    message has already enough points

    You can now obtain the plugin as a tarball, the download URL is at the
    end of the wiki page. ()

    All new options in the config file, especially score adjustments for
    the new features, are explained there as well and in the sample cf file.

    Chris

    PGP SIGNATURE
    Version: GnuPG v1.4.5 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

    lLq+DegAxIQbFXTfNxA=
    =UMKm
    PGP SIGNATURE
  • No.20 | | 2107 bytes | |

    decoder wrote:
    PGP SIGNED MESSAGE
    Hash: SHA1

    decoder wrote:

    >Hello there,
    >>

    >I have improved the original Plugin (found at
    >), so it contains
    >fuzzy matching. Like that, mistakes made by the CR recognition or
    >intentional obfuscations in the text don't make the recognition
    >impossible. This is being done with a relative distance calculation
    >between the pattern (word from a given word list) and a line in
    >the recognized input. Also, the plugin uses dynamic scoring (more
    >matched words means more score, this can be adjusted in the
    >source).
    >>

    >You can find a full description and an example in the wiki under:
    >>

    >
    >>
    >>

    >Ideas for improvements or critics are always welcome :)
    >>
    >>

    >Best regards,
    >>
    >>

    >Chris
    >
    >

    Hello again,
    --
    I just released a new version which contains all suggestions made here
    on the mailing list. Changelog:

    * Added scoring for wrong content-type
    * Added scoring for broken gif images
    * Added configuration for helper applications
    * Added autodisable_score feature to disable the CR engine if the
    message has already enough points
    --
    You can now obtain the plugin as a tarball, the download URL is at the
    end of the wiki page. ()

    All new options in the config file, especially score adjustments for
    the new features, are explained there as well and in the sample cf file.

    Hi
    I get the following warnings when linting:
    [29661] warn: config: warning: description exists for non-existent rule
    FUZZYCR_CRRUPT_IMG
    [29661] warn: config: warning: description exists for non-existent rule
    FUZZYCR_WRNG_CTYPE
    [29661] warn: lint: 2 issues detected, please rerun with debug enabled
    for more information
  • No.21 | | 2614 bytes | |

    PGP SIGNED MESSAGE
    Hash: SHA1

    Matthias Keller wrote:
    decoder wrote:
    >PGP SIGNED MESSAGE Hash: SHA1
    >>

    >decoder wrote:
    >>

    Hello there,

    I have improved the original Plugin (found at
    ), so it contains
    fuzzy matching. Like that, mistakes made by the CR recognition
    or intentional obfuscations in the text don't make the
    recognition impossible. This is being done with a relative
    distance calculation between the pattern (word from a given
    word list) and a line in the recognized input. Also, the plugin
    uses dynamic scoring (more matched words means more score, this
    can be adjusted in the source).

    You can find a full description and an example in the wiki
    under:

    Ideas for improvements or critics are always welcome :)

    Best regards,

    Chris

    >>

    >Hello again,
    >>
    >>

    >I just released a new version which contains all suggestions made
    >here on the mailing list. Changelog:
    >>

    >* Added scoring for wrong content-type * Added scoring for broken
    >gif images * Added configuration for helper applications * Added
    >autodisable_score feature to disable the CR engine if the
    >message has already enough points
    >>
    >>

    >You can now obtain the plugin as a tarball, the download URL is
    >at the end of the wiki page.
    >()
    >>

    >All new options in the config file, especially score adjustments
    >for the new features, are explained there as well and in the
    >sample cf file.
    >>

    Hi I get the following warnings when linting: [29661] warn: config:
    warning: description exists for non-existent rule
    FUZZYCR_CRRUPT_IMG [29661] warn: config: warning: description
    exists for non-existent rule FUZZYCR_WRNG_CTYPE [29661] warn:
    lint: 2 issues detected, please rerun with debug enabled for more
    information

    Indeed, I didn't notice that. It runs fine though, I'll fix it anyway
    by putting the descriptions into the plugin config as to be parsed by
    the plugin.

    Thx
    PGP SIGNATURE
    Version: GnuPG v1.4.5 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

    FMEIgWUpMe8ziacyS/tuo=
    =6czY
    PGP SIGNATURE
  • No.22 | | 426 bytes | |

    Thursday 10 August 2006 05:56, decoder wrote:
    Hello again,
    --
    I just released a new version which contains all suggestions made here
    on the mailing list. Changelog:

    * Added scoring for wrong content-type
    * Added scoring for broken gif images

    Decoder: Could I convince you to say WHAT you are releasing
    a new version F, rather than leaving the impression
    Spamassassin released a new version?
  • No.23 | | 1576 bytes | |

    PGP SIGNED MESSAGE
    Hash: SHA1

    decoder wrote:
    Hello there,

    I have improved the original Plugin (found at
    ), so it contains
    fuzzy matching. Like that, mistakes made by the CR recognition or
    intentional obfuscations in the text don't make the recognition
    impossible. This is being done with a relative distance calculation
    between the pattern (word from a given word list) and a line in
    the recognized input. Also, the plugin uses dynamic scoring (more
    matched words means more score, this can be adjusted in the
    source).

    You can find a full description and an example in the wiki under:

    --
    Ideas for improvements or critics are always welcome :)
    --
    Best regards,
    --
    Chris

    Hello there,

    I've just released version 2.1c, which fixes problems when using
    Spamassassin + Mailscanner (score is always 1.0).

    Thanks for this bug report and patch to Howard Kash.

    (minor) changes:
    - -Fixed a typo (treshold -threshold), if you are using this variable
    in your config, you need to fix this.
    - -Removed the '-' from jpegtopnm arguments to provide backwards
    compatiblity to older netpbm (as someone else mentioned here before)

    The updated version can be found at the usual download URL (see the
    spamassassin wiki under F)

    Best regards

    Christian
    PGP SIGNATURE
    Version: GnuPG v1.4.5 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

    3KUFvNC5v52BytjKnAY=
    =0r9I
    PGP SIGNATURE
  • No.24 | | 1637 bytes | |

    PGP SIGNED MESSAGE
    Hash: SHA1

    decoder wrote:
    Hello there,

    I have improved the original Plugin (found at
    ), so it contains
    fuzzy matching. Like that, mistakes made by the CR recognition or
    intentional obfuscations in the text don't make the recognition
    impossible. This is being done with a relative distance calculation
    between the pattern (word from a given word list) and a line in
    the recognized input. Also, the plugin uses dynamic scoring (more
    matched words means more score, this can be adjusted in the
    source).

    You can find a full description and an example in the wiki under:

    --
    Ideas for improvements or critics are always welcome :)
    --
    Best regards,
    --
    Chris

    A new beta is available (2.2-beta1).

    It includes a bugfix for a bug with jpeg content-types reported by
    Matthias Keller. changes:
    - - Debug file stuff removed, instead of that, the tempfiles don't get
    deleted when in debug mode (verbose 1).
    - - Logfile support, all debug messages go there
    - - Much more debug messages
    - - Error handling/logging (Thanks to Ron Bender for pointing that out)
    - - Added the necessary priority line to the cf file. (Thanks to Mark
    Martinec and others for reminding me about that)

    Please note that this is a beta so you should probably try it out
    in non-production environments first before blaming me ;D

    Chris
    PGP SIGNATURE
    Version: GnuPG v1.4.5 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

    gYGWlMFSkJ9jud+7tatZc=
    =gcsV
    PGP SIGNATURE
  • No.25 | | 2859 bytes | |

    decoder wrote:
    PGP SIGNED MESSAGE
    Hash: SHA1

    decoder wrote:

    >Hello there,
    >>

    >I have improved the original Plugin (found at
    >), so it contains
    >fuzzy matching. Like that, mistakes made by the CR recognition or
    >intentional obfuscations in the text don't make the recognition
    >impossible. This is being done with a relative distance calculation
    >between the pattern (word from a given word list) and a line in
    >the recognized input. Also, the plugin uses dynamic scoring (more
    >matched words means more score, this can be adjusted in the
    >source).
    >>

    >You can find a full description and an example in the wiki under:
    >>

    >
    >>
    >>

    >Ideas for improvements or critics are always welcome :)
    >>
    >>

    >Best regards,
    >>
    >>

    >Chris
    >
    >

    A new beta is available (2.2-beta1).

    It includes a bugfix for a bug with jpeg content-types reported by
    Matthias Keller. changes:

    - - Debug file stuff removed, instead of that, the tempfiles don't get
    deleted when in debug mode (verbose 1).
    - - Logfile support, all debug messages go there
    - - Much more debug messages
    - - Error handling/logging (Thanks to Ron Bender for pointing that out)
    - - Added the necessary priority line to the cf file. (Thanks to Mark
    Martinec and others for reminding me about that)

    Please note that this is a beta so you should probably try it out
    in non-production environments first before blaming me ;D

    Hi Chris

    Wanted to report back - it's all working nicely and smoothly so far
    And thanks to your plugin yesterday an onslaught of about 30 image spams
    within one minute have been blocked efficiently. Especially with my much
    extended wordlist most of them get blocked accurately - my only concern
    is the varying results from gocr nobody has been able to help me with
    I've tried 3 different gocr 0.40 versions and none seems to be as good
    as yours you dont have the source to your version somewhere so I
    could try yours?

    I've got one request tough: When announcing a new version, could you
    post a new subject instead of replying to the old one, maybe with a
    subject "F v2.2-beta1 released" ? In my thread sorted view I
    always have to go look for a message in an old thread

    btw, i've subscribed to your list tough i feel general discussion about
    your plugin should be done here whereas support inquiries and that stuff
    can nicely fit on the separate one

    Matt

Re: Improved OCR Plugin with approximate matching


max 4000 letters.
Your nickname that display:
In order to stop the spam: 0 + 9 =
QUESTION ON "Networking"

EMSDN.COM