Java

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • UTF8 accents & umlauts filter?

    0 answers - 4022 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    We use ICU4J to do the filtering based on Unicode blocks. See
    for a sense of what
    you can do. It's worth it for us because we need to normalize cyrillic
    as well as roman text; it might be overkill for other situations. But it
    does good work. The first example on the page linked above shows
    accent-stripping: you normalize to NFD (decomposed unicode, where
    accents are represented as non-spacing characters), then delete all the
    non-spacing characters, and finally normalize back to composed unicode.
    Peter
    Message
    From: Michael Imbeault [mailto:michael.imbeault (AT) sympatico (DOT) ca]
    Sent: Wednesday, September 13, 2006 9:34 PM
    To: java-user (AT) lucene (DOT) apache.org
    Subject: Re: UTF8 accents & umlauts filter?
    Thanks Yonik & Ken for both answers; I think the explanations went a
    little over my head, but I think you understood what I was talking
    about! Basically, a better filter to remove all possible accents (&
    umlauts as a bonus, for completeness sake; I personally would have no
    use for it).
    I think it's way more work and way more complicated than I initially
    thought it would be. Anyone feels able to do this?
    Michael Imbeault
    CHUL Research Center (CHUQ)
    2705 boul. Laurier
    Ste-Foy, QC, Canada, G1V 4G2
    Tel: (418) 654-2705, Fax: (418) 654-2212
    Yonik Seeley wrote:
    Thanks for the links Michael this one does look interesting:
    The challenge would be to make it fast perhaps a custom hash table,
    or look into the cost of a perfect hash function.
    Just to clear up some unicode/terminology issues:
    There are latin1 characters (the actual glyphs) represented by unicode
    code points 0->255 There is also a latin1 encoding for unicode (which
    can only represent unicode code points 0->255)
    UTF8 is another encoding for unicode characters (or code points), but
    that's not really relevant to a filter.
    So ISLatin1AccentFilter removes accents from characters <= 255, and
    it doesn't matter what the original encoding was (ascii, latin1, UTF8,
    UTF16, etc)
    -Yonik
    --
    9/12/06, Michael Imbeault <michael.imbeault (AT) sympatico (DOT) cawrote:
    >Right now Lucene has an accent filter (ISLatin1AccentFilter) that
    >remove accents on IS text. What about a UTF8AccentFilter? Is
    >it planned to add such a filter (which would be very useful, as
    >ISLatin1AccentFilter isn't able to remove some complex accents on
    >some languages encoded in UTF8. I would paste examples but I'm not
    >sure that they would display correctly).? I think I saw a post long
    >ago on this mailing list about something like that, but it has never
    >been released officially.
    >>

    >See
    >>

    >2001, first post about utf8 accents:
    >
    >ing=accent;#648
    >>

    >2004, a good solution, but still incomplete :
    >
    >tring=accent;#10792
    >>

    >2006, best attempt yet, but sadly undelivered :
    >
    >tring=accent;#32142
    >>
    >>

    >I think Lucene would benefit from a complete UTF8 accents remover
    >right now the best solution I have is to process everything in PHP
    >before indexing and at query time (and its a little slow).
    >

    To unsubscribe, e-mail: java-user-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-user-help (AT) lucene (DOT) apache.org
    To unsubscribe, e-mail: java-user-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-user-help (AT) lucene (DOT) apache.org
    To unsubscribe, e-mail: java-user-unsubscribe (AT) lucene (DOT) apache.org
    For additional commands, e-mail: java-user-help (AT) lucene (DOT) apache.org

Re: UTF8 accents & umlauts filter?


max 4000 letters.
Your nickname that display:
In order to stop the spam: 1 + 1 =
QUESTION ON "Java"

EMSDN.COM