We use ICU4J to do the filtering based on Unicode blocks. See
for a sense of what
you can do. It's worth it for us because we need to normalize cyrillic
as well as roman text; it might be overkill for other situations. But it
does good work. The first example on the page linked above shows
accent-stripping: you normalize to NFD (decomposed unicode, where
accents are represented as non-spacing characters), then delete all the
non-spacing characters, and finally normalize back to composed unicode.
Peter
Message
From: Michael Imbeault [mailto:michael.imbeault (AT) sympatico (DOT) ca]
Sent: Wednesday, September 13, 2006 9:34 PM
To: java-user (AT) lucene (DOT) apache.org
Subject: Re: UTF8 accents & umlauts filter?
Thanks Yonik & Ken for both answers; I think the explanations went a
little over my head, but I think you understood what I was talking
about! Basically, a better filter to remove all possible accents (&
umlauts as a bonus, for completeness sake; I personally would have no
use for it).
I think it's way more work and way more complicated than I initially
thought it would be. Anyone feels able to do this?
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212
Yonik Seeley wrote:
Thanks for the links Michael this one does look interesting:
The challenge would be to make it fast perhaps a custom hash table,
or look into the cost of a perfect hash function.
Just to clear up some unicode/terminology issues:
There are latin1 characters (the actual glyphs) represented by unicode
code points 0->255 There is also a latin1 encoding for unicode (which
can only represent unicode code points 0->255)
UTF8 is another encoding for unicode characters (or code points), but
that's not really relevant to a filter.
So ISLatin1AccentFilter removes accents from characters <= 255, and
it doesn't matter what the original encoding was (ascii, latin1, UTF8,
UTF16, etc)
-Yonik
--
9/12/06, Michael Imbeault <michael.imbeault (AT) sympatico (DOT) cawrote:
>Right now Lucene has an accent filter (ISLatin1AccentFilter) that
>remove accents on IS text. What about a UTF8AccentFilter? Is
>it planned to add such a filter (which would be very useful, as
>ISLatin1AccentFilter isn't able to remove some complex accents on
>some languages encoded in UTF8. I would paste examples but I'm not
>sure that they would display correctly).? I think I saw a post long
>ago on this mailing list about something like that, but it has never
>been released officially.
>>
>See
>>
>2001, first post about utf8 accents:
>
>ing=accent;#648
>>
>2004, a good solution, but still incomplete :
>
>tring=accent;#10792
>>
>2006, best attempt yet, but sadly undelivered :
>
>tring=accent;#32142
>>
>>
>I think Lucene would benefit from a complete UTF8 accents remover
>right now the best solution I have is to process everything in PHP
>before indexing and at query time (and its a little slow).
>
To unsubscribe, e-mail: java-user-unsubscribe (AT) lucene (DOT) apache.org
For additional commands, e-mail: java-user-help (AT) lucene (DOT) apache.org
To unsubscribe, e-mail: java-user-unsubscribe (AT) lucene (DOT) apache.org
For additional commands, e-mail: java-user-help (AT) lucene (DOT) apache.org
To unsubscribe, e-mail: java-user-unsubscribe (AT) lucene (DOT) apache.org
For additional commands, e-mail: java-user-help (AT) lucene (DOT) apache.org