Sorting strings containing special characters (german 'Umlaute')
8 answers - 847 bytes -

Hi !
I know that this topic has been discussed in the past, but I could not
find a working solution for my problem: sorting (lists of) strings
containing special characters like "", "", (german umlaute).
Consider the following list:
l = ["Aber", "Beere", ""]
For sorting the letter "" is supposed to be treated like "Ae",
therefore sorting this list should yield
l = ["Aber, "", "Beere"]
I know about the module locale and its method strcoll(string1,
string2), but currently this does not work correctly for me. Consider
locale.strcoll("", "Beere")
1
Therefore "" ist sorted after "Beere", which is not correct IM
Can someone help?
Btw: I'm using WinXP (german) and
locale.getdefaultlocale()
prints
('de_DE', 'cp1252')
TIA.
Dierk
No.1 | | 1583 bytes |
| 
DierkErdmann (AT) mail (DOT) com wrote:
Hi !
I know that this topic has been discussed in the past, but I could not
find a working solution for my problem: sorting (lists of) strings
containing special characters like "", "", (german umlaute).
Consider the following list:
l = ["Aber", "Beere", ""]
For sorting the letter "" is supposed to be treated like "Ae",
therefore sorting this list should yield
l = ["Aber, "", "Beere"]
I know about the module locale and its method strcoll(string1,
string2), but currently this does not work correctly for me. Consider
locale.strcoll("", "Beere")
1
Therefore "" ist sorted after "Beere", which is not correct IM
Can someone help?
Btw: I'm using WinXP (german) and
locale.getdefaultlocale()
prints
('de_DE', 'cp1252')
TIA.
Dierk
we tried this in a javascript version and it seems to work sorry for long line
and possible bad translation to Python
#coding: cp1252
def _deSpell(a):
u = a.decode('cp1252')
return
u.replace(u'\u00C4','Ae').replace(u'\u00e4','ae'). replace(u'\u00D6','E').replace(u'\u00f6','oe').rep lace(u'\u00DC','Ue').replace(u'\u00fc','ue').repla ce(u'\u00C5','Ao').replace(u'\u00e5','ao')
def deSort(a,b):
return cmp(_deSpell(a),_deSpell(b))
l = ["Aber", "", "Beere"]
l.sort(deSort)
print l
No.2 | | 1692 bytes |
| 
DierkErdmann (AT) mail (DOT) com wrote:
I know that this topic has been discussed in the past, but I could not
find a working solution for my problem: sorting (lists of) strings
containing special characters like "", "", (german umlaute).
Consider the following list:
l = ["Aber", "Beere", ""]
For sorting the letter "" is supposed to be treated like "Ae",
I don't think so:
sorted(["Ast", "", "Ara"], locale.strcoll)
['Ara', '\xc3\x84rger', 'Ast']
sorted(["Ast", "Aerger", "Ara"])
['Aerger', 'Ara', 'Ast']
therefore sorting this list should yield
l = ["Aber, "", "Beere"]
I know about the module locale and its method strcoll(string1,
string2), but currently this does not work correctly for me. Consider
locale.strcoll("", "Beere")
1
Therefore "" ist sorted after "Beere", which is not correct IM
Can someone help?
Btw: I'm using WinXP (german) and
locale.getdefaultlocale()
prints
('de_DE', 'cp1252')
The default locale is not used by default; you have to set it explicitly
import locale
locale.strcoll("", "Beere")
1
locale.setlocale(locale.LC_ALL, "")
'de_DE.UTF-8'
locale.strcoll("", "Beere")
-1
By the way, you will avoid a lot of ""* if you use unicode right from
the start.
Finally, for efficient sorting, a key function is preferable over a cmp
function:
sorted(["Ast", "", "Ara"], key=locale.strxfrm)
['Ara', '\xc3\x84rger', 'Ast']
Peter
(*) German for "trouble"
No.3 | | 543 bytes |
| 
DierkErdmann (AT) mail (DOT) com writes:
For sorting the letter "" is supposed to be treated like "Ae",
therefore sorting this list should yield
l = ["Aber, "", "Beere"]
Are you sure? Maybe I'm thinking of another language, I thought shold
be sorted together with A, but after A if the words are otherwise equal.
E.g. Antwort, , Beere. A proper strcoll handles that by
translating "" to e.g. ["Arger", <something like "E\0\0\0\0">],
then it can sort first by the un-accentified name and then by the rest.
No.4 | | 796 bytes |
| 
Hallvard B Furuseth wrote:
DierkErdmann (AT) mail (DOT) com writes:
>For sorting the letter "" is supposed to be treated like "Ae",
>therefore sorting this list should yield
>l = ["Aber, "", "Beere"]
Are you sure? Maybe I'm thinking of another language, I thought
shold be sorted together with A, but after A if the words are
otherwise equal.
In German, there are some different forms:
- the classic sorting for e.g. word lists: umlauts and plain vowels
are of same value (like you mentioned): = a
- name list sorting for e.g. phone books: umlauts have the same
value as their substitutes (like Dierk described): = ae
There are others, too, but those are the most widely used.
Regards,
B
No.5 | | 1456 bytes |
| 
2 Mrz., 15:25, Peter <__pete (AT) web (DOT) dewrote:
DierkErdm (AT) mail (DOT) com wrote:
For sorting the letter "" is supposed to be treated like "Ae",
There are several way of defining the sorting order. The variant "
equals ae" follows DINDIN 5007 (according to wikipedia); defining (a
equals ) complies with DIN 5007-1. Therefore both options are
possible.
The default locale is not used by default; you have to set it explicitly
import locale
locale.strcoll("", "Beere")
1
locale.setlocale(locale.LC_ALL, "")
'de_DE.UTF-8'
locale.strcoll("", "Beere")
-1
my machine
locale.setlocale(locale.LC_ALL, "")
gives
'German_Germany.1252'
But this does not affect the sorting order as it does on your
computer.
locale.strcoll("", "Beere")
yields 1 in both cases.
Thank you for your hint using unicode from the beginning on, see the
difference:
s1 = unicode("", "latin-1")
s2 = unicode("Beere", "latin-1")
locale.strcoll(s1, s2)
1
locale.setlocale(locale.LC_ALL, "")
-1
compared to
s1 = ""
s2 = "Beere"
locale.strcoll(s1, s2)
1
locale.setlocale(locale.LC_ALL, "")
'German_Germany.1252'
locale.strcoll(s1, s2)
1
Thanks for your help.
Dierk
['Ara', '\xc3\x84rger', 'Ast']
Peter
(*) German for "trouble"
No.6 | | 1159 bytes |
| 
Bjoern Schliessmann wrote:
Hallvard B Furuseth wrote:
>DierkErdmann (AT) mail (DOT) com writes:
In German, there are some different forms:
- the classic sorting for e.g. word lists: umlauts and plain vowels
are of same value (like you mentioned): = a
- name list sorting for e.g. phone books: umlauts have the same
value as their substitutes (like Dierk described): = ae
There are others, too, but those are the most widely used.
B, in one of our projects we are sorting in javascript in several languages
English, German, Scandinavian languages, Japanese; from somewhere (I cannot
actually remember) we got this sort spelling function for scandic languages
a
replace(/\u00C4/g,'A~') //A umlaut
replace(/\u00e4/g,'a~') //a umlaut
replace(/\u00D6/g,'~') // umlaut
replace(/\u00f6/g,'o~') //o umlaut
replace(/\u00DC/g,'U~') //U umlaut
replace(/\u00fc/g,'u~') //u umlaut
replace(/\u00C5/g,'A~~') //A ring
replace(/\u00e5/g,'a~~'); //a ring
does this actually make sense?
No.7 | | 1159 bytes |
| 
Robin Becker wrote:
B, in one of our projects we are sorting in javascript in
several languages English, German, Scandinavian languages,
Japanese; from somewhere (I cannot actually remember) we got this
sort spelling function for scandic languages
a
.replace(/\u00C4/g,'A~') //A umlaut
.replace(/\u00e4/g,'a~') //a umlaut
.replace(/\u00D6/g,'~') // umlaut
.replace(/\u00f6/g,'o~') //o umlaut
.replace(/\u00DC/g,'U~') //U umlaut
.replace(/\u00fc/g,'u~') //u umlaut
.replace(/\u00C5/g,'A~~') //A ring
.replace(/\u00e5/g,'a~~'); //a ring
does this actually make sense?
If I'm not mistaken, this would sort all umlauts after the "pure"
vowels. This is, according to <http://de.wikipedia.org/wiki/
Alphabetische_Sortierung>, used in Austria.
If you can't understand german, the rules given there in
section "Einsortierungsregeln" (roughly: ordering rules) translate
as follows:
"X und Y sind gleich": "X equals Y"
"X kommt nach Y": "X comes after Y"
Regards&HTH,
B
No.8 | | 1023 bytes |
| 
Robin Becker kirjoitti:
B, in one of our projects we are sorting in javascript in several
languages English, German, Scandinavian languages, Japanese; from
somewhere (I cannot actually remember) we got this sort spelling
function for scandic languages
a
.replace(/\u00C4/g,'A~') //A umlaut
.replace(/\u00e4/g,'a~') //a umlaut
.replace(/\u00D6/g,'~') // umlaut
.replace(/\u00f6/g,'o~') //o umlaut
.replace(/\u00DC/g,'U~') //U umlaut
.replace(/\u00fc/g,'u~') //u umlaut
.replace(/\u00C5/g,'A~~') //A ring
.replace(/\u00e5/g,'a~~'); //a ring
does this actually make sense?
I think this order is not correct for Finnish, which is one of the
Scandinavian languages. The Finnish alphabet in alphabetical order is:
a-z, , ,
If I understand correctly your replacements cause the order of the last
3 characters to be
, ,
which is wrong.
HTH,
Jussi