Development

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • Sorting strings containing special characters (german 'Umlaute')

    8 answers - 847 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    Hi !
    I know that this topic has been discussed in the past, but I could not
    find a working solution for my problem: sorting (lists of) strings
    containing special characters like "", "", (german umlaute).
    Consider the following list:
    l = ["Aber", "Beere", ""]
    For sorting the letter "" is supposed to be treated like "Ae",
    therefore sorting this list should yield
    l = ["Aber, "", "Beere"]
    I know about the module locale and its method strcoll(string1,
    string2), but currently this does not work correctly for me. Consider
    locale.strcoll("", "Beere")
    1
    Therefore "" ist sorted after "Beere", which is not correct IM
    Can someone help?
    Btw: I'm using WinXP (german) and
    locale.getdefaultlocale()
    prints
    ('de_DE', 'cp1252')
    TIA.
    Dierk
  • No.1 | | 1583 bytes | |

    DierkErdmann (AT) mail (DOT) com wrote:
    Hi !

    I know that this topic has been discussed in the past, but I could not
    find a working solution for my problem: sorting (lists of) strings
    containing special characters like "", "", (german umlaute).
    Consider the following list:
    l = ["Aber", "Beere", ""]

    For sorting the letter "" is supposed to be treated like "Ae",
    therefore sorting this list should yield
    l = ["Aber, "", "Beere"]

    I know about the module locale and its method strcoll(string1,
    string2), but currently this does not work correctly for me. Consider
    locale.strcoll("", "Beere")
    1

    Therefore "" ist sorted after "Beere", which is not correct IM
    Can someone help?

    Btw: I'm using WinXP (german) and
    locale.getdefaultlocale()
    prints
    ('de_DE', 'cp1252')

    TIA.

    Dierk

    we tried this in a javascript version and it seems to work sorry for long line
    and possible bad translation to Python

    #coding: cp1252
    def _deSpell(a):
    u = a.decode('cp1252')
    return
    u.replace(u'\u00C4','Ae').replace(u'\u00e4','ae'). replace(u'\u00D6','E').replace(u'\u00f6','oe').rep lace(u'\u00DC','Ue').replace(u'\u00fc','ue').repla ce(u'\u00C5','Ao').replace(u'\u00e5','ao')
    def deSort(a,b):
    return cmp(_deSpell(a),_deSpell(b))

    l = ["Aber", "", "Beere"]
    l.sort(deSort)
    print l
  • No.2 | | 1692 bytes | |

    DierkErdmann (AT) mail (DOT) com wrote:

    I know that this topic has been discussed in the past, but I could not
    find a working solution for my problem: sorting (lists of) strings
    containing special characters like "", "", (german umlaute).
    Consider the following list:
    l = ["Aber", "Beere", ""]

    For sorting the letter "" is supposed to be treated like "Ae",

    I don't think so:

    sorted(["Ast", "", "Ara"], locale.strcoll)
    ['Ara', '\xc3\x84rger', 'Ast']

    sorted(["Ast", "Aerger", "Ara"])
    ['Aerger', 'Ara', 'Ast']

    therefore sorting this list should yield
    l = ["Aber, "", "Beere"]

    I know about the module locale and its method strcoll(string1,
    string2), but currently this does not work correctly for me. Consider
    locale.strcoll("", "Beere")
    1

    Therefore "" ist sorted after "Beere", which is not correct IM
    Can someone help?

    Btw: I'm using WinXP (german) and
    locale.getdefaultlocale()
    prints
    ('de_DE', 'cp1252')

    The default locale is not used by default; you have to set it explicitly

    import locale
    locale.strcoll("", "Beere")
    1
    locale.setlocale(locale.LC_ALL, "")
    'de_DE.UTF-8'
    locale.strcoll("", "Beere")
    -1

    By the way, you will avoid a lot of ""* if you use unicode right from
    the start.

    Finally, for efficient sorting, a key function is preferable over a cmp
    function:

    sorted(["Ast", "", "Ara"], key=locale.strxfrm)
    ['Ara', '\xc3\x84rger', 'Ast']

    Peter

    (*) German for "trouble"
  • No.3 | | 543 bytes | |

    DierkErdmann (AT) mail (DOT) com writes:
    For sorting the letter "" is supposed to be treated like "Ae",
    therefore sorting this list should yield
    l = ["Aber, "", "Beere"]

    Are you sure? Maybe I'm thinking of another language, I thought shold
    be sorted together with A, but after A if the words are otherwise equal.
    E.g. Antwort, , Beere. A proper strcoll handles that by
    translating "" to e.g. ["Arger", <something like "E\0\0\0\0">],
    then it can sort first by the un-accentified name and then by the rest.
  • No.4 | | 796 bytes | |

    Hallvard B Furuseth wrote:
    DierkErdmann (AT) mail (DOT) com writes:

    >For sorting the letter "" is supposed to be treated like "Ae",
    >therefore sorting this list should yield
    >l = ["Aber, "", "Beere"]


    Are you sure? Maybe I'm thinking of another language, I thought
    shold be sorted together with A, but after A if the words are
    otherwise equal.

    In German, there are some different forms:
    - the classic sorting for e.g. word lists: umlauts and plain vowels
    are of same value (like you mentioned): = a
    - name list sorting for e.g. phone books: umlauts have the same
    value as their substitutes (like Dierk described): = ae

    There are others, too, but those are the most widely used.

    Regards,

    B
  • No.5 | | 1456 bytes | |

    2 Mrz., 15:25, Peter <__pete (AT) web (DOT) dewrote:
    DierkErdm (AT) mail (DOT) com wrote:
    For sorting the letter "" is supposed to be treated like "Ae",
    There are several way of defining the sorting order. The variant "
    equals ae" follows DINDIN 5007 (according to wikipedia); defining (a
    equals ) complies with DIN 5007-1. Therefore both options are
    possible.

    The default locale is not used by default; you have to set it explicitly

    import locale
    locale.strcoll("", "Beere")
    1
    locale.setlocale(locale.LC_ALL, "")
    'de_DE.UTF-8'
    locale.strcoll("", "Beere")

    -1

    my machine
    locale.setlocale(locale.LC_ALL, "")
    gives
    'German_Germany.1252'

    But this does not affect the sorting order as it does on your
    computer.
    locale.strcoll("", "Beere")
    yields 1 in both cases.

    Thank you for your hint using unicode from the beginning on, see the
    difference:
    s1 = unicode("", "latin-1")
    s2 = unicode("Beere", "latin-1")
    locale.strcoll(s1, s2)
    1
    locale.setlocale(locale.LC_ALL, "")
    -1

    compared to

    s1 = ""
    s2 = "Beere"
    locale.strcoll(s1, s2)
    1
    locale.setlocale(locale.LC_ALL, "")
    'German_Germany.1252'
    locale.strcoll(s1, s2)
    1

    Thanks for your help.

    Dierk

    ['Ara', '\xc3\x84rger', 'Ast']

    Peter

    (*) German for "trouble"
  • No.6 | | 1159 bytes | |

    Bjoern Schliessmann wrote:
    Hallvard B Furuseth wrote:
    >DierkErdmann (AT) mail (DOT) com writes:


    In German, there are some different forms:
    - the classic sorting for e.g. word lists: umlauts and plain vowels
    are of same value (like you mentioned): = a
    - name list sorting for e.g. phone books: umlauts have the same
    value as their substitutes (like Dierk described): = ae

    There are others, too, but those are the most widely used.

    B, in one of our projects we are sorting in javascript in several languages
    English, German, Scandinavian languages, Japanese; from somewhere (I cannot
    actually remember) we got this sort spelling function for scandic languages

    a
    replace(/\u00C4/g,'A~') //A umlaut
    replace(/\u00e4/g,'a~') //a umlaut
    replace(/\u00D6/g,'~') // umlaut
    replace(/\u00f6/g,'o~') //o umlaut
    replace(/\u00DC/g,'U~') //U umlaut
    replace(/\u00fc/g,'u~') //u umlaut
    replace(/\u00C5/g,'A~~') //A ring
    replace(/\u00e5/g,'a~~'); //a ring

    does this actually make sense?
  • No.7 | | 1159 bytes | |

    Robin Becker wrote:

    B, in one of our projects we are sorting in javascript in
    several languages English, German, Scandinavian languages,
    Japanese; from somewhere (I cannot actually remember) we got this
    sort spelling function for scandic languages

    a
    .replace(/\u00C4/g,'A~') //A umlaut
    .replace(/\u00e4/g,'a~') //a umlaut
    .replace(/\u00D6/g,'~') // umlaut
    .replace(/\u00f6/g,'o~') //o umlaut
    .replace(/\u00DC/g,'U~') //U umlaut
    .replace(/\u00fc/g,'u~') //u umlaut
    .replace(/\u00C5/g,'A~~') //A ring
    .replace(/\u00e5/g,'a~~'); //a ring

    does this actually make sense?

    If I'm not mistaken, this would sort all umlauts after the "pure"
    vowels. This is, according to <http://de.wikipedia.org/wiki/
    Alphabetische_Sortierung>, used in Austria.

    If you can't understand german, the rules given there in
    section "Einsortierungsregeln" (roughly: ordering rules) translate
    as follows:

    "X und Y sind gleich": "X equals Y"
    "X kommt nach Y": "X comes after Y"

    Regards&HTH,

    B
  • No.8 | | 1023 bytes | |

    Robin Becker kirjoitti:

    B, in one of our projects we are sorting in javascript in several
    languages English, German, Scandinavian languages, Japanese; from
    somewhere (I cannot actually remember) we got this sort spelling
    function for scandic languages

    a
    .replace(/\u00C4/g,'A~') //A umlaut
    .replace(/\u00e4/g,'a~') //a umlaut
    .replace(/\u00D6/g,'~') // umlaut
    .replace(/\u00f6/g,'o~') //o umlaut
    .replace(/\u00DC/g,'U~') //U umlaut
    .replace(/\u00fc/g,'u~') //u umlaut
    .replace(/\u00C5/g,'A~~') //A ring
    .replace(/\u00e5/g,'a~~'); //a ring

    does this actually make sense?

    I think this order is not correct for Finnish, which is one of the
    Scandinavian languages. The Finnish alphabet in alphabetical order is:

    a-z, , ,

    If I understand correctly your replacements cause the order of the last
    3 characters to be

    , ,

    which is wrong.

    HTH,
    Jussi

Re: Sorting strings containing special characters (german 'Umlaute')


max 4000 letters.
Your nickname that display:
In order to stop the spam: 2 + 1 =
QUESTION ON "Development"

EMSDN.COM