Databases

NAVIGATION
CATEGORIES
REFERRENCE
LINKS
  • Locale/encoding problem/question

    6 answers - 885 bytes - related search similar search Add To My Delicious Add To My Stumble Upon Add To My Google Mark Add To My Facebook Add To My Digg Add To My Reddit

    Hello all,
    I somehow managed to stuff up the encoding (or locale or something) in a
    transfer of a database from one machine to another (also different linux
    distribution).
    The problem is this: the origional database was created and populated
    with data using whatever default locale/encoding was installed on the
    first machine.
    Words would appear correctly, as in: H
    After dumping the database and restoring on a different one, the word
    incorrectly appears as: H
    Before I do any more damage or waste time, I'd appreciate any ideas or
    pointers on correctly performing the initdb -E<encoding>
    <locale(or any other ideas on resolving this issue) , before
    I try and restore the data again.
    Thanks
    henk
    (end of broadcast)
    TIP 5: don't forget to increase your free space map settings
  • No.1 | | 1268 bytes | |

    Fri, Aug 04, 2006 at 10:48:17AM +0200, henka (AT) cityweb (DOT) co.za wrote:
    Hello all,

    I somehow managed to stuff up the encoding (or locale or something) in a
    transfer of a database from one machine to another (also different linux
    distribution).

    The problem is this: the origional database was created and populated
    with data using whatever default locale/encoding was installed on the
    first machine.

    Two big questions:

    1. What encoding are the two database (\l will tell you)?
    2. What encoding are the clients expecting?

    It is entirely possible that the databases have got the right encoding,
    but the client you're accessing it doesn't understand. For example, if
    you've got both your databases on UTF-8, then the transfer went fine,
    but perhaps the client on the new machine doesn't display UTF-8.

    This happens because psql and other clients default to the same
    encoding as the server and don't check to see if the terminal actually
    supports that. So what you're decribing could also be a result of the
    server sending you UTF-8, the client displaying that, but the xterm or
    whatever you're using thinking it's Latin1.

    Hope this helps,
  • No.2 | | 1814 bytes | |

    Fri, Aug 04, 2006 at 10:48:17AM +0200, henka (AT) cityweb (DOT) co.za wrote:
    >Hello all,
    >>

    >I somehow managed to stuff up the encoding (or locale or something) in a
    >transfer of a database from one machine to another (also different linux
    >distribution).
    >>

    >The problem is this: the origional database was created and populated
    >with data using whatever default locale/encoding was installed on the
    >first machine.
    >

    Two big questions:

    1. What encoding are the two database (\l will tell you)?
    2. What encoding are the clients expecting?

    Thanks for the response, Martijn.

    I *think* the client_encoding origionally in the db was UTF-8 (but I could
    be wrong, it might have been LATIN1). I would imagine that LATIN1 would
    be the right one, since it needs to display standard english, plus some
    others (such as etc).

    The multibyte chars show up in xterm (putty) -and- when the data is
    displayed using php in a browser - both incorrectly.

    I've even tried using LATIN1 (ie, explicitly setting it to latin1 using
    initdb, and then restoring the database after changing the 'utf-8' strings
    in the dump data to 'latin1'). This still yields the funny chars.

    To be honest, I have no idea what the origional encoding was.

    Can you suggest any other approaches I can try to restore the database so
    that those chars display correctly?

    All comments are welcome.

    Regards
    Henk

    (end of broadcast)
    TIP 9: In versions below 8.0, the planner will ignore your desire to
    choose an index scan if your joining column's datatypes do not
    match
  • No.3 | | 1246 bytes | |

    Fri, Aug 04, 2006 at 11:58:22AM +0200, henka (AT) cityweb (DOT) co.za wrote:
    Fri, Aug 04, 2006 at 10:48:17AM +0200, henka (AT) cityweb (DOT) co.za wrote:
    Two big questions:

    1. What encoding are the two database (\l will tell you)?
    2. What encoding are the clients expecting?

    I've even tried using LATIN1 (ie, explicitly setting it to latin1 using
    initdb, and then restoring the database after changing the 'utf-8' strings
    in the dump data to 'latin1'). This still yields the funny chars.

    Wait, so the dump is in utf-8? You shouldn't need to edit the dump,
    postgresql will convert the encodings on the fly while loading.

    To be honest, I have no idea what the origional encoding was.

    It should be in the dump file, almost the first line. Locale is of no
    interest to pg_dump, you'll have to decide how you want it.

    Can you suggest any other approaches I can try to restore the database so
    that those chars display correctly?

    Well, at the very least, does it go away if you type:

    set client_encoding=latin1;

    Please provide more details about your setup too, your client is on
    windows? The server is ?

    Have a nice day,
  • No.4 | | 1844 bytes | |

    Fri, Aug 04, 2006 at 11:58:22AM +0200, henka (AT) cityweb (DOT) co.za wrote:
    >Fri, Aug 04, 2006 at 10:48:17AM +0200, henka (AT) cityweb (DOT) co.za wrote:
    >Two big questions:
    >>

    >1. What encoding are the two database (\l will tell you)?
    >2. What encoding are the clients expecting?
    >
    >I've even tried using LATIN1 (ie, explicitly setting it to latin1 using
    >initdb, and then restoring the database after changing the 'utf-8'
    >strings
    >in the dump data to 'latin1'). This still yields the funny chars.
    >

    Wait, so the dump is in utf-8? You shouldn't need to edit the dump,
    postgresql will convert the encodings on the fly while loading.

    I've actually found two versions - one with UTF-8, and the other LATIN1.

    >To be honest, I have no idea what the origional encoding was.
    >

    It should be in the dump file, almost the first line. Locale is of no
    interest to pg_dump, you'll have to decide how you want it.

    Yes: UTF-8 and the other is LATIN1

    Well, at the very least, does it go away if you type:

    set client_encoding=latin1;

    No it doesn't. That was one of the first things I tried after reading the
    docs.

    Please provide more details about your setup too, your client is on
    windows? The server is ?

    Server:
    Linux debian sarge
    PG: 8.1.4
    show all:
    lc_collate C
    lc_ctype C
    lc_messages C
    lc_monetary C
    lc_numeric C
    lc_time C
    client_encoding LATIN1 (or UTF-8)

    Clients:
    Windows using PuTTY (ie, for psql), and dynamic web content with PHP/Pg
    (on any browser).

    (end of broadcast)
    TIP 6: explain analyze is your friend
  • No.5 | | 1587 bytes | |

    henka (AT) cityweb (DOT) co.za writes:
    >It should be in the dump file, almost the first line. Locale is of no
    >interest to pg_dump, you'll have to decide how you want it.


    Yes: UTF-8 and the other is LATIN1

    Note that this represents what the original server *thought* the
    encoding was. But it's not at all impossible that the server thought
    the data was LATIN1 when it was really UTF8. (The other way around is
    less plausible because the server would have been able to detect
    encoding errors.) If you were using clients that treated the data
    as UTF8 without paying attention to what the server thought, you'd
    not have realized you were mislabeling the data.

    But, if you tried to load data marked as LATIN1 into a server using
    UTF8, it'd have applied a LATIN1 to UTF8 conversion, and then
    everything's hosed.

    I'd suggest actually inspecting the data in the dump file: it's not that
    hard to tell UTF8 from LATIN1 if you look at the byte sequences.

    you could just take the file marked LATIN1, edit it to change the
    client_encoding setting to say the data is UTF8, and see if you can
    load it. If it's not UTF8, 8.1.4 will almost certainly detect a ton of
    encoding errors.

    regards, tom lane

    (end of broadcast)
    TIP 1: if posting/reading through Usenet, please send an appropriate
    subscribe-nomail command to majordomo (AT) postgresql (DOT) org so that your
    message can get through to the mailing list cleanly
  • No.6 | | 1874 bytes | |

    henka (AT) cityweb (DOT) co.za writes:
    It should be in the dump file, almost the first line. Locale is of no
    interest to pg_dump, you'll have to decide how you want it.
    >
    >Yes: UTF-8 and the other is LATIN1
    >

    Note that this represents what the original server *thought* the
    encoding was. But it's not at all impossible that the server thought
    the data was LATIN1 when it was really UTF8. (The other way around is
    less plausible because the server would have been able to detect
    encoding errors.) If you were using clients that treated the data
    as UTF8 without paying attention to what the server thought, you'd
    not have realized you were mislabeling the data.

    But, if you tried to load data marked as LATIN1 into a server using
    UTF8, it'd have applied a LATIN1 to UTF8 conversion, and then
    everything's hosed.

    I'd suggest actually inspecting the data in the dump file: it's not that
    hard to tell UTF8 from LATIN1 if you look at the byte sequences.

    you could just take the file marked LATIN1, edit it to change the
    client_encoding setting to say the data is UTF8, and see if you can
    load it. If it's not UTF8, 8.1.4 will almost certainly detect a ton of
    encoding errors.

    Thanks Tom, your suggestion worked.

    Just to document this for others, this is what I did:
    - created a new empty DB: initdb -ELATIN1 -D data.
    - edited dump file with UTF8 encoding and changed to LATIN1 (doing the
    reverse resulted in encoding errors during restore).
    - restored database

    So, it looks like it was the reverse: the db thought it was UTF8, when in
    fact it was LATIN1.

    Regards

    (end of broadcast)
    TIP 4: Have you searched our list archives?

    http://archives.postgresql.org

Re: Locale/encoding problem/question


max 4000 letters.
Your nickname that display:
In order to stop the spam: 7 + 6 =
QUESTION ON "Databases"

EMSDN.COM