Solving character encoding problems for dbf files

From OpenJUMP Wiki
Revision as of 01:23, 12 October 2009 by Mentaer (talk | contribs) (Created page with ''''Problem''': one can not load and save shape files that contain specfic language characters in Linux, if your locale is UTF-8 based. == Discussion from mailing list == '''A1 …')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Problem: one can not load and save shape files that contain specfic language characters in Linux, if your locale is UTF-8 based.

Discussion from mailing list

A1 - Martin wrotes: Java respects the default charset that your system uses. Sadly UTF8 seems not to be supported by the dbf importer. You can take explicitly care of it via the variable LANG (e.g. add it to your startupfile or enter on the commandline):

LANG=de_DE@euro openjump 
LANG=de_DE.UTF openjump 

leads to different interpretation of the character encoding in the dbf files. In my case (german) I can save the dbf file as *ISO-8559-15/EURO* and correctly load it with OpenJUMP started as:

LANG=de_DE@euro openjump 

This should fix the encoding for the apps you mention below.

A2 - Thomas answer Oh man, it worked! Thank you wey much for the suggestion. My current locale is cs_CZ.UTF-8, which is not uderstood somehow as you mention. But with LANG=cs_CZ.WIN-1250 openjump-pirol I can open the win-1250 encoded shp/dbf files, although characters are not displayed correctly anyway. And because they are not, I tried also LANG=cs_CZ.ISO-10646-1 openjump-pirol with which I can open both win-1250 and utf-8 encoded shp/dbf files, again both with accented characters displayed uncorrectly, but i _can_. I am also able to save dataset as different shapefile, I do not know if it is saved correctly, but it seems to. The dbf file has different size than original (40051 bytes vs. 52316 bytes!), but the attribute table seems to look the same when opened side by side in QGIS. For the time being, i'll go with iso-10646-1 as my default but I'll try other possibilities if I could get the accented characters displayed correctly somehow.

Regards, Tomas B.


Solution for Linux with UTF-8 encoded default locale

The solution description is specific for Czech language and Ububtu Feisty Linux. For other languages and Unix/Linux variants it should be similar.

The Solution

  • I enabled cs_CZ locale (iso-8859-2), which is part of my Ubuntu distribution. My default locale is still cs_CZ.UTF-8.
  • I put following line near the beginning of the OpenJump startup script.
LANG=cs_CZ

Now if I open a iso-8859-2 encoded shp/dbf file, characters are displayed correctly. I can open also win-1250 and utf-8 encoded files. The former are displayed almost correctly, since the character codes differ only for four chracters of Czech alphabet between win-1250 and iso-8859-2 (AFAIK). I can save the layer as different shape file, but in this case if i open the file in OpenJump, the accented characters are damaged. It is possible to save the layer correctly as .jml (Jump GML) or .fme (FME GML).

To enable the locales (step 1) I used this procedure (This enables all suported locales containing string cs_CZ. The procedure may differ in other Linux distributions, but probably works for all Debian/Ubuntu based variants):

sudo sh
cat /usr/share/i18n/SUPPORTED | grep "cs_CZ" > /var/lib/locales/supported.d/local
dpkg-reconfigure locales

Notes

  • You can list currently enabled locales with
locale -a
  • You can list the locales which are going to be enabled with
cat /usr/share/i18n/SUPPORTED | grep "cs_CZ"

in advance.


Regards, Tomas B.

There is a tool called cstocs including a conversion utility for dbf files. It is mainly for the czech language, but implements also a few important other encodings:

It might be available for other distributions also (Ubuntu has it).