Sunday, 21 September 2014

Fun with codepages, unicode and encoding

...or how to survive in the encoding jungle, with some helpful tools.


I have several times stumble into encoding issues between legacy applications and modern "Unicode" capable ones.

In the following I will try to explain the basics in a simple way - and also supply the source and binary (if there are anyone out there that needs the utility - but don't need a compiler) for a small utility that takes a file containing Unicode characters and encodes it into a given code page. I was a bit surprised that I wasn't able to find such a utility at that time - so I did one to solve the problem.


The scenario was that one system needed to provide text files that would be transferred to a 3rd party service provider that at that time wasn't able to handle files containing Unicode characters. And the files where generated in all possible languages.

First a couple of basics things:

Unicode is a character set - not an encoding.

The list of Unicode characters contains characters from ASCII to Egyptian Hieroglyphs - over a million characters each with at unique number referred to as a code-point - the letter D would be 0x44.

UTF-8 is an encoding - not a character set.

The encoding present the way a list of numbers are represented binary, depending on the selected encoding - so our 0x44 would end up like 1000100 in memory encode with UTF-8.

BOM (Byte order mark)

This might add some value but also can cause you some pain - if you are not aware of it's existence or some applications ability to add this by will.

A tool like NotePad++ will let you play around with the BOM - and check if some programs like Notepad has "added" some value to your "normal" text file - like 0xEF, 0xBB and 0xBF to the start of your text file - when not needed BOM should be kept out of a UTF-8 encoded file - is the recommendation.

So as long as we stick to the "Western" (ISO 8859-1) 8-bit range we are good - since the Unicode consortium decide to match those. But if we need Greek or Romanian in a non-Unicode form - we need to do some encoding from one character set to another set of characters enumerated by a code page.

So as an example I did a text file containing the Greek alphabet in Unicode characters encoded with UTF-16, so every character takes up 2 bytes - and the BOM is 0xFFEF - I should mention that I do not speak, write or understand Greek, but the alphabet is an god example.

The text in a Unicode capable editor:

And shown in a hex editor to illustrate what the encoding did to the characters, and how they are stored on disk.

So now we want to convert this text file to something a non-Unicode capable system can use.

I am using an English Windows 8.1 Pro with Danish language settings so my LCID (locale ID) is 1030, and in the command prompt, the command CHCP will show code page 850 - and Windows will use the ANSI code page 1252.

But we want this file to end up as code page 1253 - so that the receiving system is able to read the data correctly. No Unicode.

So if I run my utility at the command prompt like:

UNI2CP greek-unicode.txt 1253 greek-1253.txt

The output file now looks a bit different in the hex editor:

Normally you would probably be on a system that was running the OEM/ANSI code page you needed for the output. You will in my case need an editor like Notepad2 or Notepad++, where you can set the encoding by which the file is read to be something different than your current setup.

Shown with my default encoding (1252):

..and with the encoding we convert to (1253):

I haven't spend time on the source code - since it is just loading the file with one "encoding" and saving it with another - but feel free to have a look. I would rather spend time on another helpful tool:

SBAppLocale

Since Microsoft does not support their AppLocale tool on other things than XP and Server 2003 (it wasn't really that helpful anyway - trying to fix a bad designed OS) - you might need something similar that could enabled you to run older non-Unicode capable programs, in environments like Citrix without the need for multiple server instances - one for every LCID needed - even if Windows has the "restriction" that your select one locale (LCID) per system, and all your programs should conform to that or be Unicode capable.

SBAppLocale does want Microsoft AppLocale should have done - is free and still works. Just call the program by:

SBAppLocale [-freeconsole] <locale_num> <command> [<command arg1> <command arg2> ..]

You will need to have the needed locale installed - but I guess at least Windows 8 Pro has them all out of the box.

So running a non-Unicode application with a different LCID - like an old Greek program, can be launched this way:

SBAppLocale 1032 myGreekProgram.exe

And there you go - ye olde Turbo Pascal for Windows 1.5 with OWL - maybe you should retire that one :-D

But since programs can be launched with different LCID sessions - consolidating older non-frequently used program on Citrix like setup night make sense, launching the program by the users LCID preference - but with bandwidth issues in mind - since they might not have been written with lower bandwidth in mind.

This is no excuse for not doing things the correct Unicode way, someone might also think that Citrix is a strategic tool - it is more a fix to some software that does not fit with the infrastructure.

Links:

Some free Hex-editors that I am pretty sure are written in Delphi: XVI32 and HxD
UNI2CP.EXE: Source code and compiled executable.
The 2 text files: greek-unicode.txt and greek-1253.txt