Wednesday, September 17, 2008

Codepage and Compression/DEFLATE

While data compression is great to reduce the file size/content significantly, the code-page of the PC will affect the compression/decompression or INFLATE/DEFLATE too. If you compress the data in code-page A, you might not able to get it back through de-compression in code-page B. Some of the compression algorithms involves Huffman coding(here), thus making byte-encoding significant.

Here's the sample code executed using zLib:

0) Within the same PC, INFLATE/DEFLATE is ok.

1) DEFLATE under PC with code-page 1252 (XP: Control Panel->Regional and Language Options), compare with what you get from (2) later

(From wiki: Windows-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages.)

2) DEFLATE under PC with code-page 936 (XP: Control Panel->Regional and Language Options), compare with what you get from (1) previously

(From wiki: A character is encoded as 1 or 2 bytes. A byte in the range 00–7F is a single byte that means the same thing as it does in ASCII. Strictly speaking, there are 96 characters and 32 control codes in this range.
A byte with the high bit set indicates that it is the first of 2 bytes. Loosely speaking, the first byte is in the range 81–FE (that is, never 80 or FF, and the second byte is 40–FE for some areas and 80–FE for others.)


You can check your PC's code-page using Win API: GetACP.
You also can get the VC++ zLib sample here, and VB6 GetACP sample here.

No comments: