Apache Commons – Exitcode

Recently one of our clients reports a bug regarding usage of czech national characters in file names within ZIP archive. They just didn’t display correctly. After some analysis I discovered something I never believed is possible nowadays. Windows 7 has no native support for UTF-8 encoded file names in ZIP archive! Common Microsoft, it’s 2011 and support for UTF-8 file name characters is arround at least 5 years (officialy introduced in v6.3.0 of ZIP specification).

The whole problem with file names encoding lies in the fact that ZIP format uses by default IBM PC character encoding set also known as IBM Code Page 437, IBM437 or CP437. Unfortunately this code page restricts storing file name characters to only those within the original MS-DOS range so it’s quite limited. Therefore if you want to use most national characters in file names within ZIP, you have basically two options:

Use UTF-8 and set language encoding flag to instruct the processing tool, that characters in file names are encoded in UTF-8
Use whatever encoding that’s native to your specific target platform

First Option
With first option you can achieve the best interoperability among operating systems. Downside of this approach is that Windows users have to use some third-party application to handle ZIP archives because compressed folder doesn’t display UTF-8 characters correctly. All well-known ZIP processing tools I tried on Windows (WinZip, WinRAR, 7-Zip) were able to display UTF-8 encoded file names properly. 7-Zip on unix-based systems has also displayed such a file names correctly. Here is the Java code snippet that creates a ZIP archive containig two empty files with slovak national characters in each file name.

ZipArchiveOutputStream zipOut = new ZipArchiveOutputStream(new FileOutputStream("/tmp/utf8.zip"));
zipOut.setEncoding("UTF-8");
zipOut.setUseLanguageEncodingFlag(true);
zipOut.putArchiveEntry(new ZipArchiveEntry("1_ľščťžýáíé.txt"));
zipOut.closeArchiveEntry();
zipOut.putArchiveEntry(new ZipArchiveEntry("2_úäôňďúě.txt"));
zipOut.closeArchiveEntry();
zipOut.flush();
zipOut.close();

This example uses Apache Commons Compress library which allow to specify encoding and set language flag. If you are lucky and already using Java 7 released last month, you can utilize classes from java.util.zip package that obtained new constructor to set encoding. In addition, these classes use UTF-8 by default and read/write language encoding flag. On Java versions <= 1.6 just stay with commons-compress library.

Second Option
Second option is way to go when you address only one operating system using specific code page (that’s our customer case and approach I eventually employed). Suppose all your users use Windows with code page 852 (CP852, IBM852 – standard code page used by central european countries). In this case you can generate ZIP archive in almost the same way as above but this time set the encoding to CP852 and omit the encoding flag.

ZipArchiveOutputStream zipOut = new ZipArchiveOutputStream(new FileOutputStream("/tmp/cp852.zip"));
zipOut.setEncoding("CP852");
zipOut.putArchiveEntry(new ZipArchiveEntry("1_ľščťžýáíé.txt"));
zipOut.closeArchiveEntry();
zipOut.putArchiveEntry(new ZipArchiveEntry("2_úäôňďúě.txt"));
zipOut.closeArchiveEntry();
zipOut.flush();
zipOut.close();

Every tool on the platform using default code page 852 will display national characters from this ZIP file correctly, including Windows compressed folder tool. In order to find out what code page Windows currently uses simply navigate to the following node in registry:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage

and look for a key with the name OECMP.

And remember, there is no such thing as universal, always-working approach to ZIP file names encoding.