chrisinmtown.github.io

Validating Files for the Unicode Character Set

18 July 2011

Recently I’ve been using Apache Tika to extract text from files containing all sorts of content, and write that text to files using UTF-8 encoding. After botching it a few times I finally forced myself to learn a bit about encoding systems and Java’s representation of Unicode code points. I also learned that using the word “character” all too often leads to confusion!

My colleague D. B. tells me that Java’s support for international character sets was written about the time the Unicode committee finalized a standard that allowed a character to be represented in 16 bits. Then after Sun’s people cast that 16-bit assumption into concrete deep within java.lang.Character and other classes, the Unicode committee had a change of heart.. and now it takes 32 bits to represent a character properly. This information is all well documented at the Sun (oops I mean Oracle) Java API web site but it took several reads before it started sink in for me.

What I learned, and what really bothers me still, is that a Java String may contain parts (avoiding the “C” word here) that cannot be represented using a single Java Character type! In other words, using the String.charAt(int) method gives back something that must be handled with extreme care. So iterating over the little parts that make up a String is not as simple as I once thought. And I have not yet accustomed myself to using the String.codePointAt(int) method.

The task I had to accomplish was to check whether files contain valid UTF-8 byte sequences. Some tools were giving garbage results and I needed to figure out why. The basic java.io.InputStreamReader will happily translate byte streams to character streams, but it does not report encoding/decoding errors (also see http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4646959). I found that the java.nio.charset package offers some help that does not require loops to iterate over the parts in a String so I went with that.

Below you will find the source code for an example Java class that checks whether file content can be decoded using the UTF-8 character set. It also supports creation of a file with a character that is valid in the ISO-8859-1 encoding system but not in the UTF-8 system so you can test the class and see the exception that is thrown. </P>

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CodingErrorAction;

/**
 * Reads bytes from file using a decoder that throws an exception if the bytes
 * cannot be decoded as UTF-8. Can be used to validate a single file or a
 * directory of files with full recursion. Includes a feature to create a file
 * with a non-UTF-8 byte sequence for testing.
 */
public class ValidateEncodingApp {

    private final String charsetName;
    private final CharsetDecoder decoder;
    private int filesPassed = 0, filesFailed = 0;

    /**
     * Obtains a decoder and configures it to blow up on problems.
     * 
     * @param charsetName
     *            Name of character set to use; e.g., UTF-8
     */
    public ValidateEncodingApp(String charsetName) {
        this.charsetName = charsetName;
        Charset charset = Charset.forName(charsetName);
        decoder = charset.newDecoder();
        decoder.onMalformedInput(CodingErrorAction.REPORT);
        decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
    }

    /**
     * @return The character set name supplied to the constructor
     */
    public String getCharsetName() {
        return charsetName;
    }

    /**
     * @return The number of files that could not be validated for whatever
     *         reason.
     */
    public int getFilesFailed() {
        return filesFailed;
    }

    /**
     * @return The number of files successfully validated.
     */
    public int getFilesPassed() {
        return filesPassed;
    }

    /**
     * Reads and decodes bytes from the specified file, which will find byte
     * sequences that cannot be decoded using the current system. Traps all
     * exceptions and reports to System.err.
     * 
     * @param file
     * @return True on success, false on IOException or decoding error
     */
    private boolean validateFile(File file) {
        boolean result = true;
        long total = 0;
        InputStreamReader isr = null;
        char[] buf = new char[65536];
        int chars;
        try {
            isr = new InputStreamReader(new FileInputStream(file), decoder);
            while ((chars = isr.read(buf)) >= 0)
                total += chars;
            ++filesPassed;
        } catch (CharacterCodingException ex) {
            ++filesFailed;
            System.err.println("Decoding failed near byte " + total
                    + " in file " + file.getPath() + ": " + ex.toString());
            result = false;
        } catch (IOException ex) {
            ++filesFailed;
            System.err.println("Failed to read file " + file.getPath() + ": "
                    + ex.toString());
            result = false;
        } finally {
            if (isr != null)
                try {
                    isr.close();
                } catch (IOException ex) {
                    System.err.println("Failed to close file: " + ex.toString());
                }
        }
        return result;
    }

    /**
     * Validates all files in the specified directory. Optionally recurses into
     * subdirectories. Prints message to stdout when it starts and again when it
     * finishes traversing a directory.
     * 
     * @param dir
     *            Directory to traverse.
     * @param isRecurse
     *            If true, will call itself to process subdirectories.
     * @throws Exception
     */
    private void validateDirectory(File dir, boolean isRecurse) throws Exception {
        if (!dir.exists() || !dir.isDirectory())
            throw new IllegalArgumentException("Not found or not a directory: "
                    + dir.getPath());
        System.out.println("Entering directory " + dir.getPath());
        File[] contents = dir.listFiles();
        for (File f : contents) {
            if (f.isFile())
                validateFile(f);
            else if (f.isDirectory() && isRecurse)
                validateDirectory(f, isRecurse);
        }
        System.out.println("Leaving directory " + dir.getPath());
    }

    /**
     * Prints message(s) and exits
     * 
     * @param msg
     */
    private static void usage(String msg) {
        if (msg != null)
            System.err.println(msg);
        System.err.println("Usage: ValidateEncodingApp -validate (file|dir)");
        System.err.println("       ValidateEncodingApp -create filename");
        System.exit(0);
    }

    public static void main(String[] args) {

        // TODO: extend to accept encoding system name
        // and recurse option as arguments.
        String charsetName = "UTF-8";
        boolean recurse = true;

        if (args.length != 2)
            usage("Unexpected number of arguments");

        String cmd = args[0];
        File file = new File(args[1]);

        if ("-create".equals(cmd)) {
            if (file.exists())
                usage("Will not overwrite existing file: " + file.getPath());

            System.out.println("Creating file with ISO-8859-1 content: "
                    + file.getPath());
            try {
                // The word "dolt" has an o with a diagonal slash thru it.
                // The slash-o is valid in ISO-8859-1 but not in UTF-8.
                final byte[] invalid = "d\u00f8lt".getBytes("iso-8859-1");
                FileOutputStream fis = new FileOutputStream(file);
                fis.write(invalid);
                fis.close();
            } catch (IOException ex) {
                System.err.println("Failed to write file: " + ex.toString());
            }
        } else if ("-validate".equals(cmd)) {
            if (!file.exists())
                usage("Not found: " + file.getPath());

            try {
                ValidateEncodingApp validator = new ValidateEncodingApp(
                        charsetName);
                if (file.isFile()) {
                    System.out.println("Validating decoding of bytes as "
                            + charsetName + " for file " + file.getPath());
                    validator.validateFile(file);
                } else {
                    System.out.println("Validating decoding of bytes as "
                            + charsetName + " for files in/below directory "
                            + file.getPath());
                    validator.validateDirectory(file, recurse);
                }
                System.out.println("Validation success count is "
                        + validator.getFilesPassed());
                System.out.println("Validation failure count is "
                        + validator.getFilesFailed());

            } catch (Exception ex) {
                ex.printStackTrace();
            }
        } else {
            usage("Unexpected command: " + cmd);
        }

    }

}

Blog index / feedback to christopher döt lott át gmail dðt com

This site is open source. Improve this page.