Examining invalid UTF

Recently we had this problem at work:
java.io.CharConversionException: Invalid UTF-8 start byte 0xb4 (at char #664428955, byte #664427999)
No one was exactly sure about the best way to debug this, so I set out to hack something together.

Here’s a Java program that reads bytes from a UTF-8 file and tries to interpret them as a UTF-8 string. If it succeeds, the string is printed, else if there’s an exception, the stack trace is printed. Usage:

$ java -cp "." ByteToUtfReader file.rdf 679609087 1000

reads byte 679609087 and the 1000 bytes preceding it from file.rdf.

Here’s the code in case it helps someone. Warning: quick and dirty!

import java.io.FileInputStream;
import java.nio.ByteBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;

public class ByteToUtfReader
{

    public static void main(String[] args)
    {
        String file;
        long byteNo;
        int toRead;

        try
        {
            file = args[0];
            byteNo = Long.valueOf(args[1]);
            toRead = Integer.valueOf(args[2]);
        } catch (Exception e)
        {
            System.out.println("Invalid arguments.   .");
            return;
        }

        try
        {
            String result = utfStringAtByte(file, byteNo, toRead);
            System.out.println(result);
        } catch (Exception e)
        {
            System.out.println("I/O problem: " + e);
            e.printStackTrace();
        }
    }

    static String utfStringAtByte(String file, long byteNo, int toRead)
        throws Exception
    {
        return bytesToUtfString(readBytes(file, byteNo, toRead));
    }

    static byte[] readBytes(String file, long byteNo, int toRead)
        throws Exception
    {
        FileInputStream f = new FileInputStream(file);
        byte[] bytes = new byte[toRead + 1];

        long toSkip = byteNo - toRead - 1;
        long skipped = f.skip(toSkip);
        if (skipped != toSkip)
            System.err.println("Warning: skipped only " + skipped
                    + " bytes instead of the requested " + toSkip);

        f.read(bytes);
        return bytes;
    }

    static String bytesToUtfString(byte[] bytes)
        throws Exception
    {
        // Behavior of this constructor is undefined if bytes contains invalid
        // sequences!
        // return new String(bytes, "UTF8");

        // http://www.exampledepot.com/egs/java.nio.charset/ConvertChar.html
        Charset utf = Charset.forName("UTF-8");
        CharsetDecoder dec = utf.newDecoder();
        return dec.decode(ByteBuffer.wrap(bytes)).toString();
    }
}

If you’re unlucky, you might have to “triangulate” a bit to find the faulty byte(s) in the file. In our case, the exception was propagated from the Woodstox XML processor. It gave the number of the offending byte, but it turned out to be not exactly correct, probably due to internal buffering. So I simply played a bit with the arguments to my program, decreasing the number of bytes to read and adjusting the offset into the file, seeing whether it found a valid string or not. If you’re at 0 for the number of additional bytes to read and get the exception, you found the exact byte that’s invalid.

Advertisements

Tags: , , , , , ,


%d bloggers like this: