Examining invalid UTF


Recently we had this problem at work: Invalid UTF-8 start byte 0xb4 (at char #664428955, byte #664427999)
No one was exactly sure about the best way to debug this, so I set out to hack something together.

Here’s a Java program that reads bytes from a UTF-8 file and tries to interpret them as a UTF-8 string. If it succeeds, the string is printed, else if there’s an exception, the stack trace is printed. Usage:

$ java -cp "." ByteToUtfReader file.rdf 679609087 1000

reads byte 679609087 and the 1000 bytes preceding it from file.rdf.

Here’s the code in case it helps someone. Warning: quick and dirty!