Wednesday, December 22, 2004

Character Encoding in Java

Usually I wouldn't comment on work, but this is so generic, and so weird I had to. I have had a problem where a java byte array was converted to a String and then back to a byte array, and in the process changed. I thought this was strange, and it was causing a problem with my application, so I investigated. For those kids with computers at home you can follow along with this code sample.

import java.io.UnsupportedEncodingException;

public class TestBuffers {
public static void main(String[] args) {
byte[] bytes = new byte[256];

for (int i = 0; i < bytes.length; i++) {
bytes[i] = (byte) i;
}
try {
String s1 = new String(bytes, "IBM-037");
byte[] b1 = s1.getBytes();
byte[] b2 = s1.getBytes("IBM-037");

for (int i = 0; i < b2.length; i++) {
String temp =
i
+ " "
+ Integer.toHexString(bytes[i])
+ " "
+ Integer.toHexString(b2[i]);
System.out.print(temp);

if (bytes[i] != b2[i]) {
System.out.print(" different");
}
System.out.println("");
}
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}


So if you run this what do you see? Well it is pretty normal except for IBM-037 0x25, which magially becomes IBM-037 0x15.

Essentailly the first conversion turns an EBCDIC LF to a Unicode LF, then the second coversion turns the Unicode LF into a EBCDIC NL, weird eh?
Post a Comment