Coast of Araska: Character Encoding in Java

Wednesday, December 22, 2004

Character Encoding in Java

Usually I wouldn't comment on work, but this is so generic, and so weird I had to. I have had a problem where a java byte array was converted to a String and then back to a byte array, and in the process changed. I thought this was strange, and it was causing a problem with my application, so I investigated. For those kids with computers at home you can follow along with this code sample.



import java.io.UnsupportedEncodingException;



public class TestBuffers {

	public static void main(String[] args) {

		byte[] bytes = new byte[256];



		for (int i = 0; i < bytes.length; i++) {

			bytes[i] = (byte) i;

		}

		try {

			String s1 = new String(bytes, "IBM-037");

			byte[] b1 = s1.getBytes();

			byte[] b2 = s1.getBytes("IBM-037");



			for (int i = 0; i < b2.length; i++) {

				String temp = 

					i 

					+ "	" 

					+ Integer.toHexString(bytes[i])

					+ "	"

					+ Integer.toHexString(b2[i]);

				System.out.print(temp);



				if (bytes[i] != b2[i]) {

					System.out.print(" different");

				}

				System.out.println("");

			}

		} catch (UnsupportedEncodingException e) {

			// TODO Auto-generated catch block

			e.printStackTrace();

		}

	}

}

So if you run this what do you see? Well it is pretty normal except for IBM-037 0x25, which magially becomes IBM-037 0x15.

Essentailly the first conversion turns an EBCDIC LF to a Unicode LF, then the second coversion turns the Unicode LF into a EBCDIC NL, weird eh?

1 comment:

Anonymous said...: AVIfdz The best blog you have!; 11:02 pm