When operations on different character sets need to be done in Java, we need to initialize JVM with required options.
First lets have a look at basics of java character encodings.
- Internally the JVM always operates with Unicode.
- Data transferred in or out of the JVM is converted to a format specified in the file.encoding property of the JVM
- Data transferred in the JVM is converted from the format specified at file.encoding to Unicode
- Data transferred out of the JVM is converted from Unicode to the format specified at file.encoding
- When data need to be processed from Java Program other than the format specified in file.encoding the following classes which allows usage of encodings that takes precedence over the default one can be used
- java.io.InputStreamReader
- java.io.FileReader
- java.io.OutputStreamReader
- java.io.FileWriter
Default character set of the JVM varies across platform. Following piece of code shows how to get default character set of JVM.
System.out.println(System.getProperty("file.encoding"));
System.out.println(
new java.io.OutputStreamWriter(
new java.io.ByteArrayOutputStream()).getEncoding()
);
System.out.println(java.nio.charset.Charset.defaultCharset().name());
Output on linux
ANSI_X3.4-1968
ASCII
US-ASCII
This property can be set using System.setProperty(“file.encoding”, {desired encoding});
However doing this did not help me much since, the core Java libraries does not use this mechanism to determine default encoding.
My problem was to read from an java.net.URLConnection so i used the following piece of code:
URL url = new URL(urlStr);
URLConnection connection = url.openConnection();
//Create InputStreamReader with UTF8 Charset
BufferedReader in = new BufferedReader(new InputStreamReader(connection
.getInputStream(), Charset.forName("UTF-8")));
// If we need to read this stream into a string we need to create the string like:
String str = new String(bytes, Charset.forName("UTF-8"));
