Algemeen en interessant nieuws over en door NosTech

NIEUWS

Output extended ascii characters in Java
15-12-2008 17:11

Each character of text is specified by a value specified according to some encoding scheme. The particular type of encoding, the number of bits and bytes required for the encoding, transformations between encodings, and other issues thus become important, especially for a language like Java that is aimed towards worldwide use. Encoding becomes particularly relevant to I/O when text gets moved between different systems with perhaps different encoding schemes.

So give a brief overview of character encodings here.

The 7-BIT ASCII code set is the most famous, but there are many extended eight bit sets in which the first 128 codes are ASCII and the extra 128 codes provide symbols and characters needed for other languages besides English.

For example, the ISO-LATIN-1 set (ISO Standard 8859-1) provides characters for most West European languages and for a few other languages such as Indonesian.

Java itself is based on the 2-byte UNICODE representation of characters. The sixteen bits provide for a character set of 65,535 entries and so allows for broad international use.

The first 256 entries in 2-byte UNICODE are identical to the ISO-Latin-1 set. That makes the 2-byte Unicode inefficient for programs in English since the second byte is seldom needed. Therefore, a scheme called UTF-8 is used to encode text characters (e.g. string literals) for the Java class files.

The UTF code varies from 1 byte to 3 bytes. If a byte begins with a 0 bit, then the lower 7 bits represent one of the 128 ASCII characters. If the byte begins with the bits 110, then it is the first of a two byte pair that represent the Unicode values for 128 to 2047. If any byte begins with 1110, then it is the first of a three byte set that can hold any of the other Unicode values.

Thus, UTF trades the ability to only one byte most of the time for occasionally needing to use up to three bytes. For text in English and
many other languages, this is a good tradeoff that can drastically reduce file size over those in strict Unicode.

Java typically runs on platforms that use one byte extended ASCII encoded characters. Therefore, text I/O with the local platform, or with other platforms over the network, must convert between the encodings. As we mentioned in the previous section, the original one byte streams were not convenient for this so the Reader/Writer classes for two byte I/O were introduced.

The default encoding is typically ISO-Latin-1, but your program can find the local encoding with the following static method in the System: String local_encoding = System.getProperty(“file.encoding”);

The encoding can be explicitly specified in some cases via the constructor such as in the following file output: FileOutputStream out_file = new FileOutputStream (“Turkish.txt”);
OutputStreamWriter file_writer = new OutputStreamWriter (out_file, “8859_3”);

A similar overloaded constructor is available for InputStreamReader. See the book by Harold for more information about character encoding in Java.

MORE ABOUT UNICODE

If a character is not available on your keyboard, it can be specified in a Java program by its Unicode value. This value is represented with four hexadecimal numbers preceded by the “u” escape sequence. For example, the “ö” character is given by u00F6 and “è” by u00E8.

The program UnicodesApplet shows examples of characters specified by their Unicode values and drawn on the applet panel.


UnicodesApplet.java

import javax.swing.*;
import java.awt.*;

/** Unicode demo program. **/
public class UNICODESAPPLET extends JApplet
{

public void init () {
Container content_pane = getContentPane ();

// Create an instance of DrawingPanel
DrawingPanel drawing_panel = new DrawingPanel ();

// Add the DrawingPanel to the content pane.
content_pane.add (drawing_panel);

} // init

} // class UnicodesApplet

/** Display unicode characters. **/
class DrawingPanel extends JPanel
{
public void paintComponent (Graphics g) {
// First paint background unless you will
// paint whole area yourself.
super.paintComponent (g);

g.drawString (“\u00e5 = \u00e5”, 10, 12 );
g.drawString (“\u00c5 = \u00c5”, 10, 24 );
g.drawString (“\u00e4 = \u00e4”, 10, 36 );
g.drawString (“\u00c4 = \u00c4”, 10, 48 );
g.drawString (“\u00d6 = \u00d6”, 10, 60 );
g.drawString (“\u00f6 = \u00f6”, 10, 72 );

} // paintComponent

} // class DrawingPanel
Link: http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/characterEncodng.html

< Alle nieuws