Unicode Redux

Unicode Redux

Wolfgang Laun

Note: Originally distributed as The Java Specialists' Newsletter issues 109 and 111

Unicode Redux

The first Unicode standard was published in 1991, shortly after the Java project was started. A 16-bit design was considered sufficient to encompass the characters of all the world's living languages. Unicode 2.0, which no longer restricted codepoints to 16 bits, appeared in 1996, but Java's first release had emerged the year before. Java had to follow suit, but char remained a 16-bit type. This article reviews several topics related to character and string handling in Java.

Quiz

  1. What is the minimum number of keys on a keyboard you need for typing any Java program?
  2. How many lines does this statement print?
    System.out.println( "To be or not to be\u000Athat is here the question" );
    
  3. How can you represent your system's line terminator within a string literal using Unicode escapes (\uHHHH)?
  4. How many different identifiers of length two can you use in a Java program?
  5. Given that Character.MIN_VALUE is 0 and Character.MAX_VALUE is 65535, how many different Unicode characters can be represented by a char variable?
  6. How can you obtain the 5th Unicode code point from a (sufficiently long) String value?
  7. Given that String s has length 1, is the result of s.toUpperCase() always the same as String.valueOf(Character.toUpperCase(s.charAt(0)))?
  8. Can you explain how method words can be called to produce an output like the one shown below?
    private static void words( String w1, String w2 ){
        String letPat = "[^\\p{Cntrl}]+";
        assert w1.matches(letPat) && w2.matches(letPat);
        System.out.println(w1 + " - " + w2 + ": " + w1.equals(w2));
    }
    
    Genève - Genève: false
    
  9. How would you sort a table of strings containing German words?

Introduction

The use of “characters” in Java isn't quite so simple as the simple type char might suggest; several misconceptions prevail. Notice that the word “character” goes back to the Greek word “charássein” (i.e., to scratch, engrave) which may be the reason why so many scratch their head over the resulting intricacies.

Several issues need to be covered, ranging from the representation of Java programs to the implementation of the data types char and java.lang.String, and the handling of character data during input and output.

“[Java] programs are written using the Unicode character set.” (Language specification, § 3.1) This simple statement is followed by some small print, explaining that each Java SE platform relates to one of the evolving Unicode specifications, with SE 5.0 being based on Unicode 4.0. In contrast to earlier character set definitions, Unicode distinguishes between the association of characters as abstract concepts (e.g., “Greek capital letter omega”) to a subset of the natural numbers, called code point on the one hand, and the representation of code points by values stored in units of a computer's memory. The Unicode standard defines seven of these character encoding schemes.

It would all be (relatively) simple if Unicode were the only standard in effect. Other character sets are in use, typically infested with vendor specific technicalities, and character data is bandied about without much consideration about what a sequence of storage units is intended to represent.

Another source of confusion arises from the limitation of our hardware. While high-resolution monitors let you represent any character in a wide range of glyphs with variations in font, style, size and colour, our keyboards are limited to a relatively small set of characters. This has given rise to the workaround of escape sequences, i.e, a convention by which a character you don't have on your keyboard can be represented by a sequence of ones taken from a small set you presumably have.

Writing Java Programs

A Java program needs to be stored as a “text file” on your computer's file system, but this doesn't mean much except that there is a convention for representing line ends, and even this is cursed by the famous differences between all major OS families. The Java Language Specification is not concerned with the way this text is encoded, even though it says that lexical processing expects this text to contain Unicode characters. That's why a Java compiler features the standard option -encoding encoding. As long as your program doesn't contain anything else but the 26 letters, the 10 digits, white space and the special characters for separators and operators, you may not have to worry much about encoding, provided that the Java compiler is set to accept your system's default encoding and the IDE or editor play along. Check http://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html for a list of supported encodings.

Several encodings map the aforementioned set of essential characters uniformly to the same set of code units of some 8-bit code. The character 'A', for instance, is encoded as 0x41 in US-ASCII, UTF-8 and in any of the codes ISO-8859-1 through ISO-8859-15, or windows-1250 through windows-1258. If you need to represent a Unicode code point beyond 0x7F you can evade all possible misinterpretations by supplying the character in the Unicode escape form defined by the Java language specification: characters '\' and 'u' must be followed by exactly four hexadecimal digits. Using this, the French version of “Hello world!” can be written as

public class AlloMonde {
    public static void main( String[] args ){
	System.out.println( "All\u00F4 monde!" );
    }
}
Since absolutely any character can be represented by a Unicode escape, you might write this very same program using nothing but Unicode escapes, as shown below, with line breaks added for readability:
\u0070\u0075\u0062\u006c\u0069\u0063\u0020\u0063\u006c\u0061
\u0073\u0073\u0020\u0041\u006c\u006c\u006f\u004d\u006f\u006e
\u0064\u0065\u0020\u007b\u000a\u0020\u0020\u0020\u0020\u0070
\u0075\u0062\u006c\u0069\u0063\u0020\u0073\u0074\u0061\u0074
\u0069\u0063\u0020\u0076\u006f\u0069\u0064\u0020\u006d\u0061
\u0069\u006e\u0028\u0020\u0053\u0074\u0072\u0069\u006e\u0067
\u005b\u005d\u0020\u0061\u0072\u0067\u0073\u0020\u0029\u007b
\u000a\u0009\u0053\u0079\u0073\u0074\u0065\u006d\u002e\u006f
\u0075\u0074\u002e\u0070\u0072\u0069\u006e\u0074\u006c\u006e
\u0028\u0020\u0022\u0041\u006c\u006c\u00f4\u0020\u006d\u006f
\u006e\u0064\u0065\u0021\u0022\u0020\u0029\u003b\u000a\u0020
\u0020\u0020\u0020\u007d\u000a\u007d\u000a
So, the minimum number of keys you need for such an exercise is 18: the 16 hexadecimal digits plus '\' and 'u'. (On some keyboards you may need the shift key for '\'.)

The preceding tour de force contains several instances of the escape \u000a, which represents the line feed control character - the line separator for Unices. By definition, the Java compiler converts all escapes to Unicode characters before it combines them into a sequence of tokens to be parsed according to the grammar. Most of the time you don't have to worry much about this, but there's a notable exception: using \u000A or \u000D in a character literal or a string literal is not going to create one of these characters as a character value - it indicates a line end to the lexical parser, which is a violation of the rule that neither carriage return nor line feed may occur as themselves within a literal. These are the places where you have to use one of the escape sequences \n and \r. (Heinz wrote about this almost 11 years ago in newsletter 50.)

Attentive readers might now want to challenge my claim that all Java programs can be written using only 18 keys, which did not include 'n' and 'r'. But there are two ways to make do with these 18 characters. The first one uses an octal escape, i.e., \12 or \15. The other one is the long-winded representation of the two characters of the escape sequence by their Unicode escapes: \u005C\u006E and \u005C\u0072.

Another fancy feature of Java is based on the rule that identifiers may contain any character that is a “Java letter” or a “Java letter-or-digit”. The language specification (cf. § 3.8) enumerates neither set explicitly, it delegates the decision to the java.lang.Character methods isJavaIdentifierStart and isJavaIdentifierPart, respectively. This lets you create an unbelievable number of identifiers even as short as only two characters. Investigating all char values yields 45951 and 46908 qualifying values respectively, and this would produce 2,155.469,506 identifiers of length two! (We have to subtract two for the two keywords of length two, of course.)

The decisions which character may be start or part of a Java identifier exhibit a good measure of laissez-faire. Along with the dollar sign you can use any other currency sign there is. (Isn't ¢lass a nice alternative to the ugly clazz?) More remarkable is the possibility of starting an identifier with characters that are classified as numeric, e.g., Ⅸ, the Roman numeral nine, a single character, is a valid identifier. Most astonishing is the option to use most control characters as part of an identifier, all the more so because they don't have printable representations at all. Here is one example, with a backspace following the initial letter 'A': A\u0008. Given a suitable editor, you can create a source file where the backspace is represented as a single byte, with the expected effect when the file is displayed on standard output:

public class FancyName {
    public static void main( String[] args ){
        String  = "backspace";
	System.out.println(  );
    }
}

Character Values

We may now try to answer the question how many character values can be stored in a variable of type char, which actually is an integral type. The extreme values Character.MIN_VALUE and Character.MAX_VALUE are 0 and 65535, respectively. These 65536 numeric values would be open to any interpretation, but the Java language specification says that these values are UTF-16 code units, values that are used in the UTF-16 encoding of Unicode texts. Any representation of Unicode must be capable of representing the full range of code points, its upper bound being 0x10FFFF. Thus, code points beyond 0xFFFF need to be represented by pairs of UTF-16 code units, and the values used with these so-called surrogate pairs are exempt from being used as code points themselves. In java.lang.Character we find the static methods isHighSurrogate and isLowSurrogate, simple tests that return true for the ranges '\uD800' through '\uDBFF' and '\uDC00' through '\uDFFF', respectively. Also, by definition, code units 0xFFFF and 0xFFFE do not represent Unicode characters. From this we can deduct that at most 65536 - (0xE000 - 0xD800) - 2 or 63486 Unicode code points can be represented as a char value.

The actual number of Unicode characters that can be represented in a char variable is certainly lower, simply because there are gaps in and between the blocks set aside for the various alphabets and symbol sets.

It is evident that the full range of Unicode code points can only be stored in a variable of type int. This has not always been so: originally, Java was meant to implement Unicode characters where all code points could be represented by a 16-bit unsigned integer. Since that time, Unicode has outgrown this Basic Multilingual Plane (BMP), so that Java SE 5.0 had to make amends, adding character property methods to java.lang.Character, in parallel to existing ones with a char parameter, where the parameter is an int identifying a code point.

Character Strings

When a character can be encoded with a single 16-bit value, a character string can be simply encoded as an array of characters. But the failure of char to cover all Unicode code points breaks the simplicity of this design. Accessing a string based on the progressive count of code points or Unicode characters isn't possible by mere index calculation any more, because code points are represented by one or two successive code units.

Given that we have a String value where surrogate pairs occur intermingled with individual code units identifying a code point, how do you obtain the number of Unicode characters in this string? How can you obtain the n-th Unicode character off the start of the string?

The answers to both questions are simple because there are String methods providing an out-of-the-box solution. First, the number of Unicode characters in a String is obtained like this:

public static int ucLength( String s ){
    return s.codePointCount( 0, s.length() );
}
Two method calls are sufficient for implementing the equivalent of method charAt, the first one for obtaining the offset of the n-th Unicode character in terms of code unit offsets, whereupon the second one extracts one or two code units for obtaining the integer code point.
public static int ucCharAt( String s, int index ){
    int iPos = s.offsetByCodePoints( 0, index );
    return s.codePointAt( iPos );
}

A Capital Case

When the world was young, the Romans used to chisel their inscriptions using just 21 letters in a form called Roman square capitals. This very formal form of lettering was not convenient for everyday writing, where a form called cursiva antigua was used, as difficult to read for us now as it must have been then. Plautus, a Roman comedian, wrote about them: “a hen wrote these letters”, which may very well be the origin of the term chicken scratch.

Additional letters, diacritics and ligatures morphing into proper letters are nowadays the constituents of the various alphabets used in western languages, and they come in upper case and lower case forms. Capitalization, i.e., the question when to write the initial letter of a word in upper case, is quite an issue in some languages, with German being a hot contender for the first place, with its baffling set of rules. Moreover, writing headings or emphasized words in all upper case is in widespread use.

As an aside, note that the custom of capitalizing words (as used in English texts) may have subtle pitfalls. (Compare, for instance, “March with a Pole” to “march with a pole”, with two more possible forms.)

Java comes with the String methods toUpperCase and toLowerCase. Programmers might expect these methods to produce strings of equal length, and one to be the inverse of the other when initially applied to an all upper or lower case word. But this is not true. One famous case is the German lower case letter 'ß' (“sharp s”), which (officially) doesn't have an upper case form (yet). Executing these statements

Locale de_DE = new Locale( "de", "DE" );
String wort = "Straße";
System.out.println( "word = " + wort );
String WORT = wort.toUpperCase( de_DE );
System.out.println( "WORT = " + WORT );
produces
wort = Straße
WORT = STRASSE
which is correct. Clearly, Character.toUpperCase(char) cannot work this small miracle. (The ugly combination STRAßE should be avoided.) More fun is to be expected in the near (?) future, when the LATIN CAPITAL LETTER SHARP S (U+1E9E) that was added to Unicode in 2008 will be adopted by trendy typesetters (or typing trendsetters), like this: STRAẞE.

Care must be taken in other languages, too. There is, for instance, the bothersome Dutch digraph IJ and ij. There is no such letter in any of the ISO 8859 character encodings and keyboards come without it, and so you'll have to type “IJSSELMEER”. Let's apply the Java standard sequence of statements for capitalizing a word to a string containing these letters:

Locale nl_NL = new Locale( "nl", "NL" );
String IJSSELMEER = "IJSSELMEER"; 
System.out.println( "IJSSELMEER = " + IJSSELMEER );
String IJsselmeer = IJSSELMEER.substring( 0, 1 ).toUpperCase( nl_NL ) +
                    IJSSELMEER.substring( 1 ).toLowerCase( nl_NL );
System.out.println( "IJsselmeer = " + IJsselmeer );
This snippet prints
IJSSELMEER = IJSSELMEER
IJsselmeer = Ijsselmeer
which is considered wrong; “IJsselmeer” would be the correct form. It should be obvious that a very special case like this is beyond any basic character translation you can expect from a Java API.

Combining Diacritical Marks

The codepoints in the Unicode block combining diacritical marks might be called the dark horses in the assembly of letters. They are nothing on their own, but when following the right kind of letter, they unwaveringly exercise their influence, literally “crossing the t's and dotting the i's”. They occur in non-Latin alphabets, and they add an almost exotic flavour to the Latin-derived alphabet, with, e.g., the diaeresis decorating vowels (mostly) and the háček adding body to consonants (mostly).

The combining marks can be used in fanciful ways, for instance: O͜O. While there are numerous Unicode codepoints for precombined letters with some diacritic, it is also permitted to represent them by their basic letter followed by the combining diacritical mark, and some applications might prefer it to do it that way. You can guess that this means trouble if your software has to compare words. Method equals in String is certainly not prepared to deal with such subtleties, unless the strings have been subjected to a process called normalization. This can be done by applying the static method normalize of class java.text.Normalizer. Here is a short demonstration.

import java.text.Normalizer;

public class Normalize {
    public boolean normeq( String w1, String w2 ){
        if( w1.length() != w2.length() ){
            w1 = Normalizer.normalize(w1, Normalizer.Form.NFD);
            w2 = Normalizer.normalize(w2, Normalizer.Form.NFD);
        }
        return w1.equals(w2);
    }

    public void testEquals( String w1, String w2 ){
        System.out.println( w1 + " equals " + w2 + " " +
                            w1.equals(w2));
        System.out.println( w1 + " normeq " + w2 + " " +
                            normeq(w1, w2));
    }
}
The enum constant Normalizer.Form.NFD selects the kind of normalization to apply; here it is just the decomposition step that separates precombined letters into a Latin letter and the diacritical mark. Running some tests produced this output:
Genève equals Genève false
Genève normeq Genève true
háček equals háček false
háček normeq háček false
Do you see what went wrong? The blunder is in method normeq: you can't assume that equal lengths indicate the same normalization state. In the second pair of words, one was written with the first letter composed and the second one decomposed and the other one vice versa, so string lengths are equal, not the character arrays, but the word is the same. There is no shortcut, but we can use this optimistic approach:
    public boolean normeq( String w1, String w2 ){
        if( w1.equals(w2) ){
	    return true;
	} else {
	    w1 = Normalizer.normalize(w1, Normalizer.Form.NFD);
	    w2 = Normalizer.normalize(w2, Normalizer.Form.NFD);
	    return w1.equals(w2);
	}
    }

Collating Strings

Class java.lang.String implements java.lang.Comparable, but its method compareTo is just a rudimentary effort, with a resulting collating sequence that isn't good for anything except for storing strings in an array where binary search is used. Consider, for instance, these four words, which are presented in the order Germans expect them in their dictionaries: “Abend”, “aber”, “morden”, “Morgen”. Applying Arrays.sort to this set yields “Abend”, “Morgen”, “aber”, “morden”, due to all upper case letters in the range 'A' to 'Z' preceding all lower case letters.

Treating an upper case and the corresponding lower case letter as (almost) equal is just one of the many deviations from the character order required in a useful collation algorithm. Also, note that there's a wide range of applied collations, varying by language and usage. German dictionaries, for instance, use a collation where vowels with a diaeresis are ranked immediately after the unadorned vowel, and the letter 'ß', originally resulting from a ligature of 'ſ' (long s) and 'z', is treated like 'ss'. But for lists of names, as in a telephone book, the German Standard establishes the equations 'ä' = 'ae', 'ö' = 'oe' and 'ü' = 'ue'. Book indices may require a very detailed attention, e.g., when mathematical symbols have to be included.

The technical report Unicode Collation Algorithm (UCA) contains a highly detailed specification of a general collation algorithm, with all the bells and whistles required to cope with all nuances for ordering. For anyone planning a non-trivial application dealing with Unicode strings and requiring sorting and searching, this is a must-read, and it's highly informative for anybody with an interest in languages.

Even if not all intricacies outlined in the UCA report are implemented, a generally applicable collating algorithm must support the free definition of collating sequences, and it is evident that this requires more than just the possibility of defining an arbitrary ordering of the characters. The class RuleBasedCollator in java.text provides the most essential features for this. Here is a simple example for the use of RuleBasedCollator.

public class GermanSort implements Comparator<String> {
    RuleBasedCollator collator;
    public GermanSort() throws Exception {
        collator = createCollator();
    }
    private RuleBasedCollator createCollator() throws Exception {
        String german =
	    "= '-',''' " + 
            "< A,a;ä,Ä< B,b< C,c< D,d< E,e< F,f< G,g< H,h< I,i< J,j" +
	    "< K,k< L,l< M,m< N,n< O,o;Ö,ö< P,p< Q,q< R,r< S,s< T,t" +
	    "< U,u;Ü,ü< V,v< W,w< X,x< Y,y< Z,z" +
	    "& ss=ß";
        return new RuleBasedCollator( german );
    }
    public int compare(String s1, String s2 ){
	return collator.compare( s1, s2 );
    }
    public boolean equals( Object obj ){
	return this == obj;
    }
    public void sort( String[] strings ){
	Arrays.sort( strings, this );
    }
}
The string german contains the definition of the rules, ranking the 26 letters of the ISO basic Latin alphabet by using the primary relational operator '<'. A weaker ordering principle is indicated by a semicolon, which places an umlaut after its stem vowel, and even less significant is the case difference, indicated by a comma. The initial part defines the hyphen and the apostrophe as ignorable characters. The last relations reset the position to 's', and rank 'ß' as equal to 'ss'. (Note: The javadoc for this class is neither complete nor correct. Use the syntax illustrated in the preceding example for defining ignorables.)

There is, however, a much simpler way to obtain a Collator that is adequate for most collating tasks (or at least a good starting point): simply call method getInstance, preferably with a Locale parameter. This returns a prefabricated RuleBasedCollator object, according to the indicated locale. Make sure to select the locale not only according to language, since the country may affect the collating rules. Also, the Collator instances available in this way may not be up-to-date, as the following little story illustrates. There used to be a French collating rule, requiring the words “cote”, “côte”, “coté” and “côté” to be in this order, which is in contrast to normal accent ordering, i.e., “cote”, “coté” , “côte” and “côté”. Not too long ago, this fancy rule has retracted to Canada. But, even with JDK 1.7, you may have to create a modified Collator by removing the modified '@' from the string defining the sort rules:

Collator collator = Collator.getInstance( new Locale( "fr", "FR" ) );
String rules = ((RuleBasedCollator)collator).getRules();
// '@' is last
rules = rules.substring( 0, rules.length()-1 );
collator = new RuleBasedCollator( rules );
(Making the preceding code robust is left as an exercise to the reader.)

A Closer Look: Sorting Strings

Comparing Unicode strings according to a rule based collation is bound to be a non-trivial process while the collator rules must be taken into account. You can get an idea of what this means when you look at class CollationElementIterator. This iterator, obtainable for strings by calling the RuleBasedCollator method getCollationElementIterator, delivers sequences of integers that, when compared to each other, result in the correct relation according to the collator. These integers are quite artsy combinations of a character, or a character and the next one; even two or more key integers may result from a single character. For a once-in-a-while invocation of a collator's compare method this isn't going to hurt, but sorting more than a fistful of strings is an entirely different matter.

This is where class CollationKey comes to the rescue. Objects are created by calling the (rule based) collator method getCollationKey for a string. Each object represents a value equivalent to the string's unique position in the set of all strings sorted according to this collator.

Putting this all together, an efficient sort of a collection of strings should create a collection of collation keys and sort it. Conveniently enough, the CollationKey method getSourceString delivers the corresponding string from which the key was created. This is shown in the sort method given below.

public String[] sort( String[] strings ){
    CollationKey[] keys = new CollationKey[strings.length];
    for( int i = 0; i < strings.length; i++ ){
        keys[i] = collator.getCollationKey( strings[i] );
    }
    Arrays.sort( keys );
    String[] sorted = new String[strings.length];
    for( int i = 0; i < sorted.length; i++ ){
	sorted[i] = keys[i].getSourceString();
    }
    return sorted;
}

Supplementary Characters

Supplementary characters that need to be expressed with surrogate pairs in UTF-16 are uncommon. However, it's important to know where they can turn up and may require precautions in your application code. They include:

  • Emoji symbols and emoticons, for inter-operating with Japanese mobile phones. While the BMP already contains quite a lot of emoticons, hundreds of Emoji characters were encoded in version 6.0 of the Unicode standard.
  • Uncommon (but not unused) CJK (i.e., Chinese, Japanese and Korean) characters, important for personal and place names.
  • Variation selectors for ideographic variation sequences.
  • Important symbols for mathematics.
  • Numerous minority scripts and historic scripts, important for some user communities.
  • Symbols for Domino and Mahjong tiles.
At least, Java applications for mobile devices will have to be aware of the growing number of Emoji symbols. Games could be another popular reason for the need to include supplementary characters.

Property Files

The javadoc for java.util.Properties states that the load and store methods read and write a character stream encoded in ISO 8859-1. This is an 8-bit character set, containing a selection of letters with diacritical marks as used in several European languages in addition to the traditional US-ASCII characters. Any other character must be represented using the Unicode escape (\uHHHH). This is quite likely to trip you up when you trustingly edit your properties file with an editor that's been educated to understand UTF-8. Although all printable ISO 8859-1 characters with code units greater than 0x7F happen to map to Unicode code points that are numerically equal to these code points, their UTF-8 encoding requires two bytes. (The appearance of 'Â' or 'Ā' in front of some other character is the typical evidence of such a misunderstanding.) Moreover, it's easy to create a character not contained in the ISO 8859-1 set. On my Linux system, emacs lets me produce the trade mark sign (™) with a few keystrokes. For a remedy, the same javadoc explains that the tool native2ascii accompanying JDK may be used to convert a file from any encoding to ISO 8859-1 with Unicode escapes.

The Properties methods loadFromXML(InputStream) and storeToXML(OutputStream, String, String) read and write XML data, which should indicate its encoding in the XML declaration. It may be more convenient to use these methods than the edit-and-convert rigmarole required for a simple character stream.

Writing Text Files

We call a file a “text file” if its data is meant to be a sequence of lines containing characters. While any programming language may have its individual concept of the set of characters it handles as a value of a character data type and a singular way of representing a character in memory), things aren't quite as simple as soon as you have to entrust your data to a file system. Other programs, on the same or on another system, should be able to read that data and be able to interpret it so that they may come up with the same set of characters. Standards institutes and vendors have created an overly rich set of encodings, prescriptions for mapping byte sequences to character sequences. On top of that, there are the various escape mechanisms which let you represent characters not contained in the basic set as sequences of characters from that set. The latter is an issue of interpretation according to various text formats, such as XML or HTML, and we'll skip it here.

Writing a sequence of characters and line terminators to a file should be a simple exercise, and the API of java.io does indeed provide all the essentials, but there two things to consider. First, what should become of a “character” when it is stored on the medium or sent over the wire; second, how are lines separated.

If the set of characters in the text goes beyond what can be represented with one of the legacy encodings that use one 8-bit code unit per character, one of the Unicode encoding schemes UTF-8, UTF-16 or UTF-32 must be chosen, and it should be set explicitly as it is risky to rely on the default stored in the system property file.encoding.

Which one should you choose, provided you have a choice at all? If size matters, consider that UTF-16 produces 2 bytes per character, whereas UTF-8 is a variable-width encoding, requiring 1, 2, 3 or more bytes for each codepoint. Thus, if your text used characters from US-ASCII only, the ratio between UTF-8 and UTF-16 will be 1:2, and if you are writing an Arabic text, the ratio is bound to be 1:1, and for CJK it will be 3:2. Compressing the files narrows the distance considerably. Anyway, UTF-8 has become the dominant character encoding for the World-Wide Web, it is increasingly used as the default in operating systems. Therefore it's hard to put up a case against using this very flexible encoding.

Conclusion

Delving into Unicode's mysteries is a highly rewarding adventure. We have seen that Java provides some support for text processing according to the Unicode standard, but you should always keep in mind that this support may not be sufficient for more sophisticated applications. This has been one of the two motives for writing this article. And what was the other one? Ah, yes, having fun!


Upcoming Events...

We offer all of our courses onsite at your company. More information ...

Be sure to check out our new course on Java concurrency. Please contact me for more information.





About Me
I have been writing for the Java specialist community since 2000. It's been fun.

It's even more fun when you share this writing with someone you feel might enjoy it. And they can get it fresh each month if they head for www.javaspecialists.eu and add themselves to the list.

Copyright 2012 Heinz Kabutz
Chorafakia, Chania, Crete, 73100, Greece