Unicode ReduxWolfgang Laun Note: Originally distributed as The Java Specialists' Newsletter issues 109 and 111 Unicode ReduxThe first Unicode standard was published in 1991, shortly after
the Java project was started. A 16-bit design was considered sufficient
to encompass the characters of all the world's living languages.
Unicode 2.0, which no longer restricted codepoints to 16 bits, appeared
in 1996, but Java's first release had emerged the year before. Java
had to follow suit, but Quiz
Introduction
The use of “characters” in Java isn't quite so simple
as the simple type
Several issues need to be covered, ranging from the representation
of Java programs to the implementation of the data types “[Java] programs are written using the Unicode character set.” (Language specification, § 3.1) This simple statement is followed by some small print, explaining that each Java SE platform relates to one of the evolving Unicode specifications, with SE 5.0 being based on Unicode 4.0. In contrast to earlier character set definitions, Unicode distinguishes between the association of characters as abstract concepts (e.g., “Greek capital letter omega”) to a subset of the natural numbers, called code point on the one hand, and the representation of code points by values stored in units of a computer's memory. The Unicode standard defines seven of these character encoding schemes. It would all be (relatively) simple if Unicode were the only standard in effect. Other character sets are in use, typically infested with vendor specific technicalities, and character data is bandied about without much consideration about what a sequence of storage units is intended to represent. Another source of confusion arises from the limitation of our hardware. While high-resolution monitors let you represent any character in a wide range of glyphs with variations in font, style, size and colour, our keyboards are limited to a relatively small set of characters. This has given rise to the workaround of escape sequences, i.e, a convention by which a character you don't have on your keyboard can be represented by a sequence of ones taken from a small set you presumably have. Writing Java Programs
A Java program needs to be stored as a “text file” on your
computer's file system, but this doesn't mean much except that there
is a convention for representing line ends, and even this is cursed by
the famous differences between all major OS families. The Java Language
Specification is not concerned with the way this text is encoded,
even though it says that lexical processing expects this text to
contain Unicode characters. That's why a Java compiler features the
standard option Several encodings map the aforementioned set of essential characters uniformly to the same set of code units of some 8-bit code. The character 'A', for instance, is encoded as 0x41 in US-ASCII, UTF-8 and in any of the codes ISO-8859-1 through ISO-8859-15, or windows-1250 through windows-1258. If you need to represent a Unicode code point beyond 0x7F you can evade all possible misinterpretations by supplying the character in the Unicode escape form defined by the Java language specification: characters '\' and 'u' must be followed by exactly four hexadecimal digits. Using this, the French version of “Hello world!” can be written as public class AlloMonde { public static void main( String[] args ){ System.out.println( "All\u00F4 monde!" ); } }Since absolutely any character can be represented by a Unicode escape, you might write this very same program using nothing but Unicode escapes, as shown below, with line breaks added for readability: \u0070\u0075\u0062\u006c\u0069\u0063\u0020\u0063\u006c\u0061 \u0073\u0073\u0020\u0041\u006c\u006c\u006f\u004d\u006f\u006e \u0064\u0065\u0020\u007b\u000a\u0020\u0020\u0020\u0020\u0070 \u0075\u0062\u006c\u0069\u0063\u0020\u0073\u0074\u0061\u0074 \u0069\u0063\u0020\u0076\u006f\u0069\u0064\u0020\u006d\u0061 \u0069\u006e\u0028\u0020\u0053\u0074\u0072\u0069\u006e\u0067 \u005b\u005d\u0020\u0061\u0072\u0067\u0073\u0020\u0029\u007b \u000a\u0009\u0053\u0079\u0073\u0074\u0065\u006d\u002e\u006f \u0075\u0074\u002e\u0070\u0072\u0069\u006e\u0074\u006c\u006e \u0028\u0020\u0022\u0041\u006c\u006c\u00f4\u0020\u006d\u006f \u006e\u0064\u0065\u0021\u0022\u0020\u0029\u003b\u000a\u0020 \u0020\u0020\u0020\u007d\u000a\u007d\u000aSo, the minimum number of keys you need for such an exercise is 18: the 16 hexadecimal digits plus '\' and 'u'. (On some keyboards you may need the shift key for '\'.)
The preceding tour de force contains several instances of the escape
Attentive readers might now want to challenge my claim that all Java
programs can be written using only 18 keys, which did not include
'n' and 'r'. But there are two ways to make do with these 18 characters.
The first one uses an
Another fancy feature of Java is based on the rule that identifiers
may contain any character that is a “Java letter” or a
“Java letter-or-digit”. The language specification (cf. § 3.8)
enumerates neither set explicitly, it delegates the decision to the
The decisions which character may be start or part of a Java identifier
exhibit a good measure of laissez-faire. Along with the dollar sign
you can use any other currency sign there is. (Isn't public class FancyName { public static void main( String[] args ){ String = "backspace"; System.out.println( ); } } Character Values
We may now try to answer the question how many character values can
be stored in a variable of type
The actual number of Unicode characters that can be represented
in a
It is evident that the full range of Unicode code points can only
be stored in a variable of type Character Strings
When a character can be encoded with a single 16-bit value, a
character string can be simply encoded as an array of characters.
But the failure of Given that we have a String value where surrogate pairs occur intermingled with individual code units identifying a code point, how do you obtain the number of Unicode characters in this string? How can you obtain the n-th Unicode character off the start of the string? The answers to both questions are simple because there are String methods providing an out-of-the-box solution. First, the number of Unicode characters in a String is obtained like this: public static int ucLength( String s ){ return s.codePointCount( 0, s.length() ); }Two method calls are sufficient for implementing the equivalent of method charAt , the first one for obtaining the
offset of the n-th Unicode character in terms of code unit offsets,
whereupon the second one extracts one or two code units for obtaining
the integer code point.
public static int ucCharAt( String s, int index ){ int iPos = s.offsetByCodePoints( 0, index ); return s.codePointAt( iPos ); } A Capital CaseWhen the world was young, the Romans used to chisel their inscriptions using just 21 letters in a form called Roman square capitals. This very formal form of lettering was not convenient for everyday writing, where a form called cursiva antigua was used, as difficult to read for us now as it must have been then. Plautus, a Roman comedian, wrote about them: “a hen wrote these letters”, which may very well be the origin of the term chicken scratch. Additional letters, diacritics and ligatures morphing into proper letters are nowadays the constituents of the various alphabets used in western languages, and they come in upper case and lower case forms. Capitalization, i.e., the question when to write the initial letter of a word in upper case, is quite an issue in some languages, with German being a hot contender for the first place, with its baffling set of rules. Moreover, writing headings or emphasized words in all upper case is in widespread use. As an aside, note that the custom of capitalizing words (as used in English texts) may have subtle pitfalls. (Compare, for instance, “March with a Pole” to “march with a pole”, with two more possible forms.)
Java comes with the Locale de_DE = new Locale( "de", "DE" ); String wort = "Straße"; System.out.println( "word = " + wort ); String WORT = wort.toUpperCase( de_DE ); System.out.println( "WORT = " + WORT );produces wort = Straße WORT = STRASSEwhich is correct. Clearly, Character.toUpperCase(char) cannot
work this small miracle. (The ugly combination STRAßE should
be avoided.) More fun is to be expected in the near (?) future, when the
LATIN CAPITAL LETTER SHARP S (U+1E9E) that was added to Unicode in 2008
will be adopted by trendy typesetters (or typing trendsetters), like
this: STRAẞE.
Care must be taken in other languages, too. There is, for instance,
the bothersome Dutch digraph IJ and ij. There is no such letter
in any of the ISO 8859 character encodings and keyboards come without it,
and so you'll have to type “ Locale nl_NL = new Locale( "nl", "NL" ); String IJSSELMEER = "IJSSELMEER"; System.out.println( "IJSSELMEER = " + IJSSELMEER ); String IJsselmeer = IJSSELMEER.substring( 0, 1 ).toUpperCase( nl_NL ) + IJSSELMEER.substring( 1 ).toLowerCase( nl_NL ); System.out.println( "IJsselmeer = " + IJsselmeer );This snippet prints IJSSELMEER = IJSSELMEER IJsselmeer = Ijsselmeerwhich is considered wrong; “ IJsselmeer ” would be the correct
form. It should be obvious that a very special case like this is beyond
any basic character translation you can expect from a Java API.
Combining Diacritical MarksThe codepoints in the Unicode block combining diacritical marks might be called the dark horses in the assembly of letters. They are nothing on their own, but when following the right kind of letter, they unwaveringly exercise their influence, literally “crossing the t's and dotting the i's”. They occur in non-Latin alphabets, and they add an almost exotic flavour to the Latin-derived alphabet, with, e.g., the diaeresis decorating vowels (mostly) and the háček adding body to consonants (mostly).
The combining marks can be used in fanciful ways, for instance: O͜O.
While there are numerous Unicode codepoints for precombined letters
with some diacritic, it is also permitted to represent them by their
basic letter followed by the combining diacritical mark, and some
applications might prefer it to do it that way. You can guess that
this means trouble if your software has to compare words.
Method import java.text.Normalizer; public class Normalize { public boolean normeq( String w1, String w2 ){ if( w1.length() != w2.length() ){ w1 = Normalizer.normalize(w1, Normalizer.Form.NFD); w2 = Normalizer.normalize(w2, Normalizer.Form.NFD); } return w1.equals(w2); } public void testEquals( String w1, String w2 ){ System.out.println( w1 + " equals " + w2 + " " + w1.equals(w2)); System.out.println( w1 + " normeq " + w2 + " " + normeq(w1, w2)); } }The enum constant Normalizer.Form.NFD selects the kind of
normalization to apply; here it is just the decomposition step
that separates precombined letters into a Latin letter and the
diacritical mark. Running some tests produced this output:
Genève equals Genève false Genève normeq Genève true háček equals háček false háček normeq háček falseDo you see what went wrong? The blunder is in method normeq :
you can't assume that equal lengths indicate the same normalization
state. In the second pair of words, one was written with the first
letter composed and the second one decomposed and the other one
vice versa, so string lengths are equal, not the character arrays, but
the word is the same. There is no shortcut, but we can use this
optimistic approach:
public boolean normeq( String w1, String w2 ){ if( w1.equals(w2) ){ return true; } else { w1 = Normalizer.normalize(w1, Normalizer.Form.NFD); w2 = Normalizer.normalize(w2, Normalizer.Form.NFD); return w1.equals(w2); } } Collating Strings
Class Treating an upper case and the corresponding lower case letter as (almost) equal is just one of the many deviations from the character order required in a useful collation algorithm. Also, note that there's a wide range of applied collations, varying by language and usage. German dictionaries, for instance, use a collation where vowels with a diaeresis are ranked immediately after the unadorned vowel, and the letter 'ß', originally resulting from a ligature of 'ſ' (long s) and 'z', is treated like 'ss'. But for lists of names, as in a telephone book, the German Standard establishes the equations 'ä' = 'ae', 'ö' = 'oe' and 'ü' = 'ue'. Book indices may require a very detailed attention, e.g., when mathematical symbols have to be included. The technical report Unicode Collation Algorithm (UCA) contains a highly detailed specification of a general collation algorithm, with all the bells and whistles required to cope with all nuances for ordering. For anyone planning a non-trivial application dealing with Unicode strings and requiring sorting and searching, this is a must-read, and it's highly informative for anybody with an interest in languages.
Even if not all intricacies outlined in the UCA report are implemented,
a generally applicable collating algorithm must support the
free definition of collating sequences, and it is evident that
this requires more than just the possibility of defining an arbitrary
ordering of the characters. The class public class GermanSort implements Comparator<String> { RuleBasedCollator collator; public GermanSort() throws Exception { collator = createCollator(); } private RuleBasedCollator createCollator() throws Exception { String german = "= '-',''' " + "< A,a;ä,Ä< B,b< C,c< D,d< E,e< F,f< G,g< H,h< I,i< J,j" + "< K,k< L,l< M,m< N,n< O,o;Ö,ö< P,p< Q,q< R,r< S,s< T,t" + "< U,u;Ü,ü< V,v< W,w< X,x< Y,y< Z,z" + "& ss=ß"; return new RuleBasedCollator( german ); } public int compare(String s1, String s2 ){ return collator.compare( s1, s2 ); } public boolean equals( Object obj ){ return this == obj; } public void sort( String[] strings ){ Arrays.sort( strings, this ); } }The string german contains the definition of the rules,
ranking the 26 letters of the ISO basic Latin alphabet by
using the primary relational operator '<'. A weaker ordering
principle is indicated by a semicolon, which places an umlaut
after its stem vowel, and even less significant is the case
difference, indicated by a comma. The initial part defines
the hyphen and the apostrophe as ignorable characters. The
last relations reset the position to 's', and rank 'ß' as
equal to 'ss'. (Note: The javadoc for this class is neither complete
nor correct. Use the syntax illustrated in the preceding example
for defining ignorables.)
There is, however, a much simpler way to obtain a Collator collator = Collator.getInstance( new Locale( "fr", "FR" ) ); String rules = ((RuleBasedCollator)collator).getRules(); // '@' is last rules = rules.substring( 0, rules.length()-1 ); collator = new RuleBasedCollator( rules );(Making the preceding code robust is left as an exercise to the reader.) A Closer Look: Sorting Strings
Comparing Unicode strings according to a rule based collation is bound to
be a non-trivial process while the collator rules must be taken into account.
You can get an idea of what this means when you look at class
This is where class
Putting this all together, an efficient sort of a collection of
strings should create a collection of collation keys and sort it.
Conveniently enough, the public String[] sort( String[] strings ){ CollationKey[] keys = new CollationKey[strings.length]; for( int i = 0; i < strings.length; i++ ){ keys[i] = collator.getCollationKey( strings[i] ); } Arrays.sort( keys ); String[] sorted = new String[strings.length]; for( int i = 0; i < sorted.length; i++ ){ sorted[i] = keys[i].getSourceString(); } return sorted; } Supplementary CharactersSupplementary characters that need to be expressed with surrogate pairs in UTF-16 are uncommon. However, it's important to know where they can turn up and may require precautions in your application code. They include:
Property Files
The javadoc for
The Writing Text FilesWe call a file a “text file” if its data is meant to be a sequence of lines containing characters. While any programming language may have its individual concept of the set of characters it handles as a value of a character data type and a singular way of representing a character in memory), things aren't quite as simple as soon as you have to entrust your data to a file system. Other programs, on the same or on another system, should be able to read that data and be able to interpret it so that they may come up with the same set of characters. Standards institutes and vendors have created an overly rich set of encodings, prescriptions for mapping byte sequences to character sequences. On top of that, there are the various escape mechanisms which let you represent characters not contained in the basic set as sequences of characters from that set. The latter is an issue of interpretation according to various text formats, such as XML or HTML, and we'll skip it here.
Writing a sequence of characters and line terminators to a file should
be a simple exercise, and the API of
If the set of characters in the text goes beyond what can be represented
with one of the legacy encodings that use one 8-bit code unit per character,
one of the Unicode encoding schemes UTF-8, UTF-16 or UTF-32 must be
chosen, and it should be set explicitly as it is risky to rely on the
default stored in the system property Which one should you choose, provided you have a choice at all? If size matters, consider that UTF-16 produces 2 bytes per character, whereas UTF-8 is a variable-width encoding, requiring 1, 2, 3 or more bytes for each codepoint. Thus, if your text used characters from US-ASCII only, the ratio between UTF-8 and UTF-16 will be 1:2, and if you are writing an Arabic text, the ratio is bound to be 1:1, and for CJK it will be 3:2. Compressing the files narrows the distance considerably. Anyway, UTF-8 has become the dominant character encoding for the World-Wide Web, it is increasingly used as the default in operating systems. Therefore it's hard to put up a case against using this very flexible encoding. ConclusionDelving into Unicode's mysteries is a highly rewarding adventure. We have seen that Java provides some support for text processing according to the Unicode standard, but you should always keep in mind that this support may not be sufficient for more sophisticated applications. This has been one of the two motives for writing this article. And what was the other one? Ah, yes, having fun! |
||||
Upcoming Events...
|
About Me
I have been writing for the Java specialist community since 2000. It's been fun.
It's even more fun when you share this writing with someone you feel might enjoy it. And they can get it fresh each month if they head for www.javaspecialists.eu and add themselves to the list. ![]() |
Copyright 2012 Heinz Kabutz Chorafakia, Chania, Crete, 73100, Greece |