What is Unicode ?
- Computer to be able to store text and numbers that humans can understand, there needs to be a code that transforms characters into numbers. The Unicode standard defines such a code by using character encoding.
- The reason character encoding is so important is so that every device can display the same information. A custom character encoding scheme might work brilliantly on one computer but problems will occur when if you send that same text to someone else.
- All character encoding does is assign a number to every character that can be used. You could make a character encoding right now.
- For example, I could say that the letter A becomes the number 13, a=14, 1=33, #=123, and so on.
- If the whole computer industry uses the same character encoding scheme, every computer can display the same characters.
What is Unicode ?
- ASCII (American Standard Code for Information Interchange) became the first widespread encoding scheme. However, it's limited to only 128 character definitions. This is fine for the most common English characters, numbers, and punctuation, but is a bit limiting for the rest of the world.
- Naturally, the rest of the world wants the same encoding scheme for their characters too. However, for a little while depending on where you were, there might have been a different character displayed for the same ASCII code.
- In the end, the other parts of the world began creating their own encoding schemes and things started to get a little bit confusing. Not only were the coding schemes of different lengths, programs needed to figure out which encoding scheme they were supposed to use.
- It became apparent that a new character encoding scheme was needed, which is when the Unicode standard was created.
- The objective of Unicode is to unify all the different encoding schemes so that the confusion between computers can be limited as much as possible.
- These days, the Unicode standard defines values for over 128,000 characters, and can be seen at the Unicode Consortium. It has several character encoding forms:
- Only uses one byte (8 bits) to encode English characters. It can use a sequence of bytes to encode other characters. UTF-8 is widely used in email systems and on the internet.
- Uses two bytes (16 bits) to encode the most commonly used characters. If needed, the additional characters can be represented by a pair of 16-bit numbers.
- Uses four bytes (32 bits) to encode the characters. It became apparent that as the Unicode standard grew, a 16-bit number is too small to represent all the characters. UTF-32 is capable of representing every Unicode character as one number.
- A code point is the value that a character is given in the Unicode standard. The values according to Unicode are written as hexadecimal numbers and have a prefix of U+.
- For example to encode the characters I looked at earlier:
- These code points are split into 17 different sections called planes, identified by numbers 0 through 16. Each plane holds 65,536 code points. The first plane, 0, holds the most commonly used characters, and is known as the Basic Multilingual Plane (BMP).
- The encoding schemes are made up of code units, which are used to provide an index for where a character is positioned on a plane.
- Consider UTF-16 as an example. Each 16-bit number is a code unit. The code units can be transformed into code points. For instance, the flat note symbol ♭ has a code point of U+1D160 and lives on the second plane of the Unicode standard (Supplementary Ideographic Plane). It would be encoded using the combination of the 16-bit code units U+D834 and U+DD60.
- For the BMP, the values of the code points and code units are identical.
- This allows a shortcut for UTF-16 that saves a lot of storage space. It only needs to use one 16-bit number to represent those characters.
How Does Java Use Unicode ?
- Java was created around the time when the Unicode standard had values defined for a much smaller set of characters. Back then, it was felt that 16-bits would be more than enough to encode all the characters that would ever be needed. With that in mind Java was designed to use UTF-16. In fact, the char data type was originally used to represent a 16-bit Unicode code point.
- Since Java SE v5.0, the char represents a code unit. It makes little difference for representing characters that are in the Basic Multilingual Plane because the value of the code unit is the same as the code point. However, it does mean that for the characters on the other planes, two chars are needed.
- The important thing to remember is that a single char data type can no longer represent all the Unicode characters.