The following code produces the output “Hello World!” (no really, try it).
- Unicode decoding takes place before any other lexical translation. The key benefit of this is that it makes it trivial to go back and forth between ASCII and any other encoding.
- […] The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. […]
- This gives a fundamental guarantee for platform independence (independence of supported character sets) which has always been a key goal for the Java platform.
- Being able to write any Unicode character anywhere in the file is a neat feature, and especially important in comments, when documenting code in non-latin languages. The fact that it can interfere with the semantics in such subtle ways is just an (unfortunate) side-effect.
- There are many gotchas on this theme and Java Puzzlers by Joshua Bloch and Neal Gafter included the following variant:
Is this a legal Java program
(This program turns out to be a plain “Hello World” program.)
In the solution to the puzzler, they point out the following:
More seriously, this puzzle serves to reinforce the lessons of the previous three: Unicode escapes are essential when you need to insert characters that can’t be represented in any other way into your program. Avoid them in all other cases.
- In Java source code \u000d is equivalent in every way to an ASCII CR character. It is a line ending, plain and simple, wherever it occurs. The formatting in the question is misleading, what that sequence of characters actually syntactically corresponds to is:
- IMHO the most correct answer is therefore: the code executes because it isn’t in a comment; it’s on the next line. “Executing code in comments” is not allowed in Java, just like you would expect.
- The \u000d escape terminates a comment because \u escapes are uniformly converted to the corresponding Unicode characters before the program is tokenized. You could equally use \u0057\u0057 instead of // to begin a comment.
- This is a bug in your IDE, which should syntax-highlight the line to make it clear that the \u000d ends the comment.
- This is also a design error in the language.
- It can’t be corrected now, because that would break programs that depend on it. \u escapes should either be converted to the corresponding Unicode character by the compiler only in contexts where that “makes sense” (string literals and identifiers, and probably nowhere else) or they should have been forbidden to generate characters in the U+0000–007F range, or both.
- Either of those semantics would have prevented the comment from being terminated by the \u000d escape, without interfering with the cases where \u escapes are useful—note that that includes use of \u escapes inside comments as a way to encode comments in a non-Latin script, because the text editor could take a broader view of where \u escapes are significant than the compiler does.
- There is a similar design error in the C family,1 where backslash-newline is processed before comment boundaries are determined, so e.g.
// this is a comment \
this is still in the comment!