JAVA

[ Solved -3 Answers] Why is executing Java code in comments with certain Unicode characters allowed

Why is executing Java code in comments with certain Unicode characters allowed-Unicode decoding takes place before

The following code produces the output “Hello World!” (no really, try it).

java code
public static void main(String... args) 
{

   // The comment below is not a typo.
   // \u000d System.out.println("Hello World!");
}
The reason for this is that the Java compiler parses the Unicode character \u000d as a new line and gets transformed into:

public static void main(String... args) 
{

   // The comment below is not a typo.
   //
   System.out.println("Hello World!");
}
Thus resulting into a comment being "executed"

  • Unicode decoding takes place before any other lexical translation. The key benefit of this is that it makes it trivial to go back and forth between ASCII and any other encoding.
  • […] The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. […]
  • This gives a fundamental guarantee for platform independence (independence of supported character sets) which has always been a key goal for the Java platform.
  • Being able to write any Unicode character anywhere in the file is a neat feature, and especially important in comments, when documenting code in non-latin languages. The fact that it can interfere with the semantics in such subtle ways is just an (unfortunate) side-effect.
  • There are many gotchas on this theme and Java Puzzlers by Joshua Bloch and Neal Gafter included the following variant:

Is this a legal Java program

java code
\u0070\u0075\u0062\u006c\u0069\u0063\u0020\u0020\u0020\u0020
\u0063\u006c\u0061\u0073\u0073\u0020\u0055\u0067\u006c\u0079
\u007b\u0070\u0075\u0062\u006c\u0069\u0063\u0020\u0020\u0020
\u0020\u0020\u0020\u0020\u0073\u0074\u0061\u0074\u0069\u0063
\u0076\u006f\u0069\u0064\u0020\u006d\u0061\u0069\u006e\u0028
\u0053\u0074\u0072\u0069\u006e\u0067\u005b\u005d\u0020\u0020
\u0020\u0020\u0020\u0020\u0061\u0072\u0067\u0073\u0029\u007b
\u0053\u0079\u0073\u0074\u0065\u006d\u002e\u006f\u0075\u0074
\u002e\u0070\u0072\u0069\u006e\u0074\u006c\u006e\u0028\u0020
\u0022\u0048\u0065\u006c\u006c\u006f\u0020\u0077\u0022\u002b
\u0022\u006f\u0072\u006c\u0064\u0022\u0029\u003b\u007d\u007d

(This program turns out to be a plain “Hello World” program.)

In the solution to the puzzler, they point out the following:

More seriously, this puzzle serves to reinforce the lessons of the previous three: Unicode escapes are essential when you need to insert characters that can’t be represented in any other way into your program. Avoid them in all other cases.

READ  Java Programming - Check if a given array contains duplicate elements within k distance from each other

  • In Java source code \u000d is equivalent in every way to an ASCII CR character. It is a line ending, plain and simple, wherever it occurs. The formatting in the question is misleading, what that sequence of characters actually syntactically corresponds to is:
java code
public static void main(String... args) 
{
   // The comment below is no typo. 
   // 
 System.out.println("Hello World!");
}
  • IMHO the most correct answer is therefore: the code executes because it isn’t in a comment; it’s on the next line. “Executing code in comments” is not allowed in Java, just like you would expect.

  • The \u000d escape terminates a comment because \u escapes are uniformly converted to the corresponding Unicode characters before the program is tokenized. You could equally use \u0057\u0057 instead of // to begin a comment.
  • This is a bug in your IDE, which should syntax-highlight the line to make it clear that the \u000d ends the comment.
  • This is also a design error in the language.
  • It can’t be corrected now, because that would break programs that depend on it.  \u escapes should either be converted to the corresponding Unicode character by the compiler only in contexts where that “makes sense” (string literals and identifiers, and probably nowhere else) or they should have been forbidden to generate characters in the U+0000–007F range, or both.
  • Either of those semantics would have prevented the comment from being terminated by the \u000d escape, without interfering with the cases where \u escapes are useful—note that that includes use of \u escapes inside comments as a way to encode comments in a non-Latin script, because the text editor could take a broader view of where \u escapes are significant than the compiler does.
  • There is a similar design error in the C family,1 where backslash-newline is processed before comment boundaries are determined, so e.g.
READ  Class, Interface, Or enum Expected error

// this is a comment \

   this is still in the comment!

About the author

Wikitechy Editor

Wikitechy Editor

Wikitechy Founder, Author, International Speaker, and Job Consultant. My role as the CEO of Wikitechy, I help businesses build their next generation digital platforms and help with their product innovation and growth strategy. I'm a frequent speaker at tech conferences and events.

X