Thursday, July 12, 2007

What's wrong with Java

When I see the features added recently to Java, I'm sure glad I'm using .NET, C# and boo. Even though Java is a lot older than .NET, .NET seems to get the good features first. Care in point: Generics. For a former C++ developer, it seems stupid to give up type-safe collections; I don't know how many years it took before Java got generics (10?) but .NET got them in much less time (less than 4 years, I do believe) and Java was left playing catch-up. In fact, most of the "new" features in Java 5.0 seem to be things that C# had from the beginning:
  • enhanced for loop (foreach in C#)
  • autoboxing/unboxing
  • enums
  • varargs (variable argument lists - "params" in C#)
  • annotations (much like .NET attributes)
Java generics aren't even supported by the JVM, so you get the same performance penalty from casting that you did before. I've always been unhappy with Java's performance (especially for GUI programs), whereas .NET just doesn't seem slow.

Look, Sun, if nothing short of competition from Microsoft can prompt you to improve Java, you must not care very much much about it.

Let's see, what else...
  • Value types (structs). This is a big one for me because can offer a big performance boost in many situations. You don't want to allocate a new object if that object contains nothing more than an integer and some methods, do you? A new object sucks up at least 16-20 bytes of memory even if it just contains one integer or reference; creating it requires multiple method calls and all those bytes have to be initialized. Useful value types include
    • A "Point" type that has X and Y coordinates
    • A "FixedInt" type that contains a fixed-point number (the language must support operator overloading to make it easy to use, of course.)
    • A "BigInt" type that contains a small integer normally, but allocates a memory block for a large integer if necessary.
    • A "Handle" type that contains an opaque reference to something else
    • A "Symbol" type that contains a numeric identifier that represents a string (symbols are a built-in feature of Ruby and are typically used like enums, except they are more flexible)
    • A Pair type that contains a pair of values A and B; often you get better performance by not allocating memory for this purpose.
  • Multi-language support. Well, the JVM can certainly support multiple languages, but only .NET is specifically designed for it. Admittedly, the design isn't that great, but at Microsoft specifically considers the needs of other languages.
  • Delegates. The Java equivalent is using interfaces with one function in them, but this is relatively inflexible and certainly more annoying to use. Java provides inner (even anonymous) classes to help people use the pattern, but delegates are way better.
  • Closures (functions inside other functions, where the inner function can access local variables of the outer function). Java doesn't have that, does it? You can access "final" variables from a function-inside-a-class-inside-a-function, but that's all. By the way, .NET itself doesn't actually support closures, but C# fakes it well.
  • Iterators. Now this may be my favorite feature of C# 2.0; it would be hard to choose between iterators and generics. I love them not only because you can create enumerators easily (which is great) but also because you can approximate coroutines with them.
  • Swing. Ugh! It's ugly, it's slow, and the Windows "skin" isn't very convincing. There often seem to be glitches in Swing that you don't find in other programs, such as the failure to resize a window fluidly (i.e. the window doesn't redraw itself until you let go of your mouse button). Finally, and worst of all, developing Swing GUIs is a huge pain in the ass. I absolutely can't stand it. The .NET counterpart, Windows.Forms, doesn't seem all that well designed, but it looks good, it's relatively fast, and it's easy to write code for it. Plus, of course, a good Forms designer is a standard feature of any .NET IDE.
Right now I wish I could have the C# 3.0 "var" feature because I'm sick of typing

SomeJerkGaveThisClassALongName foo = new SomeJerkGaveThisClassALongName();

Obviously we should be able to write simply

var foo = new SomeJerkGaveThisClassALongName();

And there's a lot of other great stuff in C# 3.0 [.doc]:
  • Lambda expressions (syntactic sugar for anonymous inner functions) with type inference
  • Type inference for generic method calls
  • Extension methods (they are not well thought out, but I'd rather have them than not)
  • Object and collection initializers (to make code more brief)
  • Anonymous POD ("plain old data") classes, which work like tuples except that the fields have names.
Suddenly, C# is starting to seem a lot more like boo.

Having said all this, there are a couple of things from Java that I might like to have in C#:
  • The assert statement. Typing Debug.Assert() all the time is driving me nuts.
  • Inner classes that have an implicit link to the outer class
And let's see, if I could have some more features I think they would include
  • Traits
  • The ability to supply a default implementation for a member of an interface
  • Preconditions and postconditions on methods
  • Static and run-time unit checking (units as in metres, litres, bytes, pixels, etc.)

Thursday, July 05, 2007

2 or 3 major bugs in ANTLR 3

No matter what I do I can't seem to make a lexer that works correctly. Weird things have been happening constantly, starting with the multi-megabyte lexer that I always end up with if I use LL(*). So I've been using LL(4) but am still running into a number of problems.

I've been testing with the demo lexer below. Before I show it, I want to point out what's going on with the backslash:
  • in my language, backslash is an escape for producing identifiers out of punctuation, so "\&" is an identifier, as is "\a\b\c" as well as ordinary things like "abc".
  • backslash is punctuation (PUNC) if followed by a space. The space is NOT included in the token.

By the way, should these rules be equivalent (behaviorally)? I think they should:

BUG: ('_' | {false}? . | {false}? .)+; // Version 1
BUG: ('_' | {false}? .)+; // Version 2
BUG: ('_')+; // Version 3

But as you'll see, three different behaviors may result. Another thing I've noticed is that ANTLR seems to use more lookahead than necessary to do prediction, and because of this is somehow more likely to make incorrect predictions.

I've isolated a small set of rules to demonstrate some of the problems I've been having. Note that I usually use predicates like

{input.LA(2) == ' ' || input.LA(2) == '\t'}? '\\'

instead of

('\\' (' ' | '\t')) => '\\'

because as discussed recently, the latter does not work. And it seems that often the former doesn't work either. I've included examples of both apparently failing. Here's the grammar:
/////////////////////////////////////////////////////////////////////////
lexer grammar Bug1;

options
{
language = 'CSharp';
k = 4;
TokenLabelType = CommonToken;
}

@lexer::members {
public bool IsWS(int LA) { int c = input.LA(LA); return c == '
' || c == '\t'; }
public bool IsNewline(int LA) { int c = input.LA(LA); return c ==
'\r' || c == '\n'; }
public bool IsCtrlChar(int LA) { int c = input.LA(LA); return c < 32; }
public bool IsBackslash(int LA) { int c = input.LA(LA); return c == '\\'; }
}

WS: WS_CHAR+;
fragment WS_CHAR: ' ' | '\t';

BUG: ('_' | {false}? . | {false}? .)+; // Version 1
//BUG: ('_' | {false}? .)+; // Version 2
//BUG: ('_')+; // Version 3

ID: ID_LETTER+;
fragment ID_LETTER
: ('a'..'z' | 'A'..'Z')
| {IsBackslash(1) && !IsWS(2) && !IsCtrlChar(2)}?=> '\\' .
;

PUNC: PUNC_CHAR+;
fragment PUNC_CHAR:
{IsWS(2)}? '\\' // Version 1
//{IsBackslash(1) && IsWS(2)}?=> '\\' // Version 2
//('\\' WS_CHAR)=>'\\' // Version 3
| (':'|'.'|'~'|'!'|'@'|'$'|'%'|'^'|'&'
|'*'|'-'|'+'|'='|'|'|','|'['|']'|'?');
/////////////////////////////////////////////////////////////////////////

Next, here's the testing code:

/////////////////////////////////////////////////////////////////////////
public static void Main(string[] args)
{
ParseBug(@"ab");
ParseBug(@"ab& ");
ParseBug(@"ab&\");
ParseBug(@"ab&\cd");
ParseBug(@"ab\ cd");
}
static void ParseBug(string s)
{
System.Console.WriteLine(s);
ANTLRStringStream input = new ANTLRStringStream(s);
Lexer lexerBug = new Bug1Lexer(input);
IToken t;
while ((t = lexerBug.NextToken()).Type != Bug1Lexer.EOF) {
System.Console.WriteLine("{0} [{1}]",
FooParser.tokenNames[t.Type], t.Text);
}
System.Console.WriteLine("");
}
/////////////////////////////////////////////////////////////////////////

N.B. All of these test strings should work, but they don't. ANTLR gives no warnings. The output is:
ab
ID [ab]

ab&
line 1:0 required (...)+ loop did not match anything at character 'a'
line 1:1 required (...)+ loop did not match anything at character 'b'
PUNC [&]

ab&\
line 1:0 required (...)+ loop did not match anything at character 'a'
line 1:1 required (...)+ loop did not match anything at character 'b'
line 1:3 rule PUNC_CHAR failed predicate: {input.LA(2) == ' ' ||
input.LA(2) == '\t'}?

ab&\cd
line 1:0 required (...)+ loop did not match anything at character 'a'
line 1:1 required (...)+ loop did not match anything at character 'b'
line 1:2 required (...)+ loop did not match anything at character '&'
ID [\cd]

ab\ cd
ID [ab]
line 1:2 no viable alternative at character '\'
line 1:3 required (...)+ loop did not match anything at character ' '
ID [cd]
/////////////////////////////////////////////////////////////////////////

Now switch to Version 2 of BUG. The output is:
ab
ID [ab]

ab&
line 1:0 rule BUG failed predicate: {false}?
line 1:1 rule BUG failed predicate: {false}?
PUNC [&]

ab&\
line 1:0 rule BUG failed predicate: {false}?
line 1:1 rule BUG failed predicate: {false}?
line 1:2 rule BUG failed predicate: {false}?
line 1:3 no viable alternative at character '\'
WS [ ]

ab&\cd
line 1:0 rule BUG failed predicate: {false}?
line 1:1 rule BUG failed predicate: {false}?
line 1:2 rule BUG failed predicate: {false}?
ID [\cd]

ab\ cd
ID [ab]
line 1:2 no viable alternative at character '\'
line 1:3 rule BUG failed predicate: {false}?
ID [cd]
/////////////////////////////////////////////////////////////////////////

It is worth noting that this grammar is LL(1) because all other lookahead is done with semantic predicates. However, the above symptoms only occur if k>1. If k=2 then two errors remain; if k=1 then the result is the same as when using Version 3 of BUG.

Without further ado, here's Version 3 of BUG:
ab
ID [ab]

ab&
ID [ab]
PUNC [&]

ab&\
ID [ab]
PUNC [&\]
WS [ ]

ab&\cd
ID [ab]
line 1:3 rule PUNC_CHAR failed predicate: {IsWS(2)}?
ID [cd]

ab\ cd
ID [ab]
PUNC [\]
WS [ ]
ID [cd]
/////////////////////////////////////////////////////////////////////////

That's much better! But there's a problem with the predicate because it is not getting hoisted for some reason. mPUNC() sees a backslash and decides that's all it needs to run mPUNC_CHAR. In fact, mPUNC_CHAR() itself doesn't run the predicate until after it's predicted the result. It escapes me why it would decide that the alternative succeeds, then runs the predicate afterward - which fails, of course.

Now let's try Version 3 of PUNC_CHAR - something slightly different happens:
ab
ID [ab]

ab&
ID [ab]
PUNC [&]

ab&\
ID [ab]
PUNC [&\]
WS [ ]

ab&\cd
ID [ab]
line 1:3 no viable alternative at character '\'
ID [cd]

ab\ cd
ID [ab]
PUNC [\]
WS [ ]
ID [cd]
/////////////////////////////////////////////////////////////////////////

This time, mPUNC decides that a backslash alone is enough to predict a backslash, so it calls mPUNC_CHAR. mPUNC_CHAR actually bothers to run the predicate and, of course, it fails, causing a no viable alt error.

Only Version 2 works as it should:

ab
ID [ab]

ab&
ID [ab]
PUNC [&]

ab&\
ID [ab]
PUNC [&\]
WS [ ]

ab&\cd
ID [ab]
PUNC [&]
ID [\cd]

ab\ cd
ID [ab]
PUNC [\]
WS [ ]
ID [cd]
/////////////////////////////////////////////////////////////////////////

This concludes my little bug report.

Monday, July 02, 2007

ANTLR 3 - Customizing token numbers

> I'm not sure if lexers support vocabulary importing (I know
> parsers do), but if they do then you should be able to do it that
> way -- make a tokens file and import it into the lexer. Worth a
> try, anyway :)

Ahh, of course, I should have tried that.

And happily, it works! But I found the following caveats.

There is a bug that occurs when ANTLR imports and then exports a backslash. So if I have
parser grammar FooParser;
options {
tokenVocab=Foo;
...
lexer grammar Foo;
options {
tokenVocab=Foo2;

// In Foo2.tokens
'\\'=25
// In the generated Foo.tokens
'\'=25
'\\'=31 // Added by ANTLR

This causes a syntax error when compiling the parser. And I guess there is another bug in ANTLRWorks because after the syntax error, ANTLRWorks will keep repeating the same error every time you try to Generate Code, until you quit and restart the program.

By the way, I found that

'\\\\'=25

Seems to work as a single backslash.

There is another important caveat: ANTLR cannot handle "holes" when importing tokens into the parser, i.e. unused numbers in the list of tokens. You must start numbering tokens at 4 and continue up from there with consecutive integers. The problem is that the token names array called tokenNames[] in your parser will not have any empty elements in it, so if your tokens are

APPLE=4
GRAPE=5
LEMON=9
PEAR=10

then your token array will be
public static readonly string[] tokenNames = new string[]
{
"[invalid]",
"[eor]",
"[down]",
"[up]",
"APPLE",
"GRAPE",
"LEMON",
"PEAR"
};

Therefore, token name lookups will not work correctly.

On the plus side, you do not have to define all tokens in your .tokens file; ANTLR can add any additional tokens you define in the lexer and will number them correctly.

P.S. I'm using the C# target; perhaps YMMV for Java etc.

Sunday, July 01, 2007

ANTLR Warning suppression workaround

On p.285 of the ANTLR book it implies that you cannot suppress warnings in ANTLR v3 like you could in v2.

However, it appears that a semantic predicate works nicely as a workaround:
// Strings
SQ_STRING: '\''! ({true}? ESC_SEQ | ~'\'')* '\''!;
DQ_STRING: '"'! ({true}? ESC_SEQ | ~'"')* '"'!;
BQ_STRING: '`'! ({true}? ESC_SEQ | ~'`')* '`'!;
fragment ESC_SEQ: '\\' ~('\n'|'\r');
The generated code now contains LL(2) lookahead for no reason, and some redundant code. For example, code that originally read
   if ( (LA19_0 == '\"') )
{
alt19 = 1;
}
Now says
   if ( (LA19_0 == '\"') )
{
int LA19_1 = input.LA(2);
if ( (true) )
{
alt19 = 1;
}
}
However, its behavior appears to be the same.

The compiler will emit some "unreachable code" warnings. In C# you can disable them like this:
grammar Expr;
options {
language=CSharp;
}
@lexer::members {
#pragma warning disable 0162
}
@parser::members {
#pragma warning disable 0162
}