Thursday, July 05, 2007

2 or 3 major bugs in ANTLR 3

No matter what I do I can't seem to make a lexer that works correctly. Weird things have been happening constantly, starting with the multi-megabyte lexer that I always end up with if I use LL(*). So I've been using LL(4) but am still running into a number of problems.

I've been testing with the demo lexer below. Before I show it, I want to point out what's going on with the backslash:
  • in my language, backslash is an escape for producing identifiers out of punctuation, so "\&" is an identifier, as is "\a\b\c" as well as ordinary things like "abc".
  • backslash is punctuation (PUNC) if followed by a space. The space is NOT included in the token.

By the way, should these rules be equivalent (behaviorally)? I think they should:

BUG: ('_' | {false}? . | {false}? .)+; // Version 1
BUG: ('_' | {false}? .)+; // Version 2
BUG: ('_')+; // Version 3

But as you'll see, three different behaviors may result. Another thing I've noticed is that ANTLR seems to use more lookahead than necessary to do prediction, and because of this is somehow more likely to make incorrect predictions.

I've isolated a small set of rules to demonstrate some of the problems I've been having. Note that I usually use predicates like

{input.LA(2) == ' ' || input.LA(2) == '\t'}? '\\'

instead of

('\\' (' ' | '\t')) => '\\'

because as discussed recently, the latter does not work. And it seems that often the former doesn't work either. I've included examples of both apparently failing. Here's the grammar:
/////////////////////////////////////////////////////////////////////////
lexer grammar Bug1;

options
{
language = 'CSharp';
k = 4;
TokenLabelType = CommonToken;
}

@lexer::members {
public bool IsWS(int LA) { int c = input.LA(LA); return c == '
' || c == '\t'; }
public bool IsNewline(int LA) { int c = input.LA(LA); return c ==
'\r' || c == '\n'; }
public bool IsCtrlChar(int LA) { int c = input.LA(LA); return c < 32; }
public bool IsBackslash(int LA) { int c = input.LA(LA); return c == '\\'; }
}

WS: WS_CHAR+;
fragment WS_CHAR: ' ' | '\t';

BUG: ('_' | {false}? . | {false}? .)+; // Version 1
//BUG: ('_' | {false}? .)+; // Version 2
//BUG: ('_')+; // Version 3

ID: ID_LETTER+;
fragment ID_LETTER
: ('a'..'z' | 'A'..'Z')
| {IsBackslash(1) && !IsWS(2) && !IsCtrlChar(2)}?=> '\\' .
;

PUNC: PUNC_CHAR+;
fragment PUNC_CHAR:
{IsWS(2)}? '\\' // Version 1
//{IsBackslash(1) && IsWS(2)}?=> '\\' // Version 2
//('\\' WS_CHAR)=>'\\' // Version 3
| (':'|'.'|'~'|'!'|'@'|'$'|'%'|'^'|'&'
|'*'|'-'|'+'|'='|'|'|','|'['|']'|'?');
/////////////////////////////////////////////////////////////////////////

Next, here's the testing code:

/////////////////////////////////////////////////////////////////////////
public static void Main(string[] args)
{
ParseBug(@"ab");
ParseBug(@"ab& ");
ParseBug(@"ab&\");
ParseBug(@"ab&\cd");
ParseBug(@"ab\ cd");
}
static void ParseBug(string s)
{
System.Console.WriteLine(s);
ANTLRStringStream input = new ANTLRStringStream(s);
Lexer lexerBug = new Bug1Lexer(input);
IToken t;
while ((t = lexerBug.NextToken()).Type != Bug1Lexer.EOF) {
System.Console.WriteLine("{0} [{1}]",
FooParser.tokenNames[t.Type], t.Text);
}
System.Console.WriteLine("");
}
/////////////////////////////////////////////////////////////////////////

N.B. All of these test strings should work, but they don't. ANTLR gives no warnings. The output is:
ab
ID [ab]

ab&
line 1:0 required (...)+ loop did not match anything at character 'a'
line 1:1 required (...)+ loop did not match anything at character 'b'
PUNC [&]

ab&\
line 1:0 required (...)+ loop did not match anything at character 'a'
line 1:1 required (...)+ loop did not match anything at character 'b'
line 1:3 rule PUNC_CHAR failed predicate: {input.LA(2) == ' ' ||
input.LA(2) == '\t'}?

ab&\cd
line 1:0 required (...)+ loop did not match anything at character 'a'
line 1:1 required (...)+ loop did not match anything at character 'b'
line 1:2 required (...)+ loop did not match anything at character '&'
ID [\cd]

ab\ cd
ID [ab]
line 1:2 no viable alternative at character '\'
line 1:3 required (...)+ loop did not match anything at character ' '
ID [cd]
/////////////////////////////////////////////////////////////////////////

Now switch to Version 2 of BUG. The output is:
ab
ID [ab]

ab&
line 1:0 rule BUG failed predicate: {false}?
line 1:1 rule BUG failed predicate: {false}?
PUNC [&]

ab&\
line 1:0 rule BUG failed predicate: {false}?
line 1:1 rule BUG failed predicate: {false}?
line 1:2 rule BUG failed predicate: {false}?
line 1:3 no viable alternative at character '\'
WS [ ]

ab&\cd
line 1:0 rule BUG failed predicate: {false}?
line 1:1 rule BUG failed predicate: {false}?
line 1:2 rule BUG failed predicate: {false}?
ID [\cd]

ab\ cd
ID [ab]
line 1:2 no viable alternative at character '\'
line 1:3 rule BUG failed predicate: {false}?
ID [cd]
/////////////////////////////////////////////////////////////////////////

It is worth noting that this grammar is LL(1) because all other lookahead is done with semantic predicates. However, the above symptoms only occur if k>1. If k=2 then two errors remain; if k=1 then the result is the same as when using Version 3 of BUG.

Without further ado, here's Version 3 of BUG:
ab
ID [ab]

ab&
ID [ab]
PUNC [&]

ab&\
ID [ab]
PUNC [&\]
WS [ ]

ab&\cd
ID [ab]
line 1:3 rule PUNC_CHAR failed predicate: {IsWS(2)}?
ID [cd]

ab\ cd
ID [ab]
PUNC [\]
WS [ ]
ID [cd]
/////////////////////////////////////////////////////////////////////////

That's much better! But there's a problem with the predicate because it is not getting hoisted for some reason. mPUNC() sees a backslash and decides that's all it needs to run mPUNC_CHAR. In fact, mPUNC_CHAR() itself doesn't run the predicate until after it's predicted the result. It escapes me why it would decide that the alternative succeeds, then runs the predicate afterward - which fails, of course.

Now let's try Version 3 of PUNC_CHAR - something slightly different happens:
ab
ID [ab]

ab&
ID [ab]
PUNC [&]

ab&\
ID [ab]
PUNC [&\]
WS [ ]

ab&\cd
ID [ab]
line 1:3 no viable alternative at character '\'
ID [cd]

ab\ cd
ID [ab]
PUNC [\]
WS [ ]
ID [cd]
/////////////////////////////////////////////////////////////////////////

This time, mPUNC decides that a backslash alone is enough to predict a backslash, so it calls mPUNC_CHAR. mPUNC_CHAR actually bothers to run the predicate and, of course, it fails, causing a no viable alt error.

Only Version 2 works as it should:

ab
ID [ab]

ab&
ID [ab]
PUNC [&]

ab&\
ID [ab]
PUNC [&\]
WS [ ]

ab&\cd
ID [ab]
PUNC [&]
ID [\cd]

ab\ cd
ID [ab]
PUNC [\]
WS [ ]
ID [cd]
/////////////////////////////////////////////////////////////////////////

This concludes my little bug report.

0 Comments:

Post a Comment

<< Home