parsing - Which grammars can be parsed using recursive descent without backtracking? -

according "recursive descent parser" on wikipedia, recursive descent without backtracking (a.k.a. predictive parsing) possible ll(k) grammars.

elsewhere, have read implementation of lua uses such parser. however, language not ll(k). in fact, lua inherently ambiguous: a = f(g)(h)[i] = 1 mean a = f(g); (h)[i] = 1 or a = f; (g)(h)[i] = 1? ambiguity resolved greediness in parser (so above parsed erroneous a = f(g)(h)[i]; = 1).

this example seems show predictive parsers can handle grammars not ll(k). true can, in fact, handle superset of ll(k)? if so, there way find out whether given grammar in superset?

in other words, if designing language parse using predictive parser, need restrict language ll(k)? or there looser restriction can apply?

tl;dr

for suitable definition of recursive descent parser it, absolutely correct ll(k) languages can parsed recursive descent.

lua can parsed recursive descent parser precisely because language ll(k); is, ll(k) grammar exists lua. [note 1]

1. ll(k) language may have non-ll(k) grammars.

a language ll(k) if there ll(k) grammar recognizes language. doesn't mean every grammar recognizes language ll(k); there might number of non-ll(k) grammars recognize language. fact grammar not ll(k) says absolutely nothing language itself.

2. many practical programming languages described ambiguous grammar.

in formal language theory, language inherently ambiguous if every grammar language ambiguous. safe no practical programming language inherently ambiguous, since practical programming languages deterministically parsed (somehow). [note 2].

because writing strictly non-ambiguous grammar can tedious, pretty common language documentation provide ambiguous grammar, along textual material indicates how ambiguities resolved.

for example, many languages (including lua) documented grammar not explicitly include operator precedence, allowing simple rule expressions:

exp ::= exp binop exp | unop exp | term

that rule ambiguous, given list of operators, relative precedences , indication of whether each operator left- or right-associative, rule can mechanically expanded unambiguous expression grammar. indeed, many parser generators allow user provide precedence declarations separately, , perform mechanical expansion in course of producing parser. resulting parser, should noted, parser disambiguated grammar ambiguity of original grammar not imply parsing algorithm capable of dealing ambiguous grammars.

another common example of ambiguous reference grammars can mechanically disambiguated "dangling else" ambiguity found in languages c (but not in lua). grammar:

if-statement ::= "if" '(' exp ')' stmt                | "if" '(' exp ')' stmt "else" stmt

is ambiguous; intention parse "greedy". again, ambiguity not inherent. there mechanical transformation produces unambiguous grammar, following:

matched-statement ::= matched-if-stmt | other-statement statement         ::= matched-if-stmt | unmatched-if-stmt matched-if-stmt   ::= "if" '(' exp ')' matched-statement "else" matched-statement  unmatched-if-stmt ::= "if" '(' exp ')' statement                     | "if" '(' exp ')' matched-statement "else" unmatched-if-stmt

it quite common parser generators implicitly perform transformation. (for lr parser generator, transformation implemented deleting reduce actions if conflict shift action. simpler transforming grammar, has same effect.)

so lua (and other programming languages) not inherently ambiguous; , therefore can parsed parsing algorithms require unambiguous deterministic parsers. indeed, might little surprising there languages every possible grammar ambiguous. pointed out in wikipedia article cited above, existence of such languages proven rohit parikh in 1961; simple example of inherently-ambiguous context-free language

{aⁿb^mc^mdⁿ|n,m≥0} ∪ {aⁿbⁿc^md^m|n,m≥0}.

3. greedy ll(1) parsing of lua assignment , function call statements

as dangling else construction above, disambiguation of lua statement sequences performed allowing greedy parse. intuitively, procedure straight-forward; based on forbidding 2 consecutive statements (without intervening semicolon) second 1 starts token might continue first one.

in practice, not necessary perform transformation; can done implicitly during construction of parser. i'm not going bother generate complete lua grammar here. trust small subset of lua grammar here sufficient illustrate how transformation can work.

the following subset (largely based on reference grammar) exhibits precisely ambiguity indicated in op:

program        ::= statement-list statement-list ::= Ø                  | statement-list statement statement      ::= assignment | function-call | block | ';' block          ::= "do" statement-list "end" assignment     ::= var '=' exp exp            ::= prefixexp                          [note 3] prefixexp      ::= var | '(' exp ')' | function-call var            ::= name | prefixexp '[' exp ']' function-call  ::= prefixexp '(' exp ')'

(note: (i'm using Ø represent empty string, rather ε, λ, or %empty.)

the lua grammar left-recursive, not ll(k) (independent of ambiguity). removing left-recursion can done mechanically; i've done enough of here in order demonstrate subset ll(1). unfortunately, transformed grammar not preserve structure of parse tree, classic problem ll(k) grammars. simple reconstruct correct parse tree during recursive descent parse , i'm not going go details.

it simple provide ll(1) version of exp, result eliminates distinction between var (which can assigned to) , function-call (which cannot):

exp          ::= term exp-postfix exp-postfix  ::= Ø                | '[' exp ']' exp-postfix                | '(' exp ')' exp-postfix  term         ::= name | '(' exp ')'

but need recreate distinction in order able parse both assignment statements , function calls. that's straight-forward (but not promote understanding of syntax, imho):

a-or-fc-statement ::= term a-postfix a-postfix         ::= '=' exp                      | ac-postfix c-postfix         ::= Ø                     | ac-postfix ac-postfix        ::= '(' exp ')' c-postfix                     | '[' exp ']' a-postfix

in order make greedy parse unambiguous, need ban (from grammar) occurrence of s₁ s₂ s₁ ends exp , s₂ starts '('. in effect, need distinguish different types of statement, depending on whether or not statement starts (, , independently, whether or not statement ends exp. (in practice, there 3 types because there no statements start ( , not end exp. [note 4])

statement-list ::= Ø                  | s1 statement-list                  | s2 s2-postfix                  | s3 s2-postfix s2-postfix     ::= Ø                  | s1 statement-list                  | s2 s2-postfix s1             ::= block | ';' s2             ::= name a-postfix s3             ::= '(' exp ')' a-postfix

4. recursive descent parsing, , how can modified incorporate disambiguation?

in common usage, predictive recursive descent parser implementation of ll(k) algorithm in each non-terminal mapped procedure. each non-terminal procedure starts using table of possible lookahead sequences of length k decide alternative production non-terminal use, , "executes" production symbol symbol: terminal symbols cause next input symbol discarded if matches or error reported if doesn't match; non-terminal symbols cause non-terminal procedure called.

the tables of lookahead sequences can constructed using first_k , follow_k sets. (a production a→ω mapped sequence α of terminals if α ∈ first_k(ω follow_k(a)).) [note 5]

with definition of recursive descent parsing, recursive descent parser can handle precisely , solely ll(k) languages. [note 6]

however, alignment of ll(k) , recursive descent parsers ignores important aspect of recursive descent parser, is, first , foremost, program written in turing-complete programming language. if program allowed deviate rigid structure described above, parse larger set of languages, languages not context-free. (see, example, c context-sensitivity referenced in note 2.)

in particular, easy add "default" rule table mapping lookaheads productions. tempting optimization because considerably reduces size of lookahead table. commonly, default rule used non-terminals alternatives include empty right-hand side, in case of ll(1) grammar mapped symbol in follow set non-terminal. in implementation, lookahead table includes lookaheads first set, , parser automatically produces empty right-hand side, corresponding immediate return, other symbol. (as similar optimisation in lr(k) parsers, optimization can delay recognition of errors still recognized before additional token read.)

an ll(1) parser cannot include nullable non-terminal first , follow sets contain common element. however, if recursive descent parser uses "default rule" optimization, conflict never noticed during construction of parser. in effect, ignoring conflict allows construction of "greedy" parser (certain) non-deterministic grammars.

that's enormously convenient, because have seen above producing unambiguous greedy grammars lot of work , not lead vaguely resembling clear exposition of language. modified recursive parsing algorithm not more powerful; parses equivalent sll(k) grammar (without constructing grammar).

i not intend provide complete proof of above assertion, first step observe non-terminal can rewritten disjunction of new non-terminals, each single distinct first token, , possibly new non-terminal empty right-hand side. "only" necessary remove non-terminals follow set of nullable non-terminals creating new disjunctions.

notes

here, i'm talking grammar operates on tokenized stream, in comments have been removed , other constructs (such strings delimited "long brackets") reduced single token. without transformation, language not ll(k) (since comments -- can arbitrarily long -- interfere visibility of lookahead token). allows me sidestep question of how long brackets can recognised ll(k) grammar, not particularly relevant question.
there programming languages cannot deterministically parsed context-free grammar. notorious example perl, there well-known c construct (x)*y can parsed deterministically using information symbol x -- whether variable name or type alias -- , difficulties of correctly parsing c++ expressions involving templates. (see, example, questions why can't c++ parsed lr(1) parser? , is c++ context-free or context-sensitive?)
for simplicity, i've removed various literal constants (strings, numbers, booleans, etc.) table constructors , function definitions. these tokens cannot target of function-call, means expression ending 1 of these tokens cannot extended parenthesized expression. removing them simplifies illustration of disambiguation; procedure still possible full grammar, more tedious.
with full grammar, need consider expressions cannot extended (, there 4 distinct options.
there deterministic ll(k) grammars fail produce unambiguous parsing tables using algorithm, sippu & soisalon-soininen call strong ll(k) algorithm. possible augment algorithm using additional parsing state, similar state in lr(k) parser. might convenient particular grammars not change definition of ll(k) languages. sippu & soisalon-soininen demonstrate, possible mechanically derive ll(k) grammar sll(k) grammar produces same language. (see theorem 8.47 in volume 2).
the recursive definition algorithm precise implementation of canonical stack-based ll(k) parser, parser stack implicitly constructed during execution of parser using combination of current continuation , stack of activation records.

Search This Blog

Force Net