2.2. Context-Free Syntax Analysis

(1)

Compilers and Language Processing Tools

Summer Term 2011

Prof. Dr. Arnd Poetzsch-Heffter

Software Technology Group TU Kaiserslautern

(2)

Content of Lecture

1. Introduction

2. Syntax and Type Analysis 2.1 Lexical Analysis

2.2 Context-Free Syntax Analysis 2.3 Context-Dependent Syntax Analysis 3. Translation to Target Language

3.1 Translation of Imperative Language Constructs 3.2 Translation of Object-Oriented Language Constructs 4. Selected Aspects of Compilers

4.1 Intermediate Languages 4.2 Optimization

4.3 Data Flow Analysis 4.4 Register Allocation 4.5 Code Generation 5. Garbage Collection

(3)

2.2. Context-Free Syntax Analysis

(4)

Section outline

1. Specification of parsers 2. Implementation of parsers

2.1 Top-down syntax analysis - Recursive descent - LL(k) parsing theory - LL parser generation 2.2 Bottom-up syntax analysis

- Principles of LR parsing - LR parsing theory - SLR, LALR, LR(k) parsing - LALR parser generation

3. Error handling

4. Concrete and abstract syntax

(5)

Task of context-free syntax analysis

• Check if token stream (from scanner) matches context-free syntax of language

I if erroneous: error handling

I if correct: construct syntax tree

Parser Token Stream

Abstract / Concrete Syntax Tree

(6)

Task of context-free syntax analysis (2)

Remarks:

• Parsing can be interleaved with other actions processing the program (e.g. attributation).

• Syntax tree controls translation. We distinguish

I Concrete syntax treecorresponding to context-free grammar

I Abstract syntax treeproviding a more compact representation tailored to subsequent phases

(7)

2.2.1 Specification of Parsers

(8)

Specification of parsers

2 general specification techniques

• Syntax diagrams

• Context-free grammars (often in extended form)

(9)

Context-Free Grammars

Definition Let

• N andT be two alphabets withN∩T =∅

• Πa finite subset ofN×(N∪T)^∗

• S∈N

Then,Γ = (N,T,Π,S)is acontext-free grammar(CFG) where

• N is the set of nonterminals

• T is the set of terminals

• Πis the set of productions rules

• Sis the start symbol (axiom)

(10)

Context-Free Grammars (2)

Notations:

• A,B,C, . . .denote nonterminals

• a,b,c, . . .denote terminals

• x,y,z, . . .denote strings of terminals, i.e. x ∈T^∗

• α, β, γ, ψ, φ, σ, τ are strings of terminals and nonterminals, i.e.

α∈(N∪T)^∗

Productions are denoted byA→α.

The notationA→α|β |γ |. . .is an abbreviation for A→α,A→β,A→γ,. . .

(11)

Derivation

LetΓ = (N,T,Π,S)be a CFG:

• ψisdirectly derivablefromφinΓandφdirectly producesψ, written asφ⇒ψ, if there areσ, τ with

σAτ =φandσατ =ψandA→α∈Π

• ψisderivablefromφinΓ, written asφ⇒^∗ ψ, if there exist φ₀, . . . , φ_nwithφ=φ₀andψ=φ_nandφ_i ⇒φ_i+1for all i ∈ {0, . . . ,n−1}.

• φ₀, . . . , φ_nis called aderivationofψfromφ.

• ⇒^∗ is the reflexive, transitive closure of⇒.

(12)

Derivation (2)

• A derivationφ0, . . . , φnis aleftmostderivation (rightmost) if in every derivation stepφ_i ⇒φ_i+1the leftmost (rightmost) nonterminal inφ_i is replaced.

• Leftmost and rightmost derivation steps are denoted byφ⇒_lmψ andφ⇒_rm ψresp.

• The tree representation of a derivation is asyntax tree.

• L(Γ) ={z ∈T^∗|S ⇒^∗ z}is thelanguagegenerated byΓ.

• x ∈L(Γ)is asentenceofΓ(germ.Satz).

• φ∈(N∪T)^∗ withS ⇒^∗ φis asentential formofΓ(germ.

Satzform).

(13)

Derivation (3)

Remarks:

• Each derivation corresponds to exactly one syntax tree. In reverse, for each syntax tree, there can be several derivations.

• For “syntax tree”, the term “derivation tree” is also used.

• For each language, there can be several generating grammars, i.e., the mapping L: Grammar→Language is in general not injective.

(14)

Ambiguity in Grammars

• A sentence isunambiguousif it has exactly one syntax tree. A sentence isambiguousif it has more than one syntax tree.

• For each syntax tree, there exists exactly on leftmost derivation and exactly one rightmost derivation.

• Thus: A sentence is unambiguous iff it has exactly one leftmost (rightmost) derivation.

• A grammar isambiguousif it contains an ambiguous sentence.

• For programming languages, unambiguous grammars are

essential, as the semantics and the translation are defined by the syntactic structure.

(15)

Ambiguity in Grammars (2)

Example 1: GrammarΓ₀for expressions:

• S→E

• E →E+E

• E →E∗E

• E →(E)

• E →ID

Consider the input string

(av+av)∗bv+cv+dv

resulting in the following input for the context-free analysis (ID+ID)∗ID+ID+ID

(16)

Context-Free Syntax Analysis Specification of Parsers

Ambiguity in Grammars (3)

Syntax treefor(ID+ID)∗ID+ID+ID Beispiele: (Mehrdeutigkeit)

1. Beispiel einer Ausdrucksgrammatik:

!^0:S E, E E + E, E E * E, E ( E ), E ID

Betrachte die Eingabe: (av+av) * bv + cv +dv) Eingabe zur kf-Analyse: ( ID + ID ) * ID + ID + ID

S

"

" "

E E E E E ( ID + ID ) * ID + ID + ID

- Syntaxbaum entspricht nicht den üblichen Rechenregeln.

- Es gibt mehrere Syntaxbäume gemäß !0,

insbesondere ist die Grammatik mehrdeutig.

• Syntax tree does not match conventional rules of arithmetic.

• There are several syntax trees according toΓ₀for this input, henceΓ₀is ambiguous.

c

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 16

(17)

Ambiguity in Grammars (4)

Example 2: Ambiguity in if-then-else construct

if B1 then if B2 then A:= 9 else A:= 7

First Derivation

2. Mehrdeutigkeit beim if-then-else-Konstrukt:

if B1 then if B2 then A:=8 else A:= 7

IFTHENELSE

ANW IFTHEN

ANW ANW ZW ZW IF ID THEN IF ID THEN ID EQ CO ELSE ID EQ CO

ZW ZW ANW ANW

IFTHENELSE ANW IFTHEN

c

(18)

Ambiguity in Grammars (5)

Second Derivation

if B1 then if B2 then A:=8 else A:= 7

IFTHENELSE

ANW IFTHEN

ANW ANW ZW ZW IF ID THEN IF ID THEN ID EQ CO ELSE ID EQ CO

ZW ZW ANW ANW

IFTHENELSE ANW IFTHEN

(19)

Ambiguity as Grammar Property

Ambiguity is a grammar property. The grammar for expressionsΓ0is an example of an ambiguous grammar.

Γ₀:

• S→E

• E →E+E

• E →E∗E

• E →(E)

• E →ID

Die obige Ausdrucksgrammatik

!0: S E, E E + E | E * E | E ( E ) | E ID

ist ein Beispiel für eine mehrdeutige Grammatik:

S E E E E E

ID + ID * ID E E E E

E S

Mehrdeutigkeit ist zunächst einmal eine Grammatik- eigenschaft.

c

(20)

Ambiguity as Grammar Property (2)

But there exists an unambiguous grammar for the same language:

Γ₁:

• S→E

• E →T +E|T

• T →F ∗T|F

• F →(E)|ID

Aber es gibt eine eindeutige Grammatik für die Sprache:

!^1:S E, E T + E | T, T F * T | F, F ( E ) | ID S

E

E E

E F

F

F T

T T

T

( ID + ID ) * ID + ID F T

Lesen Sie zu Abschnitt 2.2.1:

Wilhelm, Maurer:

• aus Kap. 8, Syntaktische Analyse, die S. 271 - 283 Appel:

(Es gibt aber auch kontextfreie Sprachen, die nur durch mehrdeutige Grammatiken beschrieben werden.)

c

(21)

Ambiguity as Grammar Property (3)

Remark:

• A context-free language for which every grammar is ambiguous is calledinherently ambiguous.

• There are inherently ambiguous CFLs.

(22)

Literature

Recommended reading:

• Wilhelm, Maurer: Chapter 8, pp. 271 - 283 (Syntactic Analysis)

• Appel: Chapter 3, pp. 40-47

(23)

2.2.2 Implementation of Parsers

(24)

Implementation of parsers

Overview

• Top-down parsing

I Recursive descent

I LL parsing

I LL parser generation

• Bottom-up parsing

I LR parsing

I LALR, SLR, LR(k) parsing

I LALR parser generation

(25)

Methods for context-free analysis

• Manually developed, grammar-specific implementation (error-prone, inflexible)

• Backtracking (simple, but inefficient)

• Cocke-Younger-Kasami-Algorithm (1967):

I for all CFGs in Chomsky normalform

I based on idea of dynamic programming

I time complexityO(n³)(however linear complexity desired)

• Top-down methods: from axiom to word/token stream

• Bottom-up methods: from word/token stream to axiom

(26)

Example: Top-down analysis

Top-down analysis leads to leftmost derivation.

Example derivation withBeispiel: (Top-down-Analyse)Γ1:

S

E =>

T + E =>

F * T + E =>

( E ) * T + E =>

( T + E ) * T + E =>

( F + E ) * T + E =>

( ID + E ) * T + E =>

( ID + T ) * T + E =>

( ID + F ) * T + E =>

( ID + ID ) * T + E =>

( ID + ID ) * F + E =>

( ID + ID ) * ID + E =>

( ID + ID ) * ID + T =>

( ID + ID ) * ID + F =>

( ID + ID ) * ID + ID

Ergebnis der td-Analyse ist eine Linksableitung.

Gemäß !1 :

c

(27)

Example: Bottom-up analysis

Bottom-up analysis leads to rightmost derivation.

Example derivation withΓ1:

Beispiel: (Bottom-up-Analyse)

( ID + ID ) * ID + ID <=

( F + ID ) * ID + ID <=

( T + ID ) * ID + ID <=

( T + F ) * ID + ID <=

( T + T ) * ID + ID <=

( T + E ) * ID + ID <=

( E ) * ID + ID <=

F * ID + ID <=

F * F + ID <=

F * T + ID <=

T + ID <=

T + F <=

T + T <=

T + E <=

E <=

S <=

Ergebnis der bu-Analyse ist eine Rechtsableitung.

Gemäß !1 :

c

(28)

Context-free analysis with linear complexity

• Restrictions on grammar (not every CFG has a linear parser)

• Use of push-down automata or systems of recursive procedures

• Usage of look ahead to remaining input in order to select next production rule to be applied

(29)

Syntax analysis methods and parser generators

• Basic knowledge of syntax analysis is essential for use of parser generators.

• Parser generators are not always applicable.

• Often, error handling has to be done manually.

• Methods underlying parser generation is a good example for a generic technique (and a highlight of computer science!).

(30)

2.2.2.1 Top-down syntax analysis

(31)

Top-down syntax analysis

Learning objectives

• Understand the general principle of top-down syntax analysis

• Be able to implement recursive descent parsing (by example)

• Know expressiveness and limitations of top-down parsing

• Understand the basic concepts of LL(k) parsing

(32)

Recursive descent parsing

Basic idea

• Each nonterminal A is associated with a procedure. This procedure accepts a partial sentence derived from A.

• The procedure implements a finite automaton constructed from the productions with A as left-hand side. This automaton is called theitem automatonof A.

• The recursiveness of the grammar is mapped to mutual recursive procedures such that the stack of higher programing languages is used for handling the recursion.

(33)

Construction of recursive descent parser

LetΓ⁰₁be an CFG accepting w# iffw ∈L(Γ₁), i.e.,

# is used as a special character denoting the end of the input.

Γ⁰₁:

• S→E#

• E →T +E |T

• T →F∗T |F

• F →(E) |ID

Constructitem automatonfor each nonterminal.

(34)

Context-Free Syntax Analysis Implementation of Parsers

Item automata

S→E#

64

© A. Poetzsch-Heffter, TU Kaiserslautern 25.04.2007

Konstruktion eines Parsers mit der Methode des rekursiven Abstiegs (exemplarisch):

Sei !‘ wie !1, aber mit Randzeichen #, d.h.

S E #, E T + E | T, T F * T | F, F ( E ) | ID Konstruiere für jedes Nichtterminal A den sogenannten Item-Automaten. Er beschreibt die Analyse derjenigen Produktionen, deren linke Seite A ist:

1

[S .Ê^#] ^[S Ê.^{# ]} ^[S Ê#.^]

[E .T+E]

[E .T ]

[T .F*T]

[ T .F ]

[F .^(E)]

[F .^{ID ]}

[E T+.^E] ^{[E T+E}.^]

[E T.+E]

[E T.]

[ T F.] [T F.*T]

[T F*.T] [T F*T.]

[F ID.^]

[F (.^E)] ^[F ^(E.^)] ^[F ^(E).^]

E #

T + E

F * T

(

ID

E )

E →T +E |T

1

[S .E#] [S E.# ] [S E#.]

[E .T+E]

[E .T ]

[T .F*T]

[ T .F ]

[F .(E)]

[F .ID ]

[E T+.E] [E T+E.] [E T.+E]

[E T.]

[ T F.]

[T F.*T] [T F*.T] [T F*T.]

[F ID.]

[F (.E)] [F (E.)] [F (E).]

E #

T + E

F * T

(

ID

E )

T →F ∗T |F

1

[S .E#] [S E.# ] [S E#.]

[E .T+E]

[E .T ]

[T .F*T]

[ T .F ]

[F .(E)]

[F .ID ]

[E T+.E] [E T+E.] [E T.+E]

[E T.]

[ T F.] [T F.*T]

[T F*.T] [T F*T.]

.

[F (.E)] [F (E.)] [F (E).]

E #

T + E

F * T

(

ID

E )

c

(35)

Item automata (2)

F →(E) |ID

64

1

[S .E#] [S E.# ] [S E#.]

[E .T+E]

[E .T ]

[T .F*T]

[ T .F ]

[F .(E)]

[F .ID ]

[E T+.E] [E T+E.] [E T.+E]

[E T.]

[ T F.] [T F.*T]

[T F*.T] [T F*T.]

[F ID.]

[F (.E)] [F (E.)] [F (E).]

E #

T + E

F * T

(

ID

E )

(36)

Recursive descent parsing procedures

• The recursive procedures are constructed from the item automata.

• The input is a token stream terminated by #.

• The variablecurrTokencontains one token look ahead, i.e., the first symbol of the input rest.

(37)

Recursive descent parsing procedures (2)

Production: S →E#

void S() { E();

if( currToken == ’#’ ) { accept();

} else { error();

} }

(38)

Recursive descent parsing procedures (3)

Production: E →T +E |T

void E() { T();

if( currToken == ’+’ ) { readToken();

E();

} }

Production: T →F ∗T |F

void T() { F();

if( currToken == ’*’ ){

readToken();

T();

}

(39)

Recursive descent parsing procedures (4)

Production: F →(E) |ID

void F() {

if( currToken == ’(’ ) { readToken();

E();

if( currToken == ’)’ ) { readToken();

} else error();

} else if( currToken == ID ) { readToken();

} else error();

}

(40)

Recursive descent parsing procedures (5)

Remarks:

• Recursive descent

I is relatively easy to implement

I can easily be used with other tasks (see following example)

I is a typical example for syntax-directed methods (see also following example)

• Example uses one token look ahead.

• Error handling is not considered.

(41)

Recursive descent and evaluation

Example: Interpreter for expressions using recursive descent

int env(Ident); // Ident -> int

// local variables imr store intermediate results int S() {

int imr = E();

if (currToken == ’#’) { return imr;

} else { error();

return err_result;

} }

(42)

Recursive descent and evaluation (2)

int E() {

int imr = T();

if( currToken == ’+’ ) { readToken();

return imr + E();

} }

int T() {

int imr := F();

if (currToken == ’*’){

readToken();

return imr * T();

} }

(43)

Recursive descent and evaluation (3)

int F() { int imr;

if (currToken == ’(’){

readToken();

imr := E();

if (currToken == ’)’){

readToken(); return imr;

} else {

error(); return err_result;

}

} else if (currToken == ID) {

readToken(); return env(code(ID));

} else {

error(); return err_result;

} }

(44)

Recursive descent and evaluation (4)

• Extension of parser with actions/computations can easily be implemented, but mixes conceptually different phases/tasks and causes programs hard to maintain.

• Question: For which grammars does the recursive descent technique work?

→LL(k) parsing theory

(45)

LL parsing

• Basis for town-down syntax analysis

• First “L” refers to reading input from left to right

• Second “L” refers to search for leftmost derivations

(46)

LL(k) grammars

Definition (LL(k) grammar)

LetΓ = (N,T,Π,S)be a CFG andk ∈N.

Γis an LL(k) grammar if for any two leftmost derivations S ⇒^∗_lm uAα ⇒_lm uβα⇒^∗_lm ux and

S⇒^∗_lm uAα ⇒_lm uγα⇒^∗_lm uy the following holds:

ifprefix(k,x) =prefix(k,y), thenβ=γ

whereprefix(k,x)yields the longest prefix ofx with length≤k.

(47)

LL(k) grammars (2)

Remarks:

• A grammar is an LL(k) grammar if for a leftmost derivation with k token look ahead the correct production for the next derivation step can be found.

• AlanguageL_k ⊆Σ^∗ isLL(k)if there exists an LL(k) grammarΓ withL(Γ) =L_k.

• The definition of LL(k) grammars provides no method to test if a grammar has the LL(k) property.

(48)

Non LL(k) grammars

Example 1: Grammar with left recursionΓ₂:

• S→E#

• E →E+T |T

• T →T ∗F |F

• F →(E) |ID

Elimination of left recursion:

Replace productions of formA→Aα|βwhereβdoes not start withA byA→βA⁰ andA⁰→αA⁰|.

(49)

Non LL(k) grammars (2)

Elimination of left recursion:FromΓ2we obtainΓ3. Γ₂:

• S→E#

• E →E+T |T

• T →T ∗F |F

• F →(E) |ID

Γ₃

• S →E#

• E →TE⁰

• E⁰ →+TE⁰|

• T →FT⁰

• T⁰→ ∗FT |

• F →(E) |ID

(50)

Non LL(k) grammars (3)

Example 2: GrammarΓ₄with unlimited look ahead

• STM→VAR :=VAR |ID(IDLIST)

• VAR →ID|ID(IDLIST)

• IDLIST →ID|ID,IDLIST Γ₄is not an LL(k) grammar for any k.

(Proof: cf. Wilhelm, Maurer, Example 8.3.4, p. 319) Transformation to LL(2) grammarΓ⁰₄:

• STM→ASS_CALL|ID :=VAR

• ASS_CALL→ID(IDLIST)ASS_CALL_REST

• ASS_CALL_REST →:=VAR |

(51)

Non LL(k) grammars (4)

Remarks:

• The transformed grammars accept the same language, but generate other syntax trees:

I From a theoretical point of view, this is acceptable.

I From a programming language implementation perspective, this is in generalnotacceptable.

• There are languagesLfor which no LL(k) grammarΓexists that generates the language, i.e. L(Γ) =L. (Example: grammarΓ5)

(52)

Non LL(k) grammars (5)

Example 3:

For the following grammar, there is nok such thatΓ5is an LL(k).

• S→A|B

• A→aAb|0

• B→aBbb|1

Remark:

ForL(Γ5), there exists no LL(k) grammar.

Proof.

Let k be arbitrary, but fixed.

Choose two derivations according to the LL(k) definition and show that, despite of equal prefixes of length k,β andγ are not equal:

S⇒^∗_lmS ⇒_lmA⇒^∗_lma^k0b^k S⇒^∗_lm S⇒_lmB ⇒^∗_lma^k1b^2k

Then: prefix(k,a^k0b^k) =a^k =prefix(k,a^k1b^2k), butβ =A6=B=γ.

(53)

FIRST and FOLLOW sets

Definition

LetΓ = (N,T,Π,S)be a CFG,k ∈N;

T^≤k ={u ∈T^∗ |length(u)≤k} denotes the set of all prefixes of length at leastk. We define:

• FIRST_k : (N∪T)^∗ → P(T^≤k) FIRST_k(α) ={prefix(k,u)|α⇒^∗ u}

whereprefix(n,u) =ufor alluwithlength(u)≤n.

• FOLLOW_k : (N∪T)^∗ → P(T^≤k)ß

FOLLOW_k(α) ={w|S⇒^∗βαγ∧w ∈FIRST_k(γ)}

(54)

FIRST and FOLLOW sets in parse trees

X S

FIRST

k

(X) FOLLOW

k

(X)

(55)

Characterization of LL(1) grammars

Definition (reduced CFG)

A CFGΓ = (N,T,Π,S)isreducedif each nonterminal occurs in a derivation and each nonterminal derives at least one word.

Lemma

A reduced CFG is LL(1) iff for any two productions A→β and A→γ the following holds:

(FIRST₁(β)⊕₁FOLLOW₁(A)) ∩ (FIRST₁(γ)⊕₁FOLLOW₁(A)) = ∅ where L₁⊕₁L₂={prefix(1,vw)|v ∈L₁,w ∈L₂}

Remark: FIRST and FOLLOW sets are computable, so this criterion can be checked automatically.

(56)

Example: FIRST

_k

and FOLLOW

_k

Check that the modified expression grammarΓ3is LL(1).

• S→E#

• E →TE⁰

• E⁰ →+TE⁰ |

• T →FT⁰

• T⁰ → ∗FT |

• F →(E) |ID

ComputeFIRST₁andFOLLOW₁for each nonterminal.

(57)

Example: FIRST

_k

and FOLLOW

_k

(2)

• F →(E) |ID:

FIRST₁((E))⊕₁FOLLOW₁(F)∩FIRST₁(ID)⊕₁FOLLOW₁(F)

= {(} ⊕₁FOLLOW₁(F)∩ {ID} ⊕₁FOLLOW₁(F)

= ∅

• E⁰ →+TE⁰ |:

FIRST₁(+TE⁰)⊕₁FOLLOW₁(E⁰)∩FIRST₁()⊕₁FOLLOW₁(E⁰)

= {+} ⊕₁FOLLOW₁(E⁰)∩ {} ⊕₁FOLLOW₁(E⁰)

= {+} ∩ {#,)}

= ∅

• T⁰ → ∗FT |:

FIRST₁(∗FT⁰)⊕₁FOLLOW₁(T⁰)∩FIRST₁()⊕FOLLOW₁(T⁰)

= {∗} ⊕₁FOLLOW₁(T⁰)∩ {} ⊕₁FOLLOW₁(T⁰)

= {∗} ∩ {+,#,)}

= ∅

(58)

Proof of LL characterization lemma

• Direction from left to right:

Γis LL(1) implies FIRST-FOLLOW disjointness.

Proof by contradiction:

(“FIRST-FOLLOW intersection non empty” implies “not LL(1)” ) LetA→βandA→γbe two distinct productions ofGamma (β 6=γ) such that the FIRST-FOLLOW intersection is non empty.

Case distinction. We consider three cases:

Case 1: β ⇒^∗ andγ ⇒^∗

In this case, the LL(1) property does not hold forA→β,A→γ.

(59)

Proof of LL characterization lemma (2)

Case 2: β 6⇒^∗

Then, there is az withlength(z) =1 and

z ∈((FIRST₁(β)⊕₁FOLLOW₁(A))∩(FIRST₁(γ)⊕₁FOLLOW₁(A))) BecauseΓis reduced, there are two derivations:

S⇒^∗ ψAα⇒ψβα⇒^∗ψzx S⇒^∗ ψAα⇒ψγα⇒^∗ ψzy

and there is ausuch thatψ⇒^∗ u, i.e., there are leftmost derivations

S⇒^∗_lmuAα⇒_lmuβα⇒^∗_lmuzx S ⇒^∗_lmuAα⇒_lmuγα⇒^∗_lmuzy

But,prefix(1,zx) =z =prefix(1,zy)contradicts the LL(1) property ofΓ.

Case 3: γ 6⇒^∗ : similar to Case 2.

(60)

Proof of LL characterization lemma (3)

• Direction from right to left:

FIRST-FOLLOW disjointness impliesΓis LL(1):

Proof:

Consider any two derivations withβ6=γ:

S⇒^∗_lmuAα⇒_lmuβα⇒^∗_lmux S ⇒^∗_lmuAα⇒_lmuγα⇒^∗_lm uy

that is,prefix(1,x)∈(FIRST₁(β)⊕₁FOLLOW₁(A))and prefix(1,y)∈(FIRST₁(γ)⊕₁FOLLOW₁(A)). Because of FIRST-FOLLOW disjointness,prefix(1,x)6=prefix(1,y)

(61)

Parser generation for LL(k) languages

LL(k) Parser Generator Grammar

Table for Push-Down Automaton/

Parser Program

Error:

Grammar is not LL(k)

(62)

Parser generation for LL(k) languages (2)

Remarks:

• Use of push-down automata with look ahead

• Select production from tables

• Advantages over bottom-up techniques in error analysis and error handling

Example system: ANTLR (http://www.antlr.org/) Recommended reading for top-down analysis:

• Wilhelm, Maurer: Chapter 8, Sections 8.3.1. to Sections 8.3.4, pp.

312 - 329

(63)

2.2.2.2 Bottom-up syntax analysis

(64)

Bottom-up syntax anaysis

Learning objectives:

• General principles of bottom-up syntax analysis

• LR(k) analysis

• Resolving conflicts in parser generation

• Connection between CFGs and push-down automata

(65)

Basic ideas: bottom-up syntax analysis

• Bottom-up analysis is more powerful than top-down analysis, since production is chosen at the end of the analysis while in top-down analysis the production is selected up front.

• LR: read input from left (L)

and search for rightmost derivations (R)

(66)

Principles of LR parsing

1. Reduce from sentence to axiom according to productions ofΓ 2. Reduction yields sentential forms αx withα∈(N∪T)^∗ and

x ∈T^∗ wherex is the input rest

3. αhas to be a prefix of a right sentential form ofΓ. Such prefixes are called viable prefixes. This prefix property has to hold invariantly during LR parsing to avoid dead ends.

4. Reductions are always made at the leftmost possible position.

More precisely:

(67)

Viable prefix

Definition

Let S⇒^∗_rmβAu⇒_rmβαu be a right sentential form ofΓ.

Thenαis called ahandleorredexof the right sentential formβαu. Each prefix ofβαis aviable prefixofΓ.

(68)

Regularity of viable prefixes

Theorem

The language of viable prefixes of a grammarΓis regular.

Proof.

Cf. Wilhelm, Maurer Thm. 8.4.1 and Corrollary 8.4.2.1. (pp. 361, 362).

Essential proof steps are illustrated in the following by the construction of the LR-DFA(Γ).

(69)

Examples: towards LR parsing

• ConsiderΓ₁

I S→aCD

I C→b

I D→a|b

Analysis of aba can lead to a dead end (cf. lecture).

Considering viable prefixes can avoid this.

(70)

Examples: towards LR parsing (2)

• ConsiderΓ₂

I S→E#

I E→a|(E)|EE

Analysis of ((a))(a)# (cf. lecture) Stack can manage prefixes already read.

(71)

Examples: towards LR parsing (3)

• ConsiderΓ₃

I S→E#

I E→E+T |T

I T →ID

Analysis of ID + ID + ID # (cf. lecture)

(72)

LR parsing: shift and reduce actions

Schematic syntax tree for input xay with

α∈(N∪T)^∗, a∈T, x,y ∈T^∗ and start symbolS:

x a y

!a

Lesezeiger

Schematischer Syntaxbaum zur Eingabe xay mit a in T, x,y in T* und Startsymbol S:

x a y

! ="#

Lesezeiger x a y

!

Lesezeiger

Schiebe Schritt (shift): Reduktionsschritt (reduce):

"$=>

80

x a y

!a

Lesezeiger

Schematischer Syntaxbaum zur Eingabe xay mit a in T, x,y in T* und Startsymbol S:

x a y

! = "#

Lesezeiger x a y

!

Lesezeiger

Schiebe Schritt (shift): Reduktionsschritt (reduce):

"$ =>

80

x a y

!a

Lesezeiger

x a y

! = "#

Lesezeiger x a y

!

Lesezeiger

"$=>

Read Pointer

c

(73)

LR parsing: shift and reduce actions (2)

Shift step:

80

x a y

!a

Lesezeiger

x a y

! ="#

Lesezeiger x a y

!

Lesezeiger

"$=>

80

x a y

!a

Lesezeiger

x a y

! ="#

Lesezeiger x a y

!

Lesezeiger

"$=>

80

x a y

!a

Lesezeiger

x a y

! ="#

Lesezeiger x a y

!

Lesezeiger

"$=>

Read Pointer

Reduce step:

x a y

!a

Lesezeiger

x a y

! ="#

Lesezeiger x a y

!

Lesezeiger

"$=>

x a y

! a

Lesezeiger

Schematischer Syntaxbaum zur Eingabe xay mit a in T, x,y in T* und Startsymbol S:

x a y

! = "#

Lesezeiger x a y

!

Lesezeiger

Schiebe Schritt (shift): Reduktionsschritt (reduce):

"$ =>

80

x a y

!a

Lesezeiger

x a y

! ="#

Lesezeiger x a y

!

Lesezeiger

"$=>

Read Pointer

c

(74)

LR parsing: shift and reduce actions (3)

Problems:

• Make sure that all reductions guarantee that the resulting prefix remains a viable prefix.

• When to shift? When to reduce? Which production to use?

Solution:

For each grammarΓconstruct LR-DFA(Γ) automaton (also called LR(0) automaton), that describes the viable prefixes.

(75)

Construction of LR-DFA

LetΓ = (T,N,Π,S)be a CFG.

• For each nonterminalA∈N, construct item automaton

• Build union of item automata: Start state is the start state of item automaton for S, final states are final states of item automata

• Addtransitions from each state which contains the dot in front of a nonterminalAto the starting state of the item automaton ofA Theorem

The automaton

obtained from LR-DFA(Γ) by declaring all states to be final states exactly accepts the language of viable prefixes ofΓ.

(76)

Example: Construction of LR-DFA

Γ₃:S→E#,E →!5 : S E+T |E # , E T,T →IDE + T | T , T ID

Beispiel: (Konstruktion eines LR-DEA)

Konstruktion des LR-DEA für

[S .E#] [S E.# ] [S E#.]

[E .E+T]

[E .T ]

[T .ID ]

[E E+.T] [E E+T.]

[E T.]

[T ID.]

E #

E + T

ID

[E E.+T]

T

"

" "

"

Deterministisch machen liefert folgenden Automaten:

c

(77)

Example: Construction of LR-DFA (2)

Power set construction:

[S .E#]

[S E.# ]

[S E#.] [E .E+T]

[E .T ] [T .ID ]

[E E+.T]

[E E+T.]

[E T.] [T ID.]

E #

+

T

ID ^Fehler

T

[E E.+T]

bezeichnet Fehlerkanten

q⁰

q¹ q²

q3

q⁴ q⁵

q6

Die zuverlässigen Präfixe maximaler Länge:

E# , T , ID , E+ID , E+T

[T .ID ] ID

Bemerkungen:

• Im Beispiel enthält jeder Endzustand genau eine vollständig gelesene Produktion. Dies ist im Allg.

nicht so.

• Enthält ein Endzustand mehrere vollständig gelesene Produktionen spricht man von einemreduce/reduce- Konflikt.

• Enthält ein Endzustand eine vollständig gelesene und eine unvollständig gelesene Produktion mit einem Terminal nach dem Positionspunkt, spricht man von einemshift/reduce-Konflikt.

q⁷ Error

Error Transitions

Viable prefixes of maximal length: E#,T,ID,E +ID,E+T

c

(78)

Example: Construction of LR-DFA (3)

Remarks:

• In the example, each final state contains one completely read production, this is in general not the case.

• If a final state contains more than one completely read productions, we have areduce/reduce conflict.

• If a final state contains a completely read and an uncompletely read production with a terminal after the dot, we have a

shift/reduce conflict.

(79)

Analysis with LR-DFA

Analysis of ID + ID + ID # with LR-DFA (the viable prefix is underlined)

Analyse von ID + ID + ID # mit dem LR-DEA, unterstrichen ist jeweils der zuverlässige Präfix:

ID + ID + ID # <=

T + ID + ID # <=

E + ID + ID # <=

E + T + ID # <=

E + ID # <=

E + T # <=

E # <=

S

Beispiel: (Analyse mit LR-DEA)

Beachte:

• Die Satzformen bestehen immer aus einem zuverlässigen Präfix und der Resteingabe.

• Verwendet man nur den LR-DEA

zur Analyse muss man nach jeder Reduktion die Satzform von Anfang an lesen.

deshalb: verwende Kellerautomaten zur Analyse

c

(80)

Analysis with LR-DFA (2)

Note:

• The sentential forms always consist of a viable prefix and an input rest.

• If an LR-DFA is used, after each reduction the sentential form has to be read from the beginning.

Thus: Use pushdown automaton for analysis.

(81)

LR pushdown automaton

Definition

LetΓ = (N,T,Π,S)be a CFG. The LR-DFA pushdown automaton forΓ contains:

• a finite set of statesQ(the states of the LR-DFA(Γ))

• a set of actionsAct={shift,accept,error} ∪red(Π), where red(Π) contains an action reduce(A→α) for each A→α.

• an action tableat:Q→Act.

• a successor tablesucc:P×(N∪T)→Q with P ={q∈Q | at(q) =shift}

(82)

LR pushdown automaton (2)

Remarks:

• The LR-DFA pushdown automaton is a variant of pushdown automata particularly designed for LR parsing.

• States encode the read left context.

• If there are no conflicts, the action table can be directly constructed from the LR-DFA:

I accept: final state of item automaton of start symbol

I reduce: all other final states

I error: error state

I shift: all other states

(83)

Execution of Pushdown Automaton

• Configuration:Q^∗×T^∗ where variablestackdenotes the sequence of states and variableinrdenotes the input rest

• Start configuration:(q₀,input), whereq₀is the start state of the LR-DFA

• Interpretation Procedure:

(stack, inr) := (q0,input);

do {

step(stack,inr);

} while( at( top(stack) ) != accept

&& at( top(stack) ) != error );

if( at( top(stack) ) == error ) return error;

with

(84)

Execution of Push-Down Automaton (2)

void step ( var StateSeq stack, var TokenSeq inr) { State tk: = top(stack);

switch( at(tk) ) { case shift:

stack := push ( succ(tk,top(inr)), stack );

inr := tail(inr);

break;

case reduce A -> a:

stack := mpop( length(a), stack );

stack := push( succ(top(stack),A), stack);

break;

} }

(85)

Example: LR push down automaton

LR-DFA with statesq₀, . . . ,q7for grammarΓ₃ Action table

87

Beispiel: (LR-Kellerautomat zu

!5

)

Aktionstabelle:

q0 schieben q1 schieben q2 akzeptieren q3 schieben

q4 reduzieren E E+T q5 reduzieren E T q6 reduzieren T ID q7 fehler

Nachfolgertabelle:

ID + # E T q0 q6 q7 q7 q1 q5

q1 q7 q3 q2 q7 q7

q2

q3 q6 q7 q7 q7 q4

q4

q5

q6

q7

LR-DEA mit Zuständen q0– q7 (siehe Beipiel oben)

Rechnung zu Eingabe ID + ID + ID # :

Keller Eingaberest Aktion q0

q0 q6

q0 q5

q0 q1

q0 q1 q3

q0 q1 q3 q6

q0 q1 q3 q4

q0 q1

q0 q1 q3

q0 q1 q3 q6

q0 q1 q3 q4

q0 q1

q0 q1 q2

ID + ID + ID # schieben

+ ID + ID # reduzieren T ID + ID + ID # reduzieren E T + ID + ID # schieben

ID + ID # schieben

+ ID # reduzieren T ID + ID # reduzieren E E+T + ID # schieben

ID # schieben

# reduzieren T ID

# reduzieren E E+T

# schieben akzeptieren shift

accept

error reduce shift shift reduce reduce

Successor table

Beispiel: (LR-Kellerautomat zu

!5

)

Aktionstabelle:

q0 schieben q1 schieben q2 akzeptieren q3 schieben

q4 reduzieren E E+T q5 reduzieren E T q6 reduzieren T ID q7 fehler

Nachfolgertabelle:

ID + # E T q0 q6 q7 q7 q1 q5

q1 q7 q3 q2 q7 q7

q2

q3 q6 q7 q7 q7 q4

q4

q5

q6

q7

LR-DEA mit Zuständen q0 – q7 (siehe Beipiel oben)

Rechnung zu Eingabe ID + ID + ID # :

Keller Eingaberest Aktion q0

q0 q6

q0 q5

q0 q1

q0 q1 q3

q0 q1 q3 q6

q0 q1 q3 q4

q0 q1

q0 q1 q3

q0 q1 q3 q6

q0 q1 q3 q4

q0 q1

q0 q1 q2

ID + ID + ID # schieben

+ ID + ID # reduzieren T ID + ID + ID # reduzieren E T + ID + ID # schieben

ID + ID # schieben

+ ID # reduzieren T ID + ID # reduzieren E E+T + ID # schieben

ID # schieben

# reduzieren T ID

# reduzieren E E+T

# schieben akzeptieren

c