Context-Free Analysis Lecture Compilers SS 2009 Dr.-Ing. Ina Schaefer

(1)

Context-Free Analysis

Lecture Compilers SS 2009

Dr.-Ing. Ina Schaefer

Software Technology Group TU Kaiserslautern

Ina Schaefer Context-Free Analysis 1

Content of Lecture

1. Introduction: Overview and Motivation 2. Syntax- and Type Analysis

2.1 Lexical Analysis

2.2 Context-free Syntax Analysis 2.3 Context-sensitive Syntax Analysis 3. Translation to Target Language

3.1 Translation of Imperative Language Constructs 3.2 Translation of Object-Oriented Language Constructs 4. Selected Aspects of Compilers

4.1 Intermediate Languages 4.2 Optimization

4.3 Command Selection 4.4 Register Allocation 4.5 Code Generation 5. Garbage Collection

6. XML Processing (DOM, SAX, XSLT)

(2)

Outline of Context-Free Analysis

1. Specification of Parsers 2. Implementation of Parsers

Top-Down Syntax Analysis

Recursive Descent LL(k) Parsing Theory LL Parser Generation

Bottom-Up Syntax Analysis

Principles of LR Parsing LR Parsing Theory

SLR, LALR, LR(k) Parsing LALR-Parser Generation

3. Error Handling

4. Concrete and Abstract Syntax

Context-Free Syntax Analysis

Tasks

• Check, if Token Stream (from Scanner) matches context-free syntax of languages

! Error Case: Error handling

! Correctness: Construction of Syntax Tree

Parser Token Stream

(3)

Context-Free Syntax Analysis (2)

Remarks:

• Parsing can be interleaved with other actions processing the program (e.g. attributation).

• Syntax tree controls important parts of translation. Hence, we distinguish

! Concrete syntax tree corresponds to context-free grammar

! Abstract syntax tree aims at further processing steps, compact representation of essential information.

Specification of Parsers

2 general specification techniques

• Syntax Diagrams

• Context-Free Grammars (often in extended form)

(4)

Context-Free Grammars

Definition

Let N and T be two alphabets, with N ∩T = ∅ and Π a finite subset of N ×(N ∪T)^∗ and S ∈ N. Then Γ = (N,T,Π,S) is a context-free

grammar(CFG) where

• N is the set of non-terminals

• T is the set of terminals

• Π is the set of productions rules

• S is the start symbol (axiom)

Context-Free Grammars (2)

Notations:

• A,B,C, . . . denote non-terminals

• a,b,c, . . . denote terminals

• x,y,z, . . . denote strings of terminals, i.e. x ∈ T^∗

• α,β,γ,ψ,φ,σ,τ are strings of terminals and non-terminals, i.e.

α ∈ (N ∪T)^∗

Productions are denoted byA → α.

The notation A → α|β|γ|. . . is an abbreviation for

(5)

Derivation

Let Γ =( N,T,Π,S) be a CFG:

• ψ is directly derivable from φ in Γ, φ produces ψ directly, φ ⇒ ψ, if there exists σAτ = φ and σατ = ψ and A → α ∈ Π

• ψ is derivable from φ in Γ,φ ⇒^∗ ψ, if there exists φ₀, . . . ,φ_n with φ = φ₀ and ψ = φ_n and for all i ∈ {0, . . . ,n− 1} it holds that φ_i ⇒ φ_i+1. φ₀, . . . ,φ_n is the derivation of ψ from φ.

• ⇒^∗ is the reflexive, transitive closure of ⇒.

Derivation (2)

• The derivation φ₀, . . . ,φ_n is a left derivation (right derivation), if in φ_i the left-most (right-most) non-terminal is replaced. Left

derivation steps are denoted by φ ⇒lm ψ. Right derivation steps are denoted byφ ⇒rm ψ.

• The tree-like representation of a derivation is a syntax tree.

• L(Γ) = {z ∈ T^∗|S ⇒^∗ z} is the language generated byΓ.

• x ∈ L(Γ) is a sentence of Γ.

• φ ∈ (N ∪T)^∗ with S ⇒^∗ φ is a sentential form of Γ.

(6)

Ambiguity in Grammars

• A sentence is unambiguous if it has exactly one syntax tree. A sentence is ambiguous if it has more than one syntax tree.

• For each syntax tree, there exists exactly on left derivation and exactly one right derivation.

• Thus it holds: A sentence is unambiguous if-and-only-if it has exactly one left (right) derivation.

• A grammar is ambiguous if it contains an ambiguous sentence, else it is unambiguous.

• For programming languages, unambiguous grammars are

essential, as the semantics and the translation are defined by the syntactic structure.

Ambiguity in Grammars (2)

Example 1: Grammar for Expressions Γ₀

• S → E

• E → E +E

• E → E ∗E

• E → (E)

• E → ID

Consider the input string (av + av)∗bv +cv + dv which is the following input to the context-free analysis

(7)

Ambiguity in Grammars (3)

Syntax tree for (ID + ID)∗ID + ID + ID

© A. Poetzsch-Heffter, TU Kaiserslautern 54 25.04.2007

Beispiele: (Mehrdeutigkeit)

1. Beispiel einer Ausdrucksgrammatik:

!0: S E, E E + E, E E * E, E ( E ), E ID

Betrachte die Eingabe: (av+av) * bv + cv +dv) Eingabe zur kf-Analyse: ( ID + ID ) * ID + ID + ID

S

"

" "

E E E E E ( ID + ID ) * ID + ID + ID

- Syntaxbaum entspricht nicht den üblichen Rechenregeln.

- Es gibt mehrere Syntaxbäume gemäß !0,

insbesondere ist die Grammatik mehrdeutig.

• Syntax tree does not match conventional rules of arithmetic.

• There are several syntax trees according to Γ₀ for this input, hence Γ₀ is ambiguous.

Ambiguity in Grammars (4)

Example 2: Ambiguity in if-then-else construct

if B1 then if B2 then A:= 9 else A:= 7 First Derivation

2. Mehrdeutigkeit beim if-then-else-Konstrukt:

if B1 then if B2 then A:=8 else A:= 7

IFTHENELSE

ANW IFTHEN

ANW ANW ZW ZW IF ID THEN IF ID THEN ID EQ CO ELSE ID EQ CO

ZW ZW ANW ANW

IFTHENELSE ANW IFTHEN

(8)

Ambiguity in Grammars (5)

Second Derivation

55

© A. Poetzsch-Heffter, TU Kaiserslautern 25.04.2007

2. Mehrdeutigkeit beim if-then-else-Konstrukt:

if B1 then if B2 then A:=8 else A:= 7

IFTHENELSE

ANW IFTHEN

ANW ANW ZW ZW IF ID THEN IF ID THEN ID EQ CO ELSE ID EQ CO

ZW ZW ANW ANW

IFTHENELSE ANW

IFTHEN

Ambiguity in Grammars (6)

Remarks:

• Each derivation corresponds to exactly one syntax tree. In reverse, for each syntax tree, there can be several derivations.

• Instead of the term "syntax tree" often also the terms "structure tree" or "derivation tree" are used.

• For each language, there can be several generating grammars, i.e.

the mapping L: Grammar → Language is in general not injective.

(9)

Ambiguity as Grammar Property

Ambiguity is a grammar property. The grammar for expressions Γ₀ is an example of an ambiguous grammar.

Γ₀:

• S → E

• E → E + E

• E → E ∗ E

• E → (E)

• E → ID

57

Beispiel: (Mehrdeutigkeit als Grammatikeig.)

Die obige Ausdrucksgrammatik

!^0:S E, E E + E | E * E | E ( E ) | E ID

ist ein Beispiel für eine mehrdeutige Grammatik:

S E E E E E

ID + ID * ID E E E E

E S

Mehrdeutigkeit ist zunächst einmal eine Grammatik- eigenschaft.

Ambiguity as Grammar Property (2)

But there exists an unambiguous grammar for the same language:

Γ₁:

• S → E

• E → T + E |T

• T → F ∗T |F

• F → (E)|ID

Aber es gibt eine eindeutige Grammatik für die Sprache:

!^1:S E, E T + E | T, T F * T | F, F ( E ) | ID S

E

E E

E F

F

F T

T T

T

( ID + ID ) * ID + ID F T

Lesen Sie zu Abschnitt 2.2.1:

Wilhelm, Maurer:

• aus Kap. 8, Syntaktische Analyse, die S. 271 - 283 Appel:

• aus Chap. 3, S. 40 - 47

(Es gibt aber auch kontextfreie Sprachen, die nur durch mehrdeutige Grammatiken beschrieben werden.)

(10)

Literature

Recommended Reading:

• Wilhelm, Maurer: Chapter 8, pp. 271 - 283 (Syntactic Analysis)

• Appel: Chapter 3, pp. 40-47

Implementation of Parsers

Overview

• Top-Down Parsing

! Recursive Descent

! LL-Parsing

! LL-Parser Generation

• Bottom-Up Parsing

! LR-Parsing

! LALR, SLR, LR(k)-Parsing

! LALR-Parser Generation

(11)

Methods for Context-Free Analysis

• Manually developed, grammar-specific implementation (error-prone, inflexible)

• Backtracking (simple, but inefficient)

• Cocke-Younger-Kasami-Algorithm (1967):

! for all CFGs in Chomsky Normalform

! based on idea of dynamic programming

! Time Complexity O(n³), however linear complexity desired

• Top-Down-Methods: from Axiom to Token stream

• Bottom-up-Methods: from Token stream to Axiom

Example: Top-Down Analysis

According to Γ₁: Result is a left derivation.

Beispiel: (Top-down-Analyse) S

E =>

T + E =>

F * T + E =>

( E ) * T + E =>

( T + E ) * T + E =>

( F + E ) * T + E =>

( ID + E ) * T + E =>

( ID + T ) * T + E =>

( ID + F ) * T + E =>

( ID + ID ) * T + E =>

( ID + ID ) * F + E =>

( ID + ID ) * ID + E =>

( ID + ID ) * ID + T =>

( ID + ID ) * ID + F =>

( ID + ID ) * ID + ID

Ergebnis der td-Analyse ist eine Linksableitung.

Gemäß !1 :

(12)

Example: Bottom-Up Analysis

According to Γ₁: Result is a right derivation.

Beispiel: (Bottom-up-Analyse)

( ID + ID ) * ID + ID <=

( F + ID ) * ID + ID <=

( T + ID ) * ID + ID <=

( T + F ) * ID + ID <=

( T + T ) * ID + ID <=

( T + E ) * ID + ID <=

( E ) * ID + ID <=

F * ID + ID <=

F * F + ID <=

F * T + ID <=

T + ID <=

T + F <=

T + T <=

T + E <=

E <=

S <=

Ergebnis der bu-Analyse ist eine Rechtsableitung.

Gemäß !1 :

Context-free Analysis with linear complexity

• Restrictions on Grammar (not every CFG has a linear parser)

• Use of push-down automata or systems of recursive procedures

• Usage of look ahead to remaining input in order to select next production rule to be applied

(13)

Syntax Analysis Methods and Parser Generators

• Basic Knowledge of Syntax Analysis is essential for use of Parser Generators.

• Parser generators are not always applicable.

• Often, error handling has to be done manually.

• Methods underlying parser generation is a good example for a generic technique (and a highlight of computer science!).

Implementation of Parsers Top-Down Syntax Analysis

Top-Down Syntax Analysis

Educational Objectives

• General Principle of Top-Down Syntax Analysis

• Recursive Descent Parsing (at an Example)

• Expressiveness of Top-Down Parsing

• Basic Concepts of LL(k) Parsing

(14)

Recusive Descent

Basic Idea

• Each non-terminal A is associated with a procedure. This procedure accepts a partial sentence derived from A.

• The procedure implements a finite automaton constructed from the productions starting from A

• Recursion of grammar is mapped to mutual recursive procedures such that stack of higher programing languages can be used for implementation.

Construction of Recursive Descent Parser

Let Γ^"₁ be a CFG (likeΓ₁) with a terminal # denoting the end of the

input.

Γ^"₁:

• S → E#

• E → T +E |T

• T → F ∗T |F

• F → (E)|ID

Constructitem automaton for each non-terminal A. The item

(15)

Item Automata

S → E#

64

Konstruktion eines Parsers mit der Methode des rekursiven Abstiegs (exemplarisch):

Sei !‘ wie !1, aber mit Randzeichen #, d.h.

S E #, E T + E | T, T F * T | F, F ( E ) | ID Konstruiere für jedes Nichtterminal A den sogenannten Item-Automaten. Er beschreibt die Analyse derjenigen Produktionen, deren linke Seite A ist:

1

[S .E#] [S E.# ] [S E#.]

[E .T+E]

[E .T ]

[T .^F*T]

[ T .^{F ]}

[F .(E)]

[F .ID ]

[E T+.^E] ^{[E T+E}.^]

[E T.+E]

[E T.]

[ T F.^]

[T F.^*T] _{[T F*}.^T] ^{[T F*T}.^]

[F ID.]

[F (.^E)] ^[F ^(E.^)] ^[F ^(E).^]

E #

T + E

F * T

(

ID

E )

E → T +E |T

64

Sei !‘ wie !¹, aber mit Randzeichen #, d.h.

1

[S .E#] [S E.# ] [S E#.]

[E .T+E]

[E .T ]

[T .F*T]

[ T .F ]

[F .(E)]

[F .ID ]

[E T+.E] [E T+E.] [E T.+E]

[E T.]

[ T F.] [T F.*T]

[T F*.T] [T F*T.]

[F ID.]

[F (.E)] [F (E.)] [F (E).]

E #

T + E

F * T

(

ID

E )

T → F ∗T |F

64

1

[S .E#] [S E.# ] [S E#.]

[E .^T+E]

[E .T ]

[T .^F*T]

[ T .F ]

[F .^(E)]

[F .ID ]

[E T+.E] [E T+E.] [E T.^+E]

[E T.]

[ T F.]

[T F.^*T] _{[T F*}.T] [T F*T.]

[F ID.^]

[F (.^E)] ^[F ^(E.^)] ^[F ^(E).^]

E #

T + E

F * T

(

ID

E )

Item Automata (2)

F → (E)|ID

1

[S .E#] [S E.# ] [S E#.]

[E .^T+E]

[E .T ]

[T .F*T]

[ T .F ]

[F .(E)]

[F .ID ]

[E T+.E] [E T+E.] [E T.^+E]

[E T.]

[ T F.] [T F.*T]

[T F*.^T] ^{[T F*T}.^]

[F ID.^]

[F (.E)] [F (E.)] [F (E).]

E #

T + E

F * T

(

ID

E )

(16)

Recursive Descent Parsing Procedures

• Item Automata can be mapped to recursive procedures.

• The input is a token stream terminated by #.

• The variable currSymbol contains one token look ahead, i.e. the first symbol of the stream.

Recursive Descent Parsing Procedures (2)

Production: S → E# void S() {

E ();

if (currSymbol == ’#’){

accept();

} else { error();

} }

(17)

Recursive Descent Parsing Procedures (3)

Production: E → T + E|T void E() {

T();if (currSymbol == ’+’){

readSymbol();

E();} }

Production: T → F ∗ T |F void T() {

F();if (currSymbol == ’*’){

readSymbol();

} T();

}

Recursive Descent Parsing Procedures (4)

Production: F → (E)|ID void F() {

if (currSymbol == ’(’){

readSymbol();

E();

if (currSymbol == ’)’){

readSymbol();

}else error();

}

else if (currSymbol == ID ){

readSymbol();

}

else error();

}

(18)

Recursive Descent Parsing Procedures (5)

Remarks:

• Recursive Descent

! is relatively easy to implement

! can easily be used with other tasks (see following example)

! is a typical example for syntax-directed methods (see also following example)

• Example uses one token look ahead.

Recursive Descent and Evaluation

Example: Interpreter for Expressions using recursive descent

int env(Ident); // ID -> int

// local variable int_result stores intermediate results int S() {

int int_result : = E();

if (currSymbol == ’#’) { return int_result;

} else {

error();

(19)

Recursive Descent and Evaluation (2)

int E() {

int int_result := T();

if (currSymbol == ’+’){

readSymbol();

return int_result + E();

} }

int T() {

int int_result := F();

if (currSymbol == ’*’){

readSymbol();

return int_result * T();

} }

Recursive Descent and Evaluation (3)

int F() {

int int_result;

if (currSymbol == ’(’){

readSymbol();

int_result := E();

if (currSymbol == ’)’){

readSymbol();

return int_result;

else { error();}

return error_result; } }else if (currSymbol == ID) {

readSymbol();

return env(code(ID));

}

else { error();

return error_result; }

(20)

Recursive Descent and Evaluation (4)

• Extension of Parser with Actions/Computations can easily be implemented, but mixing of conceptually different tasks and causes programs hard to maintain.

• For which grammars does the recursive descent technique work?

→ LL(k) Parsing Theory

LL Parsing

• Basis for Town-Down Syntax Analysis

• First L: Read Input from Left to Right

• Second L: Search for Left Derivations

(21)

LL(k) Grammars

Definition (LL(k) Grammar)

Let Γ =( N,T,Π,S) be a CFG and k ∈ N.

Γ is an LL(k) grammar, if for any two left derivations S ⇒^∗_lm uAα ⇒lm uβα ⇒^∗_lm ux and

S ⇒^∗_lm uAα ⇒lm uγα ⇒^∗_lm uy it holds that ifprefix(k,x) = prefix(k,y) then β = γ.

LL(k) Grammars (2)

Remarks:

• A grammar is an LL(k) grammar if for a left derivation with k token look ahead the correct production for the next derivation step can be found.

• A Language L ⊆ Σ^∗ is LL(k) if there exists LL(k) grammar Γ with L(Γ) = L.

• The definition of LL(k) grammars provides no method to test if a grammar has the LL(k) property.

(22)

Non LL(k) Grammars

Example 1: Grammar with Left Recursion Γ₂:

• S → E#

• E → E +T |T

• T → T ∗ F |F

• F → (E)|ID

Elimination of Left Recursion:

Replace productions of form A → Aα|β where β does not start with A

by A → βA^" and A^" → αA^"|(.

Non LL(k) Grammars (2)

Elimination of Left Recursion: From Γ₂ we obtain Γ₃. Γ₂:

• S → E#

• E → E + T |T

• T → T ∗F |F

• F → (E)|ID

Γ₃

• S → E#

• E → TE^"

• E^" → +TE^"|(

• T → FT^"

• T^" → ∗FT|(

• F → (E)|ID

(23)

Non LL(k) Grammars (3)

Example 2: Grammar Γ₄ with unlimited look ahead

• STM → VAR := VAR|ID(IDLIST)

• VAR → ID|ID(IDLIST)

• IDLIST → ID|ID,IDLIST

Γ₄ is not an LL(k) grammar for any k.

(Proof: cf. Wilhelm, Maurer, Example 8.3.4, p. 319) Transformation to LL(2) grammar Γ^"₄:

• STM → ASS_CALL|ID := VAR

• ASS_CALL → ID(IDLIST)ASS_CALL_REST

• ASS_CALL_REST →:= VAR|(

Non LL(k) Grammars (4)

Remark:

The transformed grammars accept the same language, but provide other syntax trees.

From a theoretical point of view, this is acceptable.

From a programming language implementation perspective, this is in general not acceptable.

There are languagesL for which no LL(k) grammar Γ exists that generates the language, i.e. L(Γ) =L. (Example: Grammar Γ₅)

(24)

Non LL(k) Grammars (5)

Example 3: For L(Γ5), there exists no LL(k) grammar.

• S → A|B

• A → aAb|0

• B → aBbb|1

We show that there is no k such that Γ₅ is an LL(k) grammar.

Proof.

Let k be arbitrary but fixed. Choose two derivations according to the LL(k) definition and show that desipite of equal prefixes of length k the resuts from β and γ are not equal:

S ⇒^∗_lm S ⇒ A_lm ⇒^∗_lm a^k0b^k S ⇒^∗_lm S ⇒lm B ⇒^∗_lm a^k1b^2k

Then: prefix(k,a^k0b^k) = prefix(k,a^k1b^2k) = a^k, but β = A += B = γ.

FIRST and FOLLOW Sets

Definition

Let Γ =( N,T,Π,S) be a CFG, k ∈ N. T^≤^k = {u ∈ T^∗|length(u) ≤ k} denotes all prefixes of length at least k.

We define:

• FIRST_k : (N ∪T)^∗ → P(T^≤^k)

with FIRST_k(α) = {prefix(k,u)|α ⇒^∗ u}

where prefix(n,u) = u for all u with length(U) ≤ n.

• FOLLOW_k : (N ∪T)^∗ → P(T^≤^k)

(25)

FIRST and FOLLOW Sets in Parse Trees

X S

FIRST

k

(X) FOLLOW

k

(X)

Characterization of LL(1) Grammars

Definition (Reduced CFG)

A CFG Γ = (N,T,Π,S) is reduced if each non-terminal occurs in a derivation and each non-terminal derives at least one word.

Lemma

A reduced CFG is LL(1) if-and-only-if for any two productions A → β and A → γ it holds that

FIRST₁(β)⊕1 FOLLOW₁(A)∩FIRST₁(γ)⊕1 FOLLOW₁(A) = ∅ where L₁ ⊕1 L₂ = {prefix(1,vw)|v ∈ L₁,w ∈ L₂}

Remark: FIRST and FOLLOW sets are computable, such this criterion can be checked automatically.

(26)

Examples: FIRST

k

and FOLLOW

k

Check that modified expression grammarΓ₃ is LL(1).

• S → E#

• E → TE^"

• E^" → +TE^"|(

• T → FT^"

• T^" → ∗FT|(

• F → (E)|ID

Compute FIRST1 and FOLLOW1 for each non-terminal.

Examples: FIRST

k

and FOLLOW

k

(2)

• F → (E)|ID:

FI₁((E))⊕1 FOLLOW₁(F)∩FIRST₁(ID)⊕ FOLLOW₁(F)

= {(}⊕1 FOLLOW₁(F)∩{ID}⊕ FOLLOW₁(F)

= ∅

• E^" → +TE^"|(:

FIRST₁(+TE^")⊕1 FOLLOW₁(E^")∩FIRST₁(()⊕FOLLOW₁(E^")

= {+}⊕1 FOLLOW1(E^")∩{(}⊕FOLLOW1(E^")

= {+}∩{#,)}

= ∅

• T^" → ∗FT|( :

FIRST₁(∗FT^")⊕1 FOLLOW₁(T^")∩FIRST₁(()⊕ FOLLOW₁(T^")

(27)

Proof of LL Characterization Lemma

• Left-To-Right Direction: Γ is LL(1) implies FIRST and FOLLOW characterization.

Proof by Contradiction:

Suppose two productions A → β and A → γ with β += γ and Φ = FIRST₁(β)⊕1 FOLLOW₁(A)∩FIRST₁(γ)⊕1 FOLLOW₁(A) += ∅. Then there exists z ∈ Φ with length(z) = 1.

As Γ is reduced, there are two derivations:

S ⇒^∗ ψAα ⇒ ψβα ⇒^∗ ψzx S ⇒^∗ ψAα ⇒ ψγα ⇒^∗ ψzy

Proof of LL Characterization Lemma (2)

Thus, we can construct the following left derivations : S ⇒^∗_lm uAα ⇒lm uβα ⇒^∗_lm uzx S ⇒^∗_lm uAα ⇒lm uγα ⇒^∗_lm uzy

with prefix(1,zx) = z = prefix(1,zy) which contradicts the LL(1) property of Γ.

(28)

Proof of LL Characterization Lemma (3)

• Right-To-Left Direction: FIRST and FOLLOW characterization implies Γ is LL(1) .

Proof by Contradiction:

Suppose Γ is not LL(1). Then there are two different derivations with length(z) = 1:

S ⇒^∗ ψAα ⇒ ψβα ⇒^∗ ψzx S ⇒^∗ ψAα ⇒ ψγα ⇒^∗ ψzy

But z ∈ FIRST₁(β)⊕1 FIRST₁(γ) which is a contradiction.

Parser Generation for LL(k) Languages

LL(k) Parser Generator

Grammar

Table for Push-Down Automaton/

Error:

Grammar is

not LL(k)

(29)

Parser Generation for LL(k) Languages (2)

Remarks:

• Use of push-down automata with look ahead

• Select Production from Tables

• Advantages over bottom-up techniques in error analysis and error handling

Example System: ANTLR (http://www.antlr.org/) Recommended Reading for Top-Down Analysis:

• Wilhelm, Maurer: Chapter 8, Sections 8.3.1. to Sections 8.3.4, pp.

312 - 329