2. Syntax and Type Analysis

(1)

Compilers and Language Processing Tools

Summer Term 2013

Arnd Poetzsch-Heffter Annette Bieniusa

Software Technology Group TU Kaiserslautern

(2)

Content of Lecture

1. Introduction

2. Syntax and Type Analysis 2.1 Lexical Analysis

2.2 Context-Free Syntax Analysis

2.3 Context-Dependent Analysis (Semantic Analysis) 3. Translation to Intermediate Representation

3.1 Languages for Intermediate Representation 3.2 Translation of Imperative Language Constructs 3.3 Translation of Object-Oriented Language Constructs 3.4 Translation of Procedures

4. Optimization and Code Generation 4.1 Assembly and Machine Code 4.2 Optimization

4.3 Register Allocation

(3)

Content of Lecture (2)

5. Selected Topics in Compiler Construction 5.1 Garbage Collection

5.2 Just-in-time Compilation

5.3 XML Processing (DOM, SAX, XSLT)

(4)

2. Syntax and Type Analysis

(5)

Main learning objectives

• Know the tasks of different syntax analysis phases

• Understand how syntax analysis phases cooperate

• Know the specification techniques for syntax analysis how to apply them

• Understand the generation techniques

• Be able to use the mentioned tools

• Understand lexical analysis

• Understand context-free analysis (parsing)

• Understand name and type analysis (context-sensitive)

(6)

Tasks of syntax analysis

• Check if input is syntactically correct

• Dependent on result:

I Error message

I Generation of appropriate data structure for subsequent processing

(7)

Syntax and type analysis phases

Lexical analysis:

Character stream→token stream(or symbol stream)

Context-free analysis:

Token stream→syntax tree

Context-sensitive analysis/semantic analysis:

Syntax tree→syntax tree with cross references

Source Code

Scanner

Parser

Name and Type Analysis Character Stream

Token Stream

Syntax Tree

Attributed Syntax Tree

SyntaxandTypeAnalysis

(8)

Reasons for separation of phases

• Lexical and context-free analysis

I Reduced load for context-free analysis, e.g., whitespaces are not required for context-free analysis

• Context-free and context-sensitive analysis

I Context-sensitive analysis uses tree structure instead of token stream

I Advantages for construction of target data structure

• For both cases

I Increased efficiency

I Natural process (cmp. natural language)

I More appropriate tool support

(9)

Lexical Analysis

2.2 Lexical Analysis

(10)

Lexical Analysis Introduction

2.2.1 Introduction

(11)

Tasks of lexical analysis

• Break input character stream into a token stream wrt. language definition

• Classify tokens into token classes

• Representation of tokens

I Hashing of identifiers

I Conversion of constants

• Elimination of

I whitespaces (spaces, comments...)

I external constructs (compiler directives...)

(12)

Tasks of lexical analysis (2)

Terminology

• Token/symbol: a word over an alphabet of characters (often with additional information, e.g. token class, encoding, position..)

• Token class: a set of tokens (identifier, constants, ...); correspond to terminal symbols of a context-free grammar

Remark:the terms token and symbol refer to the same concept. The term token is in general used when talking about parsing technology, whereas the term symbol is used when talking about formal languages.

(13)

Lexical analysis: Example

Input Line 23:

if( A <= 3.14 ) B = B

Token Class String Token Information Col:Row

IF “if” 23:3

OPAR “(” 23:5

ID “A” 72 (Hash) 23:7

RELOP “<=” 4 (Encoding) 23:9

FLOATCONST “3.14” 3,14 (Constant Value) 23:12

CPAR “)” 23:16

ID “B” 84 (Hash) 23:20

. . .

(14)

Lexical Analysis Specification of Scanners

2.2.2 Specification of Scanners

(15)

Specification

The specification of the lexical analysis is a part of the language specification.

The two parts of lexical analysis specification:

• Scanning algorithm (often only implicit)

• Specification of tokens and token classes

(16)

Examples: Scanning

1. Statement in C B = B --- A;

Problem: Separation ( - - and - are tokens) Solution: Longest token is chosen, i.e, B = B -- - A;

2. Java Fragment

class public { public m() {...} } Problem: Ambiguity (keyword, identifier) Solution: Precedence rules

(17)

Standard scan algorithm (concept)

Scanning is often implemented as a procedure:

• Procedure returns next token

• State is remainder of input

• In error cases, returns the tokenUNDEFwithout updating input

(18)

Standard scan algorithm (pseudo code)

CharStream inputRest := input;

Token nextToken() {

Token curToken := longestTokenPrefix(inputRest);

inputRest:= cut(curToken, inputRest);

return curToken;

}

wherecutis defined as

• ifcurToken6=UNDEF,curTokenis removed frominputRest

• elseinputRestremains unchanged.

(19)

Standard scan algorithm (2)

Token longestTokenPrefix(CharStream ir) { requires availableChar(ir) > 0

int curLength = 1;

String curPrefix := prefix(curLength,ir);

Token longestToken := UNDEF;

while( curLength <= availableChar(ir)

&& isTokenPrefix(curPrefix) ) { if (isToken(curPrefix) {

longestToken := token(curPrefix);

}

curLength++;

curPrefix := prefix(curLength,ir);

}

return longestToken;

}

(20)

Standard scan algorithm (3)

Predicates to be defined:

• isTokenPrefix: String→boolean

• isToken: String→boolean

• token: String→Token

I yields token UNDEF if argument doesn’t represent a token

I selects one of the tokens if there are several tokens possible

Remarks:

• Standard scan algorithm is used in many modern languages, but not, e.g., in FORTRAN:

I DO 7 I = 1.25 ⇒ “DO 7 I” is an identifier.

I DO 7 I = 1,25 ⇒ “DO” is a keyword.

• Error cases are not handled

• Complete realization oflongestTokenPrefixis discussed

(21)

Specification of token classes

• Token classes are defined byregular expressions(REs).

• REs specify the set of strings, which belong to a certain token class.

(22)

Regular Expressions

LetΣbe an alphabet, i.e. an non-empty set of characters.Σ^∗ is the set of all words overΣ,is the empty word.

Definition (Regular expressions, regular languages)

• εis a RE and specifies the languageL={}.

• Eacha∈Σis a RE and specifies the languageL={a}.

• Letr andsbe two RE specifying the languagesRandS, resp.

Then the following are RE and specify the languageL:

I (r|s)withL=R∪S (union)

I rswithL={vw|v ∈R,w∈S} (concatenation)

I r^∗with{v₁. . .v_n|v_i ∈R,0≤i≤n} (Kleene star)

The languageL⊆Σ^∗is calledregularif there exists REr definingL.

(23)

Regular Expressions (2)

Remarks:

• L=∅is not regular according to the definition, but is often considered regular.

• Other Operators, e.g. +, ?, ., [] can be defined using the basic operators, e.g.

I r⁺≡(r r^∗)≡r^∗\ {}

I [aBd]≡a|B|d

I [a−g]≡a|b|c|d|e|f|g

Caution: Regular expressions only define valid tokens and do not specify the program or translation units of a programming language.

(24)

Lexical Analysis Implementation of Scanners

2.2.3 Implementation of Scanners

(25)

Implementation of scanners

sequence of regular expressions and actions (input language of scanner generator)

Scanner Generator

scanner program

(usually in a programming language)

(26)

Scanner generator: JFlex

• Typical use of JFlex:

java -jar JFlex.jar Example.jflex javac Yylex.java

Actions are written in Java

• Examples :

1. Regular expression in JFlex

[a-zA-Z_0-9] [a-zA-Z_0-9] * 2. JFlex input with abbreviations

ZI = [0-9]

BU = [a-zA-Z_]

BUZI = [a-zA-Z_0-9]

%%

(27)

A complete JFlex example

enum Token { DO, DOUBLE, IDENT, FLOATCONST, STRING;}

%%

%line

%column

%debug

%type Token // declare token type

ZI = [0-9]

BU = [a-zA-Z_]

BUZI = [a-zA-Z_0-9]

ZE = [a-zA-Z_0-9!?\]\[\. \t...]

WhiteSpace = [ \t\n]

%%

{WhiteSpace} { }

"double" { return Token.DOUBLE; }

"do" { return Token.DO; } {BU}{BUZI}* { return Token.IDENT; } {ZI}+\.{ZI}+ { return Token.FLOATCONST; }

\"({ZE}|\\\")*\" { return Token.STRING; }

(28)

Scanner generators

• Scanner generation uses the equivalence between

I Regular expressions

I Non-deterministic finite automata (NFA)

I Deterministic finite automata (DFA)

• Construction methods is based in two steps:

I Regular expressions→NFA

I NFA→DFA

(29)

Definition of NFA

Definition (Non-deterministic finite automaton)

A non-deterministic finite automaton is defined as a 5-tuple M = (Σ,Q,∆,q₀,F)

where

• Σis the input alphabet

• Qis the set of states

• q₀∈Q is the initial state

• F ⊆Qis the set of final states

• ∆⊆Q×Σ∪ {} ×Qis the transition relation.

(30)

Regular expressions → NFA

Principle:For each regular sub-expression, construct NFA with one start and end state that accepts the same language.

1. Schritt: Reguläre Ausdrücke ! NEA Übersetzungsschema:

• !

• a

• (r|s)

• (rs)

• r*

Prinzip: Konstruiere für jeden regulären Teilausdruck NEA mit genau einem Start- und Endzustand, der die gleiche Sprache akzeptiert.

s

₀

f

₀

s

₀

a

s

₀

f

₀

! s

₁

R f

₁

s

₂

S f

₂

!

! !

s

₁

R f

₁

! s

₂

S f

₂

s

₁

R f

₁

!

f

₀

s

₀

!

c

Arnd Poetzsch-Heffter Syntax and Type Analysis 30

(31)

Regular expressions → NFA (2)

43

© A. Poetzsch-Heffter, TU Kaiserslautern 25.04.2007

1. Schritt: Reguläre Ausdrücke !NEA Übersetzungsschema:

•!

• a

• (r|s)

• (rs)

• r*

Prinzip: Konstruiere für jeden regulären Teilausdruck NEA mit genau einem Start- und Endzustand, der die gleiche Sprache akzeptiert.

s₀ f₀

s₀ a

s₀ f₀

! s₁ R f₁

s₂ S f₂

!

! !

s₁ R f₁ ! s₂ S f₂

s₁ R f₁ ! f₀

s₀ !

!

1. Schritt: Reguläre Ausdrücke ! NEA Übersetzungsschema:

•!

• a

• (r|s)

• (rs)

• r*

s0 f₀

s₀ a

s₀ f0

! s₁ R f₁

s₂ S f2

!

! !

s₁ R f1 ! s₂ S f2

s₁ R f₁ ! f₀

s₀ !

!

1. Schritt: Reguläre Ausdrücke ! NEA Übersetzungsschema:

•!

• a

• (r|s)

• (rs)

• r*

s₀ f₀

s₀ a

s₀ f₀

! s₁ R f₁

s₂ S f₂

!

! !

s₁ R f₁ ! s₂ S f₂

s₁ R f₁ !

f₀ s₀ !

!

(32)

Example: Construction of NFA

Übersetzung am Beispiel von Folie 41: 01 s5s6s7s8s9s10s11

s2s4 s13s12 s17s16s14s15

d elbuods3o

, TAB BUZI BU ZI ZI.ZI

ZI s19“

s20ZE s21 s22s23s24\

s26“s25 “!! !

!!

! ! !

! !

(33)

-closure

Functionclosurecomputes the-closure of a set of statess₁, . . . ,s_n. Definition (-closure)

For an NFAM= (Σ,Q,∆,q₀,F)and a stateq∈Q, the-closureofq is defined by

-closure(q) ={p∈Q|p reachable from q via-transitions}

ForS⊆Q, the-closureofSis defined by -closure(S) = [

s∈S

-closure(s)

(34)

Longest token prefix with NFA

Token longestTokenPrefix(char[] ir) { requires length(ir) > 0

// ir[0] contains the first character StateSet curState := closure( {s0} );

int curLength := 0;

int tokenLength := undef;

while (curLength <= length(ir) && !isEmptySet(curState) ) { if( containsFinalState(curState) ) {

tokenLength := curLength;

}

curState := closure(successor(curState,ir[curLength]));

curLength++;

}

return token(prefix(ir,tokenLength));

(35)

Longest token prefix with NFA (2)

Remark:

Problem of ambiguity:

If there are more than one token matching the longest input prefix, proceduretokennondeterministically returns one of them.

(36)

NFA → DFA

Principle:

For each NFA, a DFA can be constructed that accepts the same language. (In general, this does not hold for NFA with output.) Properties of DFA:

• No-transitions

• Transitions are deterministic given the input char

(37)

NFA → DFA (2)

Definition (Deterministic finite state automaton) A deterministic finite automaton is defined as a 5-tuple

M = (Σ,Q,∆,q₀,F) where

• Σis the input alphabet

• Qis the set of states

• q₀∈Q is the initial state

• F ⊆Qis the set of final states

• ∆ :Q×Σ→Qis the transitionfunction.

(38)

NFA → DFA (3)

Construction: (according to John Myhill)

• The states of the DFA are subsets of NFA states

(powerset construction). Subsets of finite sets are also finite.

• The start state of the DFA is the-closure of theNFAstart state

• The final states of the DFA are the sets of states that contain an NFA final state.

• The successor state of a stateSin the DFA under inputais obtained by

I computing all successorspofq∈Sunderain the NFA

I and adding the-closure ofp

(39)

NFA → DFA (4)

• If working with character classes (e.g. [a-f]), characters and character classes at outgoing transitions must be disjoint.

• Completion of automaton for error handling:

I Insert additional (final) statenT (nonToken state)

I Add a transition from statestonTfor each character for which no outgoing transition fromsexists.

(40)

NFA → DFA (5)

Definition (DFA for NFA)

LetM= (Σ,Q,∆,q₀,F)be a NFA. Then, the DFAM⁰corresponding to the NFAMis defined asM⁰ = (Σ,Q⁰,∆⁰,q₀⁰,F⁰)where

• the set of states isQ⁰ ⊆ P(Q), power set ofQ

• the initial stateq₀⁰ is the-closure ofq₀

• the final states areF⁰ ={S ⊆Q|S∩F 6=∅}

• ∆⁰(S,a) =-closure({p|(q,a,p)∈∆,q ∈S})for alla∈Σ.

(41)

Example: DFA

s0,1,2,5,12,14,18

s1 ,

LZ, TAB s3,6,13

s4,7,13

s8,13 s13 BU\{d}d

e l b u o

BUZI\{b} BUZI\{u} BUZI\{o} BUZI

BUZI\{l}

BUZI\{e}

BUZI s17s16s15

ZI ZI.ZI

ZI 19,20,22,25 s19,20,21,22,25 s26 s19,20,21,22,23,25s19,20,22,24,25,26

“

s9,13 s10,13

s11,13 ZE \ ““^\

ZE “ “\ZEZE \

ksWg. Übersichtlichkeit Kanten zu ks nur angedeutet.

Transitions to nT sketched.

nT

(42)

Longest token prefix with DFA

Token longestTokenPrefix(char[] ir) { requires length(ir) > 0

// ir[0] contains the first character State curState : = StartState;

int curLength := 0;

int tokenLength := undef;

while (curLength <= length(ir) && curState != nT) if (curState is FinalState) {

tokenLength := curLength;

}

curState := successor(curState,ir[curLength]));

curLength++;

}

return token(prefix(ir,tokenLength));

(43)

Longest token prefix with DFA (2)

Remarks:

• Computation of closure at construction time, not at runtime.

(Principle: Do as much statically as you can!)

• Problem of ambiguity still not solved. However, many scanner generators allow the user to control which token is returned. For example, JFlex returns the token of the first rule in the JFlex file that matches the longest input prefix.

(44)

Longest token prefix with DFA (3)

Implementation Aspects:

• Constructed DFA can be minimized.

• Input buffering is important: often use of cyclic arrays (caution with maximal token length, e.g. in case of comments)

• Encode DFA in table

• Choose suitable partitioning of alphabet in order to reduce number of transitions (i.e. size of table)

• Interface with parser: usually parser asks proactively for next token

(45)

2. Syntax and Type Analysis

Compilers and Language Processing Tools

Content of Lecture

Content of Lecture (2)

2. Syntax and Type Analysis

Main learning objectives

Tasks of syntax analysis

Syntax and type analysis phases

Reasons for separation of phases

2.2 Lexical Analysis

2.2.1 Introduction

Tasks of lexical analysis

Tasks of lexical analysis (2)

Lexical analysis: Example

2.2.2 Specification of Scanners

Specification

Examples: Scanning

Standard scan algorithm (concept)

Standard scan algorithm (pseudo code)

Standard scan algorithm (2)

Standard scan algorithm (3)

Specification of token classes

Regular Expressions

Regular Expressions (2)

2.2.3 Implementation of Scanners

Implementation of scanners

Scanner generator: JFlex

A complete JFlex example

Scanner generators

Definition of NFA

Regular expressions → NFA

1. Schritt: Reguläre Ausdrücke ! NEA Übersetzungsschema:

• !

• a

• (r|s)

• (rs)

• r*

Prinzip: Konstruiere für jeden regulären Teilausdruck NEA mit genau einem Start- und Endzustand, der die gleiche Sprache akzeptiert.

s

f

s

a

s

f

! s

R f

s

S f

!

! !

s

R f

! s

S f

s

R f

!

f

s

!

!

Regular expressions → NFA (2)

1. Schritt: Reguläre Ausdrücke ! NEA Übersetzungsschema:

1. Schritt: Reguläre Ausdrücke ! NEA Übersetzungsschema:

Example: Construction of NFA

-closure

Longest token prefix with NFA

Longest token prefix with NFA (2)

NFA → DFA

NFA → DFA (2)

NFA → DFA (3)

NFA → DFA (4)

NFA → DFA (5)

Example: DFA

Longest token prefix with DFA

Longest token prefix with DFA (2)

Longest token prefix with DFA (3)

Recommended reading