Syntax and Type Analysis
Lecture Compilers SS 2009
Dr.-Ing. Ina Schaefer
Software Technology Group TU Kaiserslautern
Ina Schaefer Syntax and Type Analysis 1
Educational Objectives
• Tasks of Different Syntax Analysis Phases
• Interaction of Syntax Analysis Phases
• Specification Techniques for Syntax Analysis
• Generation Techniques
• Usage of Tools
• Lexical Analysis
• Context-free Analysis (Parsing)
• Context-sensitive Analysis
Introduction to Syntax and Type Analysis
Syntax Analysis
Tasks of Syntax Analysis
• Check if Input is syntactically correct
• Dependant on Result:
! Error Message
! Generation of appropriate Data Structure for subsequent processing
Ina Schaefer Syntax and Type Analysis 3
Introduction to Syntax and Type Analysis
Syntax Analysis Phases
Lexical Analysis:
String → Token Stream (or Symbol String) Context-free Analysis:
Token Stream → Tree
Context-sensitive Analysis:
Tree → Tree with Cross References
Scanner Source Code
as String
Token Stream
Parser
Name and Type Analysis Syntax
Tree
Attributed Syntax Tree
Introduction to Syntax and Type Analysis
Reasons for Separation of Phases
• Lexical and Context-free Analysis
! Reduced load for context-free analysis, e.g. whitespaces are not required for context-free analysis
• Context-free and Context-sensitive Analysis
! Context-Sensitive Analysis uses tree structure instead of token stream
! Advantages for construction of target data structure
• For Both Cases
! Increased efficiency
! Natural process (cmp. natural language)
! More appropriate tool support
Ina Schaefer Syntax and Type Analysis 5
Lexical Analysis
Lexical Analysis
Lexical Analysis
Lexical Analysis
Tasks
• Break input character string into symbol stream (or token stream) wrt. language definition
• Classify symbols into classes
• Representation of symbols
! Hashing of identifieres
! Conversion of constants
• Elimination of
! whitespaces (spaces, comments...)
! external constructs (compiler directives...)
Ina Schaefer Syntax and Type Analysis 7
Lexical Analysis
Lexical Analysis (2)
Terminology
• Symbol: a word over an alphabet of characters (often with additional information, e.g. token class, encoding, position..)
• Symbol Class: a set of tokens (identifier, constants, ...);
correspond to terminal symbols of a context-free grammar
Lexical Analysis
Lexical Analysis: Example
Input Line 23:
␣␣if␣(␣A␣<=␣3.14␣)␣␣␣B␣=␣B--
33
© A. Poetzsch-Heffter, TU Kaiserslautern 25.04.2007
Beispiel: (lexikalische Analyse)
Zeile 23 der Eingabedatei:
Ergebnis der lexikalischen Analyse:
if( A <= 3.14) B = B---
Symbolklasse String Codierung Zeile:Spalte IF “if“ 23:3 OPAR “(“ 23:5 ID “A“ 72 23:7 RELOP “<=“ 4 23:9 FLOATCONST “3.14“ 3,14 23:12 CPAR “)“ 23:16 ID “B“ 84 23:20 ...
Hashcode des Identifiers Wert der
Konstanten
Codierung für Operator <=
Symbolinformation
Token
Class String Encoding Col:Row
Value of
Constant Hash Code
of Identifier Encoding of Operator Token
Information
33
© A. Poetzsch-Heffter, TU Kaiserslautern 25.04.2007
Beispiel: (lexikalische Analyse)
Zeile 23 der Eingabedatei:
Ergebnis der lexikalischen Analyse:
if( A <= 3.14) B = B---
Symbolklasse String Codierung Zeile:Spalte IF “if“ 23:3 OPAR “(“ 23:5 ID “A“ 72 23:7 RELOP “<=“ 4 23:9 FLOATCONST “3.14“ 3,14 23:12 CPAR “)“ 23:16 ID “B“ 84 23:20 ...
Hashcode des Identifiers Wert der
Konstanten
Codierung für Operator <=
Symbolinformation
Input Line 23:
Result of Lexical Analysis:
Ina Schaefer Syntax and Type Analysis 9
Lexical Analysis Specification of Scanners
Specification
The Specification of the Lexical Analysis is a Part of the Programming Language Specification.
The two Parts of Lexical Analysis Specification:
• Scanning Algorithm (often only implicit)
• Specification of Symbols and Symbol Classes
Lexical Analysis Specification of Scanners
Examples: Scanning
1. Statement in C B␣=␣B␣---␣A;
Problem: Separation ( - - and - are symbols) Solution: Longest symbol is chosen, i.e B␣=␣B␣--␣-␣A;
2. Java Fragment
class␣public␣{␣public␣m()␣{...}␣}
Problem: Ambiguity (key word, identifier) Solution: Precedence Rules
Ina Schaefer Syntax and Type Analysis 11
Lexical Analysis Specification of Scanners
Standard Scan-Alogrithm (Concept)
Scaning is often implemented as Co-Routine:
• State is remainder of input
• Co-Routine returns next symbol
• In error cases, co-routine returns the UNDEF symbol and updates the input
Lexical Analysis Specification of Scanners
Standard Scan-Alogrithm (Pseudo Code)
String left_input : = input;
Symbol nextSymbol() {
Symbol curSymbol := longestSymbolPrefix(left_input);
left_input:= cut(curSymbol, left_input);
return curSymbol;
}
where cut is defined as
• if curToken "= UNDEF, curToken is removed from left_input
• else left_input remains unchanged.
Ina Schaefer Syntax and Type Analysis 13
Lexical Analysis Specification of Scanners
Standard Scan-Alogrithm (2)
longestSymbolPrefix(String egr) {
\\ length(egr) > 0 int curLength := 0;
String curPrefix := prefix(curLength,egr);
Symbol longestSymbol := UNDEF;
while (curLength <= length(egr) && isSymbolPrefix(curPrefix)) if (isSymbol(curPrefix) {
longestSymbol := curPrefix;
}
curLength++;
curPrefix:=prefix(curLength,egr);
}
return longestSymbol;
}
Lexical Analysis Specification of Scanners
Standard Scan-Algorithm (3)
Only Predicates have to be defined:
• isSymbolPrefix: String → bool
• isSymbol: String→ bool Remarks:
• Standard Scan-Algorithm is used in many modern languages, but not e.g. in FORTRAN because blanks are not special except in literal symbols, e.g.
! DO 7 I = 1.25 →DO 7 I is an identifier.
! DO 7 I = 1,25 →DO is a keyword.
• Error Cases are not handled
• Complete Realisation of longestSymbolPrefix is discussed later.
Ina Schaefer Syntax and Type Analysis 15
Lexical Analysis Specification of Scanners
Specification of Symbols
• Symbols are specified by regular expressions.
• Symbols Classes are described informally.
Lexical Analysis Specification of Scanners
Regular Expressions
Let Σ be an alphabet, i.e. an non-empty set of characters. Σ∗ is the set of all words overΣ, ! is the empty word.
Definition (Regular Expressions, Regular Languages)
• ! is a regular expression (r.e.) and denotes the language L = {!}.
• Each a ∈ Σ is a r.e. and denotes the language L= {a}.
• Let r and s be two r.e. defining the languages R and S, resp.
Then the following are r.e. and define the corresponding language L:
! (r|s) withL = R∪S Union
! rs withL = {vw |v ∈R,w ∈S} Concatenation
! r∗ with{v1. . .vn|vi ∈R,0 ≤ i ≤ n} Kleene Star
The languageL ⊆ Σ∗ is called regular iff there exists r.e.r defining L.
Ina Schaefer Syntax and Type Analysis 17
Lexical Analysis Specification of Scanners
Regular Expressions (2)
Remarks:
• L= ∅ is not regular according to the definition, but is often considered regular.
• Other Operators, e.g. +, ?, ., [] can be defined using the basic operators, e.g.
! r+ ≡ (rr∗) ≡ r∗ \ {!}
! [aBd] ≡a|B|d
! [a −g]≡ a|b|c|d|e|f|g
Caution: Regular Expressions only define valid symbols and do not specify the program or translation units of a programming language.
Lexical Analysis Implementation of Scanners
Implementation of Scanners
Scanner Generator
Sequence of Regular Expressions and Actions
(Input Language of Scanner Generator)
Scanner Program
(mostly in Programming Language)
Ina Schaefer Syntax and Type Analysis 19
Lexical Analysis Implementation of Scanners
Scanner Generator: JFlex
• Typical Use of JFlex:
java -jar JFlex.jar Example.jflex javac Yylex.java
Actions are written in Java
• Examples :
1. Regular Expression in JFlex [a-zA-Z_0-9] [a-zA-Z_0-9] * 2. JFlex Input with Abbreviations
ZI = [0-9]
BU = [a-zA-Z_]
BUZI = [a-zA-Z_0-9]
%%
{BU}{BUZI}* { anAction(); }
Lexical Analysis Implementation of Scanners
A complete JFlex Example
enum Token { DO, DOUBLE, IDENT, FLOATCONST, STRING;}
%%
%type Token // declare token type ZI = [0-9]
BU = [a-zA-Z_]
BUZI = [a-zA-Z_0-9]
ZE = [a-zA-Z_0-9!?\]\[\.\t...]
%%[ \t]* /* whitespace */
"do" { return Token.DO; }
"double" { return Token.DOUBLE; } {BU}{BUZI}* { return Token.IDENT; }
{ZI}+\.{ZI}+ { return Token.FLOATCONST; }
\"({ZE}|\\\")*\" { return Token.STRING; }
Ina Schaefer Syntax and Type Analysis 21
Lexical Analysis Implementation of Scanners
Scanner Generators
• Scanner Generation uses the Equivalence between
! Regular Expressions
! Non-determininstic finite automata (NFA)
! Deterministic finite automata (DFA)
• Construction Methods is based in two steps:
! Regular Expressions →NFA
! NFA →DFA
Lexical Analysis Implementation of Scanners
Definition of NFA
Definition (Non-deterministic Finite Automaton)
A non-deterministic finite automaton is defined as a 5-tuple M = (Σ,Q,∆,q0,F)
where
• Σ is the input alphabet
• Q is the set of states
• q0 ∈ Q is the initial state
• F ⊆ Q is the set of final states
• ∆ ⊆ Q ×Σ ∪{!}× Q is the transition relation.
Ina Schaefer Syntax and Type Analysis 23
Lexical Analysis Implementation of Scanners
Regular Expressions → NFA
Principle: For each regular sub-expression, construct NFA with one start and end state that accepts the same language.
1. Schritt: Reguläre Ausdrücke ! NEA Übersetzungsschema:
• !
• a
• (r|s)
• (rs)
• r*
Prinzip: Konstruiere für jeden regulären Teilausdruck NEA mit genau einem Start- und Endzustand, der die gleiche Sprache akzeptiert.
s
0f
0s
0a
s
0f
0! s
1R f
1s
2S f
2!
! !
s
1R f
1! s
2S f
2s
1R f
1! f
0s
0!
!
!
Ina Schaefer Syntax and Type Analysis 24
Lexical Analysis Implementation of Scanners
Regular Expressions → NFA (2)
43
© A. Poetzsch-Heffter, TU Kaiserslautern 25.04.2007
1. Schritt: Reguläre Ausdrücke ! NEA Übersetzungsschema:
• !
• a
• (r|s)
• (rs)
• r*
Prinzip: Konstruiere für jeden regulären Teilausdruck NEA mit genau einem Start- und Endzustand, der die gleiche Sprache akzeptiert.
s0 f0
s0 a
s0 f0
! s1 R f1
s2 S f2
!
! !
s1 R f1 ! s2 S f2
s1 R f1 ! f0
s0 !
!
!
© A. Poetzsch-Heffter, TU Kaiserslautern 43 25.04.2007
1. Schritt: Reguläre Ausdrücke ! NEA Übersetzungsschema:
• !
• a
• (r|s)
• (rs)
• r*
Prinzip: Konstruiere für jeden regulären Teilausdruck NEA mit genau einem Start- und Endzustand, der die gleiche Sprache akzeptiert.
s0 f0
s0
a
s0 f0
! s1 R f1
s2 S f2
!
! !
s1 R f1 ! s2 S f2
s1 R f1 ! f0
s0 !
!
!
43
© A. Poetzsch-Heffter, TU Kaiserslautern 25.04.2007
1. Schritt: Reguläre Ausdrücke ! NEA Übersetzungsschema:
•
!• a
• (r|s)
• (rs)
• r*
Prinzip: Konstruiere für jeden regulären Teilausdruck NEA mit genau einem Start- und Endzustand, der die gleiche Sprache akzeptiert.
s
0f
0s
0a
s
0f
0!
s
1R f
1s
2S f
2!
! !
s
1R f
1 !s
2S f
2s
1R f
1 !f
0s
0 !!
!
Ina Schaefer Syntax and Type Analysis 25
Lexical Analysis Implementation of Scanners
Example: Construction of NFA
44© A. Poetzsch-Heffter, TU Kaiserslautern
Übersetzung am Beispiel von Folie 41: s5s6s7s8s9s10s11
s2s4 s13s12 s17s16s14s15
d elbuods3o
AB BUZI BU ZI ZI. ZI
ZI s19
“
s20ZE s21 s22s23s24\
s26
“
s25“
!! !!!
! ! !
! !
Lexical Analysis Implementation of Scanners
!-closure
Function closure computes the !-closure of a set of states s1, . . . ,sn.
Definition (!-closure)
For an NFA M = (Σ,Q,∆,q0,F) and a state q ∈ Q, the !-closure of q is defined by
!-closure(q) ={p ∈ Q|p reachable from q via!-transitions} For S ⊆ Q, the !-closure of S is defined by
!-closure(S) = !
s∈S
!-closure(s)
Ina Schaefer Syntax and Type Analysis 27
Lexical Analysis Implementation of Scanners
Longest Symbol Prefix with NFA
longestSymbolPrefix(char[] egr) { // length(egr) > 0
StateSet curState : = closure( {s0} );
int curLength := 0;
int symbolLength := undef;
while (curLength <= length(egr) && !isEmptySet(curState) ) if (contains(curState,finalState)) {
symbolLength := curLength;
}
curLength++;
curState:=closure(successor(curState,egr[curLength]));
}
return symbol(prefix(egr,symbolLength));
}
Lexical Analysis Implementation of Scanners
Longest Symbol Prefix with NFA (2)
Remark:
Problem of Ambiguity is not solved yet:
If there are more than one token matching the longest input prefix, one of these tokens is returned by the function symbol.
Ina Schaefer Syntax and Type Analysis 29
Lexical Analysis Implementation of Scanners
NFA → DFA
Principle:
For each NFA, a DFA can be constructed that accepts the same language. (In general, this does not hold for NFA with output.) Properties of DFA:
• No !-transitions.
• Transitions are determined by function.
Lexical Analysis Implementation of Scanners
NFA → DFA (2)
Definition (Deterministic Finite State Automaton)
A deterministic finite automaton is defined as a 5-tupleM = (Σ,Q,∆,q0,F) where
• Σ is the input alphabet
• Q is the set of states
• q0 ∈ Q is the initial state
• F ⊆ Q is the set of final states
• ∆ : Q ×Σ → Q is the transition function.
Ina Schaefer Syntax and Type Analysis 31
Lexical Analysis Implementation of Scanners
NFA → DFA (3)
Construction: (according to John Myhill)
• The States of the DFA are subsets of NFA states
(powerset construction). Subsets of finite sets are also finite.
• The start state of the DFA is the!-closure of theNFA start state
• The final states of the DFA are the sets of states that contain an NFA final state.
• The successor state of a state S in the DFA under input ais obtained by
! computing all successors p of q ∈ S undera in the NFA
! and adding the!-closure of p
Lexical Analysis Implementation of Scanners
NFA → DFA (4)
• If working with character classes (e.g. [a-f]), characters and character classes at outgoing transitions must be disjoint.
• Completion of automaton for error handling:
! Insert additional (final) state (nT)
! For each state, add a transition for each character for which no outgoing transition exists to the nonToken state.
Ina Schaefer Syntax and Type Analysis 33
Lexical Analysis Implementation of Scanners
NFA → DFA (5)
Definition (DFA for NFA)
Let M = (Σ,Q,∆,q0,F) be a NFA. Then, the DFA M# corresponding to the NFA M is defined as M# = (Σ,Q#,∆#,q0#,F#) where
• the set of states is Q# ⊆ P(Q), power set of Q
• the initial state q0# is the !-closure of q0
• the final states are F# = {S ⊆ Q |S ∩F "= ∅}
• ∆#(S,a) = !-closure({p|(q,a,p) ∈ ∆,q ∈ S}) for all a ∈ Σ.
Lexical Analysis Implementation of Scanners
Example: DFA
48© A. Poetzsch-Heffter, TU Kaiserslautern25.04.2007
s0,1,2,5,12,14,18
s1 LZ, TAB
LZ, TAB s3,6,13
s4,7,13
s8,13 s 13 BU\{d}d
e l b u o
BUZI\{b} BUZI\{u} BUZI\{o} BUZI
BUZI\{l}
BUZI\{e}
BUZI s17s16s15
ZI ZI.ZI
ZI s19,20,22,25 s19,20,21,22,25 s 26 s19,20,21,22,23,25s19,20,22,24,25,26
“
s9,13
s10,13
s11,13 ZE \
“ “
\ZE
“ “
\ZEZE \ksWg. Übersichtlichkeit Kanten zu ks nur angedeutet.
Transitions to nT sketched.
nT
Ina Schaefer Syntax and Type Analysis 35
Lexical Analysis Implementation of Scanners
Longest Symbol Prefix with DFA
longestSymbolPrefix(char[] egr) { // length(egr) > 0
State curState : = start_state;
int curLength := 0;
int symbolLength := undef;
while (curLength <= length(egr) && curState != nT) if ( curState is FinalState) {
tokenLength := curLength;
}
curLength++;
curState := successor(curState,egr[curLength]));
}
return symbol(prefix(egr,tokenLength));
}
Lexical Analysis Implementation of Scanners
Longest Symbol Prefix with DFA (2)
Remarks:
• Computation of closure at construction time, not at runtime.
(Principle: Do as much statically as you can!)
• Problem of ambiguity still not solved.
Most scanner generators use ordering of rules in case of conflicts.
Ina Schaefer Syntax and Type Analysis 37
Lexical Analysis Implementation of Scanners
Longest Token Prefix with DFA (3)
Implementation Aspects:
• Constructed DFA can be minimized.
• Input buffering is important: often use of cyclic arrays (caution with maximal token length, e.g. in case of comments)
• Encode DFA in table
• Choose suitable partitioning of alphabet in order to reduce number of transitions (i.e. size of table)
• Interface with Parser: usually parser asks proactively for next token (co-routines)
Lexical Analysis Implementation of Scanners
Recommended Reading
• Wilhelm, Maurer: Chap. 7, pp. 239-269 (More theoretical)
• Appel: Chap 2, pp. 16 - 37 (More practial) Additional Reading:
• Aho, Sethi, Ullman: Chap. 3 (very detailled)
Ina Schaefer Syntax and Type Analysis 39