• Keine Ergebnisse gefunden

Syntax and Type Analysis Lecture Compilers SS 2009 Dr.-Ing. Ina Schaefer

N/A
N/A
Protected

Academic year: 2022

Aktie "Syntax and Type Analysis Lecture Compilers SS 2009 Dr.-Ing. Ina Schaefer"

Copied!
39
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Syntax and Type Analysis

Lecture Compilers SS 2009

Dr.-Ing. Ina Schaefer

Software Technology Group TU Kaiserslautern

Ina Schaefer Syntax and Type Analysis 1

(2)

Educational Objectives

Tasks of Different Syntax Analysis Phases

Interaction of Syntax Analysis Phases

Specification Techniques for Syntax Analysis

Generation Techniques

Usage of Tools

Lexical Analysis

Context-free Analysis (Parsing)

Context-sensitive Analysis

(3)

Introduction to Syntax and Type Analysis

Syntax Analysis

Tasks of Syntax Analysis

Check if Input is syntactically correct

Dependant on Result:

I Error Message

I Generation of appropriate Data Structure for subsequent processing

Ina Schaefer Syntax and Type Analysis 3

(4)

Syntax Analysis Phases

Lexical Analysis:

String→Token Stream(or Symbol String) Context-free Analysis:

Token Stream→Tree Context-sensitive Analysis:

Tree→Tree with Cross References

Scanner Source Code

as String

Token Stream

Parser

Name and Type Analysis Syntax

Tree

Attributed Syntax Tree

(5)

Introduction to Syntax and Type Analysis

Reasons for Separation of Phases

Lexical and Context-free Analysis

I Reduced load for context-free analysis, e.g. whitespaces are not required for context-free analysis

Context-free and Context-sensitive Analysis

I Context-Sensitive Analysis uses tree structure instead of token stream

I Advantages for construction of target data structure

For Both Cases

I Increased efficiency

I Natural process (cmp. natural language)

I More appropriate tool support

Ina Schaefer Syntax and Type Analysis 5

(6)

Lexical Analysis

(7)

Lexical Analysis

Lexical Analysis

Tasks

Break input character string into symbol stream (or token stream) wrt. language definition

Classify symbols into classes

Representation of symbols

I Hashing of identifieres

I Conversion of constants

Elimination of

I whitespaces (spaces, comments...)

I external constructs (compiler directives...)

Ina Schaefer Syntax and Type Analysis 7

(8)

Lexical Analysis (2)

Terminology

Symbol: a word over an alphabet of characters (often with additional information, e.g. token class, encoding, position..)

Symbol Class: a set of tokens (identifier, constants, ...);

correspond to terminal symbols of a context-free grammar

(9)

Lexical Analysis

Lexical Analysis: Example

Input Line 23:

␣␣if␣(␣A␣<=␣3.14␣)␣␣␣B␣=␣B--

© A. Poetzsch-Heffter, TU Kaiserslautern 33 25.04.2007

Beispiel: (lexikalische Analyse)

Zeile 23 der Eingabedatei:

Ergebnis der lexikalischen Analyse:

if( A <= 3.14) B = B---

Symbolklasse String Codierung Zeile:Spalte IF “if“ 23:3 OPAR “(“ 23:5 ID “A“ 72 23:7 RELOP “<=“ 4 23:9 FLOATCONST “3.14“ 3,14 23:12 CPAR “)“ 23:16 ID “B“ 84 23:20 ...

Hashcode des Identifiers Wert der

Konstanten

Codierung für Operator <=

Symbolinformation Token

Class String Encoding Col:Row

Value of

Constant Hash Code

of Identifier Encoding of Operator Token

Information

33

© A. Poetzsch-Heffter, TU Kaiserslautern 25.04.2007

Beispiel: (lexikalische Analyse) Zeile 23 der Eingabedatei:

Ergebnis der lexikalischen Analyse:

if( A <= 3.14) B = B---

Symbolklasse String Codierung Zeile:Spalte IF “if“ 23:3 OPAR “(“ 23:5 ID “A“ 72 23:7 RELOP “<=“ 4 23:9 FLOATCONST “3.14“ 3,14 23:12 CPAR “)“ 23:16 ID “B“ 84 23:20 ...

Hashcode des Identifiers Wert der

Konstanten

Codierung für Operator <=

Symbolinformation

Input Line 23:

Result of Lexical Analysis:

Ina Schaefer Syntax and Type Analysis 9

(10)

Specification

The Specification of the Lexical Analysis is a Part of the Programming Language Specification.

The two Parts of Lexical Analysis Specification:

Scanning Algorithm (often only implicit)

Specification of Symbols and Symbol Classes

(11)

Lexical Analysis Specification of Scanners

Examples: Scanning

1. Statement in C B␣=␣B␣---␣A;

Problem: Separation ( - - and - are symbols) Solution: Longest symbol is chosen, i.e B␣=␣B␣--␣-␣A;

2. Java Fragment

class␣public␣{␣public␣m()␣{...}␣}

Problem: Ambiguity (key word, identifier) Solution: Precedence Rules

Ina Schaefer Syntax and Type Analysis 11

(12)

Standard Scan-Alogrithm (Concept)

Scaning is often implemented as Co-Routine:

State is remainder of input

Co-Routine returns next symbol

In error cases, co-routine returns the UNDEF symbol and updates the input

(13)

Lexical Analysis Specification of Scanners

Standard Scan-Alogrithm (Pseudo Code)

String left_input : = input;

Symbol nextSymbol() {

Symbol curSymbol := longestSymbolPrefix(left_input);

left_input:= cut(curSymbol, left_input);

return curSymbol;

}

wherecutis defined as

if curToken6=UNDEF, curToken is removed from left_input

else left_input remains unchanged.

Ina Schaefer Syntax and Type Analysis 13

(14)

Standard Scan-Alogrithm (2)

longestSymbolPrefix(String egr) {

\\ length(egr) > 0 int curLength := 0;

String curPrefix := prefix(curLength,egr);

Symbol longestSymbol := UNDEF;

while (curLength <= length(egr) && isSymbolPrefix(curPrefix)) if (isSymbol(curPrefix) {

longestSymbol := curPrefix;

}

curLength++;

curPrefix:=prefix(curLength,egr);

}

return longestSymbol;

}

(15)

Lexical Analysis Specification of Scanners

Standard Scan-Algorithm (3)

Only Predicates have to be defined:

isSymbolPrefix: String→bool

isSymbol: String→bool Remarks:

Standard Scan-Algorithm is used in many modern languages, but not e.g. in FORTRAN because blanks are not special except in literal symbols, e.g.

I DO 7 I = 1.25DO 7 Iis an identifier.

I DO 7 I = 1,25DOis a keyword.

Error Cases are not handled

Complete Realisation oflongestSymbolPrefixis discussed later.

Ina Schaefer Syntax and Type Analysis 15

(16)

Specification of Symbols

Symbols are specified by regular expressions.

Symbols Classes are described informally.

(17)

Lexical Analysis Specification of Scanners

Regular Expressions

LetΣbe analphabet, i.e. an non-empty set of characters.Σ is the set of all words overΣ,is the empty word.

Definition (Regular Expressions, Regular Languages)

is a regular expression (r.e.) and denotes the languageL={}.

Eacha∈Σis a r.e. and denotes the languageL={a}.

Letr andsbe two r.e. defining the languagesRandS, resp.

Then the following are r.e. and define the corresponding language L:

I (r|s)withL=RSUnion

I rswithL={vw|v R,wS}Concatenation

I rwith{v1. . .vn|vi R,0in}Kleene Star

The languageL⊆Σis calledregulariff there exists r.e.r definingL.

Ina Schaefer Syntax and Type Analysis 17

(18)

Regular Expressions (2)

Remarks:

L=∅is not regular according to the definition, but is often considered regular.

Other Operators, e.g. +, ?, ., [] can be defined using the basic operators, e.g.

I r+(rr)r\ {}

I [aBd]a|B|d

I [ag]a|b|c|d|e|f|g

Caution:Regular Expressions only define valid symbols and do not specify the program or translation units of a programming language.

(19)

Lexical Analysis Implementation of Scanners

Implementation of Scanners

Scanner Generator

Sequence of Regular Expressions and Actions

(Input Language of Scanner Generator)

Scanner Program

(mostly in Programming Language)

Ina Schaefer Syntax and Type Analysis 19

(20)

Scanner Generator: JFlex

Typical Use of JFlex:

java -jar JFlex.jar Example.jflex javac Yylex.java

Actions are written in Java

Examples :

1. Regular Expression in JFlex [a-zA-Z_0-9] [a-zA-Z_0-9] * 2. JFlex Input with Abbreviations

ZI = [0-9]

BU = [a-zA-Z_]

BUZI = [a-zA-Z_0-9]

%%

{BU}{BUZI}* { anAction(); }

(21)

Lexical Analysis Implementation of Scanners

A complete JFlex Example

enum Token { DO, DOUBLE, IDENT, FLOATCONST, STRING;}

%%

%type Token // declare token type ZI = [0-9]

BU = [a-zA-Z_]

BUZI = [a-zA-Z_0-9]

ZE = [a-zA-Z_0-9!?\]\[\.\t...]

%%

[ \t]* /* whitespace */

"do" { return Token.DO; }

"double" { return Token.DOUBLE; } {BU}{BUZI}* { return Token.IDENT; } {ZI}+\.{ZI}+ { return Token.FLOATCONST; }

\"({ZE}|\\\")*\" { return Token.STRING; }

Ina Schaefer Syntax and Type Analysis 21

(22)

Scanner Generators

Scanner Generation uses the Equivalence between

I Regular Expressions

I Non-determininstic finite automata (NFA)

I Deterministic finite automata (DFA)

Construction Methods is based in two steps:

I Regular ExpressionsNFA

I NFADFA

(23)

Lexical Analysis Implementation of Scanners

Definition of NFA

Definition (Non-deterministic Finite Automaton)

Anon-deterministic finite automatonis defined as a 5-tuple M = (Σ,Q,∆,q0,F)

where

Σis the input alphabet

Qis the set of states

q0∈Q is the initial state

F ⊆Qis the set of final states

∆⊆Q×Σ∪ {} ×Qis the transition relation.

Ina Schaefer Syntax and Type Analysis 23

(24)

Lexical Analysis Implementation of Scanners

Regular Expressions → NFA

Principle:For each regular sub-expression, construct NFA with one start and end state that accepts the same language.

1. Schritt: Reguläre Ausdrücke ! NEA Übersetzungsschema:

• !

• a

• (r|s)

• (rs)

• r*

Prinzip: Konstruiere für jeden regulären Teilausdruck NEA mit genau einem Start- und Endzustand, der die gleiche Sprache akzeptiert.

s

0

f

0

s

0

a

s

0

f

0

! s

1

R f

1

s

2

S f

2

!

! !

s

1

R f

1

! s

2

S f

2

s

1

R f

1

!

f

0

s

0

!

!

!

Ina Schaefer Syntax and Type Analysis 24

(25)

Lexical Analysis Implementation of Scanners

Regular Expressions → NFA (2)

43

© A. Poetzsch-Heffter, TU Kaiserslautern 25.04.2007

1. Schritt: Reguläre Ausdrücke !NEA Übersetzungsschema:

!

• a

• (r|s)

• (rs)

• r*

Prinzip: Konstruiere für jeden regulären Teilausdruck NEA mit genau einem Start- und Endzustand, der die gleiche Sprache akzeptiert.

s0 f0

s0 a

s0 f0

! s1 R f1

s2 S f2

!

! !

s1 R f1 ! s2 S f2

s1 R f1 ! f0

s0 !

!

!

43

© A. Poetzsch-Heffter, TU Kaiserslautern 25.04.2007

1. Schritt: Reguläre Ausdrücke ! NEA Übersetzungsschema:

•!

• a

• (r|s)

• (rs)

• r*

Prinzip: Konstruiere für jeden regulären Teilausdruck NEA mit genau einem Start- und Endzustand, der die gleiche Sprache akzeptiert.

s0 f0

s0 a

s0 f0

! s1 R f1

s2 S f2

!

! !

s1 R f1 ! s2 S f2

s1 R f1 ! f0

s0 !

!

!

43

© A. Poetzsch-Heffter, TU Kaiserslautern 25.04.2007

1. Schritt: Reguläre Ausdrücke ! NEA Übersetzungsschema:

•!

• a

• (r|s)

• (rs)

• r*

Prinzip: Konstruiere für jeden regulären Teilausdruck NEA mit genau einem Start- und Endzustand, der die gleiche Sprache akzeptiert.

s0 f0

s0 a

s0 f0

! s1 R f1

s2 S f2

!

! !

s1 R f1 ! s2 S f2

s1 R f1 !

f0 s0 !

!

!

Ina Schaefer Syntax and Type Analysis 25

(26)

Example: Construction of NFA

44© A. Poetzsch-Heffter, TU Kaiserslautern

Übersetzung am Beispiel von Folie 41: 01 s5s6s7s8s9s10s11

s2s4 s13s12 s17s16s14s15

d elbuods3o

, TAB BUZI BU ZI ZI.ZI

ZI s19

s20ZE s21 s22s23s24\

s26

s25

!! !

!!

! ! !

! !

(27)

Lexical Analysis Implementation of Scanners

-closure

Functionclosurecomputes the-closure of a set of statess1, . . . ,sn. Definition (-closure)

For an NFAM= (Σ,Q,∆,q0,F)and a stateq∈Q, the-closureofq is defined by

-closure(q) ={p∈Q|p reachable from q via-transitions}

ForS⊆Q, the-closureofSis defined by -closure(S) = [

s∈S

-closure(s)

Ina Schaefer Syntax and Type Analysis 27

(28)

Longest Symbol Prefix with NFA

longestSymbolPrefix(char[] egr) { // length(egr) > 0

StateSet curState : = closure( {s0} );

int curLength := 0;

int symbolLength := undef;

while (curLength <= length(egr) && !isEmptySet(curState) ) if (contains(curState,finalState)) {

symbolLength := curLength;

} curLength++;

curState:=closure(successor(curState,egr[curLength]));

}

return symbol(prefix(egr,symbolLength));

}

(29)

Lexical Analysis Implementation of Scanners

Longest Symbol Prefix with NFA (2)

Remark:

Problem of Ambiguity is not solved yet:

If there are more than one token matching the longest input prefix, one of these tokens is returned by the functionsymbol.

Ina Schaefer Syntax and Type Analysis 29

(30)

NFA → DFA

Principle:

For each NFA, a DFA can be constructed that accepts the same language. (In general, this does not hold for NFA with output.) Properties of DFA:

No-transitions.

Transitions are determined by function.

(31)

Lexical Analysis Implementation of Scanners

NFA → DFA (2)

Definition (Deterministic Finite State Automaton) Adeterministic finite automatonis defined as a 5-tuple

M = (Σ,Q,∆,q0,F) where

Σis the input alphabet

Qis the set of states

q0∈Q is the initial state

F ⊆Qis the set of final states

∆ :Q×Σ→Qis the transitionfunction.

Ina Schaefer Syntax and Type Analysis 31

(32)

NFA → DFA (3)

Construction: (according to John Myhill)

The States of the DFA are subsets of NFA states

(powerset construction). Subsets of finite sets are also finite.

The start state of the DFA is the-closure of theNFAstart state

The final states of the DFA are the sets of states that contain an NFA final state.

The successor state of a stateSin the DFA under inputais obtained by

I computing all successorspofqSunderain the NFA

I and adding the-closure ofp

(33)

Lexical Analysis Implementation of Scanners

NFA → DFA (4)

If working with character classes (e.g. [a-f]), characters and character classes at outgoing transitions must be disjoint.

Completion of automaton for error handling:

I Insert additional (final) state (nT)

I For each state, add a transition for each character for which no outgoing transition exists to the nonToken state.

Ina Schaefer Syntax and Type Analysis 33

(34)

NFA → DFA (5)

Definition (DFA for NFA)

LetM= (Σ,Q,∆,q0,F)be a NFA. Then, the DFAM0corresponding to the NFAMis defined asM0 = (Σ,Q0,∆0,q00,F0)where

the set of states isQ0 ⊆ P(Q), power set ofQ

the initial stateq00 is the-closure ofq0

the final states areF0 ={S ⊆Q|S∩F 6=∅}

0(S,a) =-closure({p|(q,a,p)∈∆,q ∈S})for alla∈Σ.

(35)

Lexical Analysis Implementation of Scanners

Example: DFA

48© A. Poetzsch-Heffter, TU Kaiserslautern25.04.2007

s0,1,2,5,12,14,18

s1 LZ, TAB

LZ, TAB s3,6,13

s4,7,13

s8,13 s13 BU\{d}d

e l b u o

BUZI\{b} BUZI\{u} BUZI\{o} BUZI

BUZI\{l}

BUZI\{e}

BUZI s17s16s15

ZI ZI.ZI

ZI s19,20,22,25 s19,20,21,22,25 s26 s19,20,21,22,23,25s19,20,22,24,25,26

s9,13 s10,13

s11,13 ZE \

“ “

\

ZE

“ “

\ZEZE \

ksWg. Übersichtlichkeit Kanten zu ks nur angedeutet.

Transitions to nT sketched.

nT

Ina Schaefer Syntax and Type Analysis 35

(36)

Longest Symbol Prefix with DFA

longestSymbolPrefix(char[] egr) { // length(egr) > 0

State curState : = start_state;

int curLength := 0;

int symbolLength := undef;

while (curLength <= length(egr) && curState != nT) if ( curState is FinalState) {

tokenLength := curLength;

}

curLength++;

curState := successor(curState,egr[curLength]));

}

return symbol(prefix(egr,tokenLength));

}

(37)

Lexical Analysis Implementation of Scanners

Longest Symbol Prefix with DFA (2)

Remarks:

Computation of closure at construction time, not at runtime.

(Principle: Do as much statically as you can!)

Problem of ambiguity still not solved.

Most scanner generators use ordering of rules in case of conflicts.

Ina Schaefer Syntax and Type Analysis 37

(38)

Longest Token Prefix with DFA (3)

Implementation Aspects:

Constructed DFA can be minimized.

Input buffering is important: often use of cyclic arrays (caution with maximal token length, e.g. in case of comments)

Encode DFA in table

Choose suitable partitioning of alphabet in order to reduce number of transitions (i.e. size of table)

Interface with Parser: usually parser asks proactively for next token (co-routines)

(39)

Lexical Analysis Implementation of Scanners

Recommended Reading

Wilhelm, Maurer: Chap. 7, pp. 239-269 (More theoretical)

Appel: Chap 2, pp. 16 - 37 (More practial) Additional Reading:

Aho, Sethi, Ullman: Chap. 3 (very detailled)

Ina Schaefer Syntax and Type Analysis 39

Referenzen

ÄHNLICHE DOKUMENTE

If there are more than one token matching the longest input prefix, procedure token nondeterministically returns one of

If there are more than one token matching the longest input prefix, procedure token nondeterministically returns one of

I Compiler implements Analysis and Translation to Abstract Machine Code.. I Abstract Machine works

Compiler implements Analysis and Translation to Abstract Machine Code!. Abstract Machine works

If there are more than one token matching the longest input prefix, one of these tokens is returned by the function symbol. Ina Schaefer Syntax and Type

Recursive Descent LL(k) Parsing Theory LL Parser Generation.. Bottom-Up

For sequential access to the symbol table, almost all types of the abstract syntax get an inherited attribute symin of type SymTab and an synthesized attribute symout.

• GlobDeclList, GlobDecl, LocVarList, LocVar, Stat, Exp, ExpList get inherited attribute envin of type Env. • GlobDecl gets synthesized