• Keine Ergebnisse gefunden

Syntax and Type Analysis

N/A
N/A
Protected

Academic year: 2022

Aktie "Syntax and Type Analysis"

Copied!
21
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Syntax and Type Analysis

Lecture Compilers Summer Term 2011

Prof. Dr. Arnd Poetzsch-Heffter

Software Technology Group TU Kaiserslautern

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 1

Content of Lecture

1. Introduction: Overview and Motivation 2. Syntax- and Type Analysis

2.1 Lexical Analysis

2.2 Context-Free Syntax Analysis

2.3 Context-Dependent Syntax Analysis 3. Translation to Target Language

3.1 Translation of Imperative Language Constructs

3.2 Translation of Object-Oriented Language Constructs 4. Selected Aspects of Compilers

4.1 Intermediate Languages 4.2 Optimization

4.3 Data Flow Analysis 4.4 Register Allocation

(2)

2. Syntax and Type Analysis

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 3

Educational Objectives

Tasks of different syntax analysis phases

Interaction of syntax analysis phases

Specification techniques for syntax analysis

Generation techniques

Usage of tools

Lexical analysis

Context-free analysis (parsing)

Context-sensitive analysis

(3)

Introduction to Syntax and Type Analysis

Syntax Analysis

Tasks of Syntax Analysis

Check if input is syntactically correct

Dependent on result:

I Error message

I Generation of appropriate data structure for subsequent processing

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 5

Introduction to Syntax and Type Analysis

Syntax and Type Analysis Phases

Lexical analysis:

Character stream → token stream (or symbol stream)

Context-free analysis:

Token stream → syntax tree Context-sensitive analysis:

Syntax tree → syntax tree with cross

Source Code

Scanner

Parser

Name and

Character Stream

Token Stream

Syntax Tree

SyntaxandTypeAnalysis

(4)

Introduction to Syntax and Type Analysis

Reasons for Separation of Phases

Lexical and context-free analysis

I Reduced load for context-free analysis, e.g., whitespaces are not required for context-free analysis

Context-free and context-sensitive analysis

I Context-sensitive analysis uses tree structure instead of token stream

I Advantages for construction of target data structure

For both cases

I Increased efficiency

I Natural process (cmp. natural language)

I More appropriate tool support

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 7

Lexical Analysis

2.1. Lexical Analysis

(5)

Lexical Analysis

Lexical Analysis

Tasks

Break input character stream into a token stream wrt. language definition

Classify tokens into token classes

Representation of tokens

I Hashing of identifiers

I Conversion of constants

Elimination of

I whitespaces (spaces, comments...)

I external constructs (compiler directives...)

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 9

Lexical Analysis

Lexical Analysis (2)

Terminology

Token/symbol: a word over an alphabet of characters (often with additional information, e.g. token class, encoding, position..)

Token class: a set of tokens (identifier, constants, ...); correspond to terminal symbols of a context-free grammar

Remark:the terms tokenand symbol refer to the same concept. The term token is in general used when talking about parsing technology, whereas the term symbol is used when talking about formal languages.

(6)

Lexical Analysis

Lexical Analysis: Example

Input Line 23:

if ( A <= 3.14 ) B = B−−

Token Class String Token Information Col:Row

IF “if” 23:3

OPAR “(” 23:5

ID “A” 72 (Hash) 23:7

RELOP “<=” 4 (Encoding) 23:9

FLOATCONST “3.14” 3,14 (Constant Value) 23:12

CPAR “)” 23:16

ID “B” 84 (Hash) 23:20

. . .

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 11

Lexical Analysis Specification of Scanners

Specification

The specification of the lexical analysis is a part of the language specification.

The two parts of lexical analysis specification:

Scanning algorithm (often only implicit)

Specification of tokens and token classes

(7)

Lexical Analysis Specification of Scanners

Examples: Scanning

1. Statement in C

B = B --- A;

Problem: Separation ( - - and - are tokens) Solution: Longest token is chosen, i.e,

B = B -- - A;

2. Java Fragment

class public { public m() {...} }

Problem: Ambiguity (keyword, identifier) Solution: Precedence rules

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 13

Lexical Analysis Specification of Scanners

Standard Scan Algorithm (Concept)

Scanning is often implemented as a procedure:

Procedure returns next token

State is remainder of input

In error cases, returns the UNDEF token and updates the input

(8)

Lexical Analysis Specification of Scanners

Standard Scan Algorithm (Pseudo Code)

CharStream inputRest := input;

Token nextToken() {

Token curToken := longestTokenPrefix(inputRest);

inputRest:= cut(curToken, inputRest);

return curToken;

}

where cut is defined as

if curToken 6=UNDEF, curToken is removed from inputRest

else inputRest remains unchanged.

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 15

Lexical Analysis Specification of Scanners

Standard Scan Algorithm (2)

Token longestTokenPrefix(CharStream ir) { require availableChar(ir) > 0

int curLength = 1;

String curPrefix := prefix(curLength,ir);

Token longestToken := UNDEF;

while( curLength <= availableChar(ir)

&& isTokenPrefix(curPrefix) ) { if (isToken(curPrefix) {

longestToken := curPrefix;

}

curLength++;

curPrefix := prefix(curLength,ir);

}

return longestToken;

}

(9)

Lexical Analysis Specification of Scanners

Standard Scan Algorithm (3)

Predicates to be defined:

isTokenPrefix: String → boolean

isToken: String → boolean Remarks:

Standard scan algorithm is used in many modern languages, but not, e.g., in FORTRAN because blanks are not special, except in literal tokens, e.g.

I DO 7 I = 1.25 DO 7 I” is an identifier.

I DO 7 I = 1,25 DO” is a keyword.

Error cases are not handled

Complete realization of longestTokenPrefix is discussed later.

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 17

Lexical Analysis Specification of Scanners

Specification of Token Classes

Token classes are defined by regular expressions (REs).

REs specify the set of strings, which belong to a certain token class.

(10)

Lexical Analysis Specification of Scanners

Regular Expressions

Let Σ be an alphabet, i.e. an non-empty set of characters. Σ is the set of all words over Σ, is the empty word.

Definition (Regular expressions, regular languages)

ε is a RE and specifies the language L = {}.

Each a ∈ Σ is a RE and specifies the language L = {a}.

Let r and s be two RE specifying the languages R and S, resp.

Then the following are RE and specify the language L:

I (r|s) withL =R S (union)

I rs withL ={vw|v R,w S} (concatenation)

I r with{v1. . .vn|vi R,0 i n} (Kleene star)

The language L ⊆ Σ is called regular if there exists RE r defining L.

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 19

Lexical Analysis Specification of Scanners

Regular Expressions (2)

Remarks:

L = ∅ is not regular according to the definition, but is often considered regular.

Other Operators, e.g. +, ?, ., [] can be defined using the basic operators, e.g.

I r+ (r r) r \ {}

I [aBd] a|B|d

I [ag] a|b|c|d |e|f |g

Caution: Regular expressions only define valid tokens and do not specify the program or translation units of a programming language.

(11)

Lexical Analysis Implementation of Scanners

Implementation of Scanners

sequence of regular expressions and actions (input language of scanner generator)

Scanner Generator

scanner program

(usually in a programming language)

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 21

Lexical Analysis Implementation of Scanners

Scanner Generator: JFlex

Typical use of JFlex:

java -jar JFlex.jar Example.jflex javac Yylex.java

Actions are written in Java

Examples :

1. Regular expression in JFlex

[a-zA-Z_0-9] [a-zA-Z_0-9] *

2. JFlex input with abbreviations

ZI = [0-9]

BU = [a-zA-Z_]

(12)

Lexical Analysis Implementation of Scanners

A Complete JFlex Example

enum Token { DO, DOUBLE, IDENT, FLOATCONST, STRING;}

%%

%line

%column

%debug

%type Token // declare token type ZI = [0-9]

BU = [a-zA-Z_]

BUZI = [a-zA-Z_0-9]

ZE = [a-zA-Z_0-9!?\]\[\. \t...]

WhiteSpace = [ \t\n]

%%

{WhiteSpace} { }

"double" { return Token.DOUBLE; }

"do" { return Token.DO; }

{BU}{BUZI}* { return Token.IDENT; }

{ZI}+\.{ZI}+ { return Token.FLOATCONST; }

\"({ZE}|\\\")*\" { return Token.STRING; }

<<EOF>> { System.out.println("FINISHED"); return null; }

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 23

Lexical Analysis Implementation of Scanners

Scanner Generators

Scanner generation uses the equivalence between

I Regular expressions

I Non-deterministic finite automata (NFA)

I Deterministic finite automata (DFA)

Construction methods is based in two steps:

I Regular expressions NFA

I NFA DFA

(13)

Lexical Analysis Implementation of Scanners

Definition of NFA

Definition (Non-deterministic Finite Automaton)

A non-deterministic finite automaton is defined as a 5-tuple M = (Σ,Q,∆,q0,F)

where

Σ is the input alphabet

Q is the set of states

q0 ∈ Q is the initial state

F ⊆ Q is the set of final states

∆ ⊆ Q ×Σ ∪ {} ×Q is the transition relation.

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 25

Lexical Analysis Implementation of Scanners

Regular Expressions → NFA

Principle: For each regular sub-expression, construct NFA with one start and end state that accepts the same language.

1. Schritt: Reguläre Ausdrücke ! NEA Übersetzungsschema:

• !

• a

• (r|s)

• (rs)

Prinzip: Konstruiere für jeden regulären Teilausdruck NEA mit genau einem Start- und Endzustand, der die gleiche Sprache akzeptiert.

s

0

f

0

s

0

a

s

0

f

0

! s

1

R f

1

s

2

S f

2

!

! !

s

1

R f

1

! s

2

S f

2

s

1

R f !

f

s !

!

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 26

(14)

Lexical Analysis Implementation of Scanners

Regular Expressions → NFA (2)

43

© A. Poetzsch-Heffter, TU Kaiserslautern 25.04.2007

1. Schritt: Reguläre Ausdrücke ! NEA Übersetzungsschema:

!

• a

• (r|s)

• (rs)

• r*

Prinzip: Konstruiere für jeden regulären Teilausdruck NEA mit genau einem Start- und Endzustand, der die gleiche Sprache akzeptiert.

s0 f0

s0 a

s0 f0

! s1 R f1

s2 S f2

!

! !

s1 R f1 ! s2 S f2

s1 R f1 ! f0

s0 !

!

!

43

© A. Poetzsch-Heffter, TU Kaiserslautern 25.04.2007

1. Schritt: Reguläre Ausdrücke ! NEA Übersetzungsschema:

• !

• a

• (r|s)

• (rs)

• r*

Prinzip: Konstruiere für jeden regulären Teilausdruck NEA mit genau einem Start- und Endzustand, der die gleiche Sprache akzeptiert.

s0 f0

s0

a

s0 f0

! s1 R f1

s2 S f2

!

! !

s1 R f1 ! s2 S f2

s1 R f1 ! f0

s0 !

!

!

43

© A. Poetzsch-Heffter, TU Kaiserslautern 25.04.2007

1. Schritt: Reguläre Ausdrücke ! NEA Übersetzungsschema:

!

• a

• (r|s)

• (rs)

• r*

Prinzip: Konstruiere für jeden regulären Teilausdruck NEA mit genau einem Start- und Endzustand, der die gleiche Sprache akzeptiert.

s

0

f

0

s

0

a

s

0

f

0

!

s

1

R f

1

s

2

S f

2

!

! !

s

1

R f

1 !

s

2

S f

2

s

1

R f

1 !

f

0

s

0 !

!

!

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 27

Lexical Analysis Implementation of Scanners

Example: Construction of NFA

44© A. Poetzsch-Heffter, TU Kaiserslautern

Übersetzung am Beispiel von Folie 41: 01 s5s6s7s8s9s10s11

s2s4 s13s12 s17s16s14s15

d elbuods3o

, TAB BUZI BU ZI ZI.ZI

ZI s19

s20ZE s21 s22s23s24\

s26

s25

!! !

!!

! ! !

! !

(15)

Lexical Analysis Implementation of Scanners

-closure

Function closure computes the -closure of a set of states s1, . . . ,sn.

Definition (-closure)

For an NFA M = (Σ,Q,∆,q0,F) and a state q ∈ Q, the -closure of q is defined by

-closure(q) = {p ∈ Q|p reachable from q via-transitions} For S ⊆ Q, the -closure of S is defined by

-closure(S) = [

sS

-closure(s)

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 29

Lexical Analysis Implementation of Scanners

Longest Token Prefix with NFA

Token longestTokenPrefix(char[] ir) { // length(ir) > 0

StateSet curState := closure( {s0} );

int curLength := 0;

int tokenLength := undef;

while (curLength <= length(ir) && !isEmptySet(curState) ) { if (contains(curState,FinalState)) {

tokenLength := curLength;

}

curLength++;

curState := closure(successor(curState,ir[curLength]));

}

(16)

Lexical Analysis Implementation of Scanners

Longest Token Prefix with NFA (2)

Remark:

Problem of ambiguity:

If there are more than one token matching the longest input prefix, procedure token nondeterministically returns one of them.

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 31

Lexical Analysis Implementation of Scanners

NFA → DFA

Principle:

For each NFA, a DFA can be constructed that accepts the same language. (In general, this does not hold for NFA with output.) Properties of DFA:

No -transitions

Transitions are deterministic given the input char

(17)

Lexical Analysis Implementation of Scanners

NFA → DFA (2)

Definition (Deterministic Finite State Automaton)

A deterministic finite automaton is defined as a 5-tuple

M = (Σ,Q,∆,q0,F) where

Σ is the input alphabet

Q is the set of states

q0 ∈ Q is the initial state

F ⊆ Q is the set of final states

∆ : Q×Σ → Q is the transition function.

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 33

Lexical Analysis Implementation of Scanners

NFA → DFA (3)

Construction: (according to John Myhill)

The States of the DFA are subsets of NFA states

(powerset construction). Subsets of finite sets are also finite.

The start state of the DFA is the -closure of the NFA start state

The final states of the DFA are the sets of states that contain an NFA final state.

The successor state of a state S in the DFA under input a is obtained by

I computing all successors p of q S undera in the NFA

I and adding the -closure of p

(18)

Lexical Analysis Implementation of Scanners

NFA → DFA (4)

If working with character classes (e.g. [a-f]), characters and character classes at outgoing transitions must be disjoint.

Completion of automaton for error handling:

I Insert additional (final) state (nT)

I For each state, add a transition for each character for which no outgoing transition exists to the nonToken state.

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 35

Lexical Analysis Implementation of Scanners

NFA → DFA (5)

Definition (DFA for NFA)

Let M = (Σ,Q,∆,q0,F) be a NFA. Then, the DFA M0 corresponding to the NFA M is defined as M0 = (Σ,Q0,∆0,q00,F0) where

the set of states is Q0 ⊆ P(Q), power set of Q

the initial state q00 is the -closure ofq0

the final states are F0 = {S ⊆ Q|S ∩F 6= ∅}

0(S,a) = -closure({p|(q,a,p) ∈ ∆,q ∈ S}) for alla ∈ Σ.

(19)

Lexical Analysis Implementation of Scanners

Example: DFA

48© A. Poetzsch-Heffter, TU Kaiserslautern25.04.2007

s0,1,2,5,12,14,18

s1 LZ, TAB

LZ, TAB s3,6,13

s4,7,13

s8,13 s 13 BU\{d}d

e l b u o

BUZI\{b} BUZI\{u} BUZI\{o} BUZI

BUZI\{l}

BUZI\{e}

BUZI s17s16s15

ZI ZI.ZI

ZI s19,20,22,25 s19,20,21,22,25 s 26 s19,20,21,22,23,25s19,20,22,24,25,26

s9,13

s10,13

s11,13 ZE \

“ “

\

ZE

“ “

\ZEZE \

ksWg. Übersichtlichkeit Kanten zu ks nur angedeutet.

Transitions to nT sketched.

nT

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 37

Lexical Analysis Implementation of Scanners

Longest Token Prefix with DFA

Token longestTokenPrefix(char[] ir) { // length(ir) > 0

State curState : = StartState;

int curLength := 0;

int tokenLength := undef;

while (curLength <= length(ir) && curState != nT) if (curState is FinalState) {

tokenLength := curLength;

}

curLength++;

curState := successor(curState,ir[curLength]));

}

(20)

Lexical Analysis Implementation of Scanners

Longest Token Prefix with DFA (2)

Remarks:

Computation of closure at construction time, not at runtime.

(Principle: Do as much statically as you can!)

Problem of ambiguity still not solved. However, many scanner generators allows the user to control which token is returned. For example, JFlex returns the token of the first rule in the JFlex file that matches the longest input prefix.

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 39

Lexical Analysis Implementation of Scanners

Longest Token Prefix with DFA (3)

Implementation Aspects:

Constructed DFA can be minimized.

Input buffering is important: often use of cyclic arrays (caution with maximal token length, e.g. in case of comments)

Encode DFA in table

Choose suitable partitioning of alphabet in order to reduce number of transitions (i.e. size of table)

Interface with parser: usually parser asks proactively for next token

(21)

Lexical Analysis Implementation of Scanners

Recommended Reading

Wilhelm, Maurer: Chap. 7, pp. 239-269 (More theoretical)

Appel: Chap 2, pp. 16 - 37 (More practical) Additional Reading:

Aho, Sethi, Ullman: Chap. 3 (very detailed)

Prof. Dr. Arnd Poetzsch-Heffter Syntax and Type Analysis 41

Referenzen

ÄHNLICHE DOKUMENTE

Möglicherweise wurde Ihre im Antragsformular angegebene Mobilnummer nicht korrekt registriert oder Sie haben Sie sich zwischenzeitlich eine neue Mobilnummer zugelegt und diese

Hinweis: Bei Problemen mit der Aktivierung des Tokens senden Sie bitte eine E-Mail mit dem Betreff «sesamvote: Anmeldung» an die Mailadresse helpdesk@bl.ch.. Die Zentrale Informatik

This document describes the implementation of a Token-Ring Gateway in remote models of the IBM 3174 Subsystem Control Unit and examines some of the performance and

The originating station sends a TEST or XID command LPDU on the local ring with the address of the destination in the destination address field and to the null SAP

If there are more than one token matching the longest input prefix, procedure token nondeterministically returns one of them... Lexical Analysis Implementation

If there are more than one token matching the longest input prefix, procedure token nondeterministically returns one of them... Lexical Analysis Implementation

If there are more than one token matching the longest input prefix, procedure token nondeterministically returns one of

If there are more than one token matching the longest input prefix, one of these tokens is returned by the function symbol. Ina Schaefer Syntax and Type