tailieunhanh - Lecture Programming languages (2/e): Chapter 3b - Tucker, Noonan

Chapter 3 - Lexical and syntactic analysis: Lexical analysis. This chapter presents the following content: Example tokens, other sequences, regular expressions, clite lexical syntax, finite state automata, some conventions, lexer code, from design to code,. | Programming Languages 2nd edition Tucker and Noonan Chapter 3 Lexical and Syntactic Analysis Syntactic sugar causes cancer of the semicolon. A. Perlis Contents Chomsky Hierarchy Lexical Analysis Syntactic Analysis Lexical Analysis Purpose: transform program representation Input: printable Ascii characters Output: tokens Discard: whitespace, comments Defn: A token is a logically cohesive sequence of characters representing a single symbol. Example Tokens Identifiers Literals: 123, , 'x', true Keywords: bool char . Operators: + - * / . Punctuation: ; , ( ) { } Other Sequences Whitespace: space tab Comments // any-char* end-of-line End-of-line End-of-file Why a Separate Phase? Simpler, faster machine model than parser 75% of time spent in lexer for non-optimizing compiler Differences in character sets End of line convention differs Regular Expressions RegExpr Meaning x a character x \x an escaped character, ., \n { name } a reference to a name M | N M or N M N M followed by N M* zero or more occurrences of M RegExpr Meaning M+ One or more occurrences of M M? Zero or one occurrence of M [aeiou] the set of vowels [0-9] the set of digits . Any single character Clite Lexical Syntax Category Definition anyChar [ -~] Letter [a-zA-Z] Digit [0-9] Whitespace [ \t] Eol \n Eof \004 Category Definition Keyword bool | char | else | false | float | if | int | main | true | while Identifier {Letter}({Letter} | {Digit})* integerLit {Digit}+ floatLit {Digit}+\.{Digit}+ charLit ‘{anyChar}’ Category Definition Operator = | || | && | == | != | | >= | + | - | * | / |! | [ | ] Separator : | . | { | } | ( | ) Comment // ({anyChar} | {Whitespace})* {eol} Generators Input: usually regular expression Output: table (slow), code C/C++: Lex, Flex Java: JLex Finite State Automata Set of states: representation – graph nodes Input alphabet + unique end symbol State transition function Labelled (using alphabet) arcs in graph Unique start state One or more final . | Programming Languages 2nd edition Tucker and Noonan Chapter 3 Lexical and Syntactic Analysis Syntactic sugar causes cancer of the semicolon. A. Perlis Contents Chomsky Hierarchy Lexical Analysis Syntactic Analysis Lexical Analysis Purpose: transform program representation Input: printable Ascii characters Output: tokens Discard: whitespace, comments Defn: A token is a logically cohesive sequence of characters representing a single symbol. Example Tokens Identifiers Literals: 123, , 'x', true Keywords: bool char . Operators: + - * / . Punctuation: ; , ( ) { } Other Sequences Whitespace: space tab Comments // any-char* end-of-line End-of-line End-of-file Why a Separate Phase? Simpler, faster machine model than parser 75% of time spent in lexer for non-optimizing compiler Differences in character sets End of line convention differs Regular Expressions RegExpr Meaning x a character x \x an escaped character, ., \n { name } a reference to a name M | N M or N M N M .