Compiler: Difference between revisions
(13 intermediate revisions by the same user not shown) | |||
Line 71: | Line 71: | ||
= Semantic Analysis = | = Semantic Analysis = | ||
* Last "front end" phase (together with lexical analysis & parsing) to enforce language | |||
* Catches all remaining errors | |||
Coolc checks | |||
# all identifiers are declared | |||
# types | |||
# inheritance relationships | |||
# classes defined only once | |||
# methods in a class defined only once | |||
# reserved identifiers are not misused | |||
# ... | |||
== Scope == | |||
* Static scope: scope depends only on the program text, not on run-time behavior | |||
* Dynamically scope: scope depends on execution of the program | |||
== Type == | |||
* It doesn't make sense to add a function pointer and an integer in C. | |||
* It does make sense to add two integers | |||
* But both have the same assembly language implementation! | |||
A language's type system specifies which operations are valid for which types. | |||
Three kinds of languages: | |||
* Statistically typed: All or almost all check of types is done as part of compilation (C, Java, Cool) | |||
* Dynamically typed: Almost all checking of types is done as part of program execution (Scheme, Python, Perl) | |||
* Untyped: No type checking (machine code) | |||
= Run-time Organization = | |||
== Activation Record & Stack == | |||
== Global and Heap == | |||
<pre> | |||
Memory | |||
Low address +----------------------+ | |||
| Code | | |||
+----------------------+ | |||
| Static Data | | |||
+----------------------+ | |||
| Stack (can grow) | | |||
| | | | |||
| v | | |||
|......................| | |||
| ^ | | |||
| | | | |||
| Heap (can grow) | | |||
| | | |||
High address +----------------------+ | |||
</pre> | |||
= Code Generation = | |||
= Optimization = | = Optimization = | ||
= | = Garbage collection = | ||
[https://www.cloudsavvyit.com/10031/what-is-garbage-collection-and-how-does-it-affect-your-programs-performance/ What Is Garbage Collection, and How Does It Affect Your Program’s Performance?] | |||
== Automatic Memory Managment == | |||
C and C++ programs have many storage bugs | |||
* forgetting to free unused memory | |||
* dereferencing a dangling pointer | |||
* overwriting parts of a data structure by accident | |||
* and so on ... | |||
How do we know an object will "never be used again"? | |||
Observation: a program can use only the objects that it can find: | |||
<pre> | |||
let x : A <- new A in { x <- y; ... } | |||
</pre> | |||
An object x is '''reachable''' if and only if | |||
* a register contains a pointer to x, or | |||
* another reachable object y contains a pointer to x. | |||
An unreachable object can never be used - such objects are '''garbage'''. | |||
== Mark and Sweep phases == | |||
When memory runs out, GC executes two phases | |||
* Mark phase: trace reachable objects | |||
* Sweep phase: collects garbage objects | |||
Every object has an extra bit: the mark bit | |||
* reserved for memory management | |||
* initially the mark bit is 0 | |||
* set to 1 for the reachable objects in the mark phase. | |||
== Stop and Copy == | |||
= Resource = | = Resource = | ||
* coursera.org | * [https://class.coursera.org/compilers-004/ coursera.org] |
Latest revision as of 21:50, 10 March 2021
Lexical Analysis
if (i == j) z = 0; else z = 1;
is indeed below in computers
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
An implementation must do
- Recognize substrings corresponding to tokens
- Identify the token class of each lexeme
Token Class
Identifier, keywords, '(', ')', Numbers, ...
- Token classes correspond to sets of strings.
- Identifier: A1, Foo, B17
- Integer: 0, 99
- Keyword: 'else' or 'if' or 'begin' or ...
- Whitespace: if___else
For the last code example, the tokens are: if, whitespace, (, i, == , j, \t, \n, else, z, =, 1, ;
Token string ---> Lexical Analysis -------> Parser
Regular Languages
Regular expressions specify regular languages.
Five constructs
- Two base cases - empty and 1-character strings
- Three compound expressions - union, concatenation, iteration.
Finite Automata
- Regular expressions = specification
- Finite automata = implementation
A finite automaton consists of
- input alphabet
- set of states
- start state
- set of accepting states
- set of transitions
Parsing
Input | Ouput | |
---|---|---|
Lexer | Strings of characters | Strings of tokens |
Parser | String of tokens | Parse tree |
Context-Free Grammars (CFG)
Parser must distinguish between valid and invalid strings of tokens.
Programming languages have recursive structure.
Semantic Analysis
- Last "front end" phase (together with lexical analysis & parsing) to enforce language
- Catches all remaining errors
Coolc checks
- all identifiers are declared
- types
- inheritance relationships
- classes defined only once
- methods in a class defined only once
- reserved identifiers are not misused
- ...
Scope
- Static scope: scope depends only on the program text, not on run-time behavior
- Dynamically scope: scope depends on execution of the program
Type
- It doesn't make sense to add a function pointer and an integer in C.
- It does make sense to add two integers
- But both have the same assembly language implementation!
A language's type system specifies which operations are valid for which types.
Three kinds of languages:
- Statistically typed: All or almost all check of types is done as part of compilation (C, Java, Cool)
- Dynamically typed: Almost all checking of types is done as part of program execution (Scheme, Python, Perl)
- Untyped: No type checking (machine code)
Run-time Organization
Activation Record & Stack
Global and Heap
Memory Low address +----------------------+ | Code | +----------------------+ | Static Data | +----------------------+ | Stack (can grow) | | | | | v | |......................| | ^ | | | | | Heap (can grow) | | | High address +----------------------+
Code Generation
Optimization
Garbage collection
What Is Garbage Collection, and How Does It Affect Your Program’s Performance?
Automatic Memory Managment
C and C++ programs have many storage bugs
- forgetting to free unused memory
- dereferencing a dangling pointer
- overwriting parts of a data structure by accident
- and so on ...
How do we know an object will "never be used again"?
Observation: a program can use only the objects that it can find:
let x : A <- new A in { x <- y; ... }
An object x is reachable if and only if
- a register contains a pointer to x, or
- another reachable object y contains a pointer to x.
An unreachable object can never be used - such objects are garbage.
Mark and Sweep phases
When memory runs out, GC executes two phases
- Mark phase: trace reachable objects
- Sweep phase: collects garbage objects
Every object has an extra bit: the mark bit
- reserved for memory management
- initially the mark bit is 0
- set to 1 for the reachable objects in the mark phase.