Compiler: Difference between revisions

From 太極
Jump to navigation Jump to search
 
(13 intermediate revisions by the same user not shown)
Line 71: Line 71:


= Semantic Analysis =
= Semantic Analysis =
* Last "front end" phase (together with lexical analysis & parsing) to enforce language
* Catches all remaining errors
Coolc checks
# all identifiers are declared
# types
# inheritance relationships
# classes defined only once
# methods in a class defined only once
# reserved identifiers are not misused
# ...
== Scope ==
* Static scope: scope depends only on the program text, not on run-time behavior
* Dynamically scope: scope depends on execution of the program
== Type ==
* It doesn't make sense to add a function pointer and an integer in C.
* It does make sense to add two integers
* But both have the same assembly language implementation!
A language's type system specifies which operations are valid for which types.
Three kinds of languages:
* Statistically typed: All or almost all check of types is done as part of compilation (C, Java, Cool)
* Dynamically typed: Almost all checking of types is done as part of program execution (Scheme, Python, Perl)
* Untyped: No type checking (machine code)
= Run-time Organization =
== Activation Record & Stack ==
== Global and Heap ==
<pre>
Memory
Low address  +----------------------+
            |          Code        |
            +----------------------+
            |      Static Data    |
            +----------------------+
            |  Stack (can grow)  |
            |          |          |
            |          v          |
            |......................|
            |          ^          |
            |          |          |
            |  Heap (can grow)    |
            |                      |
High address +----------------------+
</pre>
= Code Generation =


= Optimization =
= Optimization =


= Code Generation =
= Garbage collection =
[https://www.cloudsavvyit.com/10031/what-is-garbage-collection-and-how-does-it-affect-your-programs-performance/ What Is Garbage Collection, and How Does It Affect Your Program’s Performance?]
 
== Automatic Memory Managment ==
C and C++ programs have many storage bugs
* forgetting to free unused memory
* dereferencing a dangling pointer
* overwriting parts of a data structure by accident
* and so on ...
 
How do we know an object will "never be used again"?
 
Observation: a program can use only the objects that it can find:
<pre>
  let x : A <- new A in { x <- y; ... }
</pre>
 
An object x is '''reachable''' if and only if
* a register contains a pointer to x, or
* another reachable object y contains a pointer to x.
 
An unreachable object can never be used - such objects are '''garbage'''.
 
== Mark and Sweep phases ==
When memory runs out, GC executes two phases
* Mark phase: trace reachable objects
* Sweep phase: collects garbage objects
 
Every object has an extra bit: the mark bit
* reserved for memory management
* initially the mark bit is 0
* set to 1 for the reachable objects in the mark phase.
 
== Stop and Copy ==


= Resource =
= Resource =
* coursera.org
* [https://class.coursera.org/compilers-004/ coursera.org]

Latest revision as of 21:50, 10 March 2021

Lexical Analysis

  if (i == j)
    z = 0;
  else
    z = 1;

is indeed below in computers

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

An implementation must do

  1. Recognize substrings corresponding to tokens
  2. Identify the token class of each lexeme

Token Class

Identifier, keywords, '(', ')', Numbers, ...

  • Token classes correspond to sets of strings.
  • Identifier: A1, Foo, B17
  • Integer: 0, 99
  • Keyword: 'else' or 'if' or 'begin' or ...
  • Whitespace: if___else

For the last code example, the tokens are: if, whitespace, (, i, == , j, \t, \n, else, z, =, 1, ;

                                 Token    
   string ---> Lexical Analysis -------> Parser

Regular Languages

Regular expressions specify regular languages.

Five constructs

  • Two base cases - empty and 1-character strings
  • Three compound expressions - union, concatenation, iteration.

Finite Automata

  • Regular expressions = specification
  • Finite automata = implementation

A finite automaton consists of

  • input alphabet
  • set of states
  • start state
  • set of accepting states
  • set of transitions

Parsing

Input Ouput
Lexer Strings of characters Strings of tokens
Parser String of tokens Parse tree

Context-Free Grammars (CFG)

Parser must distinguish between valid and invalid strings of tokens.

Programming languages have recursive structure.

Semantic Analysis

  • Last "front end" phase (together with lexical analysis & parsing) to enforce language
  • Catches all remaining errors

Coolc checks

  1. all identifiers are declared
  2. types
  3. inheritance relationships
  4. classes defined only once
  5. methods in a class defined only once
  6. reserved identifiers are not misused
  7. ...

Scope

  • Static scope: scope depends only on the program text, not on run-time behavior
  • Dynamically scope: scope depends on execution of the program

Type

  • It doesn't make sense to add a function pointer and an integer in C.
  • It does make sense to add two integers
  • But both have the same assembly language implementation!

A language's type system specifies which operations are valid for which types.

Three kinds of languages:

  • Statistically typed: All or almost all check of types is done as part of compilation (C, Java, Cool)
  • Dynamically typed: Almost all checking of types is done as part of program execution (Scheme, Python, Perl)
  • Untyped: No type checking (machine code)

Run-time Organization

Activation Record & Stack

Global and Heap

Memory

Low address  +----------------------+
             |          Code        |
             +----------------------+
             |      Static Data     |
             +----------------------+
             |   Stack (can grow)   |
             |          |           |
             |          v           |
             |......................|
             |          ^           |
             |          |           |
             |   Heap (can grow)    |
             |                      |
High address +----------------------+

Code Generation

Optimization

Garbage collection

What Is Garbage Collection, and How Does It Affect Your Program’s Performance?

Automatic Memory Managment

C and C++ programs have many storage bugs

  • forgetting to free unused memory
  • dereferencing a dangling pointer
  • overwriting parts of a data structure by accident
  • and so on ...

How do we know an object will "never be used again"?

Observation: a program can use only the objects that it can find:

  let x : A <- new A in { x <- y; ... }

An object x is reachable if and only if

  • a register contains a pointer to x, or
  • another reachable object y contains a pointer to x.

An unreachable object can never be used - such objects are garbage.

Mark and Sweep phases

When memory runs out, GC executes two phases

  • Mark phase: trace reachable objects
  • Sweep phase: collects garbage objects

Every object has an extra bit: the mark bit

  • reserved for memory management
  • initially the mark bit is 0
  • set to 1 for the reachable objects in the mark phase.

Stop and Copy

Resource