1.2.2. High-level languages
Broadly speaking, the execution of a functional program carries out a series of function calls that lead to the result, with intermediate values stored exclusively in the registers. The execution of an imperative program carries out a sequence of modifications of memory cells named by identifiers, the values in the cells being computed during execution. The most widespread high-level languages include both functional and imperative features, along with various possibilities (modules, object features, etc.) to divide source code into pieces that can be reused.
Whatever the style of programming used, any program written in a high-level language needs to be translated into binary language to be executed. These translations are executed either every time the program is executed – in which case the translation program is known as an interpreter or just once, storing the produced binary code – in which case the translator is known as a compiler.
As we have seen, high-level languages facilitate the coding of algorithms. They ease reviewing of the source code of a program, as the text is more concise than it would be for the same algorithm in assembly code. This does not, however, imply that users gain a better understanding of the way the program works. To write a program, a precise knowledge of the constructs used – in other terms, their semantics, what they do and what they mean – is crucial to understand the source code. Bugs are not always the result of algorithm coding errors, and are often caused by an erroneous interpretation of elements of the language. For example, the incrementation operator ++
in C exists in two forms (i++
or ++i
), and its understanding is not as simple as it may seem. For example, the program:
C #include <stdio.h> int main () { int i = 0 ; printf (“%d\n”, i++) ; return (0) ; }
will print 0, but if i++
is replaced with ++i
, the same program will print 1.
There are a number of concepts that are common to all high-level languages: value naming, organization of namespaces, explicit memory management, etc. However, these concepts may be expressed using different syntactic constructs. The field of language semantics covers a set of logico-mathematical theories, which describe these concepts and their properties. Constructing the semantics of a program allows to the formal verification of whether the program possesses all of the required properties.
1.2.3. From source code to executable programs
The transition from the program source to its execution is a multistep process. Some of these steps may differ in different languages. In this section, we shall give an overview of the main steps involved in analyzing and transforming source code, applicable to most programming languages.
The source code of a program is made up of one or more text files. Indeed, to ease software architecture, most languages allow source code to be split across several files, known as compilation units. Each file is processed separately prior to the final phase, in which the results of processing are combined into one single executable file.
1.2.3.1. Lexical analysis
Lexical analysis is the first phase of translation: it converts the sequence of characters that is indeed the source file into a sequence of words, assigning each to a category. Comments are generally deleted at this stage. Thus, in the following text presumed to be written in C
/* This is a comment. */ if [x == 3 int +) cos ($v)
lexical analysis will recognize the keyword if
, the opening bracket, the identifier x
, the operator ==
, the integer constant 3
, the type identifier int
, etc. No word in C can contain the character $
, so a lexical error will be highlighted when $v
is encountered.
Lexical analysis may be seen as a form of “spell check”, in which each recognized word is assigned to a category (keyword, constant, identifier). These words are referred to as tokens.
1.2.3.2. Syntactic analysis
Every language follows grammar. For example, in English, a sentence is generally considered to be correctly formed if it contains a subject, verb and complement in an understandable order. Programming languages are no exception: syntactic analysis verifies that the phrases of a source file conform with the grammar of their language. For example, in C, the keyword if
must be followed by a bracketed expression, an instruction must end with a semicolon, etc. Clearly, the source text given in the example above in the context of lexical analysis does not respect the syntax of C.
Technically, the syntactic analyzer is in charge of the complete grammatical analysis of the source file. It calls the lexical analyzer every time it requires a token to progress through the analyzed source. Syntactic analysis is thus a form of grammar verification, and it also builds a representation of the source file by a data structure, which is often a tree, called the abstract syntax tree (AST). This data structure will be used by all the following phases of compilation, up to the point of execution by an interpreter or the creation of an executable file.
1.2.3.3. Semantic analyses
The first two analysis phases of compilation only concern the textual structure of the source. They do not concern the meaning of the program, i.e. its semantics. Source texts that pass the syntactic analysis phase do not always have meaning. The phrase “the sea eats a derivable rabbit” is grammatically correct, but is evidently nonsense.
The best-known semantic analysis is the typing analysis, which prohibits the combination of elements that are incompatible in nature. Thus, in the previous phase, “derivable” could be applicable to a function, but certainly not to a “rabbit”.
Semantic analyses do not reduce to a form of typing analysis but they all interpret the constructs of a program according to the semantics of the chosen language. Semantic analyses may be used to eliminate programs, which leads to execution errors. They may also apply some transformations to program code in order to get an executable file (dependency analysis, closure elimination, etc.). These semantic analyses may be carried out during subsequent passes of source code processing, even after the code generation phase described in the following section.
1.2.3.4. Code interpretation/generation
Once the abstract syntax tree (or a derived tree) has been created, there are two options. Either the tree may be executed directly via an interpreter, which is a program supplied by the programming language, or the AST is used to generate object code files, with the aim of creating an executable file that can be run independently. Let us first focus on the second approach. The interpretation mechanism will be discussed later.
Compilation uses the AST generated from the source file to produce a sequence of instructions to be executed either by the CPU or by a virtual machine (VM). The compilation is correct if the execution of this sequence of instructions gives a result, which conforms to the program’s semantics.
Optimization phases may take place during or after object code generation, with the aim of improving its compactness or its execution speed. Modern compilers implement a range of optimizations, which study lies outside the scope of this book. Certain optimizations are “universal”, while others may be specific to the CPU for which the code is generated.