Writing a Fast Compiler

2024-02-04

I'm going to describe the various tricks I used to write fast compilers for my programming languages. By fast compilation, I mean compiling at least 500.000 lines of code per second (excluding blank lines and comments) on a single CPU core.

Does it Matter?

You may argue that compilation time is not important. After all, once released, who cares that a program took hours to build; as users, we only want it to work and to work fast. It's like complaining that the last Pixar movie took days for the final rendering.

However it can severely affect the development cycle and make developers angry. It's 2024 and I can see that the most common complaint for Rust is still its compilation time.

The speed also affects the design of the compiler: when my biggest program is less than 100K SLOC and I can compile 500K SLOC per second I don't really have to worry about separate compilation since a complete build takes less than 200ms. And this is good since separate compilation can be tricky with genericity.

Designing a Language for Fast Compilation

If you're writing a compiler for an existing language, e.g. C++, you have no control here, you'll have to deal with an LL(k) grammar, a preprocessor and a terrible module system. Conversely, if you're writing a compiler for your own programming language, careful design choices can help a lot.

I've always used a context free grammar that can be easily parsed with a simple recursive descent parser. If a syntax is easy to parse by the computer it will be also easy to parse by a human.

A simple syntax will also make the development of independent tools easier (static analyzer, formating tools, refactoring, syntax highlighting, ...).

General Rules

Minimizing Code and Memory Access

Less code to execute and less memory access usually means faster execution. While modern architectures don't make this principle strictly true, it is still a good rule to follow.

I avoid copying data as much as possible. Many languages use zero-terminated strings. I prefer to use a pair of (start_pointer, size) or (start_pointer, end_pointer) instead: it allows for instance to refer to any sub-string directly from the input buffer without having to do a copy.

Reducing Memory Usage

The less memory I use, the more it will fit in cache.

Ordering variables in structs carefully can significantly reduce the size of these structs, especially in 64 bits because of alignment constaints.
Combining multiple flags in an integer saves memory but it also allows to perform multiple tests at once just by using a mask. C bitfields are useful here.
Using bytes or bitfields for enums.

Optimizing the Common Path

A lot of work in the compiler is to check for errors but a program has usually no error or very few ones. Therefore the code must be optimized considering that errors are exceptionals.

If an error requires two conditions to be met, I evaluate the fastest one first so the second one will never be evaluated.
I don't compute something needed only for an error reporting until an actual error is detected.

Memory Management: Using Memory Regions

A memory region is a contiguous block of memory where parts can be allocated but not de-allocated (the entire region must be de-allocated). The allocation consists just in advancing a pointer, and eventually creating a new region when the region is full. It makes allocations extremely fast. The de-allocation is also extremely fast since all objects of a region are freed at once.

This kind of memory management fits very well with a compiler: a single region can be used for a compilation unit. In practice I use 3 regions:

one to store the AST,
one to store the program objects and
one for the code generation so I can get rid of the AST during code generation.

However when compiling functions, there are lot of temporary objects created, mainly dictionary of names for each lexical scope. To handle this I create pools of regions: instead of creating and destroying regions I pick one from a pool and put it back to the pool when finished.

Resizable arrays and open addressing hashtables are not suitable data structures for memory regions since they need a lot of de-allocations and re-allocations.

To store lists of elements, when possible I count the elements first and then I allocate a fixed size array, when it's not possible I just use a link-list.

To handle hash tables which are heavily used for names, I use separate chaining instead of open-addressing. It eliminates re-allocation of arrays but it requires to carefully choose the size of the hash: the global namespace will need a bigger hash table than the inner scope of a function.

Lexical Analyzer: Identifiers as Numbers

CPUs are not designed to work with strings, they are designed to work with fixed size integers.

Comparing two strings needs additional access to two memory regions.
Hashing a string is slow.

An important part of the compilation process is to find which entity is associated to an identifier. If the identifier is stored as a string, it will be slow.

The compiler does not need to know the content of an identifier except to present it to the user when an error occurs or to store it in a debug symbol table in the object file. It only needs to test equality between identifiers, so it can work internally with an integer assigned for each distinct identifier.

Since the lexer already needs to do a lookup in a dictionary for keywords, the additional cost to convert an identifier into a unique integer during lexical analysis is very small.

The following source code

    func main(args)
        if args.size < 2

will produce this flow of lexical units:

type = keyword, value = Keyword.func
type = identifier, value = 1
type = openParen
type = identifier, value = 2
type = newline
type = keyword, value = Keyword.if
type = identifier, value = 2
type = point
type = identifier, value = 3
...

where:

1 corresponds to main,
2 to args and
3 to size.

Optimizing the Syntax Analysis

I use a simple classic architecture for my compiler:

  Lexical Analysis ---> Syntax Analysis ---> Build ---> Code Generation

where:

The lexical analyser (the lexer) reads the source code and produces the lexical units (the tokens).
The syntax analyser (the parser) reads the tokens and produces an Abstract Syntax Tree (AST).
The builder reads the AST and produce a representation of the program.
The code generator reads the program and generates machine code or intermediary code for LLVM.

With a carefully designed grammar, the lexical analyser is just a loop with a giant switch statement processing input characters and the syntax analyser is a hand-coded recursive descent parser combined with a grammar of operators for infix expressions.

The recursive descent parser is made of functions like this:

    e = parseExpression
    i = parseInstruction
    b = parseBlock
    id = parseIdentifier
    ...

The natural way to handle errors is to return a special value when there is a syntax error. e.g. returning a null value, but it forces every caller to check the return value.

    e = parseExpression
    if e == nil
        return nil
    end

The trick is to report an error and return a dummy expression instead. e.g. returning a '0' expression. The syntax error function will display the first error and will drop the next ones since they can be completely unrelated to the input.

Not only the code is faster by eliminating many tests, it is also simpler: it is not cluttered with test at every line, it has the advantages of exceptions without the drawbacks (hidden exit points, cost of stack unwinding).

This trick comes with a cost: I stop at the first syntax error (contrary to the build step where I try to report as much errors as possible).

Compiling only what's needed

When importing a library, you usually end up with tons of constants, structures and functions that you're not using.

It's a good idea to analyse only what's needed.

The syntactic analysis has to read and parse the whole input in order to create the AST.

class Label: Widget
    attr caption: String
    attr color: Color

    def init(caption: String)
        ...

    // ... hundreds of lines of code ...
end

But for the build step it is just:

class Label

The content of the class and its parent can be completely ignored until needed. The compiler just needs to associate Label with a yet undefined entity, so it can find it when needed and ensure that no other entity will be defined with the same name.

It means that the compiler won't validate code until it is used. It can be annoying in some case, for instance when writing a library you want to be sure that everything compiles. A command line option can force the compilation of everything.

Also the design of the language can help here. By not having everything in the global namespace, you can considerably reduce the number of declarations that need to be defined. For instance GTK+ is a library with a C API but in my language I can expose many functions as methods in their respective Widget class:

Original C API

typedef struct _GtkLabel GtkLabel;

GtkWidget*   gtk_label_new      (const gchar *str);
void         gtk_label_set_text (GtkLabel *label, const gchar *str);
const gchar* gtk_label_get_text (GtkLabel *label);
...

In Copper

class GtkLabel: GtkMisc
    import func "gtk_label_new" new(String): Self

    import def "gtk_label_set_text" set_text(String)
    import def "gtk_label_get_text" get_text: String
    ...
end

If GtkLabel is unused, all nested entities won't even be declared.

With huge libraries where only few elements are used, it can save a lot of time.

Code Generation

In addition to a LLVM and C backends, I've developed my own x64 code generator.

LLVM is great; it is very easy to use, it generates highly optimized code and supports many platforms. But it is slow, I've observed a factor of 30 speed up by developing my own x64 backend.

My LLVM backend has 2000 lines of straightforward code, while my own x64 backend has more than 10.000 lines of complex code and it does not even support floating point numbers. It has been a huge effort to develop my own x64 generator but it was worth it. May be someone should develop a non-SSA alternative to LLVM to allow implementation of fast compilers.

My code generator is mainly made of:

A pseudo-code generator
A duplicate function remover
A register allocator
A two-pass assembler

Pseudo-code Generator

An intermediary pseudo-code is generated, it makes some optimizations easier. It can be directly assembled to machine code, no need to translate it into assembly code to pass it to an assembler.

It is mostly like x64 assembly but with variables. After register allocation, variables will be replaced by registers or by a memory access. This pseudo-code is not generic like the IR of LLVM, it is optimized for x64: for instance the indexed memory accessed is used.

Duplicate Function Remover

This part is necessary because of the genericity in my language that creates many functions with different data types but generates identical code. On one hand this duplicate finder costs a lot in performance as finding duplicates in a graph with potential recursion can be tricky; on the other hand it eliminates a lot of functions to assemble.

Register Allocator

I use the linear scan algorithm (PDF) for register allocation. The original paper describes it as intended for JIT compilers but in my opinion it is also a perfect fit for fast compilers that want to generate reasonably fast code.

Two-Pass Assembler

By assembling in two passes, it seems slower at first, but in the first pass I don't write anything: I don't need to use resizable growing buffers, I just have to count bytes to compute offsets. In the second pass, I know the exact size so I can pre-allocate a buffer with the right size and write without checking limits. The cost is that the write functions are virtuals. Without a real benchmark I can't tell wether a two-pass is faster than a one-pass but I found that the two-pass is not too bad.

Future Work

Compiling my projects in 100ms is good enough so I don't need any more optimizations for now. Anyway, all the tricks described above are just basic techniques and common sense, there is still a lot of room for improvement and experiments:

Use index instead of pointers (64 bits pointers are such a waste of memory).
Use single pass for code generation.
Take advantage of multiple cores.
The AST has too many pointers, many indirections could be eliminated.

Conclusion

I hope it will be helpful if you plan to write a compiler or if you want to make an existing one faster.