This document describes Seq’s compiler internals (excluding parsing). This includes everything in the top-level
After a Seq program is parsed, the result is an AST that is built of several types of objects:
Stmt: Top-level statements like
for, variable assignments, etc.
Expr: Expressions like
2 + 2,
Var: Variables defined either by assignment, or implicitly by e.g. a
for x in ...)
Func: Function (or generator) objects defined by the user
Type: Every expression and variable has an associated type, be it
Pattern: Patterns define how to match a value in a
matchstatement (and are used nowhere else)
Block: A list of statements
There are a couple other less frequently used objects in addition to these:
BaseFuncLite: A lightweight wrapper around an LLVM function (basically a lambda that describes how to codegen the function in LLVM IR); useful for easily writing built-in methods or functions in LLVM IR.
SeqModule: This represents top-level code, which gets wrapped in a
mainfunction at the IR level. A
SeqModuleinstance is the final result of parsing a Seq program; it contains everything within it.
SeqModule all inherit from the
BaseFunc class, which defines a relatively barebones function interface.
Each of these classes is described in more detail below, along with any unintuitive corner cases or gotchas. Afterwards, generics are described, and finally type resolution and code generation.
Stmt base class is defined in
compiler/lang/stmt.cpp, but the actual subclasses representing real statements are defined in
Statement objects typically hold expression objects internally. For example,
print 42 would be converted to a
42. Note also that expressions can be statements on their own (e.g. function calls that do not return anything), for which
ExprStmt is used.
Statement objects can also contain other statement objects. For example, a
While object (representing a
while-statement) contains an expression representing the condition, but also a list of statements that appear in the body of the loop. Any time a statement contains one or more internal “scopes” like this, a
Block object is used to store the internal statements. For example,
While stores just one
Block internally, but
If can store arbitrarily many, one for each branch.
Block is nothing more than a wrapper around a vector of statements.
Statement objects also have a
loop flag indicating whether they represent loops. If so, during code generation these statements must expose two LLVM blocks: one to jump to when a
break is encountered and one to jump to when a
continue is encountered. This allows
continue statements to be codegen’d easily by finding their innermost enclosing loops and branching to the appropriate block.
All expressions inherit from
Expr, and are defined in
compiler/lang/expr.cpp. Expressions are very similar to statements in their implementation, with the main difference being that every expression has an associated type, accessible via the
getType() method. While
getType() is usually simple (e.g. the type of
42 is clearly
int – no extra work required), occasionally
getType() needs to do nontrivial work to determine the type. An example of this is calling a generic function/method; the type of the call could depend on the input type of the function, so
getType() would have to clone the function, realize its generic type parameter with whatever the call type is, then deduce the return type.
A few other notes on expressions:
Every expression has a
tcfield pointing to a
TryCatchobject (subclass of
Stmt), representing the innermost enclosing
tryblock of the expression (or null if there isn’t any). This is purely to handle function calls within a
tryat the LLVM level. Basically, if you call a function in a
tryblock in LLVM IR, and the function can potentially throw an exception, LLVM IR provides an instruction (
invoke) for doing this, which takes as arguments the block to jump to if no exception is thrown and a block to jump to if an exception is in fact thrown. The latter block is given to us by the mentioned
Some fields of some expression objects can be modified as a result of calling
getType()(although not in a way that should affect the user). This has to do with automatic type parameter deduction: if you call a generic function or instantiate a generic class without providing explicit type parameters,
getType()must first deduce the type parameters. It does this by, for each type parameter
Tin each argument type of the function via a DFS, then extracting the corresponding type from the types the function is being called with. So, calling a function
f(array[int](10))would perform a DFS in
T, then extract the corresponding type in
array[int](or failing if this can’t be done), which is
intin this case, meaning we deduce
T = int. A new
FuncExprobject with explicitly realized types is then created to replace the original unrealized one (note that a pointer to the original is also kept for cloning purposes, described below).
Currently, the only times an expression contains a statement internally are comprehensions and generators (i.e.
GenExpr). These all contain a single
Forobject representing the “unrolled” generator loop. For example,
[a for a in b if a < 10]would be converted to a
for a in b: if a < 10: pass. This is nothing more than a convenient interface for creating these kinds of expression objects.
The types of most expressions (including all unary/binary operations and index/slice expressions) are determined through the type the operation is being applied to. For example, when determining the type of
a + b, the type of
ais searched for an
__add__method applicable to the type of
b, and the output type of this method is returned. This is discussed in greater detail in the section on types.
All variables in a program (implicit or explicit, local or global) are represented by a
Var object, defined in
compiler/lang/var.cpp. This includes standard assigned variables (
a = 10),
for-loop variables (
for a in ...) and function arguments (
def f(a)). This does not include functions themselves, however; these are represented with
Func objects (described below). Variables are conceptually just a wrapper around an LLVM pointer. If a variable is marked as global, this pointer will represent a global LLVM value, otherwise it will be an LLVM
alloca in the enclosing function. Variables also have types, which are usually set in the type resolution pass before code generation (described below); these types dictate how the variables will be translated to LLVM IR.
There is one peculiar case that requires the ability to “map” a variable
v to another
v', meaning any operation applied to
v will be delegated to
v'. This case is code generation of generators. Generators are converted to functions internally: e.g.
(a*2 for a in v) is transformed to
def _implicit_gen(v): for a in v: yield a*2, which is then called with
v as an argument. The way this is done internally is by creating a
Func object representing the implicit function, assigning the generator’s internal
For loop as the body of this function, and codegen’ing the function. This would work, if it weren’t for the fact that variables appearing in the
For are referencing variables outside the implicit function! In the earlier example, for instance,
v would have been assigned before the generator expression, but when we codegen the implicit function, we want
v to refer to the function’s argument. This problem is solved by “mapping” the original
v variable to the implicit function’s argument, codegen’ing the function, then unmapping
v. (Note that, because nested generators are possible, we actually maintain a stack of mappings rather than a single one, but the idea is the same.)
Functions (i.e. those defined with
def in a program) are represented by
Func objects, defined in
compiler/lang/func.cpp. For the most part,
Func is a wrapper around a
Block representing the function’s body, and a list of
Var objects representing its arguments. Function return types can be deduced automatically. This is achieved by passing a pointer to the corresponding
Yield object to the function anytime a
yield statement is parsed within it; then the return type (and whether the function is really a generator) is deduced by looking at the type of the expression being returned/yielded. Functions can also be marked as “external” (e.g.
extern c foo()), in which case no body is associated with the function.
Regular functions store a “preamble” LLVM block (which is their first block in the IR); this is where enclosed local variables (including function arguments) codegen their
alloca. Generators additionally maintain several other blocks for general generator bookkeeping (these are all described in the LLVM coroutine docs).
Lastly, function names are mangled in the LLVM IR, and include a combination of the base function name, argument type names, output type name, enclosing function name and enclosing class name (if the function is a method of some class). This is so a generic function doesn’t produce duplicate names in the IR if called on different types.
Of all the classes, types are probably the most complicated. All types inherit from the
Type class, and are located in
compiler/types/. Fundamentally, each
Type is a wrapper around an LLVM type, with some extra data dictating how to do various operations on instances of that type. Some types are “abstract”, meaning there can’t be any variables of that type (e.g.
Void). Types can contain other subtypes internally: for example an array type stores the base type of the array, and a generic class stores its generic type parameters.
All types have the following fields:
vtable.fields: A map of string (field name) to a pair of integer (0-based field offset in structure) and
Type(type of field).
vtable.methods: A map of string (method name) to
BaseFunc(method of type).
vtable.magic: A vector of
MagicMethodobjects – these are built-in magic methods. Each
MagicMethodobject stores a name (e.g.
__add__), a vector of argument types (excluding ‘self’), an output type, and a lambda that codegen’s a call to the method, given ‘self’ and a vector of arguments. Additionally,
asFunc()method, which returns a
BaseFuncobject representing the magic method. This allows for cases where a magic method is not called, but is referenced (e.g.
vtable.overloads: A vector of
MagicOverloadobjects – these are the magic methods defined in a Seq program. Each
MagicOverloadobject stores a name (e.g.
__add__) and a
BaseFuncobject representing the method. Note that magic methods defined by the user must not be generic, as they can be overloaded (e.g. you can have two
__add__methods that take different argument types).
Many operations, as codegen’d by various expression objects, are defined by magic methods. Returning to an earlier example: to codegen
a + b, the type of
a is searched for an
__add__ method applicable to the type of
b, be it a built-in one (
MagicMethod) or a user-defined one (
MagicOverload). This method defines both how to codegen the addition and what the output type is.
Importantly, note that a “method” is really a combination of the function defining the method and the object the method is called on (i.e. ‘self’). For example, after
f = a.foo where
foo is a method, we can call
f(); this works because
f is not just a function, but a struct of
a can be implicitly passed as the first argument of
foo. If we obtained
foo statically as in
f = A.foo (where
A is the type of
a), then there is no ‘self’ involved and
f would be just a function.
Thankfully, fields are much simpler than methods: to obtain the value of a field, we just search for its name in
vtable.fields and produce the appropriate LLVM instruction with the resulting offset.
A few other notes regarding types:
Every type has
numBaseType()methods for retrieving inner types (e.g.
array[int]). This is used in the DFS during type deduction mentioned above, and also for generating string names of types.
Types can be marked as “atomic”, which means that the type, when converted to an LLVM type, will contain no pointers to heap-allocated data. This saves our GC from having to scan objects of the type when searching for pointers.
Every type has a
getID()method, which returns a unique integer identifier for the type. This is used for exception handling to determine if an exception type matches any
Patterns are relatively unimportant in the grand scheme of things, as they only apply to
match statements. Nevertheless, all pattern objects inherit from
Pattern, and are defined in
compiler/lang/patterns.cpp. Patterns define how to “match” objects of a given type, and can be composed to form new patterns (e.g. as in
OrPattern). Some patterns are “catch-all”, meaning they will match anything; this is important as we want the cases of a
match to be exhaustive.
Functions and classes can be generic. General “generics” functionality, including type parameter realization and deduction, are factored out into a
Generic class, defined in
compiler/types/generic.cpp. This class is effectively a wrapper around a vector of generic type parameters, which are implemented as special type objects that delegate all operations to another underlying type.
The core operation when realizing a generic object is “cloning”, which is essentially deep copying. For example, to realize a generic function, the entire function is cloned, and the cloned type parameters are realized. Here is a concrete example:
def f[T](x: T): return x + 1
Now if we call
f[int](42), the entire function will be cloned to create a new function
def f2[T2](x2: T2): return x2 + 1
T2 will be realized as
int. To support this, all objects used by the compiler (including all those described here) are cloneable via a
clone method. This method takes as an argument the
Generic object that initiated the clone, which itself does some bookkeeping to ensure we don’t clone things twice (e.g. cloning two calls to a single function should produce only one cloned function at the end).
Cloning classes can be complicated if the class definition contains realizations of itself (e.g. using
A[int] in a generic
A class). To solve this, such realizations are deferred, meaning
A[int] (if used within
A) will not cause an actual realization of
A, but will instead create a generic type object that defers this realization until it is actually needed. Function and method realizations are always deferred similarly.
Code generation is done in two passes:
Type resolution: Traverse the AST and set types of variables, and deduce function return types. Note that type resolution on functions can potentially fail if generic types are involved (e.g. how can we resolve the type of
for a in xif
xis generic?); in this case, we fail silently in the hope of being able to resolve types fully when the function is realized.
Code generation: At this stage, all types should be fixed, so we re-traverse the AST and do the actual codegen.
Adding new statements and expressions¶
Looking at how some existing statements/expressions are implemented is probably the easiest way to understand what goes into them. Nevertheless, statements require the following:
resolveTypes(): This method should perform any required type resolution and recursively call
resolveTypes()on any internal
codegen0(): This method takes an LLVM basic block reference as an argument, and performs code generation for the statement. By the time it returns, the argument basic block reference should point to a basic block where subsequent code generation can resume.
clone(): This method should perform a deep copy of the given statement, and recursively call
clone()on any internal Seq objects contained in the statement. The method takes a
Genericobject as an argument, which keeps track of which objects have already been cloned; implementations should generally check if a clone of the given statement already exists before actually cloning (check source for examples). Returns in this function should be done with the
SEQ_RETURN_CLONEmacro, which internally sets some other clone fields for source information etc.
Expressions are very similar:
resolveTypes(): Same as above
codegen0(): Same as above, except this method returns an LLVM value representing the result of the expression (can possibly be null if type is void).
getType0(): Returns the expression’s type. You can assume this gets called after
resolveTypes(). Note that if the expression type is fixed, it can alternatively be passed to the
Exprconstructor directly and this method need not be overriden.
clone(): Same as above
Of course, the lexer/parser must also be updated to create any new statement/expression objects:
compiler/util/ocaml.cpp provides C stubs that get called from the OCaml lexer/parser to interface with the C++ objects; new statements/expressions should generally provide wrappers there as well.
If an error (be it a type error or something else) occurs in any of these methods, a
SeqException with an appropriate error message can be thrown.
Exception handling in Seq is implemented using Itanium ABI zero-cost exception handling. The implementation is fairly similar to how Clang compiles
catch constructs in C++. However, Seq also supports explicit
finally blocks, which greatly complicates the exception handling logic.
The first complicated case occurs when
finally statements are nested. Consider the following example:
def foo(): try: try: try: try: bar() # raises C except A: print 'A' finally: print 'f A' except B: print 'B' finally: print 'f B' except C: print 'C' finally: print 'f C' except D: print 'D' finally: print 'f D'
The output is:
f A f B C f C f D
The call to
bar() raises an exception of type
C, which is handled by the
except C block. However, all prior
finally blocks must be executed, so the exception really must be caught by the innermost
except, whereupon some logic needs to be applied to determine that the next two
finally blocks need to be executed, then the
except C handler, then the last two
To handle this case, the compiler uses the concept of a handler’s “depth”: when an exception is caught, the depth of its handler is determined based on the exception type (e.g. when
C is caught by
except A, its handler depth is 2; if the exception had type
A then the depth would be 0). Then, a total of that many
finally blocks are executed, after which the appropriate handler is executed, which in turn branches to latter
finally blocks. Internally, this is coordinated by storing a depth state for each nested
finally statement, which is set before reaching the
Another nontrivial case is as follows:
def foo(): try: return 1 finally: return 2
Surprisingly to some, this function should actually return
1. This means that
return statements in
finally cannot just directly return, but must instead first execute the appropriate
finally block(s). An analogous situation arises with
continue, which similarly cannot terminate the current loop iteration without executing the necessary
finally blocks. (This is not the case with
yield however, since it does not transfer control out of the block in the same way.) To handle this, nested
finally statements share a “state” that can be one of the following:
NOT_THROWN: Nothing unusual has happened; execute as normal
THROWN: An exception has been thrown and not caught; execute
finallythen resume exception
CAUGHT: An exception has been caught here; execute handler then
returnstatement was executed; store return value for later, execute all necessary
finallyblocks, and only then return
breakstatement was executed; execute all necessary
finallyblocks within the enclosing loop, and only then break
continuestatement was executed; execute all necessary
finallyblocks within the enclosing, and only then continue
finally blocks use this state to determine what block to branch to next. Note that “depth” takes precedence here: if the depth is positive then the next
finally block must be executed before considering the state.