The power of LLVM. Image: Rainer Knäpper
Interested in adding a whole new dimension to existing everyday C code?
Or what it's like to be among the first of projects to be trying such
things with new tools and technologies, and having to fix a linker bug
to do it? Read on.
Introduction to LLVM
LLVM is a full compiler infrastructure made to not
only compile various languages to various kinds of machine code, but also work with
code in other ways using re-usable components. Examples: an editor
that understands your code as you type (an IDE); a library that
lets you perform source-level transformations easily that are very
messy to do reliably at the text level; Clang
is a C/C++ compiler built on LLVM that can generate machine code
for many architectures, including i386 and ARM;
Klee, which simulates many executions
of programs with unspecified inputs; and so on.
More complete intro in PDF format here.
Introduction to LLVM Bitcode
At the core of how LLVM works is its Intermediate Representation,
or IR, or bitcode. It represents the logic of the program after
translation from the input language. So it describes the intended
action of the program independent of a particular target architecture.
Such bitcode files can be linked together to form the bitcode
representation of the final executable program.
Powerful applications of LLVM Bitcode
This allows some powerful reasoning over complete programs at a higher
abstraction level than machine code. Example:
- Klee makes use of this. It is
able to symbolically execute a program. It simulates
running a program without knowing what inputs it will receive
(I/O results, program arguments, etc.). Such unknowns are treated
symbolically and Klee tries to find execution paths and input
values that cause unwanted states (e.g. bad pointer dereference
or an assertion failing). A nice demonstration of KLEE's reasoning powers
is making it
solve a maze.
- For a demonstration combining another powerful form of abstraction with
LLVM bitcode: based on his work with
RUMP kernels,
Antti Kantee was able to use
emscripten to translate his kernels to Javascript.
The result is the NetBSD kernel booting in javascript.
Not only has analyzing programs become possible; also transforming them
before the machine code is generated. This has powerful applications.
Examples:
- All kinds of code instrumentation.
- Link-time optimisation. Optimisation is generally done at compile
time, so on a per-file basis; but the compiler is actually quite limited
in what it can do and what information it has available at this scope.
Examples: leaving out dead code; inlining functions that are defined
in a different file; having a better estimate of how many times
functions are called. These are all things that are impossible at the
file level but possible at link time. Needless to say, the possibilities to improve
on time- and space-optimisation this way are near-endless.
- Introspection. This makes available information about functions and
variables of the current (or another!) C program in the form of regular
C datastructures, allowing operations at runtime that are otherwise
impossible.
Minix and Liveupdate
Minix is close to supporting being built
with bitcode completely. The code is ready on a working branch,
just not merged with mainline yet. This allows for many powerful
features such as LTO, introspection, running KLEE etc. to be applied
to its codebase - userland and OS components alike.
The driving force behind building Minix with bitcode is the Liveupdate
project spearheaded by Cristiano Giuffrida.
His Ph.D thesis topic is the implementation of the updating of OS components
to newer versions while they are in use. It uses information obtained from
LLVM bitcode to make this possible. More details are in his Liveupdate papers here on
his homepage.
Minix is unique
Minix is unique in more ways than one. However, having this available
as a maintained feature in the codebase means Minix is in a position
to offer a powerful extra level of code analysis and instrumentation;
for applications known (such as LTO) and to be invented. It is the
first and so far only operating system that supports full bitcode
builds in the base system!
The future holds more promise - LLVM-based instrumentation supports
more projects in Minix that Cristiano is working on - memory
checkpointing for crash recovery, address space layout randomization,
and fault injection.
Gold bugs
To link bitcode files and perform transformations on them at link
time, the Binutils Gold linker
is required. Unfortunately we at Minix ran into a few problems and bugs when linking
everything with Gold.
The most recent one I recently fixed and submitted a fix upstream
for. It was
quite a trip down the rabbit hole to find that one from just the
symptoms of a crashing runtime linker! I really like being able to
contribute such a fix to a project upstream that we make heavy use of -
it's a way to give back and raise the profile of Minix a little.
Also the fact that we ran into these problems suggests that Minix is
unusual, perhaps unique, in supporting bitcode builds in its own codebase.
And therefore is a uniquely enticing platform to do OS-level LLVM transform
experimentation on.
Now what?
More LLVM-bsed projects here.
Acknowledgements
My thanks to Cristiano Giuffrida who
kindly reviewed a draft of this post and substantially improved upon it.
Any errors are still mine of course.