Quick overview of how Clang works internally

It’s proven that Clang is a mature compiler For C and C++ as GCC and Microsoft compilers, but what makes it so special is the fact that it’s not just a compiler. It’s also an infrastructure to build tools. Thanks to its library based architecture which makes the reuse and integration of new features more flexible and easier to integrate into other projects.

Clang Design:

Like many other compilers design, Clang compiler has three phase:

  • The front end that parses source code, checking it for errors, and builds a language-specific Abstract Syntax Tree (AST) to represent the input code.
  • The optimizer: its goal is to do some optimization on the AST generated by the front end.
  • The back end : that generate the final code to be executed by the machine, it depends of the target.

What the difference between Clang and the other compilers?

The most important difference of its design is that Clang is based on LLVM , the idea behind LLVM is to use LLVM Intermediate Representation (IR), it’s like the bytecode for java.
LLVM IR is designed to host mid-level analyses and transformations that you find in the optimizer section of a compiler. It was designed with many specific goals in mind, including supporting lightweight runtime optimizations, cross-function/interprocedural optimizations, whole program analysis, and aggressive restructuring transformations, etc. The most important aspect of it, though, is that it is itself defined as a first class language with well-defined semantics.

With this design we can reuse a big part of the compiler to create other compilers, you can for example just change the front end part to treat other languages.

I- Front End:

Clang is designed to be modular and each compilation phase is done by a specific  module, Here are some projects implied in the front end phase:

As any front end parser we need a lexer and a semantic analysis. The Clang front end could be executed by passing the -cc1 argument. It has several features like the AST generation:

clang -cc1 -ast-dump test.c

This command line is treated by the cc1_main function, here’s the sequence of some interesting methods executed

clang11

The method ExecuteAction has a parameter of type  FrontEndAction , the goal is to specify which frond end action to execute. The FrontEndction is abstract , we need to inherit from it to implement a concrete front end action.

Let’s discover all the front end actions implemented by Clang using CQLinq, for that we can search for all classes inheriting directly or indirectly from it.

from t in Types
let depth0 = t.DepthOfDeriveFrom(“clang.FrontendAction”)
where depth0  >= 0 orderby depth0
select new { t, depth0 }

Many front end actions are available, for example ASTDumpAction permits to generate the AST without creating the final executable. Almost all the front end actions inherits from ASTFrontEndAction, which means that they work with the generated AST.

What’s interesting with this design is  we can plug our custom FrontEndAction easily, we have just to implement a new one.

How we can do some treatments on the AST?

Each ASTFrontEndAction create one or many ASTConsumer, the ASTConsumer class is an abstract class, and we have to implement our AST consumer for our specific needs.

The FrontEndAction will invoke the AST consumer as specified by the following graph.

Let’s search for all ASTConsumer classes using CQLinq:

from t in Types
let depth0 = t.DepthOfDeriveFrom(“clang.ASTConsumer”)
where depth0  == 1
select new { t, depth0 }

CodeGenerator is an example of the AST Consumer

As we specified before the power of LLVM is to work with IR, and to generate it we need to parse the AST. CodeGenerator is the class inheriting from ASTConsumer responsible of generating the IR, and what’s interesting is that this treatment is isolated into another project named ClangCodeGen.

Here are some classes implied in the LLVM IR generation:

II- Optimizer

To explain this phase I can’t say better than Chris Lattner the father of LLVM in this post:

“To give some intuition for how optimizations work, it is useful to walk through some examples. There are lots of different kinds of compiler optimizations, so it is hard to provide a recipe for how to solve an arbitrary problem. That said, most optimizations follow a simple three-part structure:

  • Look for a pattern to be transformed.
  • Verify that the transformation is safe/correct for the matched instance.
  • Do the transformation, updating the code.

The optimizer reads LLVM IR in, chews on it a bit, then emits LLVM IR, which hopefully will execute faster. In LLVM (as in many other compilers) the optimizer is organized as a pipeline of distinct optimization passes each of which is run on the input and has a chance to do something. Common examples of passes are the inliner (which substitutes the body of a function into call sites), expression reassociation, loop invariant code motion, etc. Depending on the optimization level, different passes are run: for example at -O0 (no optimization) the Clang compiler runs no passes, at -O3 it runs a series of 67 passes in its optimizer (as of LLVM 2.8).

Let’s discover the LLVMCore passes, for that we can search for classes inheriting from “pass” class

from t in Types
let depth0 = t.DepthOfDeriveFrom(“llvm.Pass”)
where t.ParentProject.Name==”LLVMCore” && depth0  >= 0 orderby depth0
select new { t, depth0 }

Of course many other passes exist in the other LLVM modules.

III- BackEnd

Like other phases the backend responsible of generating the output for a specific target, in the case of Clang it’s very modular, let’s take as example LLVMX86Target the module generating for X86 target.

Here’s the graph showing all the modules concerned by generating the binaries for x86 target.

Many modules are involved in this phase, each on has it’s specific responsibility, which enforces the cohesion and encourages clean APIs and separation. Therefore making it easier for developers to understand, since they only have to undertand small pieces of the big picture.

Conclusion

The duo LLVM/Clang is not just a C/C++ compiler, it’s also an infrastructure to build tools, it’s easy to extend its behavior. Many tools are included out of the box in the LLVM/Clang source code and many others could be found in the web.

If you need a C/C++ parser to build a tool, Clang is a very good candidate.