(-: Geekerie, logiciel libre et vilain gauchisme

Playing with libclang

clang is a great compiler that I use daily in my job and in my personal projects. It works just great and the error messages are really much more clear than with the other compilers.

clang is also usable as a C++ parser library. This is of particular interest for two of my projects: vera++ and ITK wrappers.

vera++ is a C++ style checker that let the developers create their own checking rules in Tcl, Lua or python. The rules have access to the tokens in the C/C++ files parsed, but nothing more. There is no easy way to know that a token is part of a function parameter, a type declaration, etc. clang should be able to provide all these extra information that would be very useful to check the coding style.

ITK is a large C++ image processing library. The code makes a heavy use of C++ templates. A combination of gccxml, swig and an ITK specific code generator is used to give access to the library from python and java, plus doxygen and another code generator to integrate the documentation in the generated code. Unfortunately, gccxml is not very well maintained and is always painful to fix when a new compiler comes out. And it is indeed currently broken on my Mac.

clang should be able to provide the same kind of information. Even better, it should be possible to remove the XML intermediate step, by using directly the clang library in the code generator. clang should also be able to give some informations that gccxml can’t: the typedefs and the data fields of a class.

AST Dump

There are several ways to look at the AST generated by clang for a C++ file. The most simple is with the flags -Xclang -ast-dump -fsyntax-only.

I’ve used this simple test.cpp:

template<typename TData>
class Foo
{
public:
  typedef Foo Self;
  typedef TData Data;
  Foo();
  // just a comment
  Data getData(int i, char const* s);
private:
  /// my precious data
  Data m_data;
};

typedef Foo<int> IntFoo;

Here is the result:

TranslationUnitDecl 0x1028242d0 <<invalid sloc>>
|-TypedefDecl 0x102824810 <<invalid sloc>> __int128_t '__int128'
|-TypedefDecl 0x102824870 <<invalid sloc>> __uint128_t 'unsigned __int128'
|-TypedefDecl 0x102824c30 <<invalid sloc>> __builtin_va_list '__va_list_tag [1]'
|-ClassTemplateDecl 0x102824dc0 </tmp/test.cpp:1:1, line:13:1> Foo
| |-TemplateTypeParmDecl 0x102824c80 <line:1:10, col:19> typename TData
| |-CXXRecordDecl 0x102824d30 <line:2:1, line:13:1> class Foo definition
| | |-CXXRecordDecl 0x102870c70 <line:2:1, col:7> class Foo
| | |-AccessSpecDecl 0x102870d00 <line:4:1, col:7> public
| | |-TypedefDecl 0x102870d40 <line:5:3, col:15> Self 'Foo<TData>'
| | |-TypedefDecl 0x102870da0 <line:6:3, col:17> Data 'TData'
| | |-CXXConstructorDecl 0x102870e60 <line:7:3, col:7> Foo<TData> 'void (void)'
| | |-CXXMethodDecl 0x102871100 <line:9:3, col:36> getData 'Data (int, const char *)'
| | | |-ParmVarDecl 0x102870f50 <col:16, col:20> i 'int'
| | | `-ParmVarDecl 0x102870ff0 <col:23, col:35> s 'const char *'
| | |-AccessSpecDecl 0x1028711e0 <line:10:1, col:8> private
| | `-FieldDecl 0x102871220 <line:12:3, col:8> m_data 'Data':'TData'
| |   `-FullComment 0x102871520 <line:11:6, col:22>
| |     `-ParagraphComment 0x1028714f0 <col:6, col:22>
| |       `-TextComment 0x1028714c0 <col:6, col:22> Text=" my precious data"
| `-ClassTemplateSpecializationDecl 0x102871280 <line:1:1, line:13:1> class Foo
|   |-TemplateArgument type 'int'
`-TypedefDecl 0x102871430 <line:15:1, col:18> IntFoo 'Foo<int>':'class Foo<int>'

So we have the definition of the template class Foo, with almost everything in it — template type, typedef, constructor, methods, etc. Even the comment associated to the field m_data is there — this may also be a chance to avoid a complex tool chain to incorporate the documentation in python and java.

However, it should be noted that some things are missing, at least in this dump:

  • the comment // just a comment is gone,
  • there is no trace of several characters, like ;, {, or } – it may be there internally but not dumped though,
  • there is no detail about the type Foo<int> declared in the typedef at the last line.

The two first points are a bit problematic for vera++, but it should be possible to make a correspondance with the tokens, and fill the missing informations. At least it means vera++ can’t do with clang alone, and must keep its current tokenizer.

The last point is a problem for ITK wrappers, because this is exactly the information we need. It looks like a problem we had with gccxml though: unless it is used somewhere, the type is not resolved. In ITK, we force the type instantiation with a sizeof() of the type. So I add this at the end of my test.cpp:

void force_instantiate()
{
  sizeof(IntFoo);
}

and redump the AST:

$ clang -Xclang -ast-dump -fsyntax-only /tmp/test.cpp
/tmp/test.cpp:19:3: warning: expression result unused [-Wunused-value]
  sizeof(IntFoo);
  ^~~~~~~~~~~~~~
TranslationUnitDecl 0x1028242d0 <<invalid sloc>>
|-TypedefDecl 0x102824810 <<invalid sloc>> __int128_t '__int128'
|-TypedefDecl 0x102824870 <<invalid sloc>> __uint128_t 'unsigned __int128'
|-TypedefDecl 0x102824c30 <<invalid sloc>> __builtin_va_list '__va_list_tag [1]'
|-ClassTemplateDecl 0x102824dc0 </tmp/test.cpp:1:1, line:13:1> Foo
| |-TemplateTypeParmDecl 0x102824c80 <line:1:10, col:19> typename TData
| |-CXXRecordDecl 0x102824d30 <line:2:1, line:13:1> class Foo definition
| | |-CXXRecordDecl 0x102870c70 <line:2:1, col:7> class Foo
| | |-AccessSpecDecl 0x102870d00 <line:4:1, col:7> public
| | |-TypedefDecl 0x102870d40 <line:5:3, col:15> Self 'Foo<TData>'
| | |-TypedefDecl 0x102870da0 <line:6:3, col:17> Data 'TData'
| | |-CXXConstructorDecl 0x102870e60 <line:7:3, col:7> Foo<TData> 'void (void)'
| | |-CXXMethodDecl 0x102871100 <line:9:3, col:36> getData 'Data (int, const char *)'
| | | |-ParmVarDecl 0x102870f50 <col:16, col:20> i 'int'
| | | `-ParmVarDecl 0x102870ff0 <col:23, col:35> s 'const char *'
| | |-AccessSpecDecl 0x1028711e0 <line:10:1, col:8> private
| | `-FieldDecl 0x102871220 <line:12:3, col:8> m_data 'Data':'TData'
| |   `-FullComment 0x103000470 <line:11:6, col:22>
| |     `-ParagraphComment 0x103000440 <col:6, col:22>
| |       `-TextComment 0x103000410 <col:6, col:22> Text=" my precious data"
| `-ClassTemplateSpecializationDecl 0x102871280 <line:1:1, line:13:1> class Foo definition
|   |-TemplateArgument type 'int'
|   |-CXXRecordDecl 0x1028715d0 prev 0x102871280 <line:2:1, col:7> class Foo
|   |-AccessSpecDecl 0x102871660 <line:4:1, col:7> public
|   |-TypedefDecl 0x1028716a0 <line:5:3, col:15> Self 'class Foo<int>'
|   |-TypedefDecl 0x102871730 <line:6:3, col:17> Data 'int':'int'
|   |-CXXConstructorDecl 0x1028717c0 <line:7:3> Foo 'void (void)'
|   |-CXXMethodDecl 0x102871a20 <line:9:3, col:36> getData 'Data (int, const char *)'
|   | |-ParmVarDecl 0x1028718b0 <col:16, col:20> i 'int'
|   | `-ParmVarDecl 0x102871910 <col:23, col:35> s 'const char *'
|   |-AccessSpecDecl 0x102871ae0 <line:10:1, col:8> private
|   `-FieldDecl 0x102871b20 <line:12:3, col:8> m_data 'Data':'int'
|     `-FullComment 0x103000540 <line:11:6, col:22>
|       `-ParagraphComment 0x103000510 <col:6, col:22>
|         `-TextComment 0x1030004e0 <col:6, col:22> Text=" my precious data"
|-TypedefDecl 0x102871430 <line:15:1, col:18> IntFoo 'Foo<int>':'class Foo<int>'
`-FunctionDecl 0x1028714a0 <line:17:1, line:20:1> force_instantiate 'void (void)'
  `-CompoundStmt 0x102871b88 <line:18:1, line:20:1>
    `-UnaryExprOrTypeTraitExpr 0x102871b68 <line:19:3, col:16> 'unsigned long' sizeof 'IntFoo':'class Foo<int>'

So this time there are a few more things (in addition to the warning):

  • a FunctionDecl for the force_instantiate function,
  • a ClassTemplateSpecializationDecl corresponding to the IntFoo type, with every type resolved — great!

libclang

The dump seems to have everything needed, but is certainly not easy to parse. And clang is made to be used as a library, so there should be a good API to access everything needed in the AST. In fact there are two options: use the stable API called libclang, or use the internal clang AST, that may change in the future. Based only on that, I’m more tempted to try libclang.

In fact, there is even a python module to use libclang — this is nice, because the ITK code generator is already written in python. Using clang from python may avoid to recreate the generator from scratch.

To install the python module, I just run:

pip install clang

Using it in python requires a bit more work on Mac OS X. After the import clang.cindex, the path of libclang should be configured. Either:

clang.cindex.Config.set_library_path(
  "/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib")

with XCode, or

clang.cindex.Config.set_library_path("/Library/Developer/CommandLineTools/usr/lib")

with the command line tools. Then create the index and read the file:

index = clang.cindex.Index.create()
tu = index.parse("/tmp/test.cpp")

Note that libclang is fully silent when the index is created this way. This is the expected behavior when using libclang from a script, because the diagnostics are accessible through the API. When using it from the python interpreter, it makes the process more complicated, and it is very easy to miss that an error occurred during the parsing of the file. It is possible to make libclang display the problems by instantiating the index this way:

index = clang.cindex.Index(clang.cindex.conf.lib.clang_createIndex(False, True))

And of course it would be even better to be able to pass this option to the normal clang.cindex.Index.create().

The API let us iterate over all the nodes very easily. Dumping some of the information in the AST can be done this way:

#!/usr/bin/env python

import clang.cindex, asciitree, sys

clang.cindex.Config.set_library_path("/Library/Developer/CommandLineTools/usr/lib")
index = clang.cindex.Index(clang.cindex.conf.lib.clang_createIndex(False, True))
translation_unit = index.parse(sys.argv[1], ['-x', 'c++'])

print asciitree.draw_tree(translation_unit.cursor,
  lambda n: n.get_children(),
  lambda n: "%s (%s)" % (n.spelling or n.displayname, str(n.kind).split(".")[1]))

Called on the test.cpp file, it produce:

$ python /tmp/dump.py /tmp/test.cpp 
/tmp/test.cpp (TRANSLATION_UNIT)
  +--Foo (CLASS_TEMPLATE)
  |  +--TData (TEMPLATE_TYPE_PARAMETER)
  |  +-- (CXX_ACCESS_SPEC_DECL)
  |  +--Self (TYPEDEF_DECL)
  |  |  +--Foo<TData> (TYPE_REF)
  |  +--Data (TYPEDEF_DECL)
  |  |  +--TData (TYPE_REF)
  |  +--Foo<TData> (CONSTRUCTOR)
  |  +--getData (CXX_METHOD)
  |  |  +--Data (TYPE_REF)
  |  |  +--i (PARM_DECL)
  |  |  +--s (PARM_DECL)
  |  +-- (CXX_ACCESS_SPEC_DECL)
  |  +--m_data (FIELD_DECL)
  |     +--Data (TYPE_REF)
  +--IntFoo (TYPEDEF_DECL)
  |  +--Foo (TEMPLATE_REF)
  +--force_instantiate (FUNCTION_DECL)
     +-- (COMPOUND_STMT)
        +-- (UNEXPOSED_EXPR)
           +--IntFoo (TYPE_REF)

Unfortunately, the template specialization is not there… For some reason, it is not exposed at all, even as an UNEXPOSED_EXPR.

Conclusion

So in the end, clang still seems to be a great tool, but its stable API, libclang, seems to be lacking in some of the features I need. This is OK for vera++, and may be investigated more closely in the future. For ITK wrapping, I guess I have a few options:

  • fix libclang
  • use the internal AST to create the code generator
  • parse the AST dump
  • fix gccxml

I think I’ll go for the first.