Orange Ng, December 2023.

Mentor: Marin Veršić.

Mentorship article on Hyperledger Wiki.


Problem

The original task and eventual goal is to generate C/C++ Foreign Function Interface (FFI) bindings for the Iroha client library. Fortunately, cbindgen, an open-source project by Mozilla, accomplishes this by generating C/C++ header files from Rust libraries for FFI use.

Users can execute cbindgen as a stand-alone program or a Rust build script. However, cbindgen is limited because it does not support the trait-associated types. This limitation is a big problem as associated types are employed heavily in the Iroha 2 codebase.

Here is an example of a trait-associated type:

trait MyTrait{
    type SomeType;
    fn some_fn() -> Self::SomeType;
}

impl MyTrait for u8{
    type SomeType = bool;
    fn some_fn() -> bool{
        false
    }
}

As you can see above, SomeType is an associated type, and concrete implementations of the trait will replace the associated type with a concrete type. In this example, we let SomeType in our implementation of MyTrait to be a bool type. Currently, cbindgen does not support this use case (mozilla/cbindgen#106) and will fail to identify the associated type. Such situations could occur whenever the types are used, such as function signatures, struct definitions, etc.

Approach 1: MIR

The first approach we tried was to use “mid-level IR” or MIR. MIR is a (mid-level) Intermediate Representation of code by the compiler, coming between high-level IR or HIR (Rust Guide: HIR) and LLVM IR. Though using it ultimately failed for various reasons, it gave us insight into what was required to fulfil the task correctly.

The key idea is that MIR resolves the associated type for us. We are delegating the work to the compiler and using its output to get the necessary information. Here is an example of an original Rust source code and the MIR output:

#[no_mangle]
extern fn demo_fn(param1: <u8 as MyTrait>::SomeType){ // Rust input
    ...
}

fn demo_fn(_1: bool) -> () { // MIR output
    ...
}

Notice that the function signature no longer contains associated types in the MIR output but instead resolves it to the concrete type bool. We can use the --emit mir flag of rustc to generate the MIR output given a Rust source file. Given the original Rust source code and the MIR output, we can use regex to do pattern matching on the strings, thus resolving the associated type to a concrete type. In the example above, we will use regex to attempt a match for demo_fn. If found, we change the 1st parameter from the <u8 as MyTrait>::SomeType to bool.

There is no good way to invoke the rustc compiler inside the Rust source code, so we opted to add a flag to the cbindgen executable to pass in the MIR file. Thus, an example invocation of cbindgen with MIR will look like this:

cbindgen --mir lib.mir --output lib.h --lang c lib.rs

This means that the user will have to manually generate the MIR files and pass them into cbindgen, which we found to be an acceptable compromise on increasing user responsibility to resolve the problem with the associated types.

However, the MIR approach turned out to be infeasible after some testing.
For instance, the regex pattern matching would no longer be reliable when you have multiple functions of the same name, e.g., new is a pretty common function name. Though Rust does not support overloaded functions, functions with the same name can still appear in the MIR. This can be from functions defined in impl blocks, be it a method of a struct or a trait implementation of a type. It could also be functions generated from derive macros or procedural macros.

Here is an example of an original Rust source code:

impl MyTrait for u8{
    type SomeType = bool;
    fn some_fn() -> bool{
        false
    }
}

#[no_mangle]
extern fn some_fn(param1: <u8 as MyTrait>::SomeType){} // Rust input

And here we look at the MIR output:

fn <impl at src/lib.rs:6:1: 6:20>::some_fn(_1: i32) -> bool {
    ...
}

fn some_fn(_1: bool) -> (){
    ...
}

There are two functions now with the same name and the same number of arguments. Also, please note that MIR removes and replaces the parameter names with numbers, making it impossible to match by parameter names.

Furthermore, it was also impractical and error-prone to write regex statements to match all possible places where types could be used. Therefore, we had to rethink our entire approach and discard the previous work in favour of something more robust.

Approach 2: Intercepting cbindgen parsing to extract associated types

The second approach we tried was to “properly” parse the associated type implementations. It required more involvement with the cbindgen source code, but it turned out to be much more robust, and as a result, we eventually succeeded in parsing associated types this way.

Essentially, we do what we assume the compiler would do in this case – we keep track of feature implementations and replace the associated types when we come across them. The source code of cbindgen is divided into several phases (internals.mdcbindgen source), but most notable for us are the “loading” phase and the “transformation” phase. cbindgen maintains an internal IR that is roughly mapped to the corresponding C types. In the “loading” phase, the syn elements are parsed and converted into cbindgen Intermediate Representation (IR) elements. In the “transformation” phase, we perform several passes to make further changes to the IR elements.

The key idea is to use the loading phase and extend the code to keep track of the associated types when they are defined in trait implementations. Each trait-associated type can be uniquely identified with a combination of 3 keys: trait nameself typeassociated type name. We maintain a map where the key is the mentioned unique key combination for each associated type, and the value is the concrete type defined in the source code. Then, during the transformation phase, we add a pass to resolve all pending associated types by checking the map and replacing the unknown type with a concrete type.

The following block shows an example of the structures:

pub struct Parse{
    ...
    assoc: HashMap<AssocTypeID, Type> // This hashmap contains the mapping for each ID key and a concrete type
    ...
}

// This ID is used to uniquely identify each associated type
pub struct AssocTypeId {
    pub ty: Box<Type>, // Type is the cbindgen ir for types
    pub trait_: Path, // Path is essentially a String
    pub ident: Path,
}

// An example of this ID for the associated type shown on top
AssocTypeID{
    ty: u8,
    trait_: "MyTrait",
    ident: "SomeType",
}

The “loading” phase of cbindgen builds heavily on top of syn, a parsing library that parses a stream of Rust tokens into a syntax tree of Rust source code. Each source file that cbindgen parses will be parsed through syn to generate the respective syn items. An example would be that parsing a file produces Vec<syn::Item>syn::Item is an enum that enumerates all possible statements in Rust source code. At the top level, cbindgen then iterates through this list to convert all relevant statements into cbindgen IR items.

We tap into this to extend the functionality that cbindgen carries out when it encounters a syn::Item::Impl enum variant. For every impl statement in the source code, we parse through the syntax tree contained to identify if this is a trait implementation, and if it is, whether there are associated types. If we encounter any, we add them to the map we defined earlier, along with the concrete type that this associated type refers to.

After this, we do the actual replacements from associated to concrete types in cbindgen’s “transformation” phase.
We go through every conststaticenumstructuniontype and fn (type here refers to type aliases, or typedefs in C), parsing through the items to replace all instances of associated types with the concrete types.

Some challenges

Familiarity with Rust

I’ve implemented some projects using Rust before but encountered more advanced concepts like trait-associated types, procedural macros, etc., during the mentorship (Mentorship program on Hyperledger Wiki). It was also challenging to write idiomatic Rust code, for example:

  1. Iterate through a vector the ‘clean’ way - using iterators rather than indexing.
  2. Using functional programming styles with closures in Rust when it makes more sense.
  3. The verbosity of comments: balancing being too verbose (which is what I was used to) while commenting where it matters while letting good code explain itself.

My mentor, Marin Veršić, helped me the most with his insight. There are also good resources that helped me along the way:

  • The Rust Book is an excellent resource for jumping into almost any of the topics in Rust. It provides a good overview and covers most of what it thinks people need to get started.
  • I use the Rust Reference for a more in-depth understanding of statements, e.g., possible variants of certain statements. It is comprehensive. I think it is most appropriate as a reference for specific topics; think about this reference like a dictionary.
  • The Rust Compiler Development Guide was pretty helpful when I had to check out specific things like borrow-checking to understand why I was getting borrow-checker errors.

Handling edge cases in identifying types to resolve

It wasn’t easy to catch several essential edge cases, and I only knew about them because my mentor pointed them out to me. An example would be associated types nested in other types.

In the following example, we have an arbitrary struct. The first member is a classic case of associated types, while the rest contains nested associated types.

pub struct TestStruct{
    classic: <u8 as MyTrait>::SomeType, // Classic example of an associated type
    array: [<u8 as MyTrait>::SomeType; 50], // Nested within an array type
    fn_ptr: extern fn(<u8 as MyTrait>::SomeType, bool), // Nested within a fn pointer
    raw_ptr: *const <u8 as MyTrait>::SomeType, // Nested within a raw pointer
}

In my original approach, I only looked for the classic or obvious ones and replaced those, missing out on other associated types nested inside the other types. Ultimately, I adopted a recursive solution to check through all types in a depth-first manner, resolving associated types nested at all levels.

Handling associated type that depends on other associated types

The following code block illustrates another interesting edge case:

pub trait Inner{
    type inner;
}

pub trait Outer{
    type outer;
}

impl Outer for u32{
    type outer = <u32 as Inner>::inner;
}

impl Inner for u32{
    type inner = bool;
}

Notice that the implementation of the Outer trait depends on the implementation of the Inner trait. Furthermore, when we first encountered the implementation of Outer, we might not have resolved Inner yet. This is solved by maintaining an additional structure of pending associated types and resolving them when we resolve their dependencies.

Conclusion

By carrying out the second approach outlined above and taking care of the edge cases, we managed to get cbindgen to “understand” associated types and replace them with concrete types when generating bindings. This approach makes heavy use of cbindgen’s existing functionality, extending its ir for Rust types and making minimal changes to the way the code was parsed before. We narrowed our scope to support only fully-qualified trait annotations, such as <MyStruct as MyTrait>::SomeType, instead of other associated type annotations. This simplifies our parsing and makes the associated type replacement more robust. Let me demonstrate with an example.

Rust input:


pub trait MyTrait{
    type SomeType;
}

#[repr(C)]
pub struct MyStruct{
    a: u8,
}

impl MyTrait for MyStruct{
    type SomeType = i64;
}

#[no_mangle]
pub extern fn test_fn(struct_: &<MyStruct as MyTrait>::SomeType) -> <MyStruct as MyTrait>::SomeType

C header file:

int64_t test_fn(const int64_t *struct_);

We had to change our approach midway, initially using MIR and then making cbindgen understand the associated types, which took some time. We had to scrap all our work on MIR to start from scratch again. However, this process was ultimately needed, and we built a more robust and complete solution.

This experience taught me much new information about Rust: its advanced features, type system, and many more. Overall, it was an enriching experience to have worked on this.

  • No labels