Table of contents

Flambda2 Ep. 6: Reducing Allocations, Why Unboxing Matters




unboxing much ?

unboxing much ?

Welcome to a new episode of The Flambda2 Snippets!

In this episode, we explore why unboxing is such a critical optimisation in functional languages like OCaml, and how Flambda2 takes it to the next level.

The F2S blog posts aim at gradually introducing the world to the inner-workings of a complex piece of software engineering: The Flambda2 Optimising Compiler for OCaml, a technical marvel born from a 10 year-long effort in Research & Development and Compilation; with many more years of expertise in all aspects of Computer Science and Formal Methods.

All feedback is welcome, thank you for staying tuned and happy reading!

The Context of Unboxing

Where do (most) allocations come from in languages with automatic memory management?

In OCaml, as for many other functional languages, like Scheme and Haskell, all values are boxed. Know that this is not a property of the language, or its semantics, but rather an implementation design choice that provides reasonable tradeoffs. Some other languages make another choice, for instance F# and MLton.

For reference: The effectiveness of type-based unboxing (Xavier Leroy, 1997), explores the benefit of using type information to drive unboxing optimisations in a language similar to OCaml (Gallium).

Why do we have boxing?

In functional languages, uniform representations greatly simplify the implementation of polymorphism and garbage collection.

The GC must be able to study the graph of live values by following every pointer. This means that it has to know, for each value, which parts of them are pointers. If the representation is uniform, there is only one single piece of code to write to be able to do that. If it wasn't, each type would have a different way to store its pointers, implying a new way of traversing values for each type.

In polymorphic languages, by default, when looking at a pointer, we can't know the type of its value. Thus, to know how to traverse it, if the representation is not uniform, you must be provided with some meta information about the types, which entails more complexity and more runtime costs.

Nota Bene: The OCaml runtime values don't have any type information. Contrary to dynamic languages, like Python and Javascript, there is no way to find type information at runtime.

The OCaml uniform representation.

Values in OCaml are represented as either a tagged integer, or a pointer.

Pointers are assumed to be aligned on 2, so the last bit is 0. Therefore, we can use 1 in that place to identify the other kind of value : the 63 bits tagged integers.

Values that are pointers point to something that we call a Block. A block is a contiguous area of memory which contains some metadata used by the GC and either a list of (uniform) values or some data that will not be scanned (such data cannot contain values). The metadata distinguish between these two cases.

Blocks are dynamically allocated using the GC. Almost all allocated memory in OCaml programs is comprised of blocks. Therefore, when we talk about "allocations" in OCaml, we talk about "allocated blocks".

The compiled code can either work with the contents of blocks, that we call "unboxed values", or with pointers to blocks, that we call "boxed values". Unboxing is the process of transforming code that works on boxed values into code that works on unboxed values.

More details about the representation of values in OCaml can be found in the OCaml manual, see: https://ocaml.org/docs/memory-representation

Historical note: Before Multicore, up until OCaml version 4.14, it was possible to encounter pointers that pointed outside of the GC-managed heap, often called naked pointers. These have since been deprecated as of OCaml 5.

Another strategy to handle polymorphism: monomorphisation

As said previously, using uniform representation makes polymorphism simple to implement since you basically have nothing to do for it. However, it's still possible to take the hard path and forgo uniform representation altogether.

One such path is monomorphisation. It consists in duplicating all polymorphic code; one version per type that this code can encounter. That way, each version can work on a single concrete representation, tailored to that type. For instance in ML world, MLton and F# use this strategy. This duplication can either be done at compile time like MLton, or at runtime like F#.

This wouldn't be reasonable to do for languages like OCaml and Haskell.

It's impossible to do at compile time because their type systems are too complex for enumerating all possible types that a given code needs to handle.

Additionally, runtime monomorphisation would require to include a JIT in the runtime which isn't planned for either language.

Unboxing optimisations

Manual vs Automatic

Some languages allow the user to change their code to handle unboxed values instead of boxed ones, which amounts to doing manual unboxing. For instance, Haskell has specific types to represent unboxed values.

For reference, checkout the Haskell documentation

Note that in OCaml, this isn't the case.

There are on-going endeavours to add unboxed values in OCaml: OxCaml

This is not today's topic. We want to talk about optimisations that the compiler does on its own. So from now on, when we say "unboxing", keep in mind that we mean "automatic unboxing".

Automatic unboxing in OCaml

For the nth time in this article so far: all values are boxed in OCaml.

There are often runtime costs that come with handling boxed values.

Consider the following example:

(* f adds three floats *)
let f (a : float) b c =
  let t = Float.add a b in
  Float.add t c

Which is roughly translated by the compiler into:

(* Unboxed_float.add is the raw CPU float addition (~1 CPU cycle) *)
let f a b c =
  let a' = unbox a in
  let b' = unbox b in
  let t' = Unboxed_float.add a' b' in
  let t = box t' in
  let c' = unbox c in
  let t'' = unbox t in
  let z' = Unboxed_float.add t'' c' in
  box z'

Boxing values is costly and in particular in this example. The cost of boxing exceeds that of the actual computation performed which is two additions of floating point values.

In general the costs of boxing are:

  • The cost of allocating (box);
  • The cost of reading from memory rather than from a register (unbox);
  • Increased pressure on the Garbage Collector;
  • Increased variability in latency because an allocation might trigger some GC work.

The OCaml runtime is heavily optimised to minimise these specific costs. In practice, in this example, the overhead is larger, but not by much, than the cost of the additions.

Side note: Jane Street has developed an extension featuring local allocations that don't go on the heap like regular allocations, but instead use a stack that roughly follows the lexical scope of values. These allocations have a lower cost than normal ones since they don't put pressure on the GC—reducing overhead somewhat—but they still cost more than a fully unboxed version. See: Stack Allocation in OxCaml

Notice that, in this example, we can divide that overhead by roughly two. The reasoning is that the t value is not returned by this function and that the computation of z' could directly reuse t' instead of t''. We could remove t and thus avoid one box and one unbox. Doing so would lead to the following code being generated:

(* Same example as above, but with one fewer call to box *)
let f a b c =
  let a' = unbox a in
  let b' = unbox b in
  let t' = Unboxed_float.add a' b' in
  (* The box operation has been removed *)
  let c' = unbox c in
  let t''= t' in
  let z' = Unboxed_float.add t'' c' in
  box z'

Furthermore, another benefit of unboxing is that it can allow us to detect more dead code. For instance, unboxing a tuple will decouple the lifetimes of all components within which can be analysed independently and thus reveal potential dead code.

In order to unbox things, we need to decide which values are needed as boxed and which are not, the latter are the ones that we could consider for unboxing. Indeed, there are values at some point of the program that are required to be in the uniform representation. Those points are usually when the code loses control over a specific value:

  • at the interfaces of functions (arguments and returned values);
  • when a value needs to be stored on the heap;
  • when required by the OCaml ABI (for instance fields of modules);

The two first reasons above, which both require boxing, can disappear during compilation, specifically in the inlining phase. Indeed, function boundaries vanish and, based on code context, some values don't even need to be stored on the heap anymore. On the other side of this phenomenon, when that happens, it's a good sign that the inlining was beneficial. Thus explaining why we need to do both inlining and unboxing at the same time. If you recall, this was described in the article on Speculative Inlining in Flambda2.

Unboxing in the vanilla OCaml compiler

Of course, the OCaml compiler team are aware of the benefits of automatic unboxing and have supported some form of it for a while. The thing about this implementation is that it triggers only in some specific cases. There are two main configurable middle-ends for the upstream OCaml compiler: with Flambda or without (which we call clambda). Both share the same code for automatic unboxing operations on boxed numbers (int32, float, int64, nativeint), but the flambda mode also allows unboxing on some other values. However the unboxing of numbers is not accounted for in the speculative inlining metrics.

Read about how unboxing is handled:

When we apply the upstream version of automatic unboxing of numbers to the previous example, the one with the float additions, we get the same result.

There's a catch, however. The context in which the unboxing transformation happens allows for little analyses. This can make it tricky to detect if unboxing is actually beneficial, or if it isn't.

Below, an example of when it's detrimental:

(* Step 1: Source code *)
let f g h a b =
  let t = Float.add a b in
  g t;
  h t

Then, the Float.add primitive is translated to a lower level with explicit boxing and unboxing:

(* Step 2: An intermediate step in the transformation *)
let f g h a b =
  let t' = Unboxed_float.add (unbox a) (unbox b) in
  let t = box t' in
  g t;
  h t

Eventually, the unboxing procedure has been performed on the let t = box t' declaration:

(* Step 3: Resulting code *)
let f g h a b =
  let t' = Unboxed_float.add (unbox a) (unbox b) in
  g (box t');
  h (box t')

What am I supposed to see?

The first thing to note is that the compiler will identify boxed numbers based on two sources: the type of the expression, or the contents of the expression (e.g., which primitive is the head of the expression?). Once that expression has been identified as returning boxed numbers, unboxing is always performed. In this example, the expression that was identified as returning a boxed value was let t = box t'.

To unbox this expression, the compiler will substitute every use of t with box t' and replace every occurrence of unbox (box expr) with expr. In this example, since there are no occurrences of unbox (box t') there is no direct benefit. Moreover, since there are multiple occurrences of t, pushing box t' everywhere duplicates it, which is ultimately detrimental.

This example is intentionally pathological, in general the code found in the wild is closer to our first example where this transformation is an actual optimisation.


The limitations of this procedure led to designing a new one which triggers only when an Unboxing operation is a net benefit and never a detriment. Flambda 2 goes further and generalizes this by applying the same procedure to all kinds of values. It also registers the benefits into the Speculative Inlining metrics. This increases the likeliness of inlining when this creates new opportunities for unboxing.

Unboxing algorithms in Flambda2

There are different situations in which noticing that there is an unboxing opportunity requires to look at different parts of the code. Each of these situations require a different algorithm that we will breakdown for you in upcoming blogposts on the matter.

In the next episodes on Unboxing:

  1. We will see how we can identify an unboxing opportunity when looking at the declaration of a value. This is the kind of case we've used to illustrate today's topic.

In some situations, we need to track all the uses of a value to know whether unboxing is even possible:

  1. First we will explain this in the context of values local to functions (examples of such values are local references and other mutable structures);

  2. Second, we will delve into dealing with this optimisation for global values (for instance when a value is captured in a closure).

Conclusion

In this article, we've explored the fundamental reasons why unboxing matters in OCaml: reducing allocation overhead, minimising GC pressure, and improving overall performance. We've seen how the vanilla OCaml compiler handles unboxing in specific cases, and hinted at the limitations that motivated Flambda2's more sophisticated approach.

In upcoming episodes on Unboxing, we'll dive into how Flambda2 implements these optimisations in practice—covering the different mechanisms and the various situations where they apply.

Stay tuned!



About OCamlPro:

OCamlPro is a R&D lab founded in 2011, with the mission to help industrial users benefit from experts with a state-of-the-art knowledge of programming languages theory and practice.

  • We provide audit, support, custom developer tools and training for both the most modern languages, such as Rust, Wasm and OCaml, and for legacy languages, such as COBOL or even home-made domain-specific languages;
  • We design, create and implement software with great added-value for our clients. High complexity is not a problem for our PhD-level experts. For example, we helped the French Income Tax Administration re-adapt and improve their internally kept M language, we designed a DSL to model and express revenue streams in the Cinema Industry, codename Niagara, and we also developed the prototype of the Tezos proof-of-stake blockchain from 2014 to 2018.
  • We have a long history of creating open-source projects, such as the Opam package manager, the LearnOCaml web platform, and contributing to other ones, such as the Flambda optimizing compiler, or the GnuCOBOL compiler.
  • We are also experts of Formal Methods, developing tools such as our SMT Solver Alt-Ergo (check our Alt-Ergo Users' Club) and using them to prove safety or security properties of programs.

Please reach out, we'll be delighted to discuss your challenges: contact@ocamlpro.com or book a quick discussion.