Table of contents

Loopifying Tail-Recursive Functions




Welcome to a new episode of The Flambda2 Snippets!

The F2S blogposts aim partly at gradually introducing the world to the inner-workings of a complex piece of software engineering: The Flambda2 Optimising Compiler, a technical marvel born from a 10 year-long effort in Research & Development and Compilation; with many more years of expertise in all aspects of Computer Science and Formal Methods.

Today's topic is Loopify, one of Flambda2's many optimisation algorithms which specifically deals with optimising both purely tail-recursive and/or functions annotated with the [@@loop] attribute in OCaml.

A lazy explanation for its utility would be to say that it simply aims at reducing the number of memory allocations in the context of recursive and tail-recursive function calls in OCaml. However, we will see that is just part of the point and thus we will tend to address the broader context: what are tail-calls, how they are optimised and how they fit in the functional programming world, what dilemma does Loopify nullify exactly and, in time, many details on how it's all implemented!

If you happen to be stumbling upon this article and wish to get a bird's-eye view of the entire F2S series, be sure to refer to Episode 0 which does a good amount of contextualising as well as summarising of, and pointing to, all subsequent episodes.

All feedback is welcome, thank you for staying tuned and happy reading!

Tail Call Optimisation

As far as we know, tail call optimisation (TCO) has been a reality since at least the 70s. Some LISP implementations used it and Scheme specified it into its language around 1975.

The debate to support TCO happens regularly today still. Nowadays, it's a given that most functional languages support it (Scala, OCaml, Haskell, Scheme and so on...). Other languages and compilers have supported it for some time too, either optionally, with some C compilers (gcc and clang) that support TCO in some specific compilation scenarios; or systematically, like Lua, which, despite not usually being considered a functional language, specifies that TCO occurs whenever possible (section 3.4.10 of the manual).

So what exactly is Tail Call Optimisation ?

A place to start would be the Wikipedia page. You may also find some precious insight about the link between the semantics of GOTO and tail calls here, a course from Xavier Leroy at the College de France in French.

Additionally to these resources, here are images to help you visualise how TCO improves stack memory consumption. Assume that g is a recursive function called from f:

A representation of the textbook behaviour for recursive functions stackframe allocations. You can see here that the stackframes of non-tail-recursive functions are allocated sequentially on decreasing memory addresses which may eventually lead to a stack overflow.

A representation of the textbook behaviour for recursive functions stackframe allocations. You can see here that the stackframes of non-tail-recursive functions are allocated sequentially on decreasing memory addresses which may eventually lead to a stack overflow.

Now, let's consider a tail-recursive implementation of the g function in a context where TCO is not supported. Tail-recursion means that the last thing g does before returning is calling itself. The key is that we still have a frame for the caller version of g but we know that it will only be used to return to the parent and so it is that the frame itself no longer holds any relevant information besides the return address and so, the corresponding memory space is mostly wasted.

A representation of the textbook behaviour for tail-recursive functions stackframe allocations without Tail Call Optimisation (TCO). When TCO is not implemented the behaviour for these allocations and the potential for a stack overflow are the same as with non-tail-recursive functions.

A representation of the textbook behaviour for tail-recursive functions stackframe allocations without Tail Call Optimisation (TCO). When TCO is not implemented the behaviour for these allocations and the potential for a stack overflow are the same as with non-tail-recursive functions.

And finally, let's look at the same function in a context where TCO is supported. It is now apparent that memory consumption is much improved by the fact that we reuse the space from the previous stackframe to allocate the next one all the while preserving its return address:

A representation of the textbook behaviour for tail-recursive functions stackframe allocations with TCO. Since TCO is implemented, we can see that the stack memory consumption is now constant, and that the potential that this specific tail-recursive function will lead to a stack overflow is diminished.

A representation of the textbook behaviour for tail-recursive functions stackframe allocations with TCO. Since TCO is implemented, we can see that the stack memory consumption is now constant, and that the potential that this specific tail-recursive function will lead to a stack overflow is diminished.

Tail calls in OCaml

The List data structure is fundamental to and ubiquitous in functional programming. Therefore, it's important to not have an arbitrary limit on the size of lists that one can manipulate. Indeed, most List manipulation functions are naturally expressed as recursive functions, and can most of the time be implemented as tail-recursive functions. Without guaranteed TCO, a programmer could not have the assurance that their program would not stack overflow at some point. That reasoning also applies to a lot of other recursive data structures that commonly occur in programs or libraries.

In OCaml, TCO is guaranteed. Ever since its inception, Cameleers have unanimously agreed to guarantee the optimisation of tail-calls. While the compiler's support for TCO has been a thing from the beginning, an attribute, [@tailcall] was later added to help users ensure that their calls are in tail position.

Recently, TCO was also extended with the Tail Mod Cons optimisation which allows to generate tail calls in more cases.

Reducing allocations VS writing clean code

One would find one of the main purposes for the existence of Loopify in the following conversation: a Discuss Post about the unboxing of floating-point values in OCaml and performance.

This specific comment sparks a secondary conversation that you may want to read yourself but will find a quick breakdown of below and that will be a nice starting point to understand today's subject.

Consider the following code:

let sum l =
  let rec loop s l =
    match l with
    | [] -> s
    | hd :: tl ->
      (* This allocates a boxed float *)
      let s = s +. hd in
      loop s tl 
  in
  loop 0. l

This is a simple tail-recursive implementation of a sum function for a list of floating-point numbers. However this is not as efficient as we would like it to be.

Indeed, OCaml needs an uniform representation of its values in order to implement polymorphic functions. In the case of floating-point numbers this means that the numbers are boxed whenever they need to be used as a generic value.

Besides, everytime we call a function all parameters have to be considered as generic values. Thus we cannot avoid the allocation at each recursive call in this function.

If we were to optimise it in order to get every last bit of performance out of it, we could try something like:

Warning: The following was coded by trained professionnals, do NOT try this at home.

let sum l = 
  (* Local references *)
  let s = ref 0. in
  let cur = ref l in
  try
    while true do
      match !cur with
      | [] -> raise Exit
      | hd :: tl ->
        (* Unboxed floats -> No allocation *)
        s := !s +. hd;
        cur := tl
    done; assert false
  with Exit -> !s (* The only allocation *)

While in general references introduce one allocation and a layer of indirection, when the compiler can prove that a reference is strictly local to a given function it will use mutable variables instead of reference cells.

In our case s and cur do not escape the function and are therefore eligible to this optimisation.

After this optimisation, s is now a mutable variable of type float and so it can also trigger another optimisation: float unboxing.

You can see more details here but note that, in this specific example, all occurrences of boxing operations disappear except a single one at the end of the function.

We like to think that not forcing the user to write such code is a benefit, to say the least.


A camel enjoying a bit of Loopifying once in a while

A camel enjoying a bit of Loopifying once in a while

Loopify

Concept

There is a general concept of transforming function-level control-flow into direct IR continuations to benefit from "basic block-level" optimisations. One such pattern is present in the local-function optimisation triggered by the [@local] attribute. (lien vers la PR qui implem [@local]). Loopify is an attempt to extend the range of this kind of optimisation to proper (meaning self) tail-recursive calls.

As you saw previously, in some cases (e.g.: numerical calculus), recursive functions sometimes hurt performances because they introduce some allocations.

That lost performance can be recovered by hand-writing loops using local references however it's unfortunate to encourage non-functional code in a language such as OCaml.

One of Flambda and Flambda2's goals is to avoid situations such as them and allow for good-looking, functional code, to be as performant as code which is written and optimised by hand at the user-level.

Therefore, we introduce a solution to the specific problem described above with Loopify, which, in a nutshell, transforms tail-recursive functions into non-recursive functions containing a loop, hence the name.

Deciding to Loopify or not

The decision to loopify a given function is made during the conversion from the Lambda IR to the Flambda2 IR. The conversion is triggered in two cases:

  • when a function is purely tail-recursive -- meaning all its uses within its body are self-tail calls, they are called proper calls;
  • when an annotation is given by the user in the source code using the [@loop] attribute;

Let's see two examples for them:

(* Not a tail-rec function: is not loopified *)
let rec map f = function
  | [] -> []
  | x :: r -> f x :: map f r

(* Is tail-rec: is loopified *)
let rec fold_left f acc = function
  | [] -> acc
  | x :: r -> fold_left f (f acc x) r

Here, the decision to loopify is automatic and requires no input from the user. Quite straightforward.


Onto the second case now:

(* Helper function, not recursive, nothing to do. *)
let log dbg f arg =
  if dbg then
    print_endline "Logging...";
  f arg
[@@inline]

(* 
  Not tail-rec in the source, but may become
  tail-rec after inlining of the [log] function.
  At this point we can loopify, provided that the
  user specified a [@@loop] attribute.
*)
let rec iter_with_log dbg f = function
  | [] -> ()
  | x :: r ->
    f x;
    log dbg (iter_with_log dbg f) r
[@@loop]

The recursive function iter_with_log, is not initially purely tail-recursive.

However after the inlining of the log function and then simplification, the new code for iter_with_log becomes purely tail-recursive.

At that point we have the ability to loopify the function, but we keep from doing so unless the user specifies the [@@loop] attribute on the function definition.

The nature of the transformation

Onto the details of the transformation.

First, we introduce a recursive continuation at the start of the function. Lets call it self.

Then, at each tail-recursive call, we replace the function call with a continuation call to self with the same arguments as the original call.

let rec iter_with_log dbg f l =
  let_cont rec k_self dbg f l =
    match l with
    | [] -> ()
    | x :: r ->
      f x;
      log dbg (iter_with_log dbg f) r
  in
  apply_cont k_self (dbg, f, l)

Then, we inline the log function:

let rec iter_with_log dbg f l =
  let_cont k_self dbg f l =
    match l with
    | [] -> ()
    | x :: r ->
      f x;
      (* Here the inlined code starts *)
      (*
        We first start by binding the arguments of the
        original call to the parameters of the function's code
       *)
      let dbg = dbg in
      let f = iter_with_log dbg f in
      let arg = r in
      if dbg then
        print_endline "Logging...";
      f arg
  in
  apply_cont k_self (dbg, f, l)

Then, we discover a proper tail-recursive call subsequently to these transformations that we replace with the adequate continuation call.

let rec iter_with_log dbg f l =
  let_cont k_self dbg f l =
    match l with
    | [] -> ()
    | x :: r ->
      f x;
      (* Here the inlined code starts *)
      (*
        Here, the let bindings have been substituted
        by the simplification.
       *)
      if dbg then
        print_endline "Logging...";
      apply_cont k_self (dbg, f, l)
  in
  apply_cont k_self (dbg, f, l)

In this context, the benefit of transforming a function call to a continuation call is mainly about allowing other optimisations to take place. As shown in the previous section, one of these optimisations is unboxing which can be important in some cases like numerical calculus. Such optimisations can take place because continuations allow more freedom than function calls which must respect the OCaml calling conventions.

One could think that a continuation call is intrinsically cheaper than a function call. However, the OCaml compiler already optimises self-tail-calls such that they are already as cheap as continuation calls (i.e, a single jump instruction).

An astute reader could realise that this transformation can apply to any function and will result in one of three outcomes:

  • if the function is not tail-recursive, or even not recursive at all, nothing will happen, the transformation does nothing.
  • if a function is purely tail-recursive then all recursive calls will be replaced to a continuation call and the function after optimisation will no longer be recursive. This allows us to later inline it and even specialise some of its arguments. This happens precisely when we automatically decide to loopify a function;
  • if a function is not purely tail-recursive, but contains some tail-recursive calls then the transformation will rewrite those calls but not the other ones. This may result in better code but it's hard to be sure in advance. In such cases (and cases where functions become purely tail-recursive only after inlining), users can force the transformation by using the [@@loop] attribute

Conclusion

  • C'est mieux d'avoir ce genre de passe pour ne pas forcer les utilisateurs intéressés par les perfs à écrire du code impératif, et donc ducoup une des manières d'éviter ça c'est faire en sorte que toutes les manières différentes d'é rire la meme chose aient des performances similaires, notamment boucles for/while vs fonctions recursives.

  • C'est l'un des premiers articles qui décrit une optimisation d'un pdv utilisateur. Il n'entre pas encore dans les détails croustillants de l'implémentation, c'est un sujet pour une autre fois.

  • Entonoir inverse: dans le prochain, on entre davantage dans l'aspect implem'.

  • Cette transofrmation a l'air naive a priori, mais le truc intéressant c'est qu'on le la fasse pendant (au meme endroit que) l'inlining, ça permet de découvrir des cas a posteriori d'effective purely tail recursive calls et donc des cas ou c'est intéssant (pas évident d'emblée) d'éligibilité a l'opti en question.

C'est un principe fondamental de la philosophie, et du design, derrière Flambda2: faire des petites optis, en même temps, qui, combinées, font des choses géniales en terme d'opti intelligente de code. Tout au long de la série de BP, ce principe reviendra souvent, et ce dès le proochain qui présentera la structure générale qui permet la réalisation de cette philosophie.

Stay tuned, and thank you for reading, until next time, see you Space Cowboy.



About OCamlPro:

OCamlPro is a R&D lab founded in 2011, with the mission to help industrial users benefit from experts with a state-of-the-art knowledge of programming languages theory and practice.

  • We provide audit, support, custom developer tools and training for both the most modern languages, such as Rust, Wasm and OCaml, and for legacy languages, such as COBOL or even home-made domain-specific languages;
  • We design, create and implement software with great added-value for our clients. High complexity is not a problem for our PhD-level experts. For example, we helped the French Income Tax Administration re-adapt and improve their internally kept M language, we designed a DSL to model and express revenue streams in the Cinema Industry, codename Niagara, and we also developed the prototype of the Tezos proof-of-stake blockchain from 2014 to 2018.
  • We have a long history of creating open-source projects, such as the Opam package manager, the LearnOCaml web platform, and contributing to other ones, such as the Flambda optimizing compiler, or the GnuCOBOL compiler.
  • We are also experts of Formal Methods, developing tools such as our SMT Solver Alt-Ergo (check our Alt-Ergo Users' Club) and using them to prove safety or security properties of programs.

Please reach out, we'll be delighted to discuss your challenges: contact@ocamlpro.com or book a quick discussion.