Loopifying Tail-Recursive Functions
Welcome to a new episode of The Flambda2 Snippets!
The F2S blogposts aim partly at gradually introducing the world to the inner-workings of a complex piece of software engineering: The
Flambda2 Optimising Compiler
, a technical marvel born from a 10 year-long effort in Research & Development and Compilation; with many more years of expertise in all aspects of Computer Science and Formal Methods.
Today's topic is Loopify
, one of Flambda2
's many optimisation algorithms
which specifically deals with optimising both purely tail-recursive and/or
functions annotated with the [@@loop]
attribute in OCaml.
A lazy explanation for its utility would be to say that it simply aims at
reducing the number of memory allocations in the context of recursive and
tail-recursive function calls in OCaml. However, we will see that is just
part of the point and thus we will tend to address the broader context:
what are tail-calls, how they are optimised and how they fit in the
functional programming world, what dilemma does Loopify
nullify exactly and,
in time, many details on how it's all implemented!
If you happen to be stumbling upon this article and wish to get a bird's-eye view of the entire F2S series, be sure to refer to Episode 0 which does a good amount of contextualising as well as summarising of, and pointing to, all subsequent episodes.
All feedback is welcome, thank you for staying tuned and happy reading!
Tail Call Optimisation
As far as we know, tail call optimisation (TCO) has been a reality since at least the 70s. Some LISP implementations used it and Scheme specified it into its language around 1975.
The debate to support TCO happens regularly today still. Nowadays, it's a given that most functional languages support it (Scala, OCaml, Haskell, Scheme and so on...). Other languages and compilers have supported it for some time too, either optionally, with some C compilers (gcc and clang) that support TCO in some specific compilation scenarios; or systematically, like Lua, which, despite not usually being considered a functional language, specifies that TCO occurs whenever possible (section 3.4.10 of the manual).
So what exactly is Tail Call Optimisation ?
A place to start would be the Wikipedia page. You may also find some precious insight about the link between the semantics of GOTO and tail calls here, a course from Xavier Leroy at the College de France in French.
Additionally to these resources, here are images to help you visualise how TCO
improves stack memory consumption. Assume that g
is a recursive function
called from f
:
Now, let's consider a tail-recursive implementation of the g
function in a
context where TCO is not supported. Tail-recursion means that the last
thing g
does before returning is calling itself. The key is that we still
have a frame for the caller version of g
but we know that it will only be
used to return to the parent and so it is that the frame itself no longer holds
any relevant information besides the return address and so, the corresponding
memory space is mostly wasted.
And finally, let's look at the same function in a context where TCO is supported. It is now apparent that memory consumption is much improved by the fact that we reuse the space from the previous stackframe to allocate the next one all the while preserving its return address:
Tail calls in OCaml
The List data structure is fundamental to and ubiquitous in functional programming. Therefore, it's important to not have an arbitrary limit on the size of lists that one can manipulate. Indeed, most List manipulation functions are naturally expressed as recursive functions, and can most of the time be implemented as tail-recursive functions. Without guaranteed TCO, a programmer could not have the assurance that their program would not stack overflow at some point. That reasoning also applies to a lot of other recursive data structures that commonly occur in programs or libraries.
In OCaml, TCO is guaranteed. Ever since its inception, Cameleers have
unanimously agreed to guarantee the optimisation of tail-calls.
While the compiler's support for TCO has been a thing from the beginning,
an attribute,
[@tailcall]
was later added to help users ensure that their calls are in tail
position.
Recently, TCO was also extended with the Tail Mod Cons
optimisation which allows to
generate tail calls in more cases.
Reducing allocations VS writing clean code
One would find one of the main purposes for the existence of Loopify
in the
following conversation: a Discuss Post about the unboxing of floating-point
values in
OCaml and
performance.
This specific comment sparks a secondary conversation that you may want to read yourself but will find a quick breakdown of below and that will be a nice starting point to understand today's subject.
Consider the following code:
let sum l =
let rec loop s l =
match l with
| [] -> s
| hd :: tl ->
(* This allocates a boxed float *)
let s = s +. hd in
loop s tl
in
loop 0. l
This is a simple tail-recursive implementation of a sum function for a list of floating-point numbers. However this is not as efficient as we would like it to be.
Indeed, OCaml needs an uniform representation of its values in order to implement polymorphic functions. In the case of floating-point numbers this means that the numbers are boxed whenever they need to be used as a generic value.
Besides, everytime we call a function all parameters have to be considered as generic values. Thus we cannot avoid the allocation at each recursive call in this function.
If we were to optimise it in order to get every last bit of performance out of it, we could try something like:
Warning: The following was coded by trained professionnals, do NOT try this at home.
let sum l =
(* Local references *)
let s = ref 0. in
let cur = ref l in
try
while true do
match !cur with
| [] -> raise Exit
| hd :: tl ->
(* Unboxed floats -> No allocation *)
s := !s +. hd;
cur := tl
done; assert false
with Exit -> !s (* The only allocation *)
While in general references introduce one allocation and a layer of indirection, when the compiler can prove that a reference is strictly local to a given function it will use mutable variables instead of reference cells.
In our case s
and cur
do not escape the function and are therefore eligible
to this optimisation.
After this optimisation, s
is now a mutable variable of type float
and so it
can also trigger another optimisation: float unboxing.
You can see more details here but note that, in this specific example, all occurrences of boxing operations disappear except a single one at the end of the function.
We like to think that not forcing the user to write such code is a benefit, to say the least.
Loopify
Concept
There is a general concept of transforming function-level control-flow into
direct IR continuations to benefit from "basic block-level" optimisations. One
such pattern is present in the local-function optimisation triggered by the
[@local]
attribute. (lien vers la PR qui implem [@local]). Loopify
is an
attempt to extend the range of this kind of optimisation to proper (meaning
self
) tail-recursive calls.
As you saw previously, in some cases (e.g.: numerical calculus), recursive functions sometimes hurt performances because they introduce some allocations.
That lost performance can be recovered by hand-writing loops using local references however it's unfortunate to encourage non-functional code in a language such as OCaml.
One of Flambda
and Flambda2
's goals is to avoid situations such as them and
allow for good-looking, functional code, to be as performant as code which is
written and optimised by hand at the user-level.
Therefore, we introduce a solution to the specific problem described above with
Loopify
, which, in a nutshell, transforms tail-recursive functions into
non-recursive functions containing a loop, hence the name.
Deciding to Loopify or not
The decision to loopify a given function is made during the conversion from the
Lambda
IR to the Flambda2
IR. The conversion is triggered in two cases:
- when a function is purely tail-recursive -- meaning all its uses within its body are self-tail calls, they are called proper calls;
- when an annotation is given by the user in the source code using the
[@loop]
attribute;
Let's see two examples for them:
(* Not a tail-rec function: is not loopified *)
let rec map f = function
| [] -> []
| x :: r -> f x :: map f r
(* Is tail-rec: is loopified *)
let rec fold_left f acc = function
| [] -> acc
| x :: r -> fold_left f (f acc x) r
Here, the decision to loopify
is automatic and requires no input from the
user. Quite straightforward.
Onto the second case now:
(* Helper function, not recursive, nothing to do. *)
let log dbg f arg =
if dbg then
print_endline "Logging...";
f arg
[@@inline]
(*
Not tail-rec in the source, but may become
tail-rec after inlining of the [log] function.
At this point we can loopify, provided that the
user specified a [@@loop] attribute.
*)
let rec iter_with_log dbg f = function
| [] -> ()
| x :: r ->
f x;
log dbg (iter_with_log dbg f) r
[@@loop]
The recursive function iter_with_log
, is not initially purely tail-recursive.
However after the inlining of the log
function and then simplification, the new
code for iter_with_log
becomes purely tail-recursive.
At that point we have the ability to loopify
the function, but we keep from
doing so unless the user specifies the [@@loop]
attribute on the function definition.
The nature of the transformation
Onto the details of the transformation.
First, we introduce a recursive continuation at the start of the function. Lets
call it self
.
Then, at each tail-recursive call, we replace the function call with a
continuation call to self
with the same arguments as the original call.
let rec iter_with_log dbg f l =
let_cont rec k_self dbg f l =
match l with
| [] -> ()
| x :: r ->
f x;
log dbg (iter_with_log dbg f) r
in
apply_cont k_self (dbg, f, l)
Then, we inline the log
function:
let rec iter_with_log dbg f l =
let_cont k_self dbg f l =
match l with
| [] -> ()
| x :: r ->
f x;
(* Here the inlined code starts *)
(*
We first start by binding the arguments of the
original call to the parameters of the function's code
*)
let dbg = dbg in
let f = iter_with_log dbg f in
let arg = r in
if dbg then
print_endline "Logging...";
f arg
in
apply_cont k_self (dbg, f, l)
Then, we discover a proper tail-recursive call subsequently to these transformations that we replace with the adequate continuation call.
let rec iter_with_log dbg f l =
let_cont k_self dbg f l =
match l with
| [] -> ()
| x :: r ->
f x;
(* Here the inlined code starts *)
(*
Here, the let bindings have been substituted
by the simplification.
*)
if dbg then
print_endline "Logging...";
apply_cont k_self (dbg, f, l)
in
apply_cont k_self (dbg, f, l)
In this context, the benefit of transforming a function call to a continuation
call is mainly about allowing other optimisations to take place. As shown
in the previous section, one of these optimisations is unboxing
which can be
important in some cases like numerical calculus. Such optimisations can take
place because continuations allow more freedom than function calls which must
respect the OCaml calling conventions.
One could think that a continuation call is intrinsically cheaper than a
function call. However, the OCaml compiler already optimises self-tail-calls
such that they are already as cheap as continuation calls (i.e, a single jump
instruction).
An astute reader could realise that this transformation can apply to any function and will result in one of three outcomes:
- if the function is not tail-recursive, or even not recursive at all, nothing will happen, the transformation does nothing.
- if a function is purely tail-recursive then all recursive calls will be replaced to a continuation call and the function after optimisation will no longer be recursive. This allows us to later inline it and even specialise some of its arguments. This happens precisely when we automatically decide to loopify a function;
- if a function is not purely tail-recursive, but contains some tail-recursive
calls then the transformation will rewrite those calls but not the other
ones. This may result in better code but it's hard to be sure in advance. In
such cases (and cases where functions become purely tail-recursive only after
inlining
), users can force the transformation by using the[@@loop]
attribute
Conclusion
-
C'est mieux d'avoir ce genre de passe pour ne pas forcer les utilisateurs intéressés par les perfs à écrire du code impératif, et donc ducoup une des manières d'éviter ça c'est faire en sorte que toutes les manières différentes d'é rire la meme chose aient des performances similaires, notamment boucles for/while vs fonctions recursives.
-
C'est l'un des premiers articles qui décrit une optimisation d'un pdv utilisateur. Il n'entre pas encore dans les détails croustillants de l'implémentation, c'est un sujet pour une autre fois.
-
Entonoir inverse: dans le prochain, on entre davantage dans l'aspect implem'.
-
Cette transofrmation a l'air naive a priori, mais le truc intéressant c'est qu'on le la fasse pendant (au meme endroit que) l'inlining, ça permet de découvrir des cas a posteriori d'effective purely tail recursive calls et donc des cas ou c'est intéssant (pas évident d'emblée) d'éligibilité a l'opti en question.
C'est un principe fondamental de la philosophie, et du design, derrière Flambda2: faire des petites optis, en même temps, qui, combinées, font des choses géniales en terme d'opti intelligente de code. Tout au long de la série de BP, ce principe reviendra souvent, et ce dès le proochain qui présentera la structure générale qui permet la réalisation de cette philosophie.
Stay tuned, and thank you for reading, until next time, see you Space Cowboy.
Au sujet d'OCamlPro :
OCamlPro développe des applications à haute valeur ajoutée depuis plus de 10 ans, en utilisant les langages les plus avancés, tels que OCaml et Rust, visant aussi bien rapidité de développement que robustesse, et en ciblant les domaines les plus exigeants (méthodes formelles, cybersécurité, systèmes distribués/blockchain, conception de DSLs). Fort de plus de 20 ingénieurs R&D, avec une expertise unique sur les langages de programmation, aussi bien théorique (plus de 80% de nos ingénieurs ont une thèse en informatique) que pratique (participation active au développement de plusieurs compilateurs open-source, prototypage de la blockchain Tezos, etc.), diversifiée (OCaml, Rust, Cobol, Python, Scilab, C/C++, etc.) et appliquée à de multiples domaines. Nous dispensons également des [formations sur mesure certifiées Qualiopi sur OCaml, Rust, et les méthodes formelles] (https://training.ocamlpro.com/) Pour nous contacter : contact@ocamlpro.com.
Articles les plus récents
2024
- opam 2.3.0 release!
- Optimisation de Geneweb, 1er logiciel français de Généalogie depuis près de 30 ans
- Alt-Ergo 2.6 is Out!
- Flambda2 Ep. 3: Speculative Inlining
- opam 2.2.0 release!
- Flambda2 Ep. 2: Loopifying Tail-Recursive Functions
- Fixing and Optimizing the GnuCOBOL Preprocessor
- OCaml Backtraces on Uncaught Exceptions
- Loopifying Tail-Recursive Functions
- Opam 102: Pinning Packages
- Flambda2 Ep. 1: Foundational Design Decisions
- Behind the Scenes of the OCaml Optimising Compiler Flambda2: Introduction and Roadmap
- Lean 4: When Sound Programs become a Choice
- Opam 101: The First Steps
2023
- Maturing Learn-OCaml to version 1.0: Gateway to the OCaml World
- The latest release of Alt-Ergo version 2.5.1 is out, with improved SMT-LIB and bitvector support!
- 2022 at OCamlPro
- Autofonce, GNU Autotests Revisited
- Sub-single-instruction Peano to machine integer conversion
- Statically guaranteeing security properties on Java bytecode: Paper presentation at VMCAI 23