I recently finished up the basics for a feature in Benchee that I had been pretty excited about
for a while now - reduction counting! But after trying it out on my own for a bit to look for bugs
and see how it could be best used so I could document the feature well, I pretty quickly ran into
some confusing results. As it turns out, the results were correct and the implementation of the
feature is totally fine, but my understanding of how reductions could be used to measure
performance was flawed, and so that’s what I wanted to share today.
This is a pretty low-level post, and to keep it short I’m not going to go explain what all these
terms mean in depth, but there are plenty of other good descriptions of these concepts out there,
so if you run into a word or a concept you aren’t really familiar with, you would be well served
by doing a quick google to learn more about it before moving on.
The reason I was so excited about having reduction counting in Benchee was because it would give
us a consistent measure of performance that one could use to write really accurate performance
tests for highly performance sensitive functions. Wall-clock time is exceedingly difficult to use
for this because there are a nearly infinite number of factors that influence it, but reductions
are constant for a given function (assuming it’s determinisitc) on a given architecture. This sort
of thing is exceptionally valueable for many applications, and not something that I’m really aware
of in any other language or runtime, so I thought this could be a really great thing for us to
have in our community.
When I got the feature implemented I ran our sort of cannonical benchmark, comparing
Enum.flat_map/2 to Enum.map/2 |> List.flatten/1, and I saw the following results:
What I expected to see was a tight correlation between runtime and reduction count - essentially
I expected to see the same percentage of difference between runtime and reduction count. But as
you can see above, the reduction count for our slower version is 2.10x higher, but the wall-clock
time is only 1.36x slower. This means that my belief that we’d be able to use reduction count for
consistent peformance testing was flawed in some way. It looks like some reductions are much
faster than others.
But why?! I thought that on average all function calls would take roughly the same amount of time
to execute! Well, it turns out that couldn’t have been farther from the truth.
What’s inside a reduction?
I was thiking about this as a lambda calculus problem, basically, where all we have is
functions. I assumed that every function was mostly made up of calls to other functions, and that
they would all take roughly the same amount of time to execute. Even though I knew that BIFs exist
in the BEAM, and that these BIFs wouldn’t be counted as reductions because they’re implemented in
C in the BEAM itself, I figured that the distribution of these BIFs would be roughly equal.
So I dug a bit deeper. I used Michał Muskała’s excellent decompile package to take a look at the BEAM assembly generated
by a couple of functions, and I was frankly blown away by how much some functions rely on
instructions that aren’t call and call_ext (which actually call functions).
For example, here are the BEAM instructions for Enum.flat_map/2:
That’s a lot of non-function calls happening in there! All those instructions for move, bif,
kill, get_map_elements, test, jump, allocate, make_fun2, etc. are not counted as a
reduction but still add to the function’s actual wall-clock execution time. This means that if you
have a function that does a lot of stuff like this, you’re going to see that function take a lot
longer to execute even though it still counts as only one reduction.
Here’s another simpler example that I think really drives this point home. Below are two functions
that each count as one reduction when they’re called:
and here is the BEAM assembly for those functions:
And here’s the benchmark I’m running:
When I run the benchmark for these two functions we get the following results:
As we can see, our first function a_from_map/1 only has 4 instructions in its assembly (ignoring
the function info in label 7), but the second function a_from_map2/4 has 14 instructions (again
ignoring the function info in label 9), which explains why it runs ~3.5x slower - because there
are 3.5x as many instructions! And we see that again there are exactly the same number of
reductions for these two functions - 1.
What does this mean for this feature?
In short, I think this means that for the purpose I had in mind - performance testing - reduction
counting is basically useless. We can clearly see that the correlation betwen reductions and
runtime isn’t there, and there is the potential for really bad false positives where the reduction
counts are equal (or fewer!) but the runtime is significantly higher.
Also, the functions where this sort of testing would be most important are also the most likely to
rely more directly on BIFs and other non-reduction operations since those are significantly faster
to execute given that they’re implemented in C directly, so using reduction counting to test those
functions wouldn’t be helpful at all and potentially misleading.
This doesn’t mean that this information is all entirely without merit, though! If you’re trying to
optimize a function like this that is an extremely hot path and pretty far down the stack, you
could use the reduction count and looking at things like the BEAM instructions to replace
reductions with BIFs and other, faster ways of doing things. Essentially, you could look at the
reduction count as a guide in your optimization work - but that’s also not something that most
developers will be doing in their production applications. That sort of optimization really only
belongs at the language level, and maybe at the library level to a very limited extent.
But then again, there could be some other way that this might be helpful to folks that I haven’t
considered, so for now I’m generally leaning towards documenting this behavior and releasing it. I
would be interested to see what this gets used for, and hopefully it adds some nice value to the
community and solves someone’s problem.
So even though it doesn’t look like this is going to work out the way that I had hoped, it’s
still cool to learn some new things about the BEAM and how we can analyze our code to help make it
better and faster, so it wasn’t a total waste!
Want to see something really crazy, but also kind of cool?! Let’s expand a bit on our previous
Test module above and add a third function that will take two reductions:
Now, here’s the assembly for those functions:
And here’s the benchmark I’m running:
It would be totally reasonable to think that the new a_from_map3 function would probably come in
between the two previous functions, right? It’s doing the same thing as a_from_map but just
adding an extra function call around it, which means it should take more time because it’s doing
more computation, right?
Wrong! Check this out:
The BEAM has dozens of ways of optimizing applications both at compile-time and at runtime, so
when it comes to performance optimization on the BEAM, never guess and always measure! There are
tons of ways that the BEAM can surprise us to make our code run way faster than we think it