C++ meta programming: recursive template functions

C++ provides a few mechanisms for meta programming through its automatic type deduction tools, and the templating engine.  This allows for example, compile time evaluation of constants, loop unpacking.

We’ll demonstrate a way to use recursive template functions to evaluate a factorial at compile time, and do loop unpacking. A rough evaluation of the efficacy of these hand micro-optimisations  is also considered.

Loop unpacking of a sum

This means substituting a sum using a for-loop with the full summand on a single line. For example, replacing

for(int i=1; i<=3; i++)
  sum += A[i];

with

sum += A[1] + A[2] + A[3];

In principle, you might save on the overhead maintaining the loop, and take advantage of any optimisations for single line evaluations.

Approach 1 (template class):

template<int n>
class sum {
public:
  static inline int add(int *A) {
    return A[n] + sum<n-1>::add(A);
  }
};

template<>
class sum<0> {
public:
  static inline int add(int *A) {
    return A[0];
  }
};

int main() {
  int A[3] = {2, 5, 10};
  cout << "adding " << sum<2>::add(A) << "\n";
  return 0;
}

Approach 2 (template function):

template<int n>
inline int add(int *A) {
  return A[n] + add<n-1>(A);
}

template<>
inline int add<0>(int *A) {
  return A[0];
}

int main() {
  int A[3] = {2, 5, 10};
  cout << "adding " << add<2>(A) << "\n";
  return 0;
}

Evaluating literals

As long as the value is something that can be computed at compile time, this allows for constant time access to those values at run time.

Approach 1 (function recursion):

constexpr int metafactorial(int n) {
  return n <= 1? 1 : (n * metafactorial(n - 1));
}

Approach 2 (enum recursion):

template<int i>
class Factorial {
  public:
    enum {factorial = i*Factorial<i-1>::factorial};
};

template<>
class Factorial<1> {
  public:
    enum {factorial = 1};
};

int main() {
  cout << Factorial<10>::factorial << "\n";
}

Benchmarks

So is this actually a good idea to do? I guess with almost everything in life, context is everything.  Development time aside, lets look at some metrics.

Test environment

LinuxMint 18 with 4.4.0 kernel, running on an Intel i7 4770K with 32 GB RAM, and gcc/g++ 5.4.0.

The benchmark program is not particularly interesting (can provide source if people want it). It does a run time calculation on the factorial computation, and the summation computation.

The factorial computation uses n = 200000, and compares a regular recursive factorial function against against the meta programming approach.

For the loop unpacking, it compares the for-loop, against the two approaches listed above. Stack and heap allocated arrays are tested. Elements are dynamically initialised to values between 0 and 2. The array sizes are  90000 each.

The enum recursion was unable to be tested for anything meaningfully large(even n=20 was too much), as integer overflows were treated as compiler errors, and attempting to evaluate really large factorials also caused a compiler segmentation fault.

In principle, one should ideally run an isolated benchmark for each of those modules individually. I run them in batch, only with the aim of giving a ball park comparison.

The calculations are averaged over 5 repeated samples.

Compile time

  • for-loop and normal recursive factorial function :  negligible compile time.
  • Full benchmark with -O0 optimisation: ~ real 8m16.409s, user 8m15.776s, sys 0m0.684s
  • Full benchmark with -O2 optimisation: ~ real 9m9.529s, user 9m11.992s, sys 0m3.296s

That’s a huge difference in compile time. The templating engine does a whole lot of extra work.

Binary size

  • for-loop and normal recursive factorial function : ~ 13 KB
  • Full benchmark with -O0 optimisation: ~ 23.9 MB.
  • Full benchmark with -O2 optimisation: ~ 1.1 MB.

The difference in binary sizes is stark, with 13 KB on the low end, and almost 24MB at the high end!

Run time

This was a lot harder to analyse, since modern compilers are magical unicorns that spread magic, sometimes even when you ask it not to. But let’s try anyway.

Full benchmark with -O0 optimisation

The factorial calculation had the most noticeable gain.

  • Normal recursive function:  ~ 7.1 e-3 s
  • Approach 1 (function recursion): ~ 1.5 e-3 s

The constexpr function looks to be almost 5 times faster with these parameters. One would expect the meta programming approach to stay about constant as n got larger, whereas the normal recursive function would scale with n.

Now let’s try the loop unpacking.

Stack array

  • for-loop: ~3.6 e-4 s
  • Approach 1(template class): ~  5.9 e-3 s
  • Approach 2(template function): ~ 8.4 e-3 s

Heap array

  • for-loop: ~ 3.3 e-4 s
  • Approach 1 (template class): ~ 5.1 e-3 s
  • Approach 2 (template function): ~ 6.0 e-3 s

What’s interesting is that the for loop created faster code than the template methods. I speculate that there could be several reasons for this, like the compiler being able to recognise the for loops better for optimisation, or extra baggage at OS or hardware from the templating itself (memory use, etc).

What’s also interesting is the slight decrease in time taken to access the heap array.

Full benchmark with -O2 optimisation

Since this is the most commonly used optimisation level in gcc/g++, these results are probably of most interest to us.

Factorial calculation:

  • Normal recursive function: ~ 5.1 e-5 s
  • Approach 1 (function recursion): 0 s

While meta programming is noticeably faster, the normal recursive function has closed the gap significantly (faster by 2 orders of magnitudes), and for our choice of n, the difference is negligible.

Now let’s try the loop unpacking.

Stack array

  • No optimisation: ~7.1 e-5 s
  • Approach 1 (template class): ~  1.5 e-4 s
  • Approach 2 (template function): ~ 1.5 e-4 s

Heap array

  • No optimisation: ~ 5.7 e-5 s
  • Approach 1 (template class): ~ 4.9 e-5 s
  • Approach 2 (template function): ~ 7.0 e-5 s

It seems like the compiler does a better job optimising for-loops than templates, with the exception of the template class method that is accessing the heap array. Approach 1 appears to outperform Approach 2.

Conclusions

This little experiment has raised some more questions about heap vs stack access times with this particular test setup.  Interestingly, the slightly faster heap access does not seem to be a fluke as it was seen consistently across all tests.

The take home message, by and large, is that attempting to beat compilers at micro-optimisations is a fool’s errand for these examples.  You should spend your time doing something else instead (unless of course, long compile times are your objective).

Computing literals may be an exception to this, as it consistent beat the run time of the naive method. However, as more compiler optimisations were enabled, the gap closed significantly. Nevertheless, if you are computing constants, with non trivial algorithmic run times, for fast lookup, then it is worth considering.  You could also just cache that information in a file, and load it into your program.

At the end of it all, if you still insist on using template meta programming for something like loop unpacking, you should favour using Approach 1 (template classes).

In C++14, one might try using STL tuples instead, with variadic templates to gain efficiency. Then again, it might not.

Leave a Reply

Your email address will not be published. Required fields are marked *