Playing with SIMD and definite integrals in C++

I have been reading about vectorization and SIMD through SSE and AVX2 instructions in C++ for some time and I have always wanted to try coding something with them, but I needed a sufficiently interesting problem to allow me to experiment with this kind of instruction. Recently, I found an optimization problem that could benefit from SIMD vectorization: the approximation of definite integrals for generic real functions. The most popular approximation approaches involve the partition of the function domain in intervals and the sum of partial quantities computed from those intervals. This second part of the process can be easily vectorized: I looked at this problem as a chance to improve my knowledge of SIMD instructions and general code optimization techniques in C++. Also, this challenge gave me the chance to look into some MSVC-generated assembly code. The result, as usual, is on my github.

The problem

Before looking at the code, let’s describe clearly what we want to do. We have a generic real function $f(x)$ and we want to compute $\int_a^b f(x) dx$, i.e., the definite integral of the function in the $[a,b]$ interval. We cannot compute the exact integral of a generic function, but we can approximate that value. Some of the most popular approximation techniques are the Riemann sum, the Trapezoidal rule and the Cavalieri-Simpson rule. Questions: which is the fastest? Which one provides best results in less time? How can we improve the performance of the best method? We will try to answer to those questions.

Preparing benchmarks

Since I belong to the game programming field, I tend to prefer fast results to accurate results, so I can totally live with a non-extremely accurate approximation as soon as it comes in a very little time. For this reason, I decided to discard the Trapezoidal rule since the beginning, because the trapezoid area computation involves a greater number of mathematic operations than the simple rectangle area formula required by the Riemann rule.

As general benchmark, I chose the Gaussian probability density function, since it has some useful applications and it is not easy to compute its integral. This function is defined as:

$$ = \frac{1}{\sigma \sqrt{2 \pi}} \exp{(\frac{x - \mu}{\sigma})^2} $$

with $ \mu $ and $\sigma$ being respectively the mean and the standard deviation of the function.

The Riemann approximation rule exploits a very useful property of definite integrals: $\int_a^b f(x)dx$ is the area of the region between the function and the X axis, in the region delimited by $latex a$ and $latex b$. The rule does what the first simple naive approach comes in your mind would do: divide the region into rectangles and sum their areas. This of course introduces an error given by the fact that the shape of the computed area does not match the shape of the function, but we can reduce that error by increasing the number of rectangles. Increasing this number requires more CPU time to compute all those areas, so time optimization is crucial here.

The second approach I am going to implement is the Cavalieri-Simpson rule, whose maths component is a lot more complicated but that can be summarized into a simple algorithm:

Partition the $[a,b]$ interval into $2n$ intervals of size $h$;
Compute the integral as $\frac{h}{3}(f(a) + 4f(a + h) + f(a + 2h) + \dots + f(b - 2h) + 4f(b - h) + f(b))$.

The Simpson rule is way more accurate than both the Riemann and the Trapezoidal rule, but it will be a little slower. So, let’s try to optimize the Riemann sum method and see how much we can get close to that time with the Simpson rule method.

The naive approach

Before putting some SIMD into the code, I wrote the first easy solution that came to my mind, in order to get a baseline for later optimizations. First, I defined some types that would make my life a lot easier if, for instance, I decided to use doubles instead of floats: in that case, I would have one simple line to edit.

typedef float real_t;
typedef unsigned long precision_t;
typedef real_t(*function_t)(real_t);

The Riemann sum is really simple to implement, since it’s the sum of all the $f(x)$ in the interval, multiplied by the step size.

// Riemann sum 
real_t definite_integral_rectangles(function_t function, 
    const real_t a, const real_t b, precision_t nRects)
{
    const real_t step = (b - a) / (real_t)nRects;
    const real_t half_step = step * 0.5f;

    real_t i = 0.0f;
    for (real_t s = a + half_step; s < b; s += step)
        i += function(s);
    return i * step;
}

The Simpson rule is quite less trivial. First, we need to ensure that the number of intervals we divide the $[a,b]$ range into is an even number: we use the ROUND_2 macro to simply set to zero the last bit. The y0 variable is updated at the end of the loop so that we are not forced to compute the same value twice: this would happen because the terms of the Simpson rule sum overlap. For example, in $\frac{h}{3}[(f(x) + 4f(x + h) + f(x + 2h)) + (f(x + 2h) + 4f(x + 3h) + f(x + 4h)) + \dots]$ we would compute the $f(x + 2h)$ value twice without this simple adjustment.

#define ROUND_2(x) x & 0xfffffffe

// Simpson rule
real_t definite_integral_cs(function_t function, 
    const real_t a, const real_t b, precision_t precision)
{
    // Prevents interval overrun.
    precision = ROUND_2(precision);

    const real_t h = (b - a) / (real_t)precision;
    const real_t double_h = h * 2;
    real_t y0, y1, y2;
    real_t i = 0.0f;
    y0 = function(a);
    for (real_t s = a; s < b; s += h * 2)
    {
        y1 = function(s + h);
        y2 = function(s + double_h);

        i += y0 + (4 * y1) + y2;
        y0 = y2;
    }
    return i * h * 0.33333f;
}

And now, let’s run the first test. To validate the algorithm, we define a gauss function which uses the normal distribution parameters ($\mu = 0, \sigma = 1$) and we compute the integral in the $[-3,3]$ interval. According to a validated calculator, this configuration should output a result of 0.9973. For each algorithm I test, I compute the integral 10000 times by dividing the region into 100000 sections. I save the average, the minimum and the maximum execution time, measured inside the function call (i.e., before the first instruction of the function and before the return statement). Results:

Starting performance tests...
Precision:			100000
Iterations for each test:	10000

int(gauss) with rectangles approximation
	Result = 0.997647
	Average execution time = 648.098 ms
int(gauss) with Cavalieri-Simpson rule approximation
	Result = 0.997465
	Average execution time = 651.353 ms

As we expected, the Simpson rule has the most accurate result. Average execution times are quite similar, even if in some contexts a difference of 3ms could not be considered so irrelevant.

Adding some SIMD

I started with SSE, since I found some good tutorials about that. The optimization of the Riemann sum approach with SSE is really trivial: we now have registers that can store and do maths on 4 floats simultaneously, so we can speed the sum loop up by taking four floats per cycle.

const __m128 m128_INC = _mm_set_ps(0.0f, 1.0f, 2.0f, 3.0f);

real_t definite_integral_rectangles_sse(function_t function, 
    const real_t a, const real_t b, precision_t precision)
{
    // Intervals must be multiple of 4
    precision = ROUND_4(precision);

    const real_t h = (b - a) / (real_t)precision;
    const real_t h_2 = h * 0.5f;

    __m128 mm_int = _mm_set_ps1(0.0f);
    const __m128 mm_h = _mm_set_ps1(h);
    const __m128 mm_h_inc = _mm_mul_ps(mm_h, m128_INC);
    __m128 heights;

    float __declspec(align(16)) v[4];
    for (real_t s = a + h_2; s < b; s = h * 4 + s)
    {
        _mm_store_ps(v, _mm_add_ps(_mm_set_ps1(s), mm_h_inc));
        heights = _mm_set_ps(function(v[0]), function(v[1]),
            function(v[2]), function(v[3]));
        mm_int = _mm_add_ps(mm_int, heights);
    }

    mm_int = _mm_mul_ps(mm_h, mm_int);
    mm_int = _mm_hadd_ps(mm_int, mm_int);
    mm_int = _mm_hadd_ps(mm_int, mm_int);
    return mm_int.m128_f32[0];
}

It is quite more difficult to fit SSE instructions into the Simpson rule loop: basically, we can compute the sum of four terms at a time, but each term is made of three function values. In other word, we will compute $y_0 + 4y_1 + y_2 + y_2 + 4y_3 + y_4 + y_4 + 4y_5 + y_6 + y_6 + 4y_7 + y_8$ in the same cycle. We need three __m128 registers to parallelize this sum. The data layout will be something like that:

register 0: |y0|y2|y4|y6| +
register 1: |y1|y3|y5|y7| * 4 +
register 2: |y2|y4|y6|y8|

const __m128 m128_INC2 = _mm_set_ps1(0.0f, 2.0f, 4.0f, 6.0f);

real_t definite_integral_cs_sse(function_t function, real_t a,
    real_t b, precision_t precision)
{
    precision = ROUND_8(precision);

    // Utility data (compute once to save CPU time)
    real_t h = (b - a) / (real_t)precision;

    __m128 mm_int = _mm_set_ps1(0.0f);
    __m128 p, a1, a2, a3;
    const __m128 mm_h_3 = _mm_set_ps1(h * 0.33333f);
    const __m128 mm_4 = _mm_set_ps1(4.0f);
    __m128 mm_x[3];
    const __m128 mm_h = _mm_set_ps1(h);
    const __m128 mm_h_inc2 = _mm_mul_ps(mm_h, m128_INC2);
    float __declspec(align(16)) x[12];
    float y0;
	
    y0 = function(a);
    for (real_t s = a; s < b; s += h * 8)
    {
        mm_x[0] = _mm_add_ps(_mm_set_ps1(s), mm_h_inc2);
        mm_x[1] = _mm_add_ps(mm_x[0], mm_h);
        mm_x[2] = _mm_add_ps(mm_x[1], mm_h);
        _mm_store_ps(&x[0], mm_x[0]);
        _mm_store_ps(&x[4], mm_x[1]);
        _mm_store_ps(&x[8], mm_x[2]);

        // Computes current term of CS rule sum. 
        a1 = _mm_set_ps(y0, function(x[1]), function(x[2]),
            function(x[3]));
        a2 = _mm_set_ps(function(x[4]), function(x[5]), 
            function(x[6]), function(x[7]));
        a3 = _mm_set_ps(a1.m128_f32[1], a1.m128_f32[2], 
            a1.m128_f32[3], function(x[11]));
        p = _mm_add_ps(a1,
            _mm_add_ps(a3,
                _mm_mul_ps(a2, mm_4)
            )
        );

        mm_int = _mm_add_ps(mm_int, p);
        y0 = a3.m128_f32[3];
    }

    mm_int = _mm_mul_ps(mm_int, mm_h_3);
    mm_int = _mm_hadd_ps(mm_int, mm_int);
    mm_int = _mm_hadd_ps(mm_int, mm_int);

    return mm_int.m128_f32[0];
}

And now the results:

Starting performance tests...
Precision:			100000
Iterations for each test:	10000

int(gauss) with rectangles approximation
	Result = 0.997647
	Average execution time = 651.039 ms
int(gauss) with rectangles approximation + SSE
	Result = 0.99747
	Average execution time = 610.109 ms
int(gauss) with Cavalieri-Simpson rule approximation
	Result = 0.997465
	Average execution time = 658.161 ms
int(gauss) with Cavalieri-Simpson rule approximation + SSE
	Result = 0.99747
	Average execution time = 710.516 ms

The Riemann sum rule receives a 6% speedup thanks to the introduction of SSE SIMD instructions, but the Simpson rule surprisingly gets slower. The reason is actually easily understandable: just look at the for loop body. There is a lot of operations that prepare the data layout on which SIMD instructions will make their operations. If we take a look at the generated assembly, we can see that the non-SIMD function for loop is translated into 22 instructions, while the SIMD-based function for loop takes 137 instructions. Yes, the number of iterations is halved, but this does not compensate the overhead.

Generalization is bad

With the introduction of SSE instructions we obtained a speedup of 6%, which is not so satisfactory, since we are iterating the half of the times. This makes me think that the real bottleneck is somewhere else. It is, and it also is easy to detect. Let’s have a look at the Riemann sum SSE-flavoured function: so many calls to the same function, which is, by the way, the following:

const real_t sqrt_2pi = sqrt(2.0f * 3.14f);

real_t gauss(real_t x)
{
    // Fixed mean and stdev, for test purposes
    real_t mean = 0.0f;
    real_t stdev = 1.0f;
    real_t e = (x - mean) / stdev;
    return exp(-0.5f * e * e) * 1 / (stdev * sqrt_2pi);
}

We call this function once for each interval we divided $[a,b]$ into, which makes 100000 calls for a single integral computation. This makes me think about Mike Acton’s talks about data-oriented design, which made me realize I always underestimate the impact of a function call on the program. Function calls push and pop data on the stack, make jumps and are likely to repeat the same operations more times, when it would be enough to perform them only once: think about the last division at the end of the gauss function. We wanted to make a generalized function that we could use to compute the integral of any function, but this is not only a damage for performances, but also a decision of doubtful value. By quoting Mike Acton words, what problem are we trying to solve? We originally wanted to compute the integral of the Gaussian function. Do we need to generalize this problem? Well, not at the moment. So let’s write an ad-hoc version of the Riemann sum approximation function that avoids calling the gauss function.

real_t gaussian_prob_sse(const float mean, const float stdev, 
    const real_t a, const real_t b, precision_t precision)
{
    precision = ROUND_4(precision);

    const real_t h = (b - a) / (real_t)precision;
    const real_t h_2 = h * 0.5f;
    __m128 mm_int = _mm_set_ps1(0.0f);
    __m128 mm;

    // Computes once the common coefficient of the Gauss 
    // function.
    const real_t coeff = 1.0f / (stdev * sqrt(2.0f * 3.141f));
    const real_t var = stdev * stdev;
    const __m128 mm_neg_half = _mm_set_ps1(-0.5f);
    const __m128 mm_var = _mm_set_ps1(var);
    const __m128 mm_mean = _mm_set_ps1(mean);
    __m128 mm_error;
    const __m128 mm_h_inc = _mm_mul_ps(_mm_set_ps1(h), 
        m128_INC);
    __m128 mm_s;

    for (real_t s = a + h_2; s < b; s += h * 4)
    {
        // Computing errors
        mm_error = _mm_add_ps(_mm_set_ps1(s), mm_h_inc);
        mm_error = _mm_sub_ps(mm_error, mm_mean);

        // Computing exponentials
        mm = fmath::exp_ps(
            _mm_div_ps(
                _mm_mul_ps(mm_neg_half,
                    _mm_mul_ps(mm_error, mm_error)
                ),
            mm_var)
        );

        // Computing area of rects
        mm_int = _mm_add_ps(mm_int, mm);
    }

    // Retrieves result.
    mm_int = _mm_hadd_ps(mm_int, mm_int);
    mm_int = _mm_hadd_ps(mm_int, mm_int);

    return coeff * mm_int.m128_f32[0] * h;
}

Some notes:

Knowing what function I am computing gives me the possibility to compute const values I am going to use repeatedly only once, before entering the loop: mm_mean, mm_var, mm_neg_half and the most expensive mm_coeff.
fmath::exp_ps is a very good SSE-based implementation of the exp function, that made possible this optimization. You can find it on heurmi’s github.

An now, the results:

Starting performance tests...
Precision:			100000
Iterations for each test:	10000

int(gauss) with rectangles approximation + SSE
	Result = 0.99747
	Average execution time = 638.962 ms
int(gauss) with Cavalieri-Simpson rule approximation
	Result = 0.997465
	Average execution time = 651.559 ms
int(gauss) with ad-hoc optimization (SSE)
	Result = 0.99739
	Average execution time = 127.576 ms

Now, that’s the kind of optimization I was looking for! 127.576 ms against the 638.962 ms of the generalized approximation, which makes a 5x speedup. I am now going to do even better by writing an AVX2-based implementation of the function, instead of the SSE implementation. The only difference is that we are going to use 256-bit registers instead of 128-bit ones. There is also the possibility to use 512-bit registers, but herumi’s code does not provide an mm512 implementation of the exp function, so I decided to stop at mm256 registers.

real_t gaussian_prob_avx2(const float mean, const float stdev, 
    const real_t a, const real_t b, precision_t precision)
{
    precision = ROUND_8(precision);

    const real_t h = (b - a) / (real_t)precision;
    const real_t h_2 = h * 0.5f;
    __m256 mm_int = _mm256_set1_ps(0.0f);
    __m256 mm;

    const real_t coeff = 1.0f / (stdev * sqrt(2.0f * 3.141f));
    const real_t var = stdev * stdev;
    const __m256 mm_h_inc = _mm256_mul_ps(_mm256_set1_ps(h), 
        m256_INC);
    const __m256 mm_neg_half = _mm256_set1_ps(-0.5f);
    const __m256 mm_var = _mm256_set1_ps(var);
    const __m256 mm_mean = _mm256_set1_ps(mean);
    __m256 mm_error;

    for (real_t s = a + h_2; s < b; s += h * 8)
    {
        // Computing errors
        mm_error = _mm256_sub_ps(_mm256_add_ps(
            _mm256_set1_ps(s), mm_h_inc), mm_mean);

        // Computing exponentials
        mm = fmath::exp_ps256(
            _mm256_div_ps(
                _mm256_mul_ps(mm_neg_half,
                    _mm256_mul_ps(mm_error, mm_error)
                ),
            mm_var)
        );

        // Computing area of rects
        mm_int = _mm256_add_ps(mm_int, mm);
    }

    mm_int = _mm256_hadd_ps(mm_int, mm_int);
    mm_int = _mm256_hadd_ps(mm_int, mm_int);
    mm_int = _mm256_hadd_ps(mm_int, mm_int);

    return coeff * mm_int.m256_f32[0] * h;
}

Which generates the following output:

Starting performance tests...
Precision:			100000
Iterations for each test:	10000

int(gauss) with rectangles approximation + SSE
	Result = 0.99747
	Average execution time = 647.258 ms
int(gauss) with Cavalieri-Simpson rule approximation
	Result = 0.997465
	Average execution time = 708.083 ms
int(gauss) with ad-hoc optimization (SSE)
	Result = 0.99739
	Average execution time = 134.093 ms
int(gauss) with ad-hoc optimization (AVX2 with m256 registers)
	Result = 0.997377
	Average execution time = 70.5615 ms

Great! Now that we eliminated the bottleneck given by function calls, we can fully benefit from the use of SIMD instructions. By using __m256 registers, we halved the execution time, thus obtaining a 9-10x speedup on the generalized SSE Rieman sum function.

But I want to generalize!

We did not need generalization for this problem, but we could need that in the future. Should we write an ad-hoc integral approximation method for each function? Well, it depends on the importance we give to performances. The speedup of ad-hoc integration is significant, so I would totally go with the ad-hoc function. But, if we really wanted to make a generalized integration function (maybe something to export into a DLL), we can still do something. Let’s define a new type of function pointer, two new function prototypes and a new version of the gauss function.

typedef __m256(*function256_t)(__m256);

real_t definite_integral_rectangles_avx_256(function256_t function, 
    const real_t a, const real_t b, precision_t precision);
real_t definite_integral_cs_avx2_256(function256_t function, 
    const real_t a, const real_t b, precision_t precision);

__m256 gauss256(__m256 x)
{
    float mean = 0.0f;
    __m256 mm_mean = _mm256_set1_ps(mean);
    float stdev = 1.0f;
    __m256 mm_stdev = _mm256_set1_ps(stdev);
    __m256 mm_e = _mm256_div_ps(_mm256_sub_ps(x, 
        mm_mean), mm_stdev);
    __m256 mm_coeff = _mm256_set1_ps(1 / (stdev * sqrt_2pi));

    return _mm256_mul_ps(
        fmath::exp_ps256(
            _mm256_mul_ps(_mm256_set1_ps(-0.5f),
                _mm256_mul_ps(mm_e, mm_e)
            )
        ), mm_coeff
    );
}

If we cannot avoid calling a function multiple times, at least let’s try to call it less often. New implementation of Riemann sum and Simpson rule accept pointers to functions that compute an array of 8 real values, so that we can reduce the number of function calls.

real_t definite_integral_rectangles_avx_256(function256_t function, 
    const real_t a, const real_t b, precision_t precision)
{
    precision = ROUND_8(precision);

    const real_t h = (b - a) / (real_t)precision;
    const real_t h_2 = h * 0.5f;

    __m256 mm_int = _mm256_set1_ps(0.0f);
    const __m256 mm_h = _mm256_set1_ps(h);
    const __m256 mm_h_inc = _mm256_mul_ps(mm_h, m256_INC);
    __m256 heights;

    for (real_t s = a + h_2; s < b; s = h * 8 + s)
    {
        heights = function(_mm256_add_ps(_mm256_set1_ps(s), 
            mm_h_inc));
        mm_int = _mm256_add_ps(mm_int, _mm256_mul_ps(heights, 
            mm_h));
    }

    mm_int = _mm256_hadd_ps(mm_int, mm_int);
    mm_int = _mm256_hadd_ps(mm_int, mm_int);
    mm_int = _mm256_hadd_ps(mm_int, mm_int);
    return mm_int.m256_f32[0];
}

real_t definite_integral_cs_avx2_256(function256_t function, 
    const real_t a, const real_t b, precision_t precision)
{
    precision = ROUND_16(precision);

    const float h = (b - a) / (real_t)precision;
    const __m256 mm_h = _mm256_set1_ps(h);
    const __m256 mm_h_3 = _mm256_set1_ps(h * 0.33333f);
    const __m256 mm_h_inc2 = _mm256_mul_ps(mm_h, m256_INC2);
    const __m256 mm_4 = _mm256_set1_ps(4.0f);
    __m256 mm_x[3];
    __m256 mm_y[3];
    __m256 mm_int = _mm256_set1_ps(0.0f);
    
    for (real_t s = a; s < b; s += h * 16)
    {
        mm_x[0] = _mm256_add_ps(_mm256_set1_ps(s), mm_h_inc2);
        mm_x[1] = _mm256_add_ps(mm_x[0], mm_h);
        mm_x[2] = _mm256_add_ps(mm_x[1], mm_h);

        // Computes function value
        mm_y[0] = function(mm_x[0]);
        mm_y[1] = _mm256_mul_ps(function(mm_x[1]), mm_4);
        mm_y[2] = function(mm_x[2]);

        // Updates accumulator
        mm_int = _mm256_add_ps(mm_int, _mm256_mul_ps(mm_h_3,
            _mm256_add_ps(mm_y[0], _mm256_add_ps(mm_y[1], 
                mm_y[2])))
        );
    }

    // Horizontal sum
    mm_int = _mm256_hadd_ps(mm_int, mm_int);
    mm_int = _mm256_hadd_ps(mm_int, mm_int);
    mm_int = _mm256_hadd_ps(mm_int, mm_int);
    return mm_int.m256_f32[0];
}

And, for the last time, some results:

Starting performance tests...
Precision:			100000
Iterations for each test:	10000

int(gauss) with rectangles approximation + SSE
	Result = 0.99747
	Average execution time = 628.781 ms
int(gauss) with rectangles approximation (AVX2 + 256 bit function vectorization)
	Result = 0.99747
	Average execution time = 81.2207 ms
int(gauss) with Cavalieri-Simpson rule approximation
	Result = 0.997465
	Average execution time = 638.87 ms
int(gauss) with Cavalieri-Simpson rule approximation (AVX2 + 256-bit vectorized function)
	Result = 0.997471
	Average execution time = 103.438 ms
int(gauss) with ad-hoc optimization (AVX2 with m256 registers)
	Result = 0.997377
	Average execution time = 68.0454 ms

Well, not bad. Of course, the ad-hoc function remains the fastes approach, but the generalized functions benefited a lot from the function vectorization and scored some good execution times.

Conclusion

So, this was a long tale of my first experiments with C++ CPU optimization. It was very fun and opened my mind on some mistakes I unconsciously make when I write code. I am very curious to know if my solutions can still be improved, so, if you can solve the problem with better times, let me know.