How can I improve performance via a high-level approach when implementing long equations in C++

C++Performance Optimization Floating Point G++

C++ Problem Overview

I am developing some engineering simulations. This involves implementing some long equations such as this equation to calculate stress in a rubber like material:

T = (
	mu * (
			pow(l1 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a
			* (
				pow(l1 * l2 * l3, -0.1e1 / 0.3e1)
				- l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1
			) * pow(l1 * l2 * l3, 0.1e1 / 0.3e1) / l1
			- pow(l2 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l1 / 0.3e1
			- pow(l3 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l1 / 0.3e1
		) / a
	+ K * (l1 * l2 * l3 - 0.1e1) * l2 * l3
) * N1 / l2 / l3

+ (
	mu * (
		- pow(l1 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l2 / 0.3e1
		+ pow(l2 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a
		* (
			pow(l1 * l2 * l3, -0.1e1 / 0.3e1)
			- l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1
		) * pow(l1 * l2 * l3, 0.1e1 / 0.3e1) / l2
		- pow(l3 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l2 / 0.3e1
	) / a
	+ K * (l1 * l2 * l3 - 0.1e1) * l1 * l3
) * N2 / l1 / l3

+ (
	mu * (
		- pow(l1 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l3 / 0.3e1
		- pow(l2 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l3 / 0.3e1
		+ pow(l3 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a
		* (
			pow(l1 * l2 * l3, -0.1e1 / 0.3e1)
			- l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1
		) * pow(l1 * l2 * l3, 0.1e1 / 0.3e1) / l3
	) / a
+ K * (l1 * l2 * l3 - 0.1e1) * l1 * l2
) * N3 / l1 / l2;

I use Maple to generate the C++ code to avoid mistakes (and save time with tedious algebra). As this code is executed thousands (if not millions) of times, the performance is a concern. Unfortunately the math only simplifies so far; the long equations are unavoidable.

What approach can I take to optimize this implementation? I'm looking for high-level strategies that I should be applying when implementing such equations, not necessarily specific optimizations for the example shown above.

I'm compiling using g++ with --enable-optimize=-O3.

Update:

I know there are a lot of repeated expressions, I am using the assumption that the compiler would handle these; my tests so far suggest it does.

l1, l2, l3, mu, a, K are all positive real numbers (not zero).

I have replaced l1*l2*l3 with an equivalent variable: J. This did help improve performance.

Replacing pow(x, 0.1e1/0.3e1) with cbrt(x) was a good suggestion.

This will be run on CPUs, In the near future this would likely run better on GPUs, but for now that option is not available.

C++ Solutions

Solution 1 - C++

Edit summary

My original answer merely noted that the code contained a lot of replicated computations and that many of the powers involved factors of 1/3. For example, pow(x, 0.1e1/0.3e1) is the same as cbrt(x).
My second edit was just wrong, and and my third extrapolated on this wrongness. This is what makes people afraid to change the oracle-like results from symbolic math programs that start with the letter 'M'. I've stricken out (i.e., ~~strike~~) those edits and pushed them to the bottom of the current revision of this answer. However, I did not delete them. I'm human. It's easy for us to make a mistake.
My fourth edit developed a very compact expression that correctly represents the convoluted expression in the question IF the parameters l1, l2, and l3 are positive real numbers and if a is a non-zero real number. (We have yet to hear from the OP regarding the specific nature of these coefficients. Given the nature of the problem, these are reasonable assumptions.)
This edit attempts to answer the generic problem of how to simplify these expressions.

First things first

> I use Maple to generate the C++ code to avoid mistakes.

Maple and Mathematica sometimes miss the obvious. Even more importantly, the users of Maple and Mathematica sometimes make mistakes. Substituting "oftentimes", or maybe even "almost always", in lieu of "sometimes is probably closer to the mark.

You could have helped Maple simplify that expression by telling it about the parameters in question. In the example at hand, I suspect that l1, l2, and l3 are positive real numbers and that a is a non-zero real number. If that's the case, tell it that. Those symbolic math programs typically assume the quantities at hand are complex. Restricting the domain lets the program make assumptions that are not valid in the complex numbers.

How to simplify those big messes from symbolic math programs (this edit)

Symbolic math programs typically provide the ability to provide information about the various parameters. Use that ability, particularly if your problem involves division or exponentiation. In the example at hand, you could have helped Maple simplify that expression by telling it that l1, l2, and l3 are positive real numbers and that a is a non-zero real number. If that's the case, tell it that. Those symbolic math programs typically assume the quantities at hand are complex. Restricting the domain lets the program make assumptions such as a^xb^x=(ab)^x. This is only if a and b are positive real numbers and if x is real. It is not valid in the complex numbers.

Ultimately, those symbolic math programs follow algorithms. Help it along. Try playing with expanding, collecting, and simplifying before you generate code. In this case, you could have collected those terms involving a factor of mu and those involving a factor of K. Reducing an expression to its "simplest form" remains a bit of an art.

When you get an ugly mess of generated code, don't accept it as a truth that you must not touch. Try to simplify it yourself. Look at what the symbolic math program had before it generated code. Look at how I reduced your expression to something much simpler and much faster, and how Walter's answer took mine several steps further. There is no magic recipe. If there was a magical recipe, Maple would have applied it and given the answer that Walter gave.

About the specific question

You are doing a lot of addition and subtraction in that calculation. You can get in deep trouble if you have terms that nearly cancel one another. You are wasting a lot of CPU if you have one term that dominates over the others.

Next, you are wasting a lot of CPU by performing repeated calculations. Unless you have enabled -ffast-math, which lets the compiler break some of the rules of IEEE floating point, the compiler will not (in fact, must not) simplify that expression for you. It will instead do exactly what you told it to do. At a minimum, you should calculate l1 * l2 * l3 prior to computing that mess.

Finally, you are making a lot of calls to pow, which is extremely slow. Note that several of those calls are of the form (l1l2l3)^(1/3). Many of those calls to pow could be performed with a single call to std::cbrt:

l123 = l1 * l2 * l3;
l123_pow_1_3 = std::cbrt(l123);
l123_pow_4_3 = l123 * l123_pow_1_3;

With this,

X * pow(l1 * l2 * l3, 0.1e1 / 0.3e1) becomes X * l123_pow_1_3.
X * pow(l1 * l2 * l3, -0.1e1 / 0.3e1) becomes X / l123_pow_1_3.
X * pow(l1 * l2 * l3, 0.4e1 / 0.3e1) becomes X * l123_pow_4_3.
X * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) becomes X / l123_pow_4_3.

Maple did miss the obvious.
For example, there's a much easier way to write

(pow(l1 * l2 * l3, -0.1e1 / 0.3e1) - l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1)

Assuming that l1, l2, and l3 are real rather than complex numbers, and that the real cube root (rather than the principle complex root) are to be extracted, the above reduces to

2.0/(3.0 * pow(l1 * l2 * l3, 1.0/3.0))

2.0/(3.0 * l123_pow_1_3)

Using cbrt_l123 instead of l123_pow_1_3, the nasty expression in the question reduces to

l123 = l1 * l2 * l3; 
cbrt_l123 = cbrt(l123);
T = 
  mu/(3.0*l123)*(  pow(l1/cbrt_l123,a)*(2.0*N1-N2-N3)
                 + pow(l2/cbrt_l123,a)*(2.0*N2-N3-N1)
                 + pow(l3/cbrt_l123,a)*(2.0*N3-N1-N2))
 +K*(l123-1.0)*(N1+N2+N3);

Always double check, but always simplify as well.

Here are some of my steps in arriving at the above:

// Step 0: Trim all whitespace.
T=(mu*(pow(l1*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a*(pow(l1*l2*l3,-0.1e1/0.3e1)-l1*l2*l3*pow(l1*l2*l3,-0.4e1/0.3e1)/0.3e1)*pow(l1*l2*l3,0.1e1/0.3e1)/l1-pow(l2*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l1/0.3e1-pow(l3*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l1/0.3e1)/a+K*(l1*l2*l3-0.1e1)*l2*l3)*N1/l2/l3+(mu*(-pow(l1*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l2/0.3e1+pow(l2*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a*(pow(l1*l2*l3,-0.1e1/0.3e1)-l1*l2*l3*pow(l1*l2*l3,-0.4e1/0.3e1)/0.3e1)*pow(l1*l2*l3,0.1e1/0.3e1)/l2-pow(l3*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l2/0.3e1)/a+K*(l1*l2*l3-0.1e1)*l1*l3)*N2/l1/l3+(mu*(-pow(l1*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l3/0.3e1-pow(l2*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l3/0.3e1+pow(l3*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a*(pow(l1*l2*l3,-0.1e1/0.3e1)-l1*l2*l3*pow(l1*l2*l3,-0.4e1/0.3e1)/0.3e1)*pow(l1*l2*l3,0.1e1/0.3e1)/l3)/a+K*(l1*l2*l3-0.1e1)*l1*l2)*N3/l1/l2;

// Step 1:
//   l1*l2*l3 -> l123
//   0.1e1 -> 1.0
//   0.4e1 -> 4.0
//   0.3e1 -> 3
l123 = l1 * l2 * l3;
T=(mu*(pow(l1*pow(l123,-1.0/3),a)*a*(pow(l123,-1.0/3)-l123*pow(l123,-4.0/3)/3)*pow(l123,1.0/3)/l1-pow(l2*pow(l123,-1.0/3),a)*a/l1/3-pow(l3*pow(l123,-1.0/3),a)*a/l1/3)/a+K*(l123-1.0)*l2*l3)*N1/l2/l3+(mu*(-pow(l1*pow(l123,-1.0/3),a)*a/l2/3+pow(l2*pow(l123,-1.0/3),a)*a*(pow(l123,-1.0/3)-l123*pow(l123,-4.0/3)/3)*pow(l123,1.0/3)/l2-pow(l3*pow(l123,-1.0/3),a)*a/l2/3)/a+K*(l123-1.0)*l1*l3)*N2/l1/l3+(mu*(-pow(l1*pow(l123,-1.0/3),a)*a/l3/3-pow(l2*pow(l123,-1.0/3),a)*a/l3/3+pow(l3*pow(l123,-1.0/3),a)*a*(pow(l123,-1.0/3)-l123*pow(l123,-4.0/3)/3)*pow(l123,1.0/3)/l3)/a+K*(l123-1.0)*l1*l2)*N3/l1/l2;

// Step 2:
//   pow(l123,1.0/3) -> cbrt_l123
//   l123*pow(l123,-4.0/3) -> pow(l123,-1.0/3)
//   (pow(l123,-1.0/3)-pow(l123,-1.0/3)/3) -> 2.0/(3.0*cbrt_l123)
//   *pow(l123,-1.0/3) -> /cbrt_l123
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T=(mu*(pow(l1/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l1-pow(l2/cbrt_l123,a)*a/l1/3-pow(l3/cbrt_l123,a)*a/l1/3)/a+K*(l123-1.0)*l2*l3)*N1/l2/l3+(mu*(-pow(l1/cbrt_l123,a)*a/l2/3+pow(l2/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l2-pow(l3/cbrt_l123,a)*a/l2/3)/a+K*(l123-1.0)*l1*l3)*N2/l1/l3+(mu*(-pow(l1/cbrt_l123,a)*a/l3/3-pow(l2/cbrt_l123,a)*a/l3/3+pow(l3/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l3)/a+K*(l123-1.0)*l1*l2)*N3/l1/l2;

// Step 3:
//   Whitespace is nice.
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  (mu*( pow(l1/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l1
       -pow(l2/cbrt_l123,a)*a/l1/3
       -pow(l3/cbrt_l123,a)*a/l1/3)/a
   +K*(l123-1.0)*l2*l3)*N1/l2/l3
 +(mu*(-pow(l1/cbrt_l123,a)*a/l2/3
       +pow(l2/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l2
       -pow(l3/cbrt_l123,a)*a/l2/3)/a
   +K*(l123-1.0)*l1*l3)*N2/l1/l3
 +(mu*(-pow(l1/cbrt_l123,a)*a/l3/3
       -pow(l2/cbrt_l123,a)*a/l3/3
       +pow(l3/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l3)/a
   +K*(l123-1.0)*l1*l2)*N3/l1/l2;

// Step 4:
//   Eliminate the 'a' in (term1*a + term2*a + term3*a)/a
//   Expand (mu_term + K_term)*something to mu_term*something + K_term*something
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  (mu*( pow(l1/cbrt_l123,a)*2.0/(3.0*cbrt_l123)*cbrt_l123/l1
       -pow(l2/cbrt_l123,a)/l1/3
       -pow(l3/cbrt_l123,a)/l1/3))*N1/l2/l3
 +K*(l123-1.0)*l2*l3*N1/l2/l3
 +(mu*(-pow(l1/cbrt_l123,a)/l2/3
       +pow(l2/cbrt_l123,a)*2.0/(3.0*cbrt_l123)*cbrt_l123/l2
       -pow(l3/cbrt_l123,a)/l2/3))*N2/l1/l3
 +K*(l123-1.0)*l1*l3*N2/l1/l3
 +(mu*(-pow(l1/cbrt_l123,a)/l3/3
       -pow(l2/cbrt_l123,a)/l3/3
       +pow(l3/cbrt_l123,a)*2.0/(3.0*cbrt_l123)*cbrt_l123/l3))*N3/l1/l2
 +K*(l123-1.0)*l1*l2*N3/l1/l2;

// Step 5:
//   Rearrange
//   Reduce l2*l3*N1/l2/l3 to N1 (and similar)
//   Reduce 2.0/(3.0*cbrt_l123)*cbrt_l123/l1 to 2.0/3.0/l1 (and similar)
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  (mu*( pow(l1/cbrt_l123,a)*2.0/3.0/l1
       -pow(l2/cbrt_l123,a)/l1/3
       -pow(l3/cbrt_l123,a)/l1/3))*N1/l2/l3
 +(mu*(-pow(l1/cbrt_l123,a)/l2/3
       +pow(l2/cbrt_l123,a)*2.0/3.0/l2
       -pow(l3/cbrt_l123,a)/l2/3))*N2/l1/l3
 +(mu*(-pow(l1/cbrt_l123,a)/l3/3
       -pow(l2/cbrt_l123,a)/l3/3
       +pow(l3/cbrt_l123,a)*2.0/3.0/l3))*N3/l1/l2
 +K*(l123-1.0)*N1
 +K*(l123-1.0)*N2
 +K*(l123-1.0)*N3;

// Step 6:
//   Factor out mu and K*(l123-1.0)
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  mu*(  ( pow(l1/cbrt_l123,a)*2.0/3.0/l1
         -pow(l2/cbrt_l123,a)/l1/3
         -pow(l3/cbrt_l123,a)/l1/3)*N1/l2/l3
      + (-pow(l1/cbrt_l123,a)/l2/3
         +pow(l2/cbrt_l123,a)*2.0/3.0/l2
         -pow(l3/cbrt_l123,a)/l2/3)*N2/l1/l3
      + (-pow(l1/cbrt_l123,a)/l3/3
         -pow(l2/cbrt_l123,a)/l3/3
         +pow(l3/cbrt_l123,a)*2.0/3.0/l3)*N3/l1/l2)
 +K*(l123-1.0)*(N1+N2+N3);

// Step 7:
//   Expand
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  mu*( pow(l1/cbrt_l123,a)*2.0/3.0/l1*N1/l2/l3
      -pow(l2/cbrt_l123,a)/l1/3*N1/l2/l3
      -pow(l3/cbrt_l123,a)/l1/3*N1/l2/l3
      -pow(l1/cbrt_l123,a)/l2/3*N2/l1/l3
      +pow(l2/cbrt_l123,a)*2.0/3.0/l2*N2/l1/l3
      -pow(l3/cbrt_l123,a)/l2/3*N2/l1/l3
      -pow(l1/cbrt_l123,a)/l3/3*N3/l1/l2
      -pow(l2/cbrt_l123,a)/l3/3*N3/l1/l2
      +pow(l3/cbrt_l123,a)*2.0/3.0/l3*N3/l1/l2)
 +K*(l123-1.0)*(N1+N2+N3);

// Step 8:
//   Simplify.
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  mu/(3.0*l123)*(  pow(l1/cbrt_l123,a)*(2.0*N1-N2-N3)
                 + pow(l2/cbrt_l123,a)*(2.0*N2-N3-N1)
                 + pow(l3/cbrt_l123,a)*(2.0*N3-N1-N2))
 +K*(l123-1.0)*(N1+N2+N3);

Wrong answer, intentionally kept for humility

Note that this is stricken. It's wrong.

~~Update~~

Maple did miss the obvious. For example, there's a much easier way to write

> (pow(l1 * l2 * l3, -0.1e1 / 0.3e1) - l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1)

Assuming that l1, l2, and l3 are real rather than complex numbers, and that the real cube root (rather than the principle complex root) are to be extracted, the above reduces to zero. This calculation of zero is repeated many times over.

Second update

If I've done the math right (there is no guarantee that I've done the math right), the nasty expression in the question reduces to

l123 = l1 * l2 * l3; cbrt_l123_inv = 1.0 / cbrt(l123); nasty_expression = K * (l123 - 1.0) * (N1 + N2 + N3) - ( pow(l1 * cbrt_l123_inv, a) * (N2 + N3) + pow(l2 * cbrt_l123_inv, a) * (N1 + N3) + pow(l3 * cbrt_l123_inv, a) * (N1 + N2)) * mu / (3.0*l123);

~~The above assumes that l1, l2, and l3 are positive real numbers.~~

Solution 2 - C++

First thing to note is that pow is really expensive, so you should get rid of this as much as possible. Scanning through the expression I see many repetitions of pow(l1 * l2 * l3, -0.1e1 / 0.3e1) and pow(l1 * l2 * l3, -0.4e1 / 0.3e1). So I would expect a big gain from pre-computing those:

 const double c1 = pow(l1 * l2 * l3, -0.1e1 / 0.3e1);
const double c2 = boost::math::pow<4>(c1);

where I am using the boost pow function.

Furthermore, you have some more pow with exponent a. If a is Integer and known at compiler time, you can also replace those with boost::math::pow<a>(...) to gain further performance. I would also suggest to replace terms like a / l1 / 0.3e1 with a / (l1 * 0.3e1) as multiplication is faster then division.

Finally, if you use g++ you can use the -ffast-math flag that allows the optimizer to be more aggressive in transforming equations. Read about what this flag actually does, as it has side effects though.

Solution 3 - C++

Woah, what a hell of an expression. Creating the expression with Maple actually was a suboptimal choice here. The result is simply unreadable.

chose speaking variable names (not l1, l2, l3, but e.g. height, width, depth, if that is what they mean). Then it is easier for you to understand your own code.
calculate subterms, that you use multiple times, upfront and store the results in variables with speaking names.
You mention, that the expression is evaluated very many times. I guess, only few parameters vary in the inner most loop. Calculate all invariant subterms before that loop. Repeat for the second inner loop and so on until all invariants are outside the loop.

Theoretically the compiler should be able to do all of that for you, but sometimes it can't - e.g. when the loop nesting spreads over multiple functions in different compilation units. Anyway, that will give you much better readable, understandable, and maintainable code.

Solution 4 - C++

David Hammen's answer is good, but still far from optimal. Let's continue with his last expression (at the time of writing this)

auto l123 = l1 * l2 * l3;
auto cbrt_l123 = cbrt(l123);
T = mu/(3.0*l123)*(  pow(l1/cbrt_l123,a)*(2.0*N1-N2-N3)
                   + pow(l2/cbrt_l123,a)*(2.0*N2-N3-N1)
                   + pow(l3/cbrt_l123,a)*(2.0*N3-N1-N2))
  + K*(l123-1.0)*(N1+N2+N3);

which can be optimised further. In particular, we can avoid the call to cbrt() and one of the calls to pow() if exploiting some mathematical identities. Let's do this again step by step.

// step 1 eliminate cbrt() by taking the exponent into pow()
auto l123 = l1 * l2 * l3;
auto athird = 0.33333333333333333 * a; // avoid division
T = mu/(3.0*l123)*(  (N1+N1-N2-N3)*pow(l1*l1/(l2*l3),athird)
                   + (N2+N2-N3-N1)*pow(l2*l2/(l1*l3),athird)
                   + (N3+N3-N1-N2)*pow(l3*l3/(l1*l2),athird))
  + K*(l123-1.0)*(N1+N2+N3);

Note that I have also optimised 2.0*N1 to N1+N1 etc. Next, we can do with only two calls to pow().

// step 2  eliminate one call to pow
auto l123 = l1 * l2 * l3;
auto athird = 0.33333333333333333 * a;
auto pow_l1l2_athird = pow(l1/l2,athird);
auto pow_l1l3_athird = pow(l1/l3,athird);
auto pow_l2l3_athird = pow_l1l3_athird/pow_l1l2_athird;
T = mu/(3.0*l123)*(  (N1+N1-N2-N3)* pow_l1l2_athird*pow_l1l3_athird
                   + (N2+N2-N3-N1)* pow_l2l3_athird/pow_l1l2_athird
                   + (N3+N3-N1-N2)/(pow_l1l3_athird*pow_l2l3_athird))
  + K*(l123-1.0)*(N1+N2+N3);

Since the calls to pow() are by far the most costly operation here, it is worth to reduce them as far as possible (the next costly operation was the call to cbrt(), which we eliminated).

If by any chance a is integer, the calls to pow could be optimized to calls to cbrt (plus integer powers), or if athird is half-integer, we can use sqrt (plus integer powers). Furthermore, if by any chance l1==l2 or l1==l3 or l2==l3 one or both calls to pow can be eliminated. So, it's worth to consider these as special cases if such chances realistically exist.

Solution 5 - C++

How many is "many many"?
How long does it take?
Do ALL parameters change between recalculation of this formula? Or can you cache some pre-calculated values?

I've attempted a manual simplification of that formula, would like to know if it saves anything?

C1 = -0.1e1 / 0.3e1;
C2 =  0.1e1 / 0.3e1;
C3 = -0.4e1 / 0.3e1;

X0 = l1 * l2 * l3;
X1 = pow(X0, C1);
X2 = pow(X0, C2);
X3 = pow(X0, C3);
X4 = pow(l1 * X1, a);
X5 = pow(l2 * X1, a);
X6 = pow(l3 * X1, a);
X7 = a / 0.3e1;
X8 = X3 / 0.3e1;
X9 = mu / a;
XA = X0 - 0.1e1;
XB = K * XA;
XC = X1 - X0 * X8;
XD = a * XC * X2;

XE = X4 * X7;
XF = X5 * X7;
XG = X6 * X7;

T = (X9 * ( X4 * XD - XF - XG) / l1 + XB * l2 * l3) * N1 / l2 / l3 
  + (X9 * (-XE + X5 * XD - XG) / l2 + XB * l1 * l3) * N2 / l1 / l3 
  + (X9 * (-XE - XF + X6 * XD) / l3 + XB * l1 * l2) * N3 / l1 / l2;

[ADDED] I've worked some more on the last three-lines formula and got it down to this beauty:

T = X9 / X0 * (
      (X4 * XD - XF - XG) * N1 + 
      (X5 * XD - XE - XG) * N2 + 
      (X5 * XD - XE - XF) * N3)
  + XB * (N1 + N2 + N3)

Let me show my work, step by step:

T = (X9 * (X4 * XD - XF - XG) / l1 + XB * l2 * l3) * N1 / l2 / l3 
  + (X9 * (X5 * XD - XE - XG) / l2 + XB * l1 * l3) * N2 / l1 / l3 
  + (X9 * (X5 * XD - XE - XF) / l3 + XB * l1 * l2) * N3 / l1 / l2;


T = (X9 * (X4 * XD - XF - XG) / l1 + XB * l2 * l3) * N1 / (l2 * l3) 
  + (X9 * (X5 * XD - XE - XG) / l2 + XB * l1 * l3) * N2 / (l1 * l3) 
  + (X9 * (X5 * XD - XE - XF) / l3 + XB * l1 * l2) * N3 / (l1 * l2);

T = (X9 * (X4 * XD - XF - XG) + XB * l1 * l2 * l3) * N1 / (l1 * l2 * l3) 
  + (X9 * (X5 * XD - XE - XG) + XB * l1 * l2 * l3) * N2 / (l1 * l2 * l3) 
  + (X9 * (X5 * XD - XE - XF) + XB * l1 * l2 * l3) * N3 / (l1 * l2 * l3);

T = (X9 * (X4 * XD - XF - XG) + XB * X0) * N1 / X0 
  + (X9 * (X5 * XD - XE - XG) + XB * X0) * N2 / X0 
  + (X9 * (X5 * XD - XE - XF) + XB * X0) * N3 / X0;

T = X9 * (X4 * XD - XF - XG) * N1 / X0 + XB * N1 
  + X9 * (X5 * XD - XE - XG) * N2 / X0 + XB * N2
  + X9 * (X5 * XD - XE - XF) * N3 / X0 + XB * N3;


T = X9 * (X4 * XD - XF - XG) * N1 / X0 
  + X9 * (X5 * XD - XE - XG) * N2 / X0
  + X9 * (X5 * XD - XE - XF) * N3 / X0
  + XB * (N1 + N2 + N3)

Solution 6 - C++

This may be a little terse, but I've actually found good speedup for polynomials (interpolation of energy functions) by using Horner Form, which basically rewrites ax^3 + bx^2 + cx + d as d + x(c + x(b + x(a))). This will avoid a lot of repeated calls to pow() and stops you from doing silly things like separately calling pow(x,6) and pow(x,7) instead of just doing x*pow(x,6).

This is not directly applicable to your current problem, but if you have high order polynomials with integer powers it can help. You might have to watch out for numerical stability and overflow issues since the order of operations is important for that (although in general I actually think Horner Form helps for this, since x^20 and x are usually many orders of magnitude apart).

Also as a practical tip, if you haven't done so already, try to simplify the expression in maple first. You can probably get it to do most of the common subexpression elimination for you. I don't know how much it affects the code generator in that program in particular, but I know in Mathematica doing a FullSimplify before generating the code can result in a huge difference.

Solution 7 - C++

It looks like you have a lot of repeated operations going on.

pow(l1 * l2 * l3, -0.1e1 / 0.3e1)
pow(l1 * l2 * l3, -0.4e1 / 0.3e1)

You could pre-calculate those so you are not repeatedly calling the pow function which can be expensive.

You could also pre-calutate

l1 * l2 * l3

as you use that term repeatedly.

Solution 8 - C++

If you have a Nvidia CUDA graphics card, you could consider offloading the calculations to the graphics card - which itself is more suitable for computationally complicated calculations.

https://developer.nvidia.com/how-to-cuda-c-cpp

If not, you may want to consider multiple threads for calculations.

Solution 9 - C++

By any chance, could you supply the calculation symbolically. If there are vector operations, you might really want to investigate using blas or lapack which in some cases can run operations in parallel.

It is conceivable (at the risk of being off-topic?) that you might be able to use python with numpy and/or scipy. To the extent that it was possible, your calculations might be more readable.

Solution 10 - C++

As you explicitly asked about high level optimizations, it might be worth trying different C++ compilers. Nowadays, compilers are very complex optimization beasts and CPU vendors might implement very powerful and specific optimizations. But please note, some of them are not free (but there might be a free academic program).

GNU compiler collection is free, flexible, and available on many architectures
Intel compilers are very fast, very expensive, and may also produce good results for AMD architectures (I believe there is an academic program)
Clang compilers are fast, free, and might produce similar results to GCC (some people say they are faster, better, but this might differ for each application case, I suggest to make your own experiences)
PGI (Portland Group) is not free as the Intel compilers.
PathScale compilers might perform good results on AMD architectures

I've seen code snippets differ in execution speed by the factor of 2, only by changing the compiler (with full optimizations of course). But be aware of checking the identity of the output. To aggressive optimization might lead to different output, which is something you definitely want to avoid.

Good Luck!

Content Type	Original Author	Original Content on Stackoverflow
Question	user602095	View Question on Stackoverflow
Solution 1 - C++	David Hammen	View Answer on Stackoverflow
Solution 2 - C++	mariomulansky	View Answer on Stackoverflow
Solution 3 - C++	cdonat	View Answer on Stackoverflow
Solution 4 - C++	Walter	View Answer on Stackoverflow
Solution 5 - C++	Vlad Feinstein	View Answer on Stackoverflow
Solution 6 - C++	neocpp	View Answer on Stackoverflow
Solution 7 - C++	NathanOliver	View Answer on Stackoverflow
Solution 8 - C++	user3791372	View Answer on Stackoverflow
Solution 9 - C++	Fred Mitchell	View Answer on Stackoverflow
Solution 10 - C++	math	View Answer on Stackoverflow