What is the fastest way to compute sin and cos together?

C# Problem Overview

I would like to compute both the sine and co-sine of a value together (for example to create a rotation matrix). Of course I could compute them separately one after another like a = cos(x); b = sin(x);, but I wonder if there is a faster way when needing both values.

Edit: To summarize the answers so far:

Vlad said, that there is the asm command FSINCOS computing both of them (in almost the same time as a call to FSIN alone)
Like Chi noticed, this optimization is sometimes already done by the compiler (when using optimization flags).
caf pointed out, that functions sincos and sincosf are probably available and can be called directly by just including math.h
tanascius approach of using a look-up table is discussed controversial. (However on my computer and in a benchmark scenario it runs 3x faster than sincos with almost the same accuracy for 32-bit floating points.)
Joel Goodwin linked to an interesting approach of an extremly fast approximation technique with quite good accuray (for me, this is even faster then the table look-up)

C# Solutions

Solution 1 - C#

Modern Intel/AMD processors have instruction FSINCOS for calculating sine and cosine functions simultaneously. If you need strong optimization, perhaps you should use it.

Here is a small example: http://home.broadpark.no/~alein/fsincos.html

Here is another example (for MSVC): http://www.codeguru.com/forum/showthread.php?t=328669

Here is yet another example (with gcc): http://www.allegro.cc/forums/thread/588470

Hope one of them helps. (I didn't use this instruction myself, sorry.)

As they are supported on processor level, I expect them to be way much faster than table lookups.

Edit:
Wikipedia suggests that FSINCOS was added at 387 processors, so you can hardly find a processor which doesn't support it.

Edit:
Intel's documentation states that FSINCOS is just about 5 times slower than FDIV (i.e., floating point division).

Edit:
Please note that not all modern compilers optimize calculation of sine and cosine into a call to FSINCOS. In particular, my VS 2008 didn't do it that way.

Edit:
The first example link is dead, but there is still a version at the Wayback Machine.

Solution 2 - C#

Modern x86 processors have a fsincos instruction which will do exactly what you're asking - calculate sin and cos at the same time. A good optimizing compiler should detect code which calculates sin and cos for the same value and use the fsincos command to execute this.

It took some twiddling of compiler flags for this to work, but:

$ gcc --version
i686-apple-darwin9-gcc-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5488)
Copyright (C) 2005 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ cat main.c
#include <math.h> 

struct Sin_cos {double sin; double cos;};

struct Sin_cos fsincos(double val) {
  struct Sin_cos r;
  r.sin = sin(val);
  r.cos = cos(val);
  return r;
}

$ gcc -c -S -O3 -ffast-math -mfpmath=387 main.c -o main.s

$ cat main.s
	.text
	.align 4,0x90
.globl _fsincos
_fsincos:
	pushl	%ebp
	movl	%esp, %ebp
	fldl	12(%ebp)
	fsincos
	movl	8(%ebp), %eax
	fstpl	8(%eax)
	fstpl	(%eax)
	leave
	ret	$4
	.subsections_via_symbols

Tada, it uses the fsincos instruction!

Solution 3 - C#

When you need performance, you could use a precalculated sin/cos table (one table will do, stored as a Dictionary). Well, it depends on the accuracy you need (maybe the table would be to big), but it should be really fast.

Solution 4 - C#

Technically, you’d achieve this by using complex numbers and Euler’s Formula. Thus, something like (C++)

complex<double> res = exp(complex<double>(0, x));
// or equivalent
complex<double> res = polar<double>(1, x);
double sin_x = res.imag();
double cos_x = res.real();

should give you sine and cosine in one step. How this is done internally is a question of the compiler and library being used. It could (and might) well take longer to do it this way (just because Euler’s Formula is mostly used to compute the complex exp using sin and cos – and not the other way round) but there might be some theoretical optimisation possible.

Edit

The headers in <complex> for GNU C++ 4.2 are using explicit calculations of sin and cos inside polar, so it doesn’t look too good for optimisations there unless the compiler does some magic (see the -ffast-math and -mfpmath switches as written in Chi’s answer).

Solution 5 - C#

You could compute either and then use the identity:

cos(x)² = 1 - sin(x)²

but as @tanascius says, a precomputed table is the way to go.

Solution 6 - C#

If you use the GNU C library, then you can do:

#define _GNU_SOURCE
#include <math.h>

and you will get declarations of the sincos(), sincosf() and sincosl() functions that calculate both values together - presumably in the fastest way for your target architecture.

Solution 7 - C#

There is very interesting stuff on this forum page, which is focused on finding good approximations that are fast: http://www.devmaster.net/forums/showthread.php?t=5784

Disclaimer: Not used any of this stuff myself.

Update 22 Feb 2018: Wayback Machine is the only way to visit the original page now: https://web.archive.org/web/20130927121234/http://devmaster.net/posts/9648/fast-and-accurate-sine-cosine

Solution 8 - C#

Many C math libraries, as caf indicates, already have sincos(). The notable exception is MSVC.

Sun has had sincos() since at least 1987 (twenty-three years; I have a hard-copy man page)
HPUX 11 had it in 1997 (but isn't in HPUX 10.20)
Added to glibc in version 2.1 (Feb 1999)
Became a built-in in gcc 3.4 (2004), __builtin_sincos().

And regarding look-up, Eric S. Raymond in the Art of Unix Programming (2004) (Chapter 12) says explicitly this a Bad Idea (at the present moment in time):

> "Another example is precomputing small tables--for example, a table of > sin(x) by degree for optimizing rotations in a 3D graphics engine will > take 365 × 4 bytes on a modern machine. Before processors got enough > faster than memory to demand caching, this was an obvious speed > optimization. Nowadays it may be faster to recompute each time rather > than pay for the percentage of additional cache misses caused by the > table. > > "But in the future, this might turn around again as caches grow larger. > More generally, many optimizations are temporary and can easily turn > into pessimizations as cost ratios change. The only way to know is to > measure and see." (from the Art of Unix Programming)

But, judging from the discussion above, not everyone agrees.

Solution 9 - C#

I don't believe that lookup tables are necessarily a good idea for this problem. Unless your accuracy requirements are very low the table needs to be very large. And modern CPUs can do a lot of computation while a value is fetched from main memory. This is not one of those questions which can be properly answered by argument (not even mine), test and measure and consider the data.

But I'd look to the fast implementations of SinCos that you find in libraries such as AMD's ACML and Intel's MKL.

Solution 10 - C#

If you are willing to use a commercial product, and are calculating a number of sin/cos calculations at the same time (so you can use vectored functions), you should check out Intel's Math Kernel Library.

It has a sincos function

According to that documentation, it averages 13.08 clocks/element on core 2 duo in high accuracy mode, which i think will be even faster than fsincos.

Solution 11 - C#

This article shows how to construct a parabolic algorithm that generates both the sine and the cosine:

DSP Trick: Simultaneous Parabolic Approximation of Sin and Cos

http://www.dspguru.com/dsp/tricks/parabolic-approximation-of-sin-and-cos

Solution 12 - C#

When performance is critical for this kind of thing it is not unusual to introduce a lookup table.

Solution 13 - C#

For a creative approach, how about expanding the Taylor series? Since they have similar terms, you could do something like the following pseudo:

numerator = x
denominator = 1
sine = x
cosine = 1
op = -1
fact = 1

while (not enough precision) {
    fact++
    denominator *= fact
    numerator *= x

    cosine += op * numerator / denominator

    fact++
    denominator *= fact
    numerator *= x

    sine += op * numerator / denominator

    op *= -1
}

This means you do something like this: starting at x and 1 for sin and cosine, follow the pattern - subtract x^2 / 2! from cosine, subtract x^3 / 3! from sine, add x^4 / 4! to cosine, add x^5 / 5! to sine...

I have no idea whether this would be performant. If you need less precision than the built in sin() and cos() give you, it may be an option.

Solution 14 - C#

There is a nice solution in the CEPHES library which can be pretty fast and you can add/remove accuracy quite flexibly for a bit more/less CPU time.

Remember that cos(x) and sin(x) are the real and imaginary parts of exp(ix). So we want to calculate exp(ix) to get both. We precalculate exp(iy) for some discrete values of y between 0 and 2pi. We shift x to the interval [0, 2pi). Then we select the y that is closest to x and write
exp(ix)=exp(iy+(ix-iy))=exp(iy)exp(i(x-y)).

We get exp(iy) from the lookup table. And since |x-y| is small (at most half the distance between the y-values), the Taylor series will converge nicely in just a few terms, so we use that for exp(i(x-y)). And then we just need a complex multiplication to get exp(ix).

Another nice property of this is that you can vectorize it using SSE.

Solution 15 - C#

You may want to have a look at http://gruntthepeon.free.fr/ssemath/, which offers an SSE vectorized implementation inspired from CEPHES library. It has good accuracy (maximum deviation from sin/cos on the order of 5e-8) and speed (slightly outperforms fsincos on a single call basis, and a clear winner over multiple values).

Solution 16 - C#

I have posted a solution involving inline ARM assembly capable of computing both the sine and cosine of two angles at a time here: [Fast sine/cosine for ARMv7+NEON][1]

[1]: http://blog.julien.cayzac.name/2009/12/fast-sinecosine-for-armv7neon.html "Fast sine/cosine for ARMv7+NEON"

Solution 17 - C#

An accurate yet fast approximation of sin and cos function simultaneously, in javascript, can be found here: http://danisraelmalta.github.io/Fmath/ (easily imported to c/c++)

Solution 18 - C#

Have you thought of declaring lookup tables for the two functions? You'd still have to "calculate" sin(x) and cos(x), but it'd be decidedly faster, if you don't need a high degree of accuracy.

Solution 19 - C#

The MSVC compiler may use the (internal) SSE2 functions

 ___libm_sse2_sincos_ (for x86)
 __libm_sse2_sincos_  (for x64)

in optimized builds if appropriate compiler flags are specified (at minimum /O2 /arch:SSE2 /fp:fast). The names of these functions seem to imply that they do not compute separate sin and cos, but both "in one step".

For example:

void sincos(double const x, double & s, double & c)
{
  s = std::sin(x);
  c = std::cos(x);
}

Assembly (for x86) with /fp:fast:

movsd   xmm0, QWORD PTR _x$[esp-4]
call    ___libm_sse2_sincos_
mov     eax, DWORD PTR _s$[esp-4]
movsd   QWORD PTR [eax], xmm0
mov     eax, DWORD PTR _c$[esp-4]
shufpd  xmm0, xmm0, 1
movsd   QWORD PTR [eax], xmm0
ret     0

Assembly (for x86) without /fp:fast but with /fp:precise instead (which is the default) calls separate sin and cos:

movsd   xmm0, QWORD PTR _x$[esp-4]
call    __libm_sse2_sin_precise
mov     eax, DWORD PTR _s$[esp-4]
movsd   QWORD PTR [eax], xmm0
movsd   xmm0, QWORD PTR _x$[esp-4]
call    __libm_sse2_cos_precise
mov     eax, DWORD PTR _c$[esp-4]
movsd   QWORD PTR [eax], xmm0
ret     0

So /fp:fast is mandatory for the sincos optimization.

But please note that

___libm_sse2_sincos_

is maybe not as precise as

__libm_sse2_sin_precise
__libm_sse2_cos_precise

due to the missing "precise" at the end of its name.

On my "slightly" older system (Intel Core 2 Duo E6750) with the latest MSVC 2019 compiler and appropriate optimizations, my benchmark shows that the sincos call is about 2.4 times faster than separate sin and cos calls.

Solution 20 - C#

The fastest way is the approximation of functions with single precision by Chebyshev polynomials. It is better to set the argument in degrees, so less resources are spent on reducing the argument. The following code is proposed for Visual Studio. This is an example in C++, but the assembly code can be adapted to other environments.

void _declspec(naked) _vectorcall SinCosD(float x, float &s, float &c)
{
  _declspec(align(16)) static const float ct[8] = // Таблица констант
  {                            
    -1/180.0f,                 // Множитель для приведения x
    -0.0f,                     // 80000000h
    1.74532924E-2f,            // b0/90 = c0
    90.0f,                     // Константа для перехода от cos к sin
    1.34955580E-11f,           // b2/90^5 = c2
    3.91499677E-22f,           // b4/90^9 = c4
    -8.86095677E-7f,           // b1/90^3 = c1
    -9.77249307E-17f           // b3/90^7 = c3
  };
  _asm
  {                            
    mov eax,offset ct          // В eax - адрес таблицы констант
    vmovaps xmm1,[eax]         // xmm1 = 90 # c0 : 80000000h # -1/180
    vmovddup xmm4,[eax+16]     // xmm4 = c4 # c2 : c4 # c2
    vmulss xmm1,xmm1,xmm0      // xmm1 = 90 # c0 : 80000000h # -x/180
    vmovddup xmm5,[eax+24]     // xmm5 = c3 # c1 : c3 # c1
    vcvtss2si eax,xmm1         // eax = -k, где k - округлённое до целых значение x/180
    vshufps xmm2,xmm1,xmm1,93  // xmm2 = 90 # 80000000h
    imul eax,180               // eax = -180*k; of=1, если переполнение
    jno sc_cont                // В случае слишком большого |x| считать, как при x=0
    sub eax,eax                // Для этого обнулить eax
    vxorps xmm0,xmm0,xmm0      // и обнулить xmm0
      sc_cont:                 // Продолжаем для корректного значения x
    vcvtsi2ss xmm1,xmm1,eax    // xmm1 = -180*k в позиции 0
    vaddss xmm1,xmm1,xmm0      // xmm1 = x-k*180 = 90*t - число в диапазоне [-90; 90]
    shl eax,29                 // При нечётном k установить знаковый бит eax
    vmovd xmm0,eax             // В xmm0 - знаковая маска результата
    vorps xmm2,xmm2,xmm1       // xmm2 = -90 # -|90*t|
    vmovlhps xmm0,xmm0,xmm0    // Знаковую маску скопировать в старшую половину xmm0
    vhsubps xmm2,xmm2,xmm1     // xmm2 = 90*t : 90-|90*t| - приведённые аргументы
    vxorps xmm0,xmm0,xmm2      // В xmm0 - приведённые аргументы с учётом знака
    vmovsldup xmm2,xmm2        // xmm2 = 90*t # 90*t : 90-|90*t| # 90-|90*t|
    vmulps xmm2,xmm2,xmm2      // xmm2 = p # p : q # q - аргументы многочлена
    vmovhlps xmm1,xmm1,xmm1    // xmm1 = c0 : с0 (свободный член)
    vfmadd231ps xmm5,xmm4,xmm2 // xmm5 = c3+c4*p # c1+c2*p : c3+c4*q # c1+c2*q
    vmulps xmm3,xmm2,xmm2      // xmm3 = p^2 : q^2
    vmovshdup xmm4,xmm5        // xmm4 = c3+c4*p : c3+c4*q
    vfmadd231ps xmm5,xmm4,xmm3 // xmm5 = c1+c2*p+c3*p^2+c4*p^3 : c1+c2*q+с3*q^2+с4*q^3
    vfmadd231ps xmm1,xmm5,xmm2 // xmm1 = сумма для синуса : сумма для косинуса
    vmulps xmm0,xmm0,xmm1      // xmm0 = sin x : cos x - готовый результат (-1)^k*t*f(t)
    vmovss [edx],xmm0          // Сохранить косинус в переменной c
    vextractps [ecx],xmm0,2    // Сохранить синус в переменной s
    ret                        // Вернуться
  }
}

Content Type	Original Author	Original Content on Stackoverflow
Question	Danvil	View Question on Stackoverflow
Solution 1 - C#	Vlad	View Answer on Stackoverflow
Solution 2 - C#	Chi	View Answer on Stackoverflow
Solution 3 - C#	tanascius	View Answer on Stackoverflow
Solution 4 - C#	Debilski	View Answer on Stackoverflow
Solution 5 - C#	Mitch Wheat	View Answer on Stackoverflow
Solution 6 - C#	caf	View Answer on Stackoverflow
Solution 7 - C#	Joel Goodwin	View Answer on Stackoverflow
Solution 8 - C#	Joseph Quinsey	View Answer on Stackoverflow
Solution 9 - C#	High Performance Mark	View Answer on Stackoverflow
Solution 10 - C#	Chi	View Answer on Stackoverflow
Solution 11 - C#	Probes	View Answer on Stackoverflow
Solution 12 - C#	Tom Cabanski	View Answer on Stackoverflow
Solution 13 - C#	Tesserex	View Answer on Stackoverflow
Solution 14 - C#	Jsl	View Answer on Stackoverflow
Solution 15 - C#	SleuthEye	View Answer on Stackoverflow
Solution 16 - C#	jcayzac	View Answer on Stackoverflow
Solution 17 - C#	user2781980	View Answer on Stackoverflow
Solution 18 - C#	Frank Shearar	View Answer on Stackoverflow
Solution 19 - C#	x y	View Answer on Stackoverflow
Solution 20 - C#	aenigma	View Answer on Stackoverflow