Save time with parallel FOR loop

C#.NetParallel Extensions

C# Problem Overview


I have a question concerning parallel for loops. I have the following code:

	public static void MultiplicateArray(double[] array, double factor)
	{
		for (int i = 0; i < array.Length; i++)
		{
			array[i] = array[i] * factor;
		}
	}

	public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
	{
		for (int i = 0; i < arrayToChange.Length; i++)
		{
			arrayToChange[i] = arrayToChange[i] * multiplication[i];
		}
	}

	public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
	{
		for (int i = 0; i < arrayToChange.Length; i++)
		{
			arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
		}
	}

Now I try to add parallel function:

	public static void MultiplicateArray(double[] array, double factor)
	{
		Parallel.For(0, array.Length, i =>
			{
				array[i] = array[i] * factor;
			});
	}

	public static void MultiplicateArray(double[] arrayToChange, double[] multiplication)
	{
		Parallel.For(0, arrayToChange.Length, i =>
		{
			arrayToChange[i] = arrayToChange[i] * multiplication[i];
		});
	}

	public static void MultiplicateArray(double[] arrayToChange, double[,] multiArray, int dimension)
	{
		Parallel.For(0, arrayToChange.Length, i =>
		{
			arrayToChange[i] = arrayToChange[i] * multiArray[i, dimension];
		});
	}

The issue is, that I want to save time, not to waste it. With the standard for loop it computes about 2 minutes, but with the parallel for loop it takes 3 min. Why?

C# Solutions


Solution 1 - C#

Parallel.For() can improve performance a lot by parallelizing your code, but it also has overhead (synchronization between threads, invoking the delegate on each iteration). And since in your code, each iteration is very short (basically, just a few CPU instructions), this overhead can become prominent.

Because of this, I thought using Parallel.For() is not the right solution for you. Instead, if you parallelize your code manually (which is very simple in this case), you may see the performance improve.

To verify this, I performed some measurements: I ran different implementations of MultiplicateArray() on an array of 200 000 000 items (the code I used is below). On my machine, the serial version consistently took 0.21 s and Parallel.For() usually took something around 0.45 s, but from time to time, it spiked to 8–9 s!

First, I'll try to improve the common case and I'll come to those spikes later. We want to process the array by N CPUs, so we split it into N equally sized parts and process each part separately. The result? 0.35 s. That's still worse than the serial version. But for loop over each item in an array is one of the most optimized constructs. Can't we do something to help the compiler? Extracting computing the bound of the loop could help. It turns out it does: 0.18 s. That's better than the serial version, but not by much. And, interestingly, changing the degree of parallelism from 4 to 2 on my 4-core machine (no HyperThreading) doesn't change the result: still 0.18 s. This makes me conclude that the CPU is not the bottleneck here, memory bandwidth is.

Now, back to the spikes: my custom parallelization doesn't have them, but Parallel.For() does, why? Parallel.For() does use range partitioning, which means each thread processes its own part of the array. But, if one thread finishes early, it will try to help processing the range of another thread that hasn't finished yet. If that happens, you will get a lot of false sharing, which could slow down the code a lot. And my own test with forcing false sharing seems to indicate this could indeed be the problem. Forcing the degree of parallelism of the Parallel.For() seems to help with the spikes a little.

Of course, all those measurements are specific to the hardware on my computer and will be different for you, so you should make your own measurements.

The code I used:

static void Main()
{
	double[] array = new double[200 * 1000 * 1000];

	for (int i = 0; i < array.Length; i++)
		array[i] = 1;

	for (int i = 0; i < 5; i++)
	{
		Stopwatch sw = Stopwatch.StartNew();
		Serial(array, 2);
		Console.WriteLine("Serial: {0:f2} s", sw.Elapsed.TotalSeconds);

		sw = Stopwatch.StartNew();
		ParallelFor(array, 2);
		Console.WriteLine("Parallel.For: {0:f2} s", sw.Elapsed.TotalSeconds);

		sw = Stopwatch.StartNew();
		ParallelForDegreeOfParallelism(array, 2);
		Console.WriteLine("Parallel.For (degree of parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);

		sw = Stopwatch.StartNew();
		CustomParallel(array, 2);
		Console.WriteLine("Custom parallel: {0:f2} s", sw.Elapsed.TotalSeconds);

		sw = Stopwatch.StartNew();
		CustomParallelExtractedMax(array, 2);
		Console.WriteLine("Custom parallel (extracted max): {0:f2} s", sw.Elapsed.TotalSeconds);

		sw = Stopwatch.StartNew();
		CustomParallelExtractedMaxHalfParallelism(array, 2);
		Console.WriteLine("Custom parallel (extracted max, half parallelism): {0:f2} s", sw.Elapsed.TotalSeconds);

		sw = Stopwatch.StartNew();
		CustomParallelFalseSharing(array, 2);
		Console.WriteLine("Custom parallel (false sharing): {0:f2} s", sw.Elapsed.TotalSeconds);
	}
}

static void Serial(double[] array, double factor)
{
	for (int i = 0; i < array.Length; i++)
	{
		array[i] = array[i] * factor;
	}
}

static void ParallelFor(double[] array, double factor)
{
	Parallel.For(
		0, array.Length, i => { array[i] = array[i] * factor; });
}

static void ParallelForDegreeOfParallelism(double[] array, double factor)
{
	Parallel.For(
		0, array.Length, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
		i => { array[i] = array[i] * factor; });
}

static void CustomParallel(double[] array, double factor)
{
	var degreeOfParallelism = Environment.ProcessorCount;

	var tasks = new Task[degreeOfParallelism];

	for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
	{
		// capturing taskNumber in lambda wouldn't work correctly
		int taskNumberCopy = taskNumber;

		tasks[taskNumber] = Task.Factory.StartNew(
			() =>
			{
				for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
					i < array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
					i++)
				{
					array[i] = array[i] * factor;
				}
			});
	}

	Task.WaitAll(tasks);
}

static void CustomParallelExtractedMax(double[] array, double factor)
{
	var degreeOfParallelism = Environment.ProcessorCount;

	var tasks = new Task[degreeOfParallelism];

	for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
	{
		// capturing taskNumber in lambda wouldn't work correctly
		int taskNumberCopy = taskNumber;

		tasks[taskNumber] = Task.Factory.StartNew(
			() =>
			{
				var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
				for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
					i < max;
					i++)
				{
					array[i] = array[i] * factor;
				}
			});
	}

	Task.WaitAll(tasks);
}

static void CustomParallelExtractedMaxHalfParallelism(double[] array, double factor)
{
	var degreeOfParallelism = Environment.ProcessorCount / 2;

	var tasks = new Task[degreeOfParallelism];

	for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
	{
		// capturing taskNumber in lambda wouldn't work correctly
		int taskNumberCopy = taskNumber;

		tasks[taskNumber] = Task.Factory.StartNew(
			() =>
			{
				var max = array.Length * (taskNumberCopy + 1) / degreeOfParallelism;
				for (int i = array.Length * taskNumberCopy / degreeOfParallelism;
					i < max;
					i++)
				{
					array[i] = array[i] * factor;
				}
			});
	}

	Task.WaitAll(tasks);
}

static void CustomParallelFalseSharing(double[] array, double factor)
{
	var degreeOfParallelism = Environment.ProcessorCount;

	var tasks = new Task[degreeOfParallelism];

	int i = -1;

	for (int taskNumber = 0; taskNumber < degreeOfParallelism; taskNumber++)
	{
		tasks[taskNumber] = Task.Factory.StartNew(
			() =>
			{
				int j = Interlocked.Increment(ref i);
				while (j < array.Length)
				{
					array[j] = array[j] * factor;
					j = Interlocked.Increment(ref i);
				}
			});
	}

	Task.WaitAll(tasks);
}

Example output:

Serial: 0,20 s
Parallel.For: 0,50 s
Parallel.For (degree of parallelism): 8,90 s
Custom parallel: 0,33 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,53 s
Serial: 0,21 s
Parallel.For: 0,52 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,19 s
Custom parallel (false sharing): 7,59 s
Serial: 0,21 s
Parallel.For: 11,21 s
Parallel.For (degree of parallelism): 0,36 s
Custom parallel: 0,32 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,76 s
Serial: 0,21 s
Parallel.For: 0,46 s
Parallel.For (degree of parallelism): 0,35 s
Custom parallel: 0,31 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s
Serial: 0,21 s
Parallel.For: 0,45 s
Parallel.For (degree of parallelism): 0,40 s
Custom parallel: 0,38 s
Custom parallel (extracted max): 0,18 s
Custom parallel (extracted max, half parallelism): 0,18 s
Custom parallel (false sharing): 7,58 s

Solution 2 - C#

Svick already provided a great answer but I'd like to emphasize that the key point is not to "parallelize your code manually" instead of using Parallel.For() but that you have to process larger chunks of data.

This can still be done using Parallel.For() like this:

static void My(double[] array, double factor)
{
    int degreeOfParallelism = Environment.ProcessorCount;

    Parallel.For(0, degreeOfParallelism, workerId =>
    {
        var max = array.Length * (workerId + 1) / degreeOfParallelism;
        for (int i = array.Length * workerId / degreeOfParallelism; i < max; i++)
            array[i] = array[i] * factor;
    });
}

which does the same thing as svicks CustomParallelExtractedMax() but is shorter, simpler and (on my machine) performs even slightly faster:

Serial: 3,94 s
Parallel.For: 9,28 s
Parallel.For (degree of parallelism): 9,58 s
Custom parallel: 2,05 s
Custom parallel (extracted max): 1,19 s
Custom parallel (extracted max, half parallelism): 1,49 s
Custom parallel (false sharing): 27,88 s
My: 0,95 s

Btw, the keyword for this which is missing from all the other answers is granularity.

Solution 3 - C#

See Custom Partitioners for PLINQ and TPL:

> In a For loop, the body of the loop is provided to the method as a delegate. The cost of invoking that delegate is about the same as a virtual method call. In some scenarios, the body of a parallel loop might be small enough that the cost of the delegate invocation on each loop iteration becomes significant. In such situations, you can use one of the Create overloads to create an IEnumerable<T> of range partitions over the source elements. Then, you can pass this collection of ranges to a ForEach method whose body consists of a regular for loop. The benefit of this approach is that the delegate invocation cost is incurred only once per range, rather than once per element.

In your loop body, you are performing a single multiplication, and the overhead of the delegate call will be very noticeable.

Try this:

public static void MultiplicateArray(double[] array, double factor)
{
    var rangePartitioner = Partitioner.Create(0, array.Length);

    Parallel.ForEach(rangePartitioner, range =>
    {
        for (int i = range.Item1; i < range.Item2; i++)
        {
            array[i] = array[i] * factor;
        }
    });
}

See also: Parallel.ForEach documentation and Partitioner.Create documentation.

Solution 4 - C#

Parallel.For involves more complex memory management. That result could vary depending on cpu specs, like #cores, L1 & L2 cache...

Please take a look to this interesting article:

http://msdn.microsoft.com/en-us/magazine/cc872851.aspx

Solution 5 - C#

from http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.aspx and http://msdn.microsoft.com/en-us/library/dd537608.aspx

you are not creating three thread/process that execute your for, but the iteration of the for is tryed to be executet in parallel, so even with only one for you are using multiple thread/process.

this mean that interation with index = 0 and index = 1 may be executed at the same time.

Probabily you are forcing to use too much thread/process, and the overhead for the creation/execution of them is bigger that the speed gain.

Try to use three normal for but in three different thread/process, if your sistem is multicore (3x at least) it should take less than one minute

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestiontroView Question on Stackoverflow
Solution 1 - C#svickView Answer on Stackoverflow
Solution 2 - C#Roman ReinerView Answer on Stackoverflow
Solution 3 - C#Kris VandermottenView Answer on Stackoverflow
Solution 4 - C#JordiView Answer on Stackoverflow
Solution 5 - C#LestoView Answer on Stackoverflow