Standard deviation of generic list?

C#MathStatisticsStandard Deviation

C# Problem Overview


I need to calculate the standard deviation of a generic list. I will try to include my code. Its a generic list with data in it. The data is mostly floats and ints. Here is my code that is relative to it without getting into to much detail:

namespace ValveTesterInterface
{
    public class ValveDataResults
    {
        private List<ValveData> m_ValveResults;

        public ValveDataResults()
        {
            if (m_ValveResults == null)
            {
                m_ValveResults = new List<ValveData>();
            }
        }

        public void AddValveData(ValveData valve)
        {
            m_ValveResults.Add(valve);
        }

Here is the function where the standard deviation needs to be calculated:

        public float LatchStdev()
        {
            
            float sumOfSqrs = 0;
            float meanValue = 0;
            foreach (ValveData value in m_ValveResults)
            {
                meanValue += value.LatchTime;
            }
            meanValue = (meanValue / m_ValveResults.Count) * 0.02f;
            
            for (int i = 0; i <= m_ValveResults.Count; i++) 
            {   
                sumOfSqrs += Math.Pow((m_ValveResults - meanValue), 2);  
            }
            return Math.Sqrt(sumOfSqrs /(m_ValveResults.Count - 1));
            
        }
    }
}

Ignore whats inside the LatchStdev() function because I'm sure its not right. Its just my poor attempt to calculate the st dev. I know how to do it of a list of doubles, however not of a list of generic data list. If someone had experience in this, please help.

C# Solutions


Solution 1 - C#

The example above is slightly incorrect and could have a divide by zero error if your population set is 1. The following code is somewhat simpler and gives the "population standard deviation" result. (http://en.wikipedia.org/wiki/Standard_deviation)

using System;
using System.Linq;
using System.Collections.Generic;

public static class Extend
{
    public static double StandardDeviation(this IEnumerable<double> values)
    {
        double avg = values.Average();
        return Math.Sqrt(values.Average(v=>Math.Pow(v-avg,2)));
    }
}

Solution 2 - C#

This article should help you. It creates a function that computes the deviation of a sequence of double values. All you have to do is supply a sequence of appropriate data elements.

The resulting function is:

private double CalculateStandardDeviation(IEnumerable<double> values)
{   
  double standardDeviation = 0;

  if (values.Any()) 
  {      
     // Compute the average.     
     double avg = values.Average();

     // Perform the Sum of (value-avg)_2_2.      
     double sum = values.Sum(d => Math.Pow(d - avg, 2));

     // Put it all together.      
     standardDeviation = Math.Sqrt((sum) / (values.Count()-1));   
  }  

  return standardDeviation;
}

This is easy enough to adapt for any generic type, so long as we provide a selector for the value being computed. LINQ is great for that, the Select funciton allows you to project from your generic list of custom types a sequence of numeric values for which to compute the standard deviation:

List<ValveData> list = ...
var result = list.Select( v => (double)v.SomeField )
                 .CalculateStdDev();

Solution 3 - C#

Even though the accepted answer seems mathematically correct, it is wrong from the programming perspective - it enumerates the same sequence 4 times. This might be ok if the underlying object is a list or an array, but if the input is a filtered/aggregated/etc linq expression, or if the data is coming directly from the database or network stream, this would cause much lower performance.

I would highly recommend not to reinvent the wheel and use one of the better open source math libraries Math.NET. We have been using that lib in our company and are very happy with the performance.

> PM> Install-Package MathNet.Numerics

var populationStdDev = new List<double>(1d, 2d, 3d, 4d, 5d).PopulationStandardDeviation();

var sampleStdDev = new List<double>(2d, 3d, 4d).StandardDeviation();

See http://numerics.mathdotnet.com/docs/DescriptiveStatistics.html for more information.

Lastly, for those who want to get the fastest possible result and sacrifice some precision, read "one-pass" algorithm https://en.wikipedia.org/wiki/Standard_deviation#Rapid_calculation_methods

Solution 4 - C#

I see what you're doing, and I use something similar. It seems to me you're not going far enough. I tend to encapsulate all data processing into a single class, that way I can cache the values that are calculated until the list changes. for instance:

public class StatProcessor{
private list<double> _data; //this holds the current data
private _avg; //we cache average here
private _avgValid; //a flag to say weather we need to calculate the average or not
private _calcAvg(); //calculate the average of the list and cache in _avg, and set _avgValid
public double average{
     get{
     if(!_avgValid) //if we dont HAVE to calculate the average, skip it
        _calcAvg(); //if we do, go ahead, cache it, then set the flag.
     return _avg; //now _avg is garunteed to be good, so return it.
     }
}
...more stuff
Add(){
//add stuff to the list here, and reset the flag
}
}

You'll notice that using this method, only the first request for average actually computes the average. After that, as long as we don't add (or remove, or modify at all, but those arnt shown) anything from the list, we can get the average for basically nothing.

Additionally, since the average is used in the algorithm for the standard deviation, computing the standard deviation first will give us the average for free, and computing the average first will give us a little performance boost in the standard devation calculation, assuming we remember to check the flag.

Furthermore! places like the average function, where you're looping through every value already anyway, is a great time to cache things like the minimum and maximum values. Of course, requests for this information need to first check whether theyve been cached, and that can cause a relative slowdown compared to just finding the max using the list, since it does all the extra work setting up all the concerned caches, not just the one your accessing.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionTom HanglerView Question on Stackoverflow
Solution 1 - C#Jonathan DeMarksView Answer on Stackoverflow
Solution 2 - C#LBushkinView Answer on Stackoverflow
Solution 3 - C#Yuri AstrakhanView Answer on Stackoverflow
Solution 4 - C#BenjaminView Answer on Stackoverflow