Why is statistics.mean() so slow?
PythonPerformanceMeanPython Problem Overview
I compared the performance of the mean
function of the statistics
module with the simple sum(l)/len(l)
method and found the mean
function to be very slow for some reason. I used timeit
with the two code snippets below to compare them, does anyone know what causes the massive difference in execution speed? I'm using Python 3.5.
from timeit import repeat
print(min(repeat('mean(l)',
'''from random import randint; from statistics import mean; \
l=[randint(0, 10000) for i in range(10000)]''', repeat=20, number=10)))
The code above executes in about 0.043 seconds on my machine.
from timeit import repeat
print(min(repeat('sum(l)/len(l)',
'''from random import randint; from statistics import mean; \
l=[randint(0, 10000) for i in range(10000)]''', repeat=20, number=10)))
The code above executes in about 0.000565 seconds on my machine.
Python Solutions
Solution 1 - Python
Python's statistics
module is not built for speed, but for precision
In the specs for this module, it appears that
> The built-in sum can lose accuracy when dealing with floats of wildly
> differing magnitude. Consequently, the above naive mean fails this
> "torture test"
>
> assert mean([1e30, 1, 3, -1e30]) == 1
>
> returning 0 instead of 1, a purely computational error of 100%.
>
> Using math.fsum inside mean will make it more accurate with float
> data, but it also has the side-effect of converting any arguments to
> float even when unnecessary. E.g. we should expect the mean of a list
> of Fractions to be a Fraction, not a float.
Conversely, if we take a look at the implementation of _sum()
in this module, the first lines of the method's docstring seem to confirm that:
def _sum(data, start=0):
"""_sum(data [, start]) -> (type, sum, count)
Return a high-precision sum of the given numeric data as a fraction,
together with the type to be converted to and the count of items.
[...] """
So yeah, statistics
implementation of sum
, instead of being a simple one-liner call to Python's built-in sum()
function, takes about 20 lines by itself with a nested for
loop in its body.
This happens because statistics._sum
chooses to guarantee the maximum precision for all types of number it could encounter (even if they widely differ from one another), instead of simply emphasizing speed.
Hence, it appears normal that the built-in sum
proves a hundred times faster. The cost of it being a much lower precision in you happen to call it with exotic numbers.
Other options
If you need to prioritize speed in your algorithms, you should have a look at Numpy instead, the algorithms of which being implemented in C.
NumPy mean is not as precise as statistics
by a long shot but it implements (since 2013) a routine based on pairwise summation which is better than a naive sum/len
(more info in the link).
However...
import numpy as np
import statistics
np_mean = np.mean([1e30, 1, 3, -1e30])
statistics_mean = statistics.mean([1e30, 1, 3, -1e30])
print('NumPy mean: {}'.format(np_mean))
print('Statistics mean: {}'.format(statistics_mean))
> NumPy mean: 0.0
> Statistics mean: 1.0
Solution 2 - Python
if you do care of speed use numpy/scipy/pandas instead:
In [119]: from random import randint; from statistics import mean; import numpy as np;
In [122]: l=[randint(0, 10000) for i in range(10**6)]
In [123]: mean(l)
Out[123]: 5001.992355
In [124]: %timeit mean(l)
1 loop, best of 3: 2.01 s per loop
In [125]: a = np.array(l)
In [126]: np.mean(a)
Out[126]: 5001.9923550000003
In [127]: %timeit np.mean(a)
100 loops, best of 3: 2.87 ms per loop
Conclusion: it will be orders of magnitude faster - in my example it was 700 times faster, but maybe not that precise (as numpy doesn't use Kahan summation algorithm).
Solution 3 - Python
I asked the same question a while back but once I noticed the _sum
function called in mean on line 317 in the source I understood why:
def _sum(data, start=0):
"""_sum(data [, start]) -> (type, sum, count)
Return a high-precision sum of the given numeric data as a fraction,
together with the type to be converted to and the count of items.
If optional argument ``start`` is given, it is added to the total.
If ``data`` is empty, ``start`` (defaulting to 0) is returned.
Examples
--------
>>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75)
(<class 'float'>, Fraction(11, 1), 5)
Some sources of round-off error will be avoided:
>>> _sum([1e50, 1, -1e50] * 1000) # Built-in sum returns zero.
(<class 'float'>, Fraction(1000, 1), 3000)
Fractions and Decimals are also supported:
>>> from fractions import Fraction as F
>>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])
(<class 'fractions.Fraction'>, Fraction(63, 20), 4)
>>> from decimal import Decimal as D
>>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]
>>> _sum(data)
(<class 'decimal.Decimal'>, Fraction(6963, 10000), 4)
Mixed types are currently treated as an error, except that int is
allowed.
"""
count = 0
n, d = _exact_ratio(start)
partials = {d: n}
partials_get = partials.get
T = _coerce(int, type(start))
for typ, values in groupby(data, type):
T = _coerce(T, typ) # or raise TypeError
for n,d in map(_exact_ratio, values):
count += 1
partials[d] = partials_get(d, 0) + n
if None in partials:
# The sum will be a NAN or INF. We can ignore all the finite
# partials, and just look at this special one.
total = partials[None]
assert not _isfinite(total)
else:
# Sum all the partial sums using builtin sum.
# FIXME is this faster if we sum them in order of the denominator?
total = sum(Fraction(n, d) for d, n in sorted(partials.items()))
return (T, total, count)
There is a multitude of operations happening in comparison to just calling the builtin sum
, as per the doc strings mean
calculates a high-precision sum.
You can see using mean vs sum can give you different output:
In [7]: l = [.1, .12312, 2.112, .12131]
In [8]: sum(l) / len(l)
Out[8]: 0.6141074999999999
In [9]: mean(l)
Out[9]: 0.6141075
Solution 4 - Python
Both len() and sum() are Python builtin functions (with limited functionality), that are written in C and, more importantly, are optimized to work fast with certain types or objects (list).
You can look at the implementation of builtin functions here:
https://hg.python.org/sandbox/python2.7/file/tip/Python/bltinmodule.c
The statistics.mean() is a high level function written in Python. Take a look here at how it is implemented:
https://hg.python.org/sandbox/python2.7/file/tip/Lib/statistics.py
You can see that later uses internally another function called _sum(), which does a few additional checks compared to the builtin functions.
Solution 5 - Python
If you want a faster mean function, the statistics
module introduced the fmean
function in python 3.8. It converts its data to float
before calculating the mean.
(Implementation here)
A quick comparison :
import timeit, statistics
def test_basic_mean(): return sum(range(10000)) / 10000
def test_mean(): return statistics.mean(range(10000))
def test_fmean(): return statistics.fmean(range(10000))
print("basic mean:", min(timeit.repeat(stmt=test_basic_mean, setup="from __main__ import test_basic_mean", repeat=20, number=10)))
print("statistics.mean:", min(timeit.repeat(stmt=test_mean, setup="from __main__ import statistics, test_mean", repeat=20, number=10)))
print("statistics.fmean:", min(timeit.repeat(stmt=test_fmean, setup="from __main__ import statistics, test_fmean", repeat=20, number=10)))
gives me:
basic mean: 0.0013072469737380743
statistics.mean: 0.025932796066626906
statistics.fmean: 0.001833588001318276
Solution 6 - Python
According to that post: https://stackoverflow.com/questions/7716331/calculating-arithmetic-mean-average-in-python
It should be "due to a particularly precise implementation of the sum operator in statistics".
The mean function is coded with a inner _sum function that is supposed to be more precise than the normal addition but which is a lot slower (code available here: https://hg.python.org/cpython/file/3.5/Lib/statistics.py).
It's specified in the PEP: https://www.python.org/dev/peps/pep-0450/ Accuracy is considered more important as speed for that module.