How can I get descriptive statistics of a NumPy array?

PythonNumpyMultidimensional ArrayScipy

Python Problem Overview


I use the following code to create a numpy-ndarray. The file has 9 columns. I explicitly type each column:

dataset = np.genfromtxt("data.csv", delimiter=",",dtype=('|S1', float, float,float,float,float,float,float,int))

Now I would like to get some descriptive statistics for each column (min, max, stdev, mean, median, etc.). Shouldn't there be an easy way to do this?

I tried this:

from scipy import stats
stats.describe(dataset)

but this returns an error: TypeError: cannot perform reduce with flexible type

How can I get descriptive statistics of the created NumPy array?

Python Solutions


Solution 1 - Python

import pandas as pd
import numpy as np

df_describe = pd.DataFrame(dataset)
df_describe.describe()

please note that dataset is your np.array to describe.

import pandas as pd
import numpy as np

df_describe = pd.DataFrame('your np.array')
df_describe.describe()

Solution 2 - Python

This is not a pretty solution, but it gets the job done. The problem is that by specifying multiple dtypes, you are essentially making a 1D-array of tuples (actually np.void), which cannot be described by stats as it includes multiple different types, incl. strings.

This could be resolved by either reading it in two rounds, or using pandas with read_csv.

If you decide to stick to numpy:

import numpy as np
a = np.genfromtxt('sample.txt', delimiter=",",unpack=True,usecols=range(1,9))
s = np.genfromtxt('sample.txt', delimiter=",",unpack=True,usecols=0,dtype='|S1')

from scipy import stats
for arr in a: #do not need the loop at this point, but looks prettier
    print(stats.describe(arr))
#Output per print:
DescribeResult(nobs=6, minmax=(0.34999999999999998, 0.70999999999999996), mean=0.54500000000000004, variance=0.016599999999999997, skewness=-0.3049304880932534, kurtosis=-0.9943046886340534)

Note that in this example the final array has dtype as float, not int, but can easily (if necessary) be converted to int using arr.astype(int)

Solution 3 - Python

The question of how to deal with mixed data from genfromtxt comes up often. People expect a 2d array, and instead get a 1d that they can't index by column. That's because they get a structured array - with different dtype for each column.

All the examples in the genfromtxt doc show this:

>>> s = StringIO("1,1.3,abcde")
>>> data = np.genfromtxt(s, dtype=[('myint','i8'),('myfloat','f8'),
... ('mystring','S5')], delimiter=",")
>>> data
array((1, 1.3, 'abcde'),
      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])

But let me demonstrate how to access this kind of data

In [361]: txt=b"""A, 1,2,3
     ...: B,4,5,6
     ...: """
In [362]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=('S1,int,float,int'))
In [363]: data
Out[363]: 
array([(b'A', 1, 2.0, 3), (b'B', 4, 5.0, 6)], 
      dtype=[('f0', 'S1'), ('f1', '<i4'), ('f2', '<f8'), ('f3', '<i4')])

So my array has 2 records (check the shape), which are displayed as tuples in a list.

You access fields by name, not by column number (do I need to add a structured array documentation link?)

In [364]: data['f0']
Out[364]: 
array([b'A', b'B'], 
      dtype='|S1')
In [365]: data['f1']
Out[365]: array([1, 4])

In a case like this might be more useful if I choose a dtype with 'subarrays'. This a more advanced dtype topic

In [367]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=('S1,(3)float'))
In [368]: data
Out[368]: 
array([(b'A', [1.0, 2.0, 3.0]), (b'B', [4.0, 5.0, 6.0])], 
      dtype=[('f0', 'S1'), ('f1', '<f8', (3,))])
In [369]: data['f1']
Out[369]: 
array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.]])

The character column is still loaded as S1, but the numbers are now in a 3 column array. Note that they are all float (or int).

In [371]: from scipy import stats
In [372]: stats.describe(data['f1'])
Out[372]: DescribeResult(nobs=2, 
   minmax=(array([ 1.,  2.,  3.]), array([ 4.,  5.,  6.])),
   mean=array([ 2.5,  3.5,  4.5]), 
   variance=array([ 4.5,  4.5,  4.5]), 
   skewness=array([ 0.,  0.,  0.]), 
   kurtosis=array([-2., -2., -2.]))

Solution 4 - Python

Official Scipy Documentation Example

#INPUT
from scipy import stats
a = np.arange(10)
stats.describe(a)

#OUTPUT
DescribeResult(nobs=10, minmax=(0, 9), mean=4.5, variance=9.166666666666666,
               skewness=0.0, kurtosis=-1.2242424242424244)

#INPUT
b = [[1, 2], [3, 4]]
stats.describe(b)

#OUTPUT
DescribeResult(nobs=2, minmax=(array([1, 2]), array([3, 4])),
               mean=array([2., 3.]), variance=array([2., 2.]),
               skewness=array([0., 0.]), kurtosis=array([-2., -2.]))

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionbetaView Question on Stackoverflow
Solution 1 - PythonINNO TECHView Answer on Stackoverflow
Solution 2 - PythonM.TView Answer on Stackoverflow
Solution 3 - PythonhpauljView Answer on Stackoverflow
Solution 4 - PythonsoguView Answer on Stackoverflow