How do I calculate percentiles with python/numpy?

PythonNumpyStatisticsNumpy NdarrayPercentile

Python Problem Overview


Is there a convenient way to calculate percentiles for a sequence or single-dimensional numpy array?

I am looking for something similar to Excel's percentile function.

I looked in NumPy's statistics reference, and couldn't find this. All I could find is the median (50th percentile), but not something more specific.

Python Solutions


Solution 1 - Python

You might be interested in the SciPy Stats package. It has the percentile function you're after and many other statistical goodies.

percentile() is available in numpy too.

import numpy as np
a = np.array([1,2,3,4,5])
p = np.percentile(a, 50) # return 50th percentile, e.g median.
print p
3.0

This ticket leads me to believe they won't be integrating percentile() into numpy anytime soon.

Solution 2 - Python

By the way, there is a pure-Python implementation of percentile function, in case one doesn't want to depend on scipy. The function is copied below:

## {{{ http://code.activestate.com/recipes/511478/ (r1)
import math
import functools

def percentile(N, percent, key=lambda x:x):
    """
    Find the percentile of a list of values.

    @parameter N - is a list of values. Note N MUST BE already sorted.
    @parameter percent - a float value from 0.0 to 1.0.
    @parameter key - optional key function to compute value from each element of N.

    @return - the percentile of the values
    """
    if not N:
        return None
    k = (len(N)-1) * percent
    f = math.floor(k)
    c = math.ceil(k)
    if f == c:
        return key(N[int(k)])
    d0 = key(N[int(f)]) * (c-k)
    d1 = key(N[int(c)]) * (k-f)
    return d0+d1

# median is 50th percentile.
median = functools.partial(percentile, percent=0.5)
## end of http://code.activestate.com/recipes/511478/ }}}

Solution 3 - Python

import numpy as np
a = [154, 400, 1124, 82, 94, 108]
print np.percentile(a,95) # gives the 95th percentile

Solution 4 - Python

Here's how to do it without numpy, using only python to calculate the percentile.

import math

def percentile(data, perc: int):
    size = len(data)
    return sorted(data)[int(math.ceil((size * perc) / 100)) - 1]

percentile([10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0], 90)
# 9.0
percentile([142, 232, 290, 120, 274, 123, 146, 113, 272, 119, 124, 277, 207], 50)
# 146

Solution 5 - Python

Starting Python 3.8, the standard library comes with the quantiles function as part of the statistics module:

from statistics import quantiles

quantiles([1, 2, 3, 4, 5], n=100)
# [0.06, 0.12, 0.18, 0.24, 0.3, 0.36, 0.42, 0.48, 0.54, 0.6, 0.66, 0.72, 0.78, 0.84, 0.9, 0.96, 1.02, 1.08, 1.14, 1.2, 1.26, 1.32, 1.38, 1.44, 1.5, 1.56, 1.62, 1.68, 1.74, 1.8, 1.86, 1.92, 1.98, 2.04, 2.1, 2.16, 2.22, 2.28, 2.34, 2.4, 2.46, 2.52, 2.58, 2.64, 2.7, 2.76, 2.82, 2.88, 2.94, 3.0, 3.06, 3.12, 3.18, 3.24, 3.3, 3.36, 3.42, 3.48, 3.54, 3.6, 3.66, 3.72, 3.78, 3.84, 3.9, 3.96, 4.02, 4.08, 4.14, 4.2, 4.26, 4.32, 4.38, 4.44, 4.5, 4.56, 4.62, 4.68, 4.74, 4.8, 4.86, 4.92, 4.98, 5.04, 5.1, 5.16, 5.22, 5.28, 5.34, 5.4, 5.46, 5.52, 5.58, 5.64, 5.7, 5.76, 5.82, 5.88, 5.94]
quantiles([1, 2, 3, 4, 5], n=100)[49] # 50th percentile (e.g median)
# 3.0

quantiles returns for a given distribution dist a list of n - 1 cut points separating the n quantile intervals (division of dist into n continuous intervals with equal probability):

> statistics.quantiles(dist, *, n=4, method='exclusive')

where n, in our case (percentiles) is 100.

Solution 6 - Python

The definition of percentile I usually see expects as a result the value from the supplied list below which P percent of values are found... which means the result must be from the set, not an interpolation between set elements. To get that, you can use a simpler function.

def percentile(N, P):
    """
    Find the percentile of a list of values
    
    @parameter N - A list of values.  N must be sorted.
    @parameter P - A float value from 0.0 to 1.0

    @return - The percentile of the values.
    """
    n = int(round(P * len(N) + 0.5))
    return N[n-1]

# A = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
# B = (15, 20, 35, 40, 50)
#
# print percentile(A, P=0.3)
# 4
# print percentile(A, P=0.8)
# 9
# print percentile(B, P=0.3)
# 20
# print percentile(B, P=0.8)
# 50

If you would rather get the value from the supplied list at or below which P percent of values are found, then use this simple modification:

def percentile(N, P):
    n = int(round(P * len(N) + 0.5))
    if n > 1:
        return N[n-2]
    else:
        return N[0]

Or with the simplification suggested by @ijustlovemath:

def percentile(N, P):
    n = max(int(round(P * len(N) + 0.5)), 2)
    return N[n-2]

Solution 7 - Python

check for scipy.stats module:

 scipy.stats.scoreatpercentile

Solution 8 - Python

To calculate the percentile of a series, run:

from scipy.stats import rankdata
import numpy as np

def calc_percentile(a, method='min'):
    if isinstance(a, list):
        a = np.asarray(a)
    return rankdata(a, method=method) / float(len(a))

For example:

a = range(20)
print {val: round(percentile, 3) for val, percentile in zip(a, calc_percentile(a))}
>>> {0: 0.05, 1: 0.1, 2: 0.15, 3: 0.2, 4: 0.25, 5: 0.3, 6: 0.35, 7: 0.4, 8: 0.45, 9: 0.5, 10: 0.55, 11: 0.6, 12: 0.65, 13: 0.7, 14: 0.75, 15: 0.8, 16: 0.85, 17: 0.9, 18: 0.95, 19: 1.0}

Solution 9 - Python

A convenient way to calculate percentiles for a one-dimensional numpy sequence or matrix is by using numpy.percentile <https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html>. Example:

import numpy as np

a = np.array([0,1,2,3,4,5,6,7,8,9,10])
p50 = np.percentile(a, 50) # return 50th percentile, e.g median.
p90 = np.percentile(a, 90) # return 90th percentile.
print('median = ',p50,' and p90 = ',p90) # median =  5.0  and p90 =  9.0

However, if there is any NaN value in your data, the above function will not be useful. The recommended function to use in that case is the numpy.nanpercentile <https://docs.scipy.org/doc/numpy/reference/generated/numpy.nanpercentile.html> function:

import numpy as np

a_NaN = np.array([0.,1.,2.,3.,4.,5.,6.,7.,8.,9.,10.])
a_NaN[0] = np.nan
print('a_NaN',a_NaN)
p50 = np.nanpercentile(a_NaN, 50) # return 50th percentile, e.g median.
p90 = np.nanpercentile(a_NaN, 90) # return 90th percentile.
print('median = ',p50,' and p90 = ',p90) # median =  5.5  and p90 =  9.1

In the two options presented above, you can still choose the interpolation mode. Follow the examples below for easier understanding.

import numpy as np

b = np.array([1,2,3,4,5,6,7,8,9,10])
print('percentiles using default interpolation')
p10 = np.percentile(b, 10) # return 10th percentile.
p50 = np.percentile(b, 50) # return 50th percentile, e.g median.
p90 = np.percentile(b, 90) # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.9 , median =  5.5  and p90 =  9.1

print('percentiles using interpolation = ', "linear")
p10 = np.percentile(b, 10,interpolation='linear') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='linear') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='linear') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.9 , median =  5.5  and p90 =  9.1

print('percentiles using interpolation = ', "lower")
p10 = np.percentile(b, 10,interpolation='lower') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='lower') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='lower') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1 , median =  5  and p90 =  9

print('percentiles using interpolation = ', "higher")
p10 = np.percentile(b, 10,interpolation='higher') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='higher') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='higher') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  2 , median =  6  and p90 =  10

print('percentiles using interpolation = ', "midpoint")
p10 = np.percentile(b, 10,interpolation='midpoint') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='midpoint') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='midpoint') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.5 , median =  5.5  and p90 =  9.5

print('percentiles using interpolation = ', "nearest")
p10 = np.percentile(b, 10,interpolation='nearest') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='nearest') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='nearest') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  2 , median =  5  and p90 =  9

If your input array only consists of integer values, you might be interested in the percentil answer as an integer. If so, choose interpolation mode such as ‘lower’, ‘higher’, or ‘nearest’.

Solution 10 - Python

In case you need the answer to be a member of the input numpy array:

Just to add that the percentile function in numpy by default calculates the output as a linear weighted average of the two neighboring entries in the input vector. In some cases people may want the returned percentile to be an actual element of the vector, in this case, from v1.9.0 onwards you can use the "interpolation" option, with either "lower", "higher" or "nearest".

import numpy as np
x=np.random.uniform(10,size=(1000))-5.0

np.percentile(x,70) # 70th percentile

2.075966046220879

np.percentile(x,70,interpolation="nearest")

2.0729677997904314

The latter is an actual entry in the vector, while the former is a linear interpolation of two vector entries that border the percentile

Solution 11 - Python

for a series: used describe functions

suppose you have df with following columns sales and id. you want to calculate percentiles for sales then it works like this,

df['sales'].describe(percentiles = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])

0.0: .0: minimum
1: maximum 
0.1 : 10th percentile and so on

Solution 12 - Python

I bootstrap the data and then plotted out the confidence interval for 10 samples. The confidence interval shows the range where the probabilities will fall between 5 percent and 95 percent probability.

 import pandas as pd
 import matplotlib.pyplot as plt
 import seaborn as sns
 import numpy as np
 import json
 import dc_stat_think as dcst

 data = [154, 400, 1124, 82, 94, 108]
 #print (np.percentile(data,[0.5,95])) # gives the 95th percentile

 bs_data = dcst.draw_bs_reps(data, np.mean, size=6*10)

 #print(np.reshape(bs_data,(24,6)))

 x= np.linspace(1,6,6)
 print(x)
 for (item1,item2,item3,item4,item5,item6) in bs_data.reshape((10,6)):
     line_data=[item1,item2,item3,item4,item5,item6]
     ci=np.percentile(line_data,[.025,.975])
     mean_avg=np.mean(line_data)
     fig, ax = plt.subplots()
     ax.plot(x,line_data)
     ax.fill_between(x, (line_data-ci[0]), (line_data+ci[1]), color='b', alpha=.1)
     ax.axhline(mean_avg,color='red')
     plt.show()

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionUriView Question on Stackoverflow
Solution 1 - PythonJon WView Answer on Stackoverflow
Solution 2 - PythonBoris GorelikView Answer on Stackoverflow
Solution 3 - PythonrichieView Answer on Stackoverflow
Solution 4 - PythonAshkanView Answer on Stackoverflow
Solution 5 - PythonXavier GuihotView Answer on Stackoverflow
Solution 6 - PythonmpounsettView Answer on Stackoverflow
Solution 7 - PythonEvertView Answer on Stackoverflow
Solution 8 - PythonRoei BahumiView Answer on Stackoverflow
Solution 9 - PythonItalo GervasioView Answer on Stackoverflow
Solution 10 - PythonAdrian TompkinsView Answer on Stackoverflow
Solution 11 - PythonashwiniView Answer on Stackoverflow
Solution 12 - PythonGolden LionView Answer on Stackoverflow