sorting by a custom list in pandas

PythonSortingPandas

Python Problem Overview


After reading through: http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.sort.html

I still can't seem to figure out how to sort a column by a custom list. Obviously, the default sort is alphabetical. I'll give an example. Here is my (very abridged) dataframe:

             Player	     Year	Age	  Tm	 G
2967	 Cedric Hunter	 1991	 27	 CHH	 6
5335	 Maurice Baker	 2004	 25	 VAN	 7
13950	 Ratko Varda	 2001	 22	 TOT	 60
6141	 Ryan Bowen	     2009	 34	 OKC     52
6169	 Adrian Caldwell 1997	 31	 DAL	 81

I want to be able to sort by Player, Year and then Tm. The default sort by Player and Year is fine for me, in normal order. However, I do not want Team sorted alphabetically b/c I want TOT always at the top.

Here is the list I created:

sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',
   'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
   'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
   'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',
   'WAS', 'WSB']

After reading through the link above, I thought this would work but it didn't:

df.sort(['Player', 'Year', 'Tm'], ascending = [True, True, sorter])

It still has ATL at the top, meaning that it sorted alphabetically and not according to my custom list. Any help would really be greatly appreciated, I just can't figure this out.

Python Solutions


Solution 1 - Python

I just discovered that with pandas 15.1 it is possible to use categorical series (http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#categoricals)

As for your example, lets define the same data-frame and sorter:

import pandas as pd

data = {
    'id': [2967, 5335, 13950, 6141, 6169],
    'Player': ['Cedric Hunter', 'Maurice Baker', 
               'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],
    'Year': [1991, 2004, 2001, 2009, 1997],
    'Age': [27, 25, 22, 34, 31],
    'Tm': ['CHH', 'VAN', 'TOT', 'OKC', 'DAL'],
    'G': [6, 7, 60, 52, 81]
}

# Create DataFrame
df = pd.DataFrame(data)

# Define the sorter
sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',
          'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
          'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
          'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN', 'WAS', 'WSB']

With the data-frame and sorter, which is a category-order, we can do the following in pandas 15.1:

# Convert Tm-column to category and in set the sorter as categories hierarchy
# Youc could also do both lines in one just appending the cat.set_categories()
df.Tm = df.Tm.astype("category")
df.Tm.cat.set_categories(sorter, inplace=True)

print(df.Tm)
Out[48]: 
0    CHH
1    VAN
2    TOT
3    OKC
4    DAL
Name: Tm, dtype: category
Categories (38, object): [TOT < ATL < BOS < BRK ... UTA < VAN < WAS < WSB]

df.sort_values(["Tm"])  ## 'sort' changed to 'sort_values'
Out[49]: 
   Age   G           Player   Tm  Year     id
2   22  60      Ratko Varda  TOT  2001  13950
0   27   6    Cedric Hunter  CHH  1991   2967
4   31  81  Adrian Caldwell  DAL  1997   6169
3   34  52       Ryan Bowen  OKC  2009   6141
1   25   7    Maurice Baker  VAN  2004   5335


Solution 2 - Python

Below is an example that performs lexicographic sort on a dataframe. The idea is to create an numerical index based on the specific sort. Then to perform a numerical sort based on the index. A column is added to the dataframe to do so, and is then removed.

import pandas as pd

# Create DataFrame
df = pd.DataFrame(
{'id':[2967, 5335, 13950, 6141, 6169],
    'Player': ['Cedric Hunter', 'Maurice Baker',
               'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],
    'Year': [1991, 2004, 2001, 2009, 1997],
    'Age': [27, 25, 22, 34, 31],
    'Tm': ['CHH' ,'VAN' ,'TOT' ,'OKC', 'DAL'],
    'G': [6, 7, 60, 52, 81]})

# Define the sorter
sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL','DEN',
          'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
          'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
          'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',
          'WAS', 'WSB']

# Create the dictionary that defines the order for sorting
sorterIndex = dict(zip(sorter, range(len(sorter))))

# Generate a rank column that will be used to sort
# the dataframe numerically
df['Tm_Rank'] = df['Tm'].map(sorterIndex)

# Here is the result asked with the lexicographic sort
# Result may be hard to analyze, so a second sorting is
# proposed next
## NOTE: 
## Newer versions of pandas use 'sort_values' instead of 'sort'
df.sort_values(['Player', 'Year', 'Tm_Rank'],
        ascending = [True, True, True], inplace = True)
df.drop('Tm_Rank', 1, inplace = True)
print(df)

# Here is an example where 'Tm' is sorted first, that will 
# give the first row of the DataFrame df to contain TOT as 'Tm'
df['Tm_Rank'] = df['Tm'].map(sorterIndex)
## NOTE: 
## Newer versions of pandas use 'sort_values' instead of 'sort'
df.sort_values(['Tm_Rank', 'Player', 'Year'],
        ascending = [True , True, True], inplace = True)
df.drop('Tm_Rank', 1, inplace = True)
print(df)

Solution 3 - Python

df1 = df.set_index('Tm')
df1.loc[sorter]

Solution 4 - Python

According to pandas 1.1.0 documentation, it has become possible to sort with key parameter like in sorted function (finally!). Here how we can sort by Tm

import pandas as pd


data = {
    'id': [2967, 5335, 13950, 6141, 6169],
    'Player': ['Cedric Hunter', 'Maurice Baker', 
               'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],
    'Year': [1991, 2004, 2001, 2009, 1997],
    'Age': [27, 25, 22, 34, 31],
    'Tm': ['CHH', 'VAN', 'TOT', 'OKC', 'DAL'],
    'G': [6, 7, 60, 52, 81]
}

# Create DataFrame
df = pd.DataFrame(data)


def tm_sorter(column):
    """Sort function"""
    teams = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',
       'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
       'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
       'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',
       'WAS', 'WSB']
    correspondence = {team: order for order, team in enumerate(teams)}
    return column.map(correspondence)

df.sort_values(by='Tm', key=tm_sorter)

Sadly, it looks like we can use this feature only in sorting by 1 column (list with keys is not acceptable). It can be circumvented by groupby

df.sort_values(['Player', 'Year']) \
  .groupby(['Player', 'Year']) \
  .apply(lambda x: x.sort_values(by='Tm', key=tm_sorter)) \
  .reset_index(drop=True)

If you know how to use key in sort_values with multiple columns, tell me please

Solution 5 - Python

This does the job in just a couple of lines

# Create a dummy df with the required list and the col name to sort on
dummy = pd.Series(sort_list, name = col_name).to_frame()

# Use left merge on the dummy to return a sorted df
sorted_df = pd.merge(dummy, df, on = col_name, how = 'left')

Solution 6 - Python

Setting the index then DataFrame.loc is useful when you need to order by a single custom list. Because loc will create NaN rows for values in sorter that aren't in the DataFrame we'll first find the intersection. This prevents any unwanted upcasting. Any rows with values not in the list are removed.

true_sort = [s for s in sorter if s in df.Tm.unique()]
df = df.set_index('Tm').loc[true_sort].reset_index()

    Tm     id           Player  Year  Age   G
0  TOT  13950      Ratko Varda  2001   22  60
1  CHH   2967    Cedric Hunter  1991   27   6
2  DAL   6169  Adrian Caldwell  1997   31  81
3  OKC   6141       Ryan Bowen  2009   34  52
4  VAN   5335    Maurice Baker  2004   25   7

Starting Data:

print(df)
      id           Player  Year  Age   Tm   G
0   2967    Cedric Hunter  1991   27  CHH   6
1   5335    Maurice Baker  2004   25  VAN   7
2  13950      Ratko Varda  2001   22  TOT  60
3   6141       Ryan Bowen  2009   34  OKC  52
4   6169  Adrian Caldwell  1997   31  DAL  81

sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',
          'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
          'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
          'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN', 'WAS', 'WSB']

Solution 7 - Python

Partial solution for those only interested in sorting by categorical columns:

You can do this with a helper function that creates a sort order mapper from a custom list.

This example only includes values from one column, however it could be extended to include other columns by creating a custom order list that includes values that occur in all columns. Naturally, since you must construct your custom list with all possible values in your sort field, this is good mostly for categorical sorting and would not be suitable for continuous variables (unless the possible values are known up front) and columns with a very high cardinality.

import pandas as pd

# set up a dummy dataframe
df = pd.DataFrame({'a':list('abcde'), 'b':range(5)})

# helper function
def make_sorter(l):
    """
    Create a dict from the list to map to 0..len(l)
    Returns a mapper to map a series to this custom sort order
    """
    sort_order = {k:v for k,v in zip(l, range(len(l)))}
    return lambda s: s.map(lambda x: sort_order[x])

# define a custom sort order
my_order = list('bdeca')

df.sort_values('a', key=make_sorter(my_order))

   a b
1  b 1
3  d 3
4  e 4
2  c 2
0  a 0

With OP's data:

df = pd.DataFrame({
    'id':[2967, 5335, 13950, 6141, 6169],
    'Player': ['Cedric Hunter', 'Maurice Baker',
               'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],
    'Year': [1991, 2004, 2001, 2009, 1997],
    'Age': [27, 25, 22, 34, 31],
    'Tm': ['CHH' ,'VAN' ,'TOT' ,'OKC', 'DAL'],
    'G': [6, 7, 60, 52, 81]
})

# Define the sorter
sorter = [
    'TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL',
    'DEN', 'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA',
    'MIL', 'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL',
    'PHI', 'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',
    'WAS', 'WSB'
]

df.sort_values('Tm', key=make_sorter(sorter))

      id           Player  Year  Age   Tm   G
2  13950      Ratko Varda  2001   22  TOT  60
0   2967    Cedric Hunter  1991   27  CHH   6
4   6169  Adrian Caldwell  1997   31  DAL  81
3   6141       Ryan Bowen  2009   34  OKC  52
1   5335    Maurice Baker  2004   25  VAN   7

Solution 8 - Python

My idea is generate sort number by index, then merge sort number into original dataframe

import pandas as pd

df = pd.DataFrame(
{'id':[2967, 5335, 13950, 6141, 6169],\
 'Player': ['Cedric Hunter', 'Maurice Baker' ,\
            'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],\
 'Year': [1991 ,2004 ,2001 ,2009 ,1997],\
 'Age': [27 ,25 ,22 ,34 ,31],\
 'Tm':['CHH' ,'VAN' ,'TOT' ,'OKC' ,'DAL'],\
 'G':[6 ,7 ,60 ,52 ,81]})

sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',
   'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
   'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
   'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',
   'WAS', 'WSB']

x = pd.DataFrame({'Tm': sorter})
x.index = x.index.set_names('number')
x = x.reset_index()

df = pd.merge(df, x, how='left', on='Tm')

df.sort_values(['Player', 'Year', 'number'], \
        ascending = [True, True, True], inplace = True)
df.drop('number', 1, inplace = True)

Solution 9 - Python

Since version 1.1.0 you can use the key attribute to sort values:

df.sort_values(by="Tm", key=lambda column: column.map(lambda e: sorter.index(e)), inplace=True)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionitjcms18View Question on Stackoverflow
Solution 1 - PythondmeuView Answer on Stackoverflow
Solution 2 - PythonGuillaume JacquenotView Answer on Stackoverflow
Solution 3 - PythonkaiView Answer on Stackoverflow
Solution 4 - PythonArlegView Answer on Stackoverflow
Solution 5 - PythonG-HigginsView Answer on Stackoverflow
Solution 6 - PythonALollzView Answer on Stackoverflow
Solution 7 - PythonscignView Answer on Stackoverflow
Solution 8 - PythonMithrilView Answer on Stackoverflow
Solution 9 - PythonMarius DKMView Answer on Stackoverflow