Pandas: convert categories to numbers

PythonPandasSeriesCategorical DataBinning

Python Problem Overview


Suppose I have a dataframe with countries that goes as:

cc | temp
US | 37.0
CA | 12.0
US | 35.0
AU | 20.0

I know that there is a pd.get_dummies function to convert the countries to 'one-hot encodings'. However, I wish to convert them to indices instead such that I will get cc_index = [1,2,1,3] instead.

I'm assuming that there is a faster way than using the get_dummies along with a numpy where clause as shown below:

[np.where(x) for x in df.cc.get_dummies().values]

This is somewhat easier to do in R using 'factors' so I'm hoping pandas has something similar.

Python Solutions


Solution 1 - Python

First, change the type of the column:

df.cc = pd.Categorical(df.cc)

Now the data look similar but are stored categorically. To capture the category codes:

df['code'] = df.cc.cat.codes

Now you have:

   cc  temp  code
0  US  37.0     2
1  CA  12.0     1
2  US  35.0     2
3  AU  20.0     0

If you don't want to modify your DataFrame but simply get the codes:

df.cc.astype('category').cat.codes

Or use the categorical column as an index:

df2 = pd.DataFrame(df.temp)
df2.index = pd.CategoricalIndex(df.cc)

Solution 2 - Python

If you wish only to transform your series into integer identifiers, you can use pd.factorize.

Note this solution, unlike pd.Categorical, will not sort alphabetically. So the first country will be assigned 0. If you wish to start from 1, you can add a constant:

df['code'] = pd.factorize(df['cc'])[0] + 1

print(df)

   cc  temp  code
0  US  37.0     1
1  CA  12.0     2
2  US  35.0     1
3  AU  20.0     3

If you wish to sort alphabetically, specify sort=True:

df['code'] = pd.factorize(df['cc'], sort=True)[0] + 1 

  

Solution 3 - Python

If you are using the sklearn library you can use LabelEncoder. Like pd.Categorical, input strings are sorted alphabetically before encoding.

from sklearn.preprocessing import LabelEncoder

LE = LabelEncoder()
df['code'] = LE.fit_transform(df['cc'])

print(df)

   cc  temp  code
0  US  37.0     2
1  CA  12.0     1
2  US  35.0     2
3  AU  20.0     0

Solution 4 - Python

One-line code:

df[['cc']] = df[['cc']].apply(lambda col:pd.Categorical(col).codes)

This works also if you have a list_of_columns:

df[list_of_columns] = df[list_of_columns].apply(lambda col:pd.Categorical(col).codes)

Furthermore, if you want to keep your NaN values you can apply a replace:

df[['cc']] = df[['cc']].apply(lambda col:pd.Categorical(col).codes).replace(-1,np.nan)

Solution 5 - Python

Try this, convert to number based on frequency (high frequency - high number):

labels = df[col].value_counts(ascending=True).index.tolist()
codes = range(1,len(labels)+1)
df[col].replace(labels,codes,inplace=True)

Solution 6 - Python

Will change any columns into Numbers. It will not create a new column but just replace the values with numerical data.

def characters_to_numb(*args): for arg in args: df[arg] = pd.Categorical(df[arg]) df[arg] = df[arg].cat.codes return df

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionsachinrukView Question on Stackoverflow
Solution 1 - PythonJohn ZwinckView Answer on Stackoverflow
Solution 2 - PythonjppView Answer on Stackoverflow
Solution 3 - PythonjppView Answer on Stackoverflow
Solution 4 - PythonPiotroView Answer on Stackoverflow
Solution 5 - PythonPalepalli Surendra ReddyView Answer on Stackoverflow
Solution 6 - PythonDenis KalyanView Answer on Stackoverflow