Check which columns in DataFrame are Categorical

PythonPandas

Python Problem Overview


I am new to Pandas... I want to a simple and generic way to find which columns are categorical in my DataFrame, when I don't manually specify each column type, unlike in this SO question. The df is created with:

import pandas as pd
df = pd.read_csv("test.csv", header=None)

e.g.

           0         1         2         3        4
0   1.539240  0.423437 -0.687014   Chicago   Safari
1   0.815336  0.913623  1.800160    Boston   Safari
2   0.821214 -0.824839  0.483724  New York   Safari

.

UPDATE (2018/02/04) The question assumes numerical columns are NOT categorical, @Zero's accepted answer solves this.

BE CAREFUL - As @Sagarkar's comment points out that's not always true. The difficulty is that Data Types and Categorical/Ordinal/Nominal types are orthogonal concepts, thus mapping between them isn't straightforward. @Jeff's answer below specifies the precise manner to achieve the manual mapping.

Python Solutions


Solution 1 - Python

You could use df._get_numeric_data() to get numeric columns and then find out categorical columns

In [66]: cols = df.columns

In [67]: num_cols = df._get_numeric_data().columns

In [68]: num_cols
Out[68]: Index([u'0', u'1', u'2'], dtype='object')

In [69]: list(set(cols) - set(num_cols))
Out[69]: ['3', '4']

Solution 2 - Python

The way I found was updating to Pandas v0.16.0, then excluding number dtypes with:

df.select_dtypes(exclude=["number","bool_","object_"])

Which works, providing no types are changed and no more are added to NumPy. The suggestion in the question's comments by @Jeff suggests include=["category"], but that didn't seem to work.

NumPy Types: link

Numpy Types

Solution 3 - Python

For posterity. The canonical method to select dtypes is .select_dtypes. You can specify an actual numpy dtype or convertible, or 'category' which not a numpy dtype.

In [1]: df = DataFrame({'A' : Series(range(3)).astype('category'), 'B' : range(3), 'C' : list('abc'), 'D' : np.random.randn(3) })

In [2]: df
Out[2]: 
   A  B  C         D
0  0  0  a  0.141296
1  1  1  b  0.939059
2  2  2  c -2.305019

In [3]: df.select_dtypes(include=['category'])
Out[3]: 
   A
0  0
1  1
2  2

In [4]: df.select_dtypes(include=['object'])
Out[4]: 
   C
0  a
1  b
2  c

In [5]: df.select_dtypes(include=['object']).dtypes
Out[5]: 
C    object
dtype: object

In [6]: df.select_dtypes(include=['category','int']).dtypes
Out[6]: 
A    category
B       int64
dtype: object

In [7]: df.select_dtypes(include=['category','int','float']).dtypes
Out[7]: 
A    category
B       int64
D     float64
dtype: object

Solution 4 - Python

You can get the list of categorical columns using this code :

dfName.select_dtypes(include=['object']).columns.tolist()

And intuitively for numerical columns :

dfName.select_dtypes(exclude=['object']).columns.tolist()

Hope that helps.

Solution 5 - Python

select categorical column names

cat_features=[i for i in df.columns if df.dtypes[i]=='object']

Solution 6 - Python

Get categorical and numerical variables

numCols = X.select_dtypes("number").columns
catCols = X.select_dtypes("object").columns
numCols= list(set(numCols))
catCols= list(set(catCols))

Solution 7 - Python

numeric_var = [key for key in dict(df.dtypes)
                   if dict(pd.dtypes)[key]
                       in ['float64','float32','int32','int64']] # Numeric Variable

cat_var = [key for key in dict(df.dtypes)
             if dict(df.dtypes)[key] in ['object'] ] # Categorical Varible

Solution 8 - Python

Use .dtypes

In [10]: df.dtypes
Out[10]: 
0    float64
1    float64
2    float64
3     object
4     object
dtype: object

Solution 9 - Python

Use pandas.DataFrame.select_dtypes. There are categorical dtypes that can be found by 'categorical' flag. For Strings you might use the numpy object dtype

More Info: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.select_dtypes.html

Exemple:

import pandas as pd
df = pd.DataFrame({'Integer': [1, 2] * 3,'Bool': [True, False] * 3,'Float': [1.0, 2.0] * 3,'String': ['Dog', 'Cat'] * 3})
df

Out[1]:    
    Integer	Bool	Float	String
0	1	    True	1.0	    Dog
1	2	    False	2.0	    Cat
2	1	    True	1.0	    Dog
3	2	    False	2.0	    Cat
4	1	    True	1.0	    Dog
5	2	    False	2.0	    Cat

df.select_dtypes(include=['category', object]).columns

Out[2]:
Index(['String'], dtype='object')

Solution 10 - Python

I have faced similar obstacle where categorizing variables was a challenge. However I came up with some approaches based on the nature of the data. This would give a general and flexible answer to your issue as well as to future data.

Normally while categorization of data is done on the basis of its datatype which sometimes may result in wrong analysis. (Usually done by df.select_dtypes(include = ['object', 'category'])

Approach:

The approach is of viewing the data not on a column level but on a row level. This approach would give the number of distinct values which would automatically distinguish categorical variables from numerical types.

That is if count of unique values in a row exceed more than certain number of values (This is for you to decide how many categorical variables you presume in your column)

for eg: if ['Dog', 'Cat', 'Bird', 'Fish', 'Reptile'] makes up for five unique categorical values for a particular column and if number of distinct values don't exceed more than those five unique categorical values in that column then that column falls under categorical variables.

elif ['Dog', 'Cat', 'Bird', 'Fish', 'Reptile'] makes up for five unique categorical values for a particular column and if number of distinct values exceed more than those five unique categorical values in that column then they fall under numerical variables.

if [col for col in df.columns if len(df[col].unique()) <=5]:
     cat_var = [col for col in df.columns if len(df[col].unique()) <=5]  
elif [col for col in df.columns if len(df[col].unique()) > 5]:
     num_var = [col for col in df.columns if len(df[col].unique()) > 5]

# where 5 : presumed number of categorical variables and may be flexible for user to decide. 

I have used if and elif for better illustration. There is no need for that you can directly go for lines inside the condition.

Solution 11 - Python

You don't need to query the data if you are just interested in which columns are of what type.

The fastest method (when %%timeit-ing it) is:

df.dtypes[df.dtypes == 'category'].index

(this will give you a pandas' Index. You can .tolist() to get a list out of it, if you need that.)

This works because df.dtypes is a pd.Series of strings (its own dtype is 'object'), so you can actually just select for the type that you need with normal pandas querying.

You don't have your categorical types as 'category' but as simple strings ('object')? Then just:

df.dtypes[df.dtypes == 'object'].index

Do you have a mix of 'object' and 'category'? Then use isin like you would do normally to query for multiple matches:

df.dtypes[df.dtypes.isin(['object','category'])].index

Solution 12 - Python

`categorical_values  = (df.dtypes == 'object')
categorical_variables = categorical_variables =[categorical_values.index[ind] 
for ind, val in enumerate(categorical_values) if val == True]

In the first line of code, we obtain a series which gives information regarding all the columns. The series gives information on which column is an object type and which column is not of the object type by representing it with a Boolean value.

In the second line, we use a list comprehension using enumeration(iterating through index and value), so that we could easily find the column which is of categorical type and append it to the categorical_variables list

Solution 13 - Python

Often columns get pandas dtype of string (or "object") or category. Better to include both incase the columns you look for don't get listed under category dtype.

dataframe.select_dtypes(include=['object','category']).columns.tolist()

Solution 14 - Python

First we can segregate the data frame with the default types available when we read the datasets. This will list out all the different types and the corresponding data.

for types in data.dtypes.unique():
    print(types)
    print(data.select_dtypes(types).columns)

Solution 15 - Python

This code will get all categorical variables:

cat_cols = [col for col in df.columns if col not in df.describe().columns]

Solution 16 - Python

This will give an array of all the categorical variables in a dataframe.

dataset.select_dtypes(include=['O']).columns.values

Solution 17 - Python

# Import packages
import numpy as np
import pandas as pd

# Data
df = pd.DataFrame({"Country" : ["France", "Spain", "Germany", "Spain", "Germany", "France"], 
                   "Age" : [34, 27, 30, 32, 42, 30], 
                   "Purchased" : ["No", "Yes", "No", "No", "Yes", "Yes"]})
df

Out[1]:
  Country Age Purchased
0  France  34        No
1   Spain  27       Yes
2 Germany  30        No
3   Spain  32        No
4 Germany  42       Yes
5  France  30       Yes

# Checking data type
df.dtypes

Out[2]: 
Country      object
Age           int64
Purchased    object
dtype: object

# Saving CATEGORICAL Variables
cat_col = [c for i, c in enumerate(df.columns) if df.dtypes[i] in [np.object]]
cat_col
Out[3]: ['Country', 'Purchased']

Solution 18 - Python

This might help. But you need to check the columns with slightly less than 10 characters, or you need to check columns with unique values that are slightly more than 10 characters manually.

def find_cate(df):
cols=df.columns
i=0
for col in cols:
    if len(df[col].unique())<=10:
        print(col,len(df[col].unique()))
        i=i+1
print(i)

Solution 19 - Python

df.select_dtypes(exclude=["number"]).columns

This will help you to directly display all the non numerical rows

Solution 20 - Python

This always worked pretty well for me :

categorical_columns = list(set(df.columns) - set(df.describe().columns))

Solution 21 - Python

Sklearn gives you a one liner (or a 2 liner if you want to use it on many DataFrames). Lets say your DataFrame object is df then:

## good example in https://scikit-learn.org/stable/auto_examples/ensemble/plot_stack_predictors.html
from sklearn.compose import make_column_selector
        
cat_cols = make_column_selector(dtype_include=object) (df)
print (cat_cols)
## OR to use with many DataFrames, create one _selector object first
num_selector = make_column_selector(dtype_include=np.number) 
num_cols = num_selector (df)
print (num_cols)  

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionpdsView Question on Stackoverflow
Solution 1 - PythonZeroView Answer on Stackoverflow
Solution 2 - PythonpdsView Answer on Stackoverflow
Solution 3 - PythonJeffView Answer on Stackoverflow
Solution 4 - PythonShikhar OmarView Answer on Stackoverflow
Solution 5 - PythonGucci148View Answer on Stackoverflow
Solution 6 - PythonRanjita ShettyView Answer on Stackoverflow
Solution 7 - PythonSudhir TiwariView Answer on Stackoverflow
Solution 8 - PythonLiam FoleyView Answer on Stackoverflow
Solution 9 - PythondCrystalView Answer on Stackoverflow
Solution 10 - PythonAztec-3x5View Answer on Stackoverflow
Solution 11 - PythonMichele PiccoliniView Answer on Stackoverflow
Solution 12 - PythonLakshman DunnaView Answer on Stackoverflow
Solution 13 - Pythond2a2dView Answer on Stackoverflow
Solution 14 - PythonChethanView Answer on Stackoverflow
Solution 15 - PythonBhanu SinghView Answer on Stackoverflow
Solution 16 - Pythonankit2saxenaView Answer on Stackoverflow
Solution 17 - PythonHamza ChennaqView Answer on Stackoverflow
Solution 18 - PythonVenkatView Answer on Stackoverflow
Solution 19 - PythonHemanth HarikrishnanView Answer on Stackoverflow
Solution 20 - PythonTaha MahjoubiView Answer on Stackoverflow
Solution 21 - Pythonuser96265View Answer on Stackoverflow