Selecting Pandas Columns by dtype

PythonPandasScipy

Python Problem Overview


I was wondering if there is an elegant and shorthand way in Pandas DataFrames to select columns by data type (dtype). i.e. Select only int64 columns from a DataFrame.

To elaborate, something along the lines of

df.select_columns(dtype=float64)

Thanks in advance for the help

Python Solutions


Solution 1 - Python

Since 0.14.1 there's a select_dtypes method so you can do this more elegantly/generally.

In [11]: df = pd.DataFrame([[1, 2.2, 'three']], columns=['A', 'B', 'C'])

In [12]: df.select_dtypes(include=['int'])
Out[12]:
   A
0  1

> To select all numeric types use the numpy dtype numpy.number

In [13]: df.select_dtypes(include=[np.number])
Out[13]:
   A    B
0  1  2.2

In [14]: df.select_dtypes(exclude=[object])
Out[14]:
   A    B
0  1  2.2

Solution 2 - Python

df.loc[:, df.dtypes == np.float64]

Solution 3 - Python

df.select_dtypes(include=[np.float64])

Solution 4 - Python

I'd like to extend existing answer by adding options for selecting all floating dtypes or all integer dtypes:

Demo:

np.random.seed(1234)

df = pd.DataFrame({
        'a':np.random.rand(3), 
        'b':np.random.rand(3).astype('float32'), 
        'c':np.random.randint(10,size=(3)).astype('int16'),
        'd':np.arange(3).astype('int32'), 
        'e':np.random.randint(10**7,size=(3)).astype('int64'),
        'f':np.random.choice([True, False], 3),
        'g':pd.date_range('2000-01-01', periods=3)
     })

yields:

In [2]: df
Out[2]:
          a         b  c  d        e      f          g
0  0.191519  0.785359  6  0  7578569  False 2000-01-01
1  0.622109  0.779976  8  1  7981439   True 2000-01-02
2  0.437728  0.272593  0  2  2558462   True 2000-01-03

In [3]: df.dtypes
Out[3]:
a           float64
b           float32
c             int16
d             int32
e             int64
f              bool
g    datetime64[ns]
dtype: object

Selecting all floating number columns:

In [4]: df.select_dtypes(include=['floating'])
Out[4]:
          a         b
0  0.191519  0.785359
1  0.622109  0.779976
2  0.437728  0.272593

In [5]: df.select_dtypes(include=['floating']).dtypes
Out[5]:
a    float64
b    float32
dtype: object

Selecting all integer number columns:

In [6]: df.select_dtypes(include=['integer'])
Out[6]:
   c  d        e
0  6  0  7578569
1  8  1  7981439
2  0  2  2558462

In [7]: df.select_dtypes(include=['integer']).dtypes
Out[7]:
c    int16
d    int32
e    int64
dtype: object

Selecting all numeric columns:

In [8]: df.select_dtypes(include=['number'])
Out[8]:
          a         b  c  d        e
0  0.191519  0.785359  6  0  7578569
1  0.622109  0.779976  8  1  7981439
2  0.437728  0.272593  0  2  2558462

In [9]: df.select_dtypes(include=['number']).dtypes
Out[9]:
a    float64
b    float32
c      int16
d      int32
e      int64
dtype: object

Solution 5 - Python

Multiple includes for selecting columns with list of types for example- float64 and int64

df_numeric = df.select_dtypes(include=[np.float64,np.int64])

Solution 6 - Python

If you want to select int64 columns and then update "in place", you can use:

int64_cols = [col for col in df.columns if is_int64_dtype(df[col].dtype)]
df[int64_cols]

For example, notice that I update all the int64 columns in df to zero below:

In [1]:

    import pandas as pd
    from pandas.api.types import is_int64_dtype
    
    df = pd.DataFrame({'a': [1, 2] * 3,
                       'b': [True, False] * 3,
                       'c': [1.0, 2.0] * 3,
                       'd': ['red','blue'] * 3,
                       'e': pd.Series(['red','blue'] * 3, dtype="category"),
                       'f': pd.Series([1, 2] * 3, dtype="int64")})
    
    int64_cols = [col for col in df.columns if is_int64_dtype(df[col].dtype)] 
    print('int64 Cols: ',int64_cols)
    
    print(df[int64_cols])
    
    df[int64_cols] = 0
    
    print(df[int64_cols]) 
    
Out [1]:

    int64 Cols:  ['a', 'f']

           a  f
        0  1  1
        1  2  2
        2  1  1
        3  2  2
        4  1  1
        5  2  2
           a  f
        0  0  0
        1  0  0
        2  0  0
        3  0  0
        4  0  0
        5  0  0

Just for completeness:

df.loc() and df.select_dtypes() are going to give a copy of a slice from the dataframe. This means that if you try to update values from df.select_dtypes(), you will get a SettingWithCopyWarning and no updates will happen to df in place.

For example, notice when I try to update df using .loc() or .select_dtypes() to select columns, nothing happens:

In [2]:

    df = pd.DataFrame({'a': [1, 2] * 3,
                       'b': [True, False] * 3,
                       'c': [1.0, 2.0] * 3,
                       'd': ['red','blue'] * 3,
                       'e': pd.Series(['red','blue'] * 3, dtype="category"),
                       'f': pd.Series([1, 2] * 3, dtype="int64")})
    
    df_bool = df.select_dtypes(include='bool')
    df_bool.b[0] = False
    
    print(df_bool.b[0])
    print(df.b[0])
    
    df.loc[:, df.dtypes == np.int64].a[0]=7
    print(df.a[0])

Out [2]:
    
    False
    True
    1

Solution 7 - Python

select_dtypes(include=[np.int])

Solution 8 - Python

Optionally if you don't want to create a subset of the dataframe during the process, you can directly iterate through the column datatype.

I haven't benchmarked the code below, assume it will be faster if you work on very large dataset.

[col for col in df.columns.tolist() if df[col].dtype not in ['object','<M8[ns]']] 

Solution 9 - Python

You can use :

for i in x.columns[x.dtypes == 'object']:
    print(i)

incase you just want to display only the column names of a particular dataframe rather than a sliced dataframe. Don't know if any function as such exits for python.

PS : replace object with the datatype you want.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestioncanerView Question on Stackoverflow
Solution 1 - PythonAndy HaydenView Answer on Stackoverflow
Solution 2 - PythonDan AllanView Answer on Stackoverflow
Solution 3 - PythonnormonicsView Answer on Stackoverflow
Solution 4 - PythonMaxU - stop genocide of UAView Answer on Stackoverflow
Solution 5 - PythonGurubuxView Answer on Stackoverflow
Solution 6 - PythonJake DrewView Answer on Stackoverflow
Solution 7 - PythonAnjan PrasadView Answer on Stackoverflow
Solution 8 - Pythonhui chenView Answer on Stackoverflow
Solution 9 - PythonRahul BordoloiView Answer on Stackoverflow