df.drop if it exists

PythonPandas

Python Problem Overview


Below is a function that takes a file and drops column names 'row_num", 'start_date', 'end_date.'

The problem is not every file has each of these column names, so the function returns an error.

My goal is to alter code so that it removes these columns if it exists but does not return an error if the column does not exist.

def read_df(file):
    df = pd.read_csv(file, na_values=['', ' '])
    # Drop useless junk and fill empty values with zero 
    df = df.drop(['row_num','start_date','end_date','symbol'], axis=1).fillna(0)
    df=df[df!=0][:-1].dropna().append(df.iloc[-1])
    return df

Python Solutions


Solution 1 - Python

Add parameter errors to DataFrame.drop:

>errors : {'ignore', 'raise'}, default 'raise' > >If 'ignore', suppress error and only existing labels are dropped.

df = df.drop(['row_num','start_date','end_date','symbol'], axis=1, errors='ignore')

Sample:

df = pd.DataFrame({'row_num':[1,2], 'w':[3,4]})
df = df.drop(['row_num','start_date','end_date','symbol'], axis=1, errors='ignore')
print (df)
   w
0  3
1  4

Solution 2 - Python

In my tests the following was at least as fast as any of the given answers:

candidates=['row_num','start_date','end_date','symbol']
df = df.drop([x for x in candidates if x in df.columns], axis=1)

It has the benefit of readability and (with a small tweak to the code) the ability to record exactly which columns existed/were dropped when.

Some reasons this might be more desireable than the previous solutions:

  • Looping over the items and dropping each column individually if it exists is functional, but quite slow (see benchmarks below).
  • jezrael's answer is very nice, but made me nervous at first (ignoring errors feels bad!). Further looking at the documentation makes it sounds like this is OK though, and simply ignores the error of the column not existing (not other errors that might be undesireable to ignore). My solution may be more readable, especially for those less familiar with optional kwargs in pandas.

Benchmark Results:

![benchmark results

Code for benchmark tests (credit to an answer in this question for how to create this sort of benchmark):

import math
from simple_benchmark import benchmark
import pandas as pd

# setting up the toy df:
def df_creator(length):
    c1=list(range(0,10))
    c2=list('a,b,c,d,e'.split(','))
    c3=list(range(0,5))
    c4=[True,False]
    lists=[c1,c2,c3,c4]
    df=pd.DataFrame()
    count=0
    for x in lists:
        count+=1
        df['col'+str(count)]=x*math.floor(length/len(x))
    return df

# setting up benchmark test:
def list_comp(df,candidates=['col1','col2','col5','col8']):
    return df.drop([x for x in candidates if x in df.columns], axis=1)

def looper(df,candidates=['col1','col2','col5','col8']):
    for col in candidates:
        if col in df.columns:
            out = df.drop(columns=col, axis=1)
    return out

def ignore_error(df,candidates=['col1','col2','col5','col8']):
    return df.drop(candidates, axis=1, errors='ignore')

functions=[list_comp,looper,ignore_error]

args={n : df_creator(n) for n in [10,100,1000,10000,100000]}
argname='df_length'
b=benchmark(functions,args,argname)
b.plot()

Solution 3 - Python

Just use Pandas Filter, the Pythonic Way

Oddly, No answers use the pandas dataframe filter method

thisFilter = df.filter(drop_list)
df.drop(thisFilter, inplace=True, axis=1)

This will create a filter from the drop_list that exists in df, then drop thisFilter from the df inplace on axis=1

i.e., drop the columns that match the drop_list and don't error if they are nonexistent

Solution 4 - Python

I just had to do this; here's what I did:

# Drop these columns if they exist
cols = ['Billing Address Street 1', 'Billing Address Street 2','Billing Company']
for col in cols:
    if col in df.columns:
        df = df.drop(columns=col, axis=1)

Might not be the best way, but it served it's purpose.

Solution 5 - Python

x = ['row_num','start_date','end_date','symbol']

To check if column exists then You can do:

for i in x:
    if i in df:
        df = df.drop(['row_num','start_date','end_date','symbol'], axis=1).fillna(0)

or

for i in x:
    if i in df.columns:
        df = df.drop(['row_num','start_date','end_date','symbol'], axis=1).fillna(0)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionZJAYView Question on Stackoverflow
Solution 1 - PythonjezraelView Answer on Stackoverflow
Solution 2 - PythonamquackView Answer on Stackoverflow
Solution 3 - PythonTimothy L.J. StewartView Answer on Stackoverflow
Solution 4 - PythonCoffee and CodeView Answer on Stackoverflow
Solution 5 - PythonKhakhar ShyamView Answer on Stackoverflow