df.drop if it exists
PythonPandasPython Problem Overview
Below is a function that takes a file and drops column names 'row_num", 'start_date', 'end_date.'
The problem is not every file has each of these column names, so the function returns an error.
My goal is to alter code so that it removes these columns if it exists but does not return an error if the column does not exist.
def read_df(file):
df = pd.read_csv(file, na_values=['', ' '])
# Drop useless junk and fill empty values with zero
df = df.drop(['row_num','start_date','end_date','symbol'], axis=1).fillna(0)
df=df[df!=0][:-1].dropna().append(df.iloc[-1])
return df
Python Solutions
Solution 1 - Python
Add parameter errors
to DataFrame.drop
:
>errors : {'ignore', 'raise'}, default 'raise' > >If 'ignore', suppress error and only existing labels are dropped.
df = df.drop(['row_num','start_date','end_date','symbol'], axis=1, errors='ignore')
Sample:
df = pd.DataFrame({'row_num':[1,2], 'w':[3,4]})
df = df.drop(['row_num','start_date','end_date','symbol'], axis=1, errors='ignore')
print (df)
w
0 3
1 4
Solution 2 - Python
In my tests the following was at least as fast as any of the given answers:
candidates=['row_num','start_date','end_date','symbol']
df = df.drop([x for x in candidates if x in df.columns], axis=1)
It has the benefit of readability and (with a small tweak to the code) the ability to record exactly which columns existed/were dropped when.
Some reasons this might be more desireable than the previous solutions:
- Looping over the items and dropping each column individually if it exists is functional, but quite slow (see benchmarks below).
- jezrael's answer is very nice, but made me nervous at first (ignoring errors feels bad!). Further looking at the documentation makes it sounds like this is OK though, and simply ignores the error of the column not existing (not other errors that might be undesireable to ignore). My solution may be more readable, especially for those less familiar with optional kwargs in pandas.
Benchmark Results:
Code for benchmark tests (credit to an answer in this question for how to create this sort of benchmark):
import math
from simple_benchmark import benchmark
import pandas as pd
# setting up the toy df:
def df_creator(length):
c1=list(range(0,10))
c2=list('a,b,c,d,e'.split(','))
c3=list(range(0,5))
c4=[True,False]
lists=[c1,c2,c3,c4]
df=pd.DataFrame()
count=0
for x in lists:
count+=1
df['col'+str(count)]=x*math.floor(length/len(x))
return df
# setting up benchmark test:
def list_comp(df,candidates=['col1','col2','col5','col8']):
return df.drop([x for x in candidates if x in df.columns], axis=1)
def looper(df,candidates=['col1','col2','col5','col8']):
for col in candidates:
if col in df.columns:
out = df.drop(columns=col, axis=1)
return out
def ignore_error(df,candidates=['col1','col2','col5','col8']):
return df.drop(candidates, axis=1, errors='ignore')
functions=[list_comp,looper,ignore_error]
args={n : df_creator(n) for n in [10,100,1000,10000,100000]}
argname='df_length'
b=benchmark(functions,args,argname)
b.plot()
Solution 3 - Python
Just use Pandas Filter, the Pythonic Way
Oddly, No answers use the pandas
dataframe filter method
thisFilter = df.filter(drop_list)
df.drop(thisFilter, inplace=True, axis=1)
This will create a filter from the drop_list
that exists in df
, then drop thisFilter
from the df
inplace
on axis=1
i.e., drop the columns that match the drop_list
and don't error if they are nonexistent
Solution 4 - Python
I just had to do this; here's what I did:
# Drop these columns if they exist
cols = ['Billing Address Street 1', 'Billing Address Street 2','Billing Company']
for col in cols:
if col in df.columns:
df = df.drop(columns=col, axis=1)
Might not be the best way, but it served it's purpose.
Solution 5 - Python
x = ['row_num','start_date','end_date','symbol']
To check if column exists then You can do:
for i in x:
if i in df:
df = df.drop(['row_num','start_date','end_date','symbol'], axis=1).fillna(0)
or
for i in x:
if i in df.columns:
df = df.drop(['row_num','start_date','end_date','symbol'], axis=1).fillna(0)