How to keep leading zeros in a column when reading CSV with Pandas?

Python Problem Overview

I am importing study data into a Pandas data frame using read_csv.

My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").

When I import into Pandas, the leading zero is stripped of and the column is formatted as int64.

Is there a way to import this column unchanged maybe as a string?

I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.

Python Solutions

Solution 1 - Python

As indicated in this question/answer by Lev Landau, there could be a simple solution to use converters option for a certain column in read_csv function.

converters={'column_name': lambda x: str(x)}

You can refer to more options of read_csv funtion in pandas.io.parsers.read_csv documentation.

Lets say I have csv file projects.csv like below:

project_name,project_id
Some Project,000245
Another Project,000478

As for example below code is triming leading zeros:

import csv
from pandas import read_csv

dataframe = read_csv('projects.csv')
print dataframe

Result:

me@ubuntu:~$ python test_dataframe.py 
      project_name  project_id
0     Some Project         245
1  Another Project         478
me@ubuntu:~$

Solution code example:

import csv
from pandas import read_csv

dataframe = read_csv('projects.csv', converters={'project_id': lambda x: str(x)})
print dataframe

Required result:

me@ubuntu:~$ python test_dataframe.py 
      project_name project_id
0     Some Project     000245
1  Another Project     000478
me@ubuntu:~$

Update as it helps others:

To have all columns as str, one can do this (from the comment):

pd.read_csv('sample.csv', dtype = str)

To have most or selective columns as str, one can do this:

# lst of column names which needs to be string
lst_str_cols = ['prefix', 'serial']
# use dictionary comprehension to make dict of dtypes
dict_dtypes = {x : 'str'  for x in lst_str_cols}
# use dict on dtypes
pd.read_csv('sample.csv', dtype=dict_dtypes)

Solution 2 - Python

here is a shorter, robust and fully working solution:

simply define a mapping (dictionary) between variable names and desired data type:

dtype_dic= {'subject_id': str, 
            'subject_number' : 'float'}

use that mapping with pd.read_csv():

df = pd.read_csv(yourdata, dtype = dtype_dic)

et voila!

Solution 3 - Python

If you have a lot of columns and you don't know which ones contain leading zeros that might be missed, or you might just need to automate your code. You can do the following:

df = pd.read_csv("your_file.csv", nrows=1) # Just take the first row to extract the columns' names
col_str_dic = {column:str for column in list(df)}
df = pd.read_csv("your_file.csv", dtype=col_str_dic) # Now you can read the compete file

You could also do:

df = pd.read_csv("your_file.csv", dtype=str)

By doing this you will have all your columns as strings and you won't lose any leading zeros.

Solution 4 - Python

You Can do This , Works On all Versions of Pandas

pd.read_csv('filename.csv', dtype={'zero_column_name': object})

Solution 5 - Python

You can use converters to convert number to fixed width if you know the width.

For example, if the width is 5, then

data = pd.read_csv('text.csv', converters={'column1': lambda x: f"{x:05}"})

This will do the trick. It works for pandas==0.23.0 and also read_excel.

Python3.6 or higher required.

Solution 6 - Python

I don't think you can specify a column type the way you want (if there haven't been changes reciently and if the 6 digit number is not a date that you can convert to datetime). You could try using np.genfromtxt() and create the DataFrame from there.

EDIT: Take a look at Wes Mckinney's blog, there might be something for you. It seems to be that there is a new parser from pandas 0.10 coming in November.

Content Type	Original Author	Original Content on Stackoverflow
Question	user1802883	View Question on Stackoverflow
Solution 1 - Python	baltasvejas	View Answer on Stackoverflow
Solution 2 - Python	ℕʘʘḆḽḘ	View Answer on Stackoverflow
Solution 3 - Python	Erick Rodriguez	View Answer on Stackoverflow
Solution 4 - Python	user11669928	View Answer on Stackoverflow
Solution 5 - Python	secsilm	View Answer on Stackoverflow
Solution 6 - Python	root	View Answer on Stackoverflow