Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign

PythonCsvUtf 8Pandas

Python Problem Overview


I've read something about a Python 2 limitation with respect to Pandas' to_csv( ... etc ...). Have I hit it? I'm on Python 2.7.3

This turns out trash characters for ≥ and - when they appear in strings. Aside from that the export is perfect.

df.to_csv("file.csv", encoding="utf-8") 

Is there any workaround?

df.head() is this:

demography  Adults ≥49 yrs  Adults 1849 yrs at high risk||  \
state                                                           
Alabama                 32.7                             38.6   
Alaska                  31.2                             33.2   
Arizona                 22.9                             38.8   
Arkansas                31.2                             34.0   
California              29.8                             38.8  

csv output is this

state,	Adults ≥49 yrs,	Adults 18−49 yrs at high risk||
0,	Alabama,	32.7,	38.6
1,	Alaska,	31.2,	33.2
2,	Arizona,	22.9,	38.8
3,	Arkansas,31.2,  34
4,	California,29.8, 38.8

the whole code is this:

import pandas
import xlrd
import csv
import json

df = pandas.DataFrame()
dy = pandas.DataFrame()
# first merge all this xls together


workbook = xlrd.open_workbook('csv_merger/vaccoverage.xls')
worksheets = workbook.sheet_names()


for i in range(3,len(worksheets)):
    dy = pandas.io.excel.read_excel(workbook, i, engine='xlrd', index=None)
    i = i+1
    df = df.append(dy)

df.index.name = "index"

df.columns = ['demography', 'area','state', 'month', 'rate', 'moe']

#Then just grab month = 'May'

may_mask = df['month'] == "May"
may_df = (df[may_mask])

#then delete some columns we dont need

may_df = may_df.drop('area', 1)
may_df = may_df.drop('month', 1)
may_df = may_df.drop('moe', 1)


print may_df.dtypes #uh oh, it sees 'rate' as type 'object', not 'float'.  Better change that.

may_df = may_df.convert_objects('rate', convert_numeric=True)

print may_df.dtypes #that's better

res = may_df.pivot_table('rate', 'state', 'demography')
print res.head()


#and this is going to spit out an array of Objects, each Object a state containing its demographics
res.reset_index().to_json("thejson.json", orient='records')
#and a .csv for good measure
res.reset_index().to_csv("thecsv.csv", orient='records', encoding="utf-8")

Python Solutions


Solution 1 - Python

Your "bad" output is UTF-8 displayed as CP1252.

On Windows, many editors assume the default ANSI encoding (CP1252 on US Windows) instead of UTF-8 if there is no byte order mark (BOM) character at the start of the file. While a BOM is meaningless to the UTF-8 encoding, its UTF-8-encoded presence serves as a signature for some programs. For example, Microsoft Office's Excel requires it even on non-Windows OSes. Try:

df.to_csv('file.csv',encoding='utf-8-sig')

That encoder will add the BOM.

Solution 2 - Python

encoding='utf-8-sig does not work for me. Excel reads the special characters fine now, but the Tab separators are gone! However, encoding='utf-16 does work correctly: special characters OK and Tab separators work. This is the solution for me.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMaggieView Question on Stackoverflow
Solution 1 - PythonMark TolonenView Answer on Stackoverflow
Solution 2 - PythongermView Answer on Stackoverflow