pandas.read_csv from string or package data

PythonCsvPython 3.xNumpyPandas

Python Problem Overview


I have some csv text data in a package which I want to read using read_csv. I was doing this by

from pkgutil import get_data
from StringIO import StringIO

data = read_csv(StringIO(get_data('package.subpackage', 'path/to/data.csv')))

However, StringIO.StringIO disappears in Python 3, and io.StringIO only accepts Unicode. Is there a simple way to do this?

Edit: the following does not appear to work

import pandas as pd

import pkgutil
from io import StringIO

def get_data_file(pkg, path):
    f = StringIO()
    contents = unicode(pkgutil.get_data('pymc.examples', 'data/wells.dat'))
    f.write(contents)
    return f

wells = get_data_file('pymc.examples', 'data/wells.dat')

data = pd.read_csv(wells, delimiter=' ', index_col='id',
                   dtype={'switch': np.int8})

failing with

  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 401, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 209, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 509, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 611, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 893, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "parser.pyx", line 441, in pandas._parser.TextReader.__cinit__ (pandas/src/parser.c:3940)
  File "parser.pyx", line 551, in pandas._parser.TextReader._get_header (pandas/src/parser.c:5096)
pandas._parser.CParserError: Passed header=0 but only 0 lines in file

Python Solutions


Solution 1 - Python

To pass a string to pandas read_csv(), you can use io.StringIO, i.e.:

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO("csv string..."))

Solution 2 - Python

The following worked for me in 3.3:

>>> import numpy as np, pandas as pd
>>> import io, pkgutil
>>> wells = pkgutil.get_data('pymc.examples', 'data/wells.dat')
>>> type(wells)
<class 'bytes'>
>>> df = pd.read_csv(io.BytesIO(wells), encoding='utf8', sep=" ", index_col="id", dtype={"switch": np.int8})
>>> df.head()
    switch  arsenic       dist  assoc  educ
id                                         
1        1     2.36  16.826000      0     0
2        1     0.71  47.321999      0     0
3        0     2.07  20.966999      0    10
4        1     1.15  21.486000      0    12
5        1     1.10  40.874001      1    14

[5 rows x 5 columns]

N.B. I had to manually put wells.dat in that location, so I can't swear I copied it correctly and that there isn't terminal whitespace, because I deleted some. But passing read_csv a BytesIO object and an encoding parameter should work. (Actually, you can probably get away without it, but it's a good habit. io.TextIOWrapper might be another option.)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJohn SalvatierView Question on Stackoverflow
Solution 1 - PythonPedro LobitoView Answer on Stackoverflow
Solution 2 - PythonDSMView Answer on Stackoverflow