Read csv from Google Cloud storage to pandas dataframe

PythonPandasCsvGoogle Cloud-PlatformGoogle Cloud-Storage

Python Problem Overview


I am trying to read a csv file present on the Google Cloud Storage bucket onto a panda dataframe.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from io import BytesIO

from google.cloud import storage

storage_client = storage.Client()
bucket = storage_client.get_bucket('createbucket123')
blob = bucket.blob('my.csv')
path = "gs://createbucket123/my.csv"
df = pd.read_csv(path)

It shows this error message:

FileNotFoundError: File b'gs://createbucket123/my.csv' does not exist

What am I doing wrong, I am not able to find any solution which does not involve google datalab?

Python Solutions


Solution 1 - Python

UPDATE

As of version 0.24 of pandas, read_csv supports reading directly from Google Cloud Storage. Simply provide link to the bucket like this:

df = pd.read_csv('gs://bucket/your_path.csv')

The read_csv will then use gcsfs module to read the Dataframe, which means it had to be installed (or you will get an exception pointing at missing dependency).

I leave three other options for the sake of completeness.

  • Home-made code
  • gcsfs
  • dask

I will cover them below.

The hard way: do-it-yourself code

I have written some convenience functions to read from Google Storage. To make it more readable I added type annotations. If you happen to be on Python 2, simply remove these and code will work all the same.

It works equally on public and private data sets, assuming you are authorised. In this approach you don't need to download first the data to your local drive.

How to use it:

fileobj = get_byte_fileobj('my-project', 'my-bucket', 'my-path')
df = pd.read_csv(fileobj)

The code:

from io import BytesIO, StringIO
from google.cloud import storage
from google.oauth2 import service_account

def get_byte_fileobj(project: str,
                     bucket: str,
                     path: str,
                     service_account_credentials_path: str = None) -> BytesIO:
    """
    Retrieve data from a given blob on Google Storage and pass it as a file object.
    :param path: path within the bucket
    :param project: name of the project
    :param bucket_name: name of the bucket
    :param service_account_credentials_path: path to credentials.
           TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
    :return: file object (BytesIO)
    """
    blob = _get_blob(bucket, path, project, service_account_credentials_path)
    byte_stream = BytesIO()
    blob.download_to_file(byte_stream)
    byte_stream.seek(0)
    return byte_stream

def get_bytestring(project: str,
                   bucket: str,
                   path: str,
                   service_account_credentials_path: str = None) -> bytes:
    """
    Retrieve data from a given blob on Google Storage and pass it as a byte-string.
    :param path: path within the bucket
    :param project: name of the project
    :param bucket_name: name of the bucket
    :param service_account_credentials_path: path to credentials.
           TIP: can be stored as env variable, e.g. os.getenv('GOOGLE_APPLICATION_CREDENTIALS_DSPLATFORM')
    :return: byte-string (needs to be decoded)
    """
    blob = _get_blob(bucket, path, project, service_account_credentials_path)
    s = blob.download_as_string()
    return s


def _get_blob(bucket_name, path, project, service_account_credentials_path):
    credentials = service_account.Credentials.from_service_account_file(
        service_account_credentials_path) if service_account_credentials_path else None
    storage_client = storage.Client(project=project, credentials=credentials)
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(path)
    return blob

gcsfs

gcsfs is a "Pythonic file-system for Google Cloud Storage".

How to use it:

import pandas as pd
import gcsfs

fs = gcsfs.GCSFileSystem(project='my-project')
with fs.open('bucket/path.csv') as f:
    df = pd.read_csv(f)

dask

Dask "provides advanced parallelism for analytics, enabling performance at scale for the tools you love". It's great when you need to deal with large volumes of data in Python. Dask tries to mimic much of the pandas API, making it easy to use for newcomers.

Here is the read_csv

How to use it:

import dask.dataframe as dd

df = dd.read_csv('gs://bucket/data.csv')
df2 = dd.read_csv('gs://bucket/path/*.csv') # nice!

# df is now Dask dataframe, ready for distributed processing
# If you want to have the pandas version, simply:
df_pd = df.compute()

Solution 2 - Python

Another option is to use TensorFlow which comes with the ability to do a streaming read from Google Cloud Storage:

from tensorflow.python.lib.io import file_io
with file_io.FileIO('gs://bucket/file.csv', 'r') as f:
  df = pd.read_csv(f)

Using tensorflow also gives you a convenient way to handle wildcards in the filename. For example:

Reading wildcard CSV into Pandas

Here is code that will read all CSVs that match a specific pattern (e.g: gs://bucket/some/dir/train-*) into a Pandas dataframe:

import tensorflow as tf
from tensorflow.python.lib.io import file_io
import pandas as pd

def read_csv_file(filename):
  with file_io.FileIO(filename, 'r') as f:
    df = pd.read_csv(f, header=None, names=['col1', 'col2'])
    return df
   
def read_csv_files(filename_pattern):
  filenames = tf.gfile.Glob(filename_pattern)
  dataframes = [read_csv_file(filename) for filename in filenames]
  return pd.concat(dataframes)

usage

DATADIR='gs://my-bucket/some/dir'
traindf = read_csv_files(os.path.join(DATADIR, 'train-*'))
evaldf = read_csv_files(os.path.join(DATADIR, 'eval-*'))

Solution 3 - Python

As of pandas==0.24.0 this is supported natively if you have gcsfs installed: https://github.com/pandas-dev/pandas/pull/22704.

Until the official release you can try it out with pip install pandas==0.24.0rc1.

Solution 4 - Python

I was taking a look at this question and didn't want to have to go through the hassle of installing another library, gcsfs, which literally says in the documentation, This software is beta, use at your own risk... but I found a great workaround that I wanted to post here in case this is helpful to anyone else, using just the google.cloud storage library and some native python libraries. Here's the function:

import pandas as pd
from google.cloud import storage
import os
import io
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path/to/creds.json'


def gcp_csv_to_df(bucket_name, source_file_name):
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(source_blob_name)
    data = blob.download_as_string()
    df = pd.read_csv(io.BytesIO(data))
    print(f'Pulled down file from bucket {bucket_name}, file name: {source_file_name}')
    return df

Further, although it is outside of the scope of this question, if you would like to upload a pandas dataframe to GCP using a similar function, here is the code to do so:

def df_to_gcp_csv(df, dest_bucket_name, dest_file_name):
    storage_client = storage.Client()
    bucket = storage_client.bucket(dest_bucket_name)
    blob = bucket.blob(dest_file_name)
    blob.upload_from_string(df.to_csv(), 'text/csv')
    print(f'DataFrame uploaded to bucket {dest_bucket_name}, file name: {dest_file_name}')

Hope this is helpful! I know I'll be using these functions for sure.

Solution 5 - Python

read_csv does not support gs://

From the documentation:

> The string could be a URL. Valid URL schemes include http, ftp, s3, > and file. For file URLs, a host is expected. For instance, a local > file could be file ://localhost/path/to/table.csv

You can download the file or fetch it as a string in order to manipulate it.

Solution 6 - Python

Since Pandas 1.2 it's super easy to load files from google storage into a DataFrame.

If you work on your local machine it looks like this:

df = pd.read_csv('gcs://your-bucket/path/data.csv.gz',
                 storage_options={"token": "credentials.json"})

It's imported that you add as token the credentials.json file from google.

If you work on google cloud do this:

df = pd.read_csv('gcs://your-bucket/path/data.csv.gz',
                 storage_options={"token": "cloud"})

Solution 7 - Python

There are three ways of accessing files in the GCS:

  1. Downloading the client library (this one for you)
  2. Using Cloud Storage Browser in the Google Cloud Platform Console
  3. Using gsutil, a command-line tool for working with files in Cloud Storage.

Using Step 1, setup the GSC for your work. After which you have to:

import cloudstorage as gcs
from google.appengine.api import app_identity

Then you have to specify the Cloud Storage bucket name and create read/write functions for to access your bucket:

You can find the remaining read/write tutorial here:

Solution 8 - Python

Using pandas and google-cloud-storage python packages:

First, we upload a file to the bucket in order to get a fully working example:

import pandas as pd
from sklearn.datasets import load_iris

dataset = load_iris()

data_df = pd.DataFrame(
    dataset.data,
    columns=dataset.feature_names)

data_df.head()
Out[1]: 
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2

Upload a csv file to the bucket (GCP credentials setup is required, read more in here):

from io import StringIO
from google.cloud import storage

bucket_name = 'my-bucket-name' # Replace it with your own bucket name.
data_path = 'somepath/data.csv'

# Get Google Cloud client
client = storage.Client()

# Get bucket object
bucket = client.get_bucket(bucket_name)

# Get blob object (this is pointing to the data_path)
data_blob = bucket.blob(data_path)

# Upload a csv to google cloud storage
data_blob.upload_from_string(
    data_df.to_csv(), 'text/csv')

Now that we have a csv on the bucket, use pd.read_csv by passing the content of the file.

# Read from bucket
data_str = data_blob.download_as_text()

# Instanciate dataframe
data_dowloaded_df = pd.read_csv(StringIO(data_str))

data_dowloaded_df.head()
Out[2]: 
   Unnamed: 0  sepal length (cm)  ...  petal length (cm)  petal width (cm)
0           0                5.1  ...                1.4               0.2
1           1                4.9  ...                1.4               0.2
2           2                4.7  ...                1.3               0.2
3           3                4.6  ...                1.5               0.2
4           4                5.0  ...                1.4               0.2

[5 rows x 5 columns]

When comparing this approach with pd.read_csv('gs://my-bucket/file.csv') approach, I found that the approach described in here makes more explicit that client = storage.Client() is the one taking care of the authentication (which could be very handy when working with multiple credentials). Also, storage.Client comes already fully installed if you run this code on a resource from Google Cloud Platform, when for pd.read_csv('gs://my-bucket/file.csv') you'll need to have installed the package gcsfs that allow pandas to access Google Storage.

Solution 9 - Python

If i understood your question correctly then maybe this link can help u get a better URL for your read_csv() function :

https://cloud.google.com/storage/docs/access-public-data

Solution 10 - Python

One will still need to use import gcsfs if loading compressed files.

Tried pd.read_csv('gs://your-bucket/path/data.csv.gz') in pd.version=> 0.25.3 got the following error,

/opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    438     # See https://github.com/python/mypy/issues/1297
    439     fp_or_buf, _, compression, should_close = get_filepath_or_buffer(
--> 440         filepath_or_buffer, encoding, compression
    441     )
    442     kwds["compression"] = compression

/opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode)
    211 
    212     if is_gcs_url(filepath_or_buffer):
--> 213         from pandas.io import gcs
    214 
    215         return gcs.get_filepath_or_buffer(

/opt/conda/anaconda/lib/python3.6/site-packages/pandas/io/gcs.py in <module>
      3 
      4 gcsfs = import_optional_dependency(
----> 5     "gcsfs", extra="The gcsfs library is required to handle GCS files"
      6 )
      7 

/opt/conda/anaconda/lib/python3.6/site-packages/pandas/compat/_optional.py in import_optional_dependency(name, extra, raise_on_missing, on_version)
     91     except ImportError:
     92         if raise_on_missing:
---> 93             raise ImportError(message.format(name=name, extra=extra)) from None
     94         else:
     95             return None

ImportError: Missing optional dependency 'gcsfs'. The gcsfs library is required to handle GCS files Use pip or conda to install gcsfs.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionuser1838940View Question on Stackoverflow
Solution 1 - PythonLukasz TracewskiView Answer on Stackoverflow
Solution 2 - PythonLakView Answer on Stackoverflow
Solution 3 - PythonbnaulView Answer on Stackoverflow
Solution 4 - PythonLle.4View Answer on Stackoverflow
Solution 5 - PythonBurhan KhalidView Answer on Stackoverflow
Solution 6 - PythonMarkusOdenthalView Answer on Stackoverflow
Solution 7 - PythonAhmad M.View Answer on Stackoverflow
Solution 8 - PythonRaulView Answer on Stackoverflow
Solution 9 - PythonshubhamView Answer on Stackoverflow
Solution 10 - PythonAshwin KasilingamView Answer on Stackoverflow