Importing data into R from google spreadsheet

RGoogle Sheets

R Problem Overview


There seems to be a change in the google spreadsheet publishing options. It is no longer possible to publish to the web as csv or tab file (see this recent post). Thus the usual way to use RCurl to import data into R from a google spreadsheed does not work anymore:

require(RCurl)
u <- "https://docs.google.com/spreadsheet/pub?hl=en_GB&hl=en_GB&key=0AmFzIcfgCzGFdHQ0eEU0MWZWV200RjgtTXVMY1NoQVE&single=true&gid=4&output=csv"
tc <- getURL(u, ssl.verifypeer=FALSE)
net <- read.csv(textConnection(tc))

Does anyone have a work-around?

R Solutions


Solution 1 - R

I just wrote a simple package to solve exactly this problem: downloading a Google sheet using just the URL.

install.packages('gsheet')
library(gsheet)
gsheet2tbl('docs.google.com/spreadsheets/d/1I9mJsS5QnXF2TNNntTy-HrcdHmIF9wJ8ONYvEJTXSNo')

More detail is here: https://github.com/maxconway/gsheet

Solution 2 - R

Use the googlesheets4 package, a Google Sheets R API by Jenny Bryan. It is the best way to analyze and edit Google Sheets data in R. Not only can it pull data from Google Sheets, but you can edit the data in Google Sheets, create new sheets, etc.

The package can be installed with install.packages("googlesheets4").

There's a vignette for getting started; see her GitHub repository for more. And you also can install the latest development version of the package from that GitHub page, if desired.

Solution 3 - R

I am working on a solution for this. Here is a function that works on your data as well as a few of my own Google Spreadsheets.

First, we need a function to read from Google sheets. readGoogleSheet() will return a list of data frames, one for each table found on the Google sheet:

readGoogleSheet <- function(url, na.string="", header=TRUE){
  stopifnot(require(XML))
  # Suppress warnings because Google docs seems to have incomplete final line
  suppressWarnings({
    doc <- paste(readLines(url), collapse=" ")
  })
  if(nchar(doc) == 0) stop("No content found")
  htmlTable <- gsub("^.*?(<table.*</table).*$", "\\1>", doc)
  ret <- readHTMLTable(htmlTable, header=header, stringsAsFactors=FALSE, as.data.frame=TRUE)
  lapply(ret, function(x){ x[ x == na.string] <- NA; x})
}

Next, we need a function to clean the individual tables. cleanGoogleTable() removes empty lines inserted by Google, removes the row names (if they exist) and allows you to skip empty lines before the table starts:

cleanGoogleTable <- function(dat, table=1, skip=0, ncols=NA, nrows=-1, header=TRUE, dropFirstCol=NA){
  if(!is.data.frame(dat)){
    dat <- dat[[table]]
  }
  
  if(is.na(dropFirstCol)) {
    firstCol <- na.omit(dat[[1]])
    if(all(firstCol == ".") || all(firstCol== as.character(seq_along(firstCol)))) {
      dat <- dat[, -1]
    }
  } else if(dropFirstCol) {
    dat <- dat[, -1]
  }

  if(skip > 0){
    dat <- dat[-seq_len(skip), ]
  }
  
  if(nrow(dat) == 1) return(dat)
  
  
  if(nrow(dat) >= 2){
    if(all(is.na(dat[2, ]))) dat <- dat[-2, ]
  }
  if(header && nrow(dat) > 1){
    header <- as.character(dat[1, ])
    names(dat) <- header
    dat <- dat[-1, ]
  }
  
  # Keep only desired columns
  if(!is.na(ncols)){
    ncols <- min(ncols, ncol(dat))
    dat <- dat[, seq_len(ncols)]
  }
  
  
  # Keep only desired rows
  if(nrows > 0){
    nrows <- min(nrows, nrow(dat))
    dat <- dat[seq_len(nrows), ]
  }
  
  # Rename rows
  rownames(dat) <- seq_len(nrow(dat))
  dat
}

Now we are ready to read you Google sheet:

> u <- "https://docs.google.com/spreadsheets/d/0AmFzIcfgCzGFdHQ0eEU0MWZWV200RjgtTXVMY1NoQVE/pubhtml"
> g <- readGoogleSheet(u)
> cleanGoogleTable(g, table=1)


         2012-Jan Mobile internet Tanzania
1 Airtel Zantel Vodacom Tigo TTCL Combined


> cleanGoogleTable(g, table=2, skip=1)

                           BUNDLE       FEE VALIDITY     MB    Cost Sh/MB
1             Daily Bundle (20MB)     500/=    1 day     20     500  25.0
2            1 Day bundle (300MB)   3,000/=    1 day    300   3,000  10.0
3             Weekly bundle (3GB)  15,000/=   7 days  3,000  15,000   5.0
4            Monthly bundle (8GB)  70,000/=  30 days  8,000  70,000   8.8
5         Quarterly Bundle (24GB) 200,000/=  90 days 24,000 200,000   8.3
6            Yearly Bundle (96GB) 750,000/= 365 days 96,000 750,000   7.8
7 Handset Browsing Bundle(400 MB)   2,500/=  30 days    400   2,500   6.3
8                        STANDARD      <NA>     <NA>      1    <NA>  <NA>

Solution 4 - R

Not sure if other use cases have a higher complexity or if something changed in the meantime. After publishing the spreadsheet in CSV format this simple 1-liner worked for me:

myCSV<-read.csv("http://docs.google.com/spreadsheets/d/1XKeAajiH47jAP0bPkCtS4OdOGTSsjleOXImDrFzxxZQ/pub?output=csv")

R version 3.3.2 (2016-10-31)

Solution 5 - R

There is an easiest way to fetch the google sheets even if you're behind the proxy

require(RCurl)
fileUrl <- "https://docs.google.com/spreadsheets/d/[ID]/export?format=csv"
fileCSV <- getURL(fileUrl,.opts=list(ssl.verifypeer=FALSE))
fileCSVDF <-  read.csv(textConnection(fileCSV))

Solution 6 - R

A simpler way.

Be sure to match your URL carefully to the format of the example one here. You can get all but the /export?format=csv piece from the Google Spreadsheets edit page. Then, just manually add this piece to the URL and then use as shown here.

library(RCurl)
library(mosaic)
mydat2 <- fetchGoogle(paste0("https://docs.google.com/spreadsheets/d/",
  "1mAxpSTrjdFv1UrpxwDTpieVJP16R9vkSQrpHV8lVTA8/export?format=csv"))
mydat2

Solution 7 - R

Scrape the html table using httr and XML packages.

library(XML)
library(httr)

url <- "https://docs.google.com/spreadsheets/d/12MK9EFmPww4Vw9P6BShmhOolH1C45Irz0jdzE0QR3hs/pubhtml"

readSpreadsheet <- function(url, sheet = 1){
  library(httr)
  r <- GET(url)
  html <- content(r)
  sheets <- readHTMLTable(html, header=FALSE, stringsAsFactors=FALSE)
  df <- sheets[[sheet]]
  dfClean <- function(df){
    nms <- t(df[1,])
    names(df) <- nms
    df <- df[-1,-1] 
    row.names(df) <- seq(1,nrow(df))
    df
  }
  dfClean(df)
}
df <- readSpreadsheet(url)
df

Solution 8 - R

Publish as CSV doesn't seem to be supported (or at least isn't currently supported) in the new Google Sheets, which is the default for any new sheet you create. You can, though, create a sheet in the old Google Sheets format, which does support publish as CSV, through this link... https://g.co/oldsheets.

More details on the new vs. old Sheets is here... https://support.google.com/drive/answer/3541068?p=help_new_sheets&rd=1

Solution 9 - R

Thanks for this solution! Works as good as the old one. I used another fix to get rid of the blank first line. When you just exclude it, you might per accident delete a valid observation when the line is 'unfreezed'. The extra instruction in the function deletes any rows which have no time stamp.

readSpreadsheet <- function(url, sheet = 1){
   library(httr)
   r <- GET(url)
   html <- content(r)
   sheets <- readHTMLTable(html, header=FALSE, stringsAsFactors=FALSE)
   df <- sheets[[sheet]]
   dfClean <- function(df){
    nms <- t(df[1,])
    names(df) <- nms
    df <- df[-1,-1] 
    df <- df[df[,1] != "",]   ## only select rows with time stamps
    row.names(df) <- seq(1,nrow(df))
    df
   }
   dfClean(df)
}

Solution 10 - R

It is still (as of May 2015) possible to get a CSV file out of Google Spreadsheets, using the hidden URL <sheeturl>/export?format=csv trick 1.

However, after solving this problem, one encounters another problem - numbers are formatted according to the locale of the sheet, e.g. you may get 1,234.15 in a "US" sheet or 1.234,15 in a "German" sheet. To decide on a sheet locale, go to File > Spreadsheet Settings in Google Docs.

Now you need to remove the decimal mark from the numeric columns so that R can parse them; depending on how large your numbers are, this may need to be done several times for each column. A simple function I wrote to accomplish this:

# helper function to load google sheet and adjust for thousands separator (,)
getGoogleDataset <- function(id) {
  download.file(paste0('https://docs.google.com/spreadsheets/d/', id, '/export?format=csv'),'google-ds.csv','curl');
  lines <- scan('google-ds.csv', character(0), sep="\n");

  pattern<-"\"([0-9]+),([0-9]+)";
  for (i in 0:length(lines)) {
    while (length(grep(pattern,lines[i]))> 0) {
      lines[i] <- gsub(pattern,"\"\\1\\2",lines[i]);
    }
  }
  return(read.csv(textConnection(lines)));
}

You will need to require(utils) and have curl installed, but no other extra packages.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionjokelView Question on Stackoverflow
Solution 1 - RMax ConwayView Answer on Stackoverflow
Solution 2 - RSam FirkeView Answer on Stackoverflow
Solution 3 - RAndrieView Answer on Stackoverflow
Solution 4 - Rrob2universeView Answer on Stackoverflow
Solution 5 - RSenthilView Answer on Stackoverflow
Solution 6 - RJumpInView Answer on Stackoverflow
Solution 7 - RjpmarindiazView Answer on Stackoverflow
Solution 8 - RDartPrivateerView Answer on Stackoverflow
Solution 9 - Ruser3819568View Answer on Stackoverflow
Solution 10 - RflexponsiveView Answer on Stackoverflow