Importing data into R from google spreadsheet
RGoogle SheetsR Problem Overview
There seems to be a change in the google spreadsheet publishing options. It is no longer possible to publish to the web as csv or tab file (see this recent post). Thus the usual way to use RCurl to import data into R from a google spreadsheed does not work anymore:
require(RCurl)
u <- "https://docs.google.com/spreadsheet/pub?hl=en_GB&hl=en_GB&key=0AmFzIcfgCzGFdHQ0eEU0MWZWV200RjgtTXVMY1NoQVE&single=true&gid=4&output=csv"
tc <- getURL(u, ssl.verifypeer=FALSE)
net <- read.csv(textConnection(tc))
Does anyone have a work-around?
R Solutions
Solution 1 - R
I just wrote a simple package to solve exactly this problem: downloading a Google sheet using just the URL.
install.packages('gsheet')
library(gsheet)
gsheet2tbl('docs.google.com/spreadsheets/d/1I9mJsS5QnXF2TNNntTy-HrcdHmIF9wJ8ONYvEJTXSNo')
More detail is here: https://github.com/maxconway/gsheet
Solution 2 - R
Use the googlesheets4 package, a Google Sheets R API by Jenny Bryan. It is the best way to analyze and edit Google Sheets data in R. Not only can it pull data from Google Sheets, but you can edit the data in Google Sheets, create new sheets, etc.
The package can be installed with install.packages("googlesheets4")
.
There's a vignette for getting started; see her GitHub repository for more. And you also can install the latest development version of the package from that GitHub page, if desired.
Solution 3 - R
I am working on a solution for this. Here is a function that works on your data as well as a few of my own Google Spreadsheets.
First, we need a function to read from Google sheets. readGoogleSheet()
will return a list of data frames, one for each table found on the Google sheet:
readGoogleSheet <- function(url, na.string="", header=TRUE){
stopifnot(require(XML))
# Suppress warnings because Google docs seems to have incomplete final line
suppressWarnings({
doc <- paste(readLines(url), collapse=" ")
})
if(nchar(doc) == 0) stop("No content found")
htmlTable <- gsub("^.*?(<table.*</table).*$", "\\1>", doc)
ret <- readHTMLTable(htmlTable, header=header, stringsAsFactors=FALSE, as.data.frame=TRUE)
lapply(ret, function(x){ x[ x == na.string] <- NA; x})
}
Next, we need a function to clean the individual tables. cleanGoogleTable()
removes empty lines inserted by Google, removes the row names (if they exist) and allows you to skip empty lines before the table starts:
cleanGoogleTable <- function(dat, table=1, skip=0, ncols=NA, nrows=-1, header=TRUE, dropFirstCol=NA){
if(!is.data.frame(dat)){
dat <- dat[[table]]
}
if(is.na(dropFirstCol)) {
firstCol <- na.omit(dat[[1]])
if(all(firstCol == ".") || all(firstCol== as.character(seq_along(firstCol)))) {
dat <- dat[, -1]
}
} else if(dropFirstCol) {
dat <- dat[, -1]
}
if(skip > 0){
dat <- dat[-seq_len(skip), ]
}
if(nrow(dat) == 1) return(dat)
if(nrow(dat) >= 2){
if(all(is.na(dat[2, ]))) dat <- dat[-2, ]
}
if(header && nrow(dat) > 1){
header <- as.character(dat[1, ])
names(dat) <- header
dat <- dat[-1, ]
}
# Keep only desired columns
if(!is.na(ncols)){
ncols <- min(ncols, ncol(dat))
dat <- dat[, seq_len(ncols)]
}
# Keep only desired rows
if(nrows > 0){
nrows <- min(nrows, nrow(dat))
dat <- dat[seq_len(nrows), ]
}
# Rename rows
rownames(dat) <- seq_len(nrow(dat))
dat
}
Now we are ready to read you Google sheet:
> u <- "https://docs.google.com/spreadsheets/d/0AmFzIcfgCzGFdHQ0eEU0MWZWV200RjgtTXVMY1NoQVE/pubhtml"
> g <- readGoogleSheet(u)
> cleanGoogleTable(g, table=1)
2012-Jan Mobile internet Tanzania
1 Airtel Zantel Vodacom Tigo TTCL Combined
> cleanGoogleTable(g, table=2, skip=1)
BUNDLE FEE VALIDITY MB Cost Sh/MB
1 Daily Bundle (20MB) 500/= 1 day 20 500 25.0
2 1 Day bundle (300MB) 3,000/= 1 day 300 3,000 10.0
3 Weekly bundle (3GB) 15,000/= 7 days 3,000 15,000 5.0
4 Monthly bundle (8GB) 70,000/= 30 days 8,000 70,000 8.8
5 Quarterly Bundle (24GB) 200,000/= 90 days 24,000 200,000 8.3
6 Yearly Bundle (96GB) 750,000/= 365 days 96,000 750,000 7.8
7 Handset Browsing Bundle(400 MB) 2,500/= 30 days 400 2,500 6.3
8 STANDARD <NA> <NA> 1 <NA> <NA>
Solution 4 - R
Not sure if other use cases have a higher complexity or if something changed in the meantime. After publishing the spreadsheet in CSV format this simple 1-liner worked for me:
myCSV<-read.csv("http://docs.google.com/spreadsheets/d/1XKeAajiH47jAP0bPkCtS4OdOGTSsjleOXImDrFzxxZQ/pub?output=csv")
R version 3.3.2 (2016-10-31)
Solution 5 - R
There is an easiest way to fetch the google sheets even if you're behind the proxy
require(RCurl)
fileUrl <- "https://docs.google.com/spreadsheets/d/[ID]/export?format=csv"
fileCSV <- getURL(fileUrl,.opts=list(ssl.verifypeer=FALSE))
fileCSVDF <- read.csv(textConnection(fileCSV))
Solution 6 - R
A simpler way.
Be sure to match your URL carefully to the format of the example one here. You can get all but the /export?format=csv
piece from the Google Spreadsheets edit page. Then, just manually add this piece to the URL and then use as shown here.
library(RCurl)
library(mosaic)
mydat2 <- fetchGoogle(paste0("https://docs.google.com/spreadsheets/d/",
"1mAxpSTrjdFv1UrpxwDTpieVJP16R9vkSQrpHV8lVTA8/export?format=csv"))
mydat2
Solution 7 - R
Scrape the html table using httr and XML packages.
library(XML)
library(httr)
url <- "https://docs.google.com/spreadsheets/d/12MK9EFmPww4Vw9P6BShmhOolH1C45Irz0jdzE0QR3hs/pubhtml"
readSpreadsheet <- function(url, sheet = 1){
library(httr)
r <- GET(url)
html <- content(r)
sheets <- readHTMLTable(html, header=FALSE, stringsAsFactors=FALSE)
df <- sheets[[sheet]]
dfClean <- function(df){
nms <- t(df[1,])
names(df) <- nms
df <- df[-1,-1]
row.names(df) <- seq(1,nrow(df))
df
}
dfClean(df)
}
df <- readSpreadsheet(url)
df
Solution 8 - R
Publish as CSV doesn't seem to be supported (or at least isn't currently supported) in the new Google Sheets, which is the default for any new sheet you create. You can, though, create a sheet in the old Google Sheets format, which does support publish as CSV, through this link... https://g.co/oldsheets.
More details on the new vs. old Sheets is here... https://support.google.com/drive/answer/3541068?p=help_new_sheets&rd=1
Solution 9 - R
Thanks for this solution! Works as good as the old one. I used another fix to get rid of the blank first line. When you just exclude it, you might per accident delete a valid observation when the line is 'unfreezed'. The extra instruction in the function deletes any rows which have no time stamp.
readSpreadsheet <- function(url, sheet = 1){
library(httr)
r <- GET(url)
html <- content(r)
sheets <- readHTMLTable(html, header=FALSE, stringsAsFactors=FALSE)
df <- sheets[[sheet]]
dfClean <- function(df){
nms <- t(df[1,])
names(df) <- nms
df <- df[-1,-1]
df <- df[df[,1] != "",] ## only select rows with time stamps
row.names(df) <- seq(1,nrow(df))
df
}
dfClean(df)
}
Solution 10 - R
It is still (as of May 2015) possible to get a CSV file out of Google Spreadsheets, using the hidden URL <sheeturl>/export?format=csv
trick 1.
However, after solving this problem, one encounters another problem - numbers are formatted according to the locale of the sheet, e.g. you may get 1,234.15 in a "US" sheet or 1.234,15 in a "German" sheet. To decide on a sheet locale, go to File > Spreadsheet Settings in Google Docs.
Now you need to remove the decimal mark from the numeric columns so that R can parse them; depending on how large your numbers are, this may need to be done several times for each column. A simple function I wrote to accomplish this:
# helper function to load google sheet and adjust for thousands separator (,)
getGoogleDataset <- function(id) {
download.file(paste0('https://docs.google.com/spreadsheets/d/', id, '/export?format=csv'),'google-ds.csv','curl');
lines <- scan('google-ds.csv', character(0), sep="\n");
pattern<-"\"([0-9]+),([0-9]+)";
for (i in 0:length(lines)) {
while (length(grep(pattern,lines[i]))> 0) {
lines[i] <- gsub(pattern,"\"\\1\\2",lines[i]);
}
}
return(read.csv(textConnection(lines)));
}
You will need to require(utils)
and have curl installed, but no other extra packages.