How can you read a CSV file in R with different number of columns

RCsvImportread.tableSparse Columns

R Problem Overview


I have a sparse data set, one whose number of columns vary in length, in a csv format. Here is a sample of the file text.

12223, University
12227, bridge, Sky
12828, Sunset
13801, Ground
14853, Tranceamerica
14854, San Francisco
15595, shibuya, Shrine
16126, fog, San Francisco
16520, California, ocean, summer, golden gate, beach, San Francisco

When I use

read.csv("data.txt", header = F)

R will interpret the data set as having 3 columns because the size is determined from the first 5 rows. Is there anyway to force r to put the data in more columns?

R Solutions


Solution 1 - R

Deep in the ?read.table documentation there is the following:

> The number of data columns is determined by looking at the first five > lines of input (or the whole file if it has less than five lines), or > from the length of col.names if it is specified and is longer. This > could conceivably be wrong if fill or blank.lines.skip are true, so > specify col.names if necessary (as in the ‘Examples’).

Therefore, let's define col.names to be length X (where X is the max number of fields in your dataset), and set fill = TRUE:

dat <- textConnection("12223, University
12227, bridge, Sky
12828, Sunset
13801, Ground
14853, Tranceamerica
14854, San Francisco
15595, shibuya, Shrine
16126, fog, San Francisco
16520, California, ocean, summer, golden gate, beach, San Francisco")

read.table(dat, header = FALSE, sep = ",", 
  col.names = paste0("V",seq_len(7)), fill = TRUE)

     V1             V2             V3      V4           V5     V6             V7
1 12223     University                                                          
2 12227         bridge            Sky                                           
3 12828         Sunset                                                          
4 13801         Ground                                                          
5 14853  Tranceamerica                                                          
6 14854  San Francisco                                                          
7 15595        shibuya         Shrine                                           
8 16126            fog  San Francisco                                           
9 16520     California          ocean  summer  golden gate  beach  San Francisco

If the maximum number of fields is unknown, you can use the nifty utility function count.fields (which I found in the read.table example code):

count.fields(dat, sep = ',')
# [1] 2 3 2 2 2 2 3 3 7
max(count.fields(dat, sep = ','))
# [1] 7

Possibly helpful related reading: Only read limited number of columns in R

Solution 2 - R

You could read the data like this:

dat <- textConnection("12223, University
12227, bridge, Sky
12828, Sunset
13801, Ground
14853, Tranceamerica
14854, San Francisco
15595, shibuya, Shrine
16126, fog, San Francisco
16520, California, ocean, summer, golden gate, beach, San Francisco")

dat <- readLines(dat)
dat <- strsplit(dat, ",")

This results in a list.

Solution 3 - R

This does seem to work (following @BlueMagister's suggestion):

tt <- read.table("~/Downloads/tmp.csv", fill=TRUE, header=FALSE, 
          sep=",", colClasses=c("numeric", rep("character", 6)))
names(tt) <- paste("V", 1:7, sep="")

     V1             V2             V3      V4           V5     V6             V7
1 12223     University                                                          
2 12227         bridge            Sky                                           
3 12828         Sunset                                                          
4 13801         Ground                                                          
5 14853  Tranceamerica                                                          
6 14854  San Francisco                                                          
7 15595        shibuya         Shrine                                           
8 16126            fog  San Francisco                                           
9 16520     California          ocean  summer  golden gate  beach  San Francisco

Solution 4 - R

I faced a similar challenge, but count.fields from Blue Magister´s answer didn't work, probably because commas inside fields conflicted with sep=",". In addition, number of columns varied from file to file. So I just defined excess col.names in read.table(100 was enough in my case) and then I used which(!is.na()) to get rid of excess columns.

dat <- read.table("path/to/file.csv", col.names = paste("V",1:100), fill = T, sep = ",")
dat <- dat[,which(!is.na(dat[1,]))]

Solution 5 - R

Try this, it is a bit more dynamic..

readVariableWidthFile <- function(filePath){
  con <-file(filePath)
  lines<- readLines(con)
  close(con)
  slines <- strsplit(lines,",")
  colCount <- max(unlist(lapply(slines, length)))

  FileContent <- read.csv(filePath,
                        header = FALSE,
                        col.names = paste0("V",seq_len(colCount)),
                        fill = TRUE)
  return(FileContent)
}

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionCompChemistView Question on Stackoverflow
Solution 1 - RBlue MagisterView Answer on Stackoverflow
Solution 2 - RRolandView Answer on Stackoverflow
Solution 3 - RArunView Answer on Stackoverflow
Solution 4 - ROndroVView Answer on Stackoverflow
Solution 5 - Ruser12756182View Answer on Stackoverflow