Split delimited strings in a column and insert as new rows

RDataframeReshapeData ManipulationStrsplit

R Problem Overview


I have a data frame as follow:

+-----+-------+
|  V1 |  V2   |
+-----+-------+
|  1  | a,b,c |
|  2  | a,c   |
|  3  | b,d   |
|  4  | e,f   |
|  .  | .     |
+-----+-------+

Each of the alphabet is a character separated by comma. I would like to split V2 on each comma and insert the split strings as new rows. For instance, the desired output will be:

+----+----+
| V1 | V2 |
+----+----+
|  1 |  a |
|  1 |  b |
|  1 |  c |
|  2 |  a |
|  2 |  c |
|  3 |  b |
|  3 |  d |
|  4 |  e |
|  4 |  f |
+----+----+

I am trying to use strsplit() to spit V2 first, then cast the list into a data frame. It didn't work. Any help will be appreciated.

R Solutions


Solution 1 - R

As of Dec 2014, this can be done using the unnest function from Hadley Wickham's tidyr package (see release notes http://blog.rstudio.org/2014/12/08/tidyr-0-2-0/)

> library(tidyr)
> library(dplyr)
> mydf

  V1    V2
2  1 a,b,c
3  2   a,c
4  3   b,d
5  4   e,f
6  .     .


> mydf %>% 
    mutate(V2 = strsplit(as.character(V2), ",")) %>% 
    unnest(V2)

   V1 V2
1   1  a
2   1  b
3   1  c
4   2  a
5   2  c
6   3  b
7   3  d
8   4  e
9   4  f
10  .  .

Update 2017: note the separate_rows function as described by @Tif below.

It works so much better, and it allows to "unnest" multiple columns in a single statement:

> head(mydf)
geneid              chrom	 start	end	strand	length	gene_count
ENSG00000223972.5	chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1	11869;12010;12179;12613;12613;12975;13221;13221;13453	12227;12057;12227;12721;12697;13052;13374;14409;13670	+;+;+;+;+;+;+;+;+	1735	11
ENSG00000227232.5	chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1	14404;15005;15796;16607;16858;17233;17606;17915;18268;24738;29534	14501;15038;15947;16765;17055;17368;17742;18061;18366;24891;29570	-;-;-;-;-;-;-;-;-;-;-	1351	380
ENSG00000278267.1	chr1	17369	17436	-	68	14
ENSG00000243485.4	chr1;chr1;chr1;chr1;chr1	29554;30267;30564;30976;30976	30039;30667;30667;31097;31109	+;+;+;+;+	1021	22
ENSG00000237613.2	chr1;chr1;chr1	34554;35277;35721	35174;35481;36081	-;-;-	1187	24
ENSG00000268020.3	chr1	52473	53312	+	840	14


> mydf %>% separate_rows(strand, chrom, gene_start, gene_end)
geneid	length	gene_count	strand	chrom	start	end
ENSG00000223972.5	1735	11	+	chr1	11869	12227
ENSG00000223972.5	1735	11	+	chr1	12010	12057
ENSG00000223972.5	1735	11	+	chr1	12179	12227
ENSG00000223972.5	1735	11	+	chr1	12613	12721
ENSG00000223972.5	1735	11	+	chr1	12613	12697
ENSG00000223972.5	1735	11	+	chr1	12975	13052
ENSG00000223972.5	1735	11	+	chr1	13221	13374
ENSG00000223972.5	1735	11	+	chr1	13221	14409
ENSG00000223972.5	1735	11	+	chr1	13453	13670
ENSG00000227232.5	1351	380	-	chr1	14404	14501
ENSG00000227232.5	1351	380	-	chr1	15005	15038
ENSG00000227232.5	1351	380	-	chr1	15796	15947
ENSG00000227232.5	1351	380	-	chr1	16607	16765
ENSG00000227232.5	1351	380	-	chr1	16858	17055
ENSG00000227232.5	1351	380	-	chr1	17233	17368
ENSG00000227232.5	1351	380	-	chr1	17606	17742
ENSG00000227232.5	1351	380	-	chr1	17915	18061
ENSG00000227232.5	1351	380	-	chr1	18268	18366
ENSG00000227232.5	1351	380	-	chr1	24738	24891
ENSG00000227232.5	1351	380	-	chr1	29534	29570
ENSG00000278267.1	68	5	-	chr1	17369	17436
ENSG00000243485.4	1021	8	+	chr1	29554	30039
ENSG00000243485.4	1021	8	+	chr1	30267	30667
ENSG00000243485.4	1021	8	+	chr1	30564	30667
ENSG00000243485.4	1021	8	+	chr1	30976	31097
ENSG00000243485.4	1021	8	+	chr1	30976	31109
ENSG00000237613.2	1187	24	-	chr1	34554	35174
ENSG00000237613.2	1187	24	-	chr1	35277	35481
ENSG00000237613.2	1187	24	-	chr1	35721	36081
ENSG00000268020.3	840	0	+	chr1	52473	53312

Solution 2 - R

Here is another way of doing it..

df <- read.table(textConnection("1|a,b,c\n2|a,c\n3|b,d\n4|e,f"), header = F, sep = "|", stringsAsFactors = F)

df
##   V1    V2
## 1  1 a,b,c
## 2  2   a,c
## 3  3   b,d
## 4  4   e,f

s <- strsplit(df$V2, split = ",")
data.frame(V1 = rep(df$V1, sapply(s, length)), V2 = unlist(s))
##   V1 V2
## 1  1  a
## 2  1  b
## 3  1  c
## 4  2  a
## 5  2  c
## 6  3  b
## 7  3  d
## 8  4  e
## 9  4  f

Solution 3 - R

Now you can use tidyr 0.5.0's separate_rows is in place of strsplit + unnest.

For example:

library(tidyr)
(df <- read.table(textConnection("1|a,b,c\n2|a,c\n3|b,d\n4|e,f"), header = F, sep = "|", stringsAsFactors = F))

> V1 V2 > 1 1 a,b,c > 2 2 a,c > 3 3 b,d > 4 4 e,f

separate_rows(df, V2)

Gives:

> V1 V2 > 1 1 a > 2 1 b > 3 1 c > 4 2 a > 5 2 c > 6 3 b > 7 3 d > 8 4 e > 9 4 f

See reference: https://blog.rstudio.org/2016/06/13/tidyr-0-5-0/

Solution 4 - R

Here's a data.table solution:

d.df <- read.table(header=T, text="V1 | V2
1 | a,b,c
2 | a,c
3 | b,d
4 | e,f", stringsAsFactors=F, sep="|", strip.white = TRUE)
require(data.table)
d.dt <- data.table(d.df, key="V1")
out <- d.dt[, list(V2 = unlist(strsplit(V2, ","))), by=V1]

#    V1 V2
# 1:  1  a
# 2:  1  b
# 3:  1  c
# 4:  2  a
# 5:  2  c
# 6:  3  b
# 7:  3  d
# 8:  4  e
# 9:  4  f

> sapply(out$V2, nchar) # (or simply nchar(out$V2))
# a b c a c b d e f 
# 1 1 1 1 1 1 1 1 1 

Solution 5 - R

You can consider cSplit with direction = "long" from my "splitstackshape" package.

Usage would be:

cSplit(mydf, "V2", ",", "long")
##    V1 V2
## 1:  1  a
## 2:  1  b
## 3:  1  c
## 4:  2  a
## 5:  2  c
## 6:  3  b
## 7:  3  d
## 8:  4  e
## 9:  4  f

Old answer....

Here is one approach using base R. It assumes we're starting with a data.frame named "mydf". It uses read.csv to read in the second column as a separate data.frame, which we combine with the first column from your source data. Finally, you use reshape to convert the data into a long form.

temp <- data.frame(Ind = mydf$V1, 
                   read.csv(text = as.character(mydf$V2), header = FALSE))
temp1 <- reshape(temp, direction = "long", idvar = "Ind", 
                 timevar = "time", varying = 2:ncol(temp), sep = "")
temp1[!temp1$V == "", c("Ind", "V")]
#     Ind  V
# 1.1   1  a
# 2.1   2  a
# 3.1   3  b
# 4.1   4  e
# 1.2   1  b
# 2.2   2  c
# 3.2   3  d
# 4.2   4  f
# 1.3   1  c

Another fairly direct alternative is:

stack(
  setNames(
    sapply(strsplit(mydf$V2, ","), 
           function(x) gsub("^\\s|\\s$", "", x)), mydf$V1))
  values ind
1      a   1
2      b   1
3      c   1
4      a   2
5      c   2
6      b   3
7      d   3
8      e   4
9      f   4

Solution 6 - R

Another data.table solution, which doesn't rely on the existence of any unique fields in the original data.

DT = data.table(read.table(header=T, text="blah | splitme
    T | a,b,c
    T | a,c
    F | b,d
    F | e,f", stringsAsFactors=F, sep="|", strip.white = TRUE))

DT[,.( blah     , splitme     , splitted=unlist(strsplit(splitme, ","))     ),by=seq_len(nrow(DT))]

The important thing is by=seq_len(nrow(DT)), this is the 'fake' uniqueID that the splitting occurs on. It's tempting to use by=.I instead, as it should be defined the same, but .I appears to be a magical thing that changes its value, better to stick with by=seq_len(nrow(DT))

There are three columns in the output. We simply name the two existing columns, and then compute the third as a split

.( blah       # first column of original
 , splitme    # second column of original
 , splitted = unlist(strsplit(splitme, ","))
 )

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionBoxuanView Question on Stackoverflow
Solution 1 - RdalloliogmView Answer on Stackoverflow
Solution 2 - RCHPView Answer on Stackoverflow
Solution 3 - RTifView Answer on Stackoverflow
Solution 4 - RArunView Answer on Stackoverflow
Solution 5 - RA5C1D2H2I1M1N2O1R2T1View Answer on Stackoverflow
Solution 6 - RAaron McDaidView Answer on Stackoverflow