Reshaping multiple sets of measurement columns (wide format) into single columns (long format)

RReshapeTidyrReshape2R Faq

R Problem Overview


I have a dataframe in a wide format, with repeated measurements taken within different date ranges. In my example there are three different periods, all with their corresponding values. E.g. the first measurement (Value1) was measured in the period from DateRange1Start to DateRange1End:

ID DateRange1Start DateRange1End Value1 DateRange2Start DateRange2End Value2 DateRange3Start DateRange3End Value3
1 1/1/90 3/1/90 4.4 4/5/91 6/7/91 6.2 5/5/95 6/6/96 3.3 

I'm looking to reshape the data to a long format such that the DateRangeXStart and DateRangeXEnd columns are grouped,. Thus, what was 1 row in the original table becomes 3 rows in the new table:

ID DateRangeStart DateRangeEnd Value
1 1/1/90 3/1/90 4.4
1 4/5/91 6/7/91 6.2
1 5/5/95 6/6/96 3.3

I know there must be a way to do this with reshape2/melt/recast/tidyr, but I can't seem to figure it out how to map the multiple sets of measure variables into single sets of value columns in this particular way.

R Solutions


Solution 1 - R

reshape(dat, idvar="ID", direction="long", 
             varying=list(Start=c(2,5,8), End=c(3,6,9), Value=c(4,7,10)),
             v.names = c("DateRangeStart", "DateRangeEnd", "Value") )
#-------------
    ID time DateRangeStart DateRangeEnd Value
1.1  1    1          1/1/90        3/1/90    4.4
1.2  1    2          4/5/91        6/7/91    6.2
1.3  1    3          5/5/95        6/6/96    3.3

(Added the v.names per Josh's suggestion.)

Solution 2 - R

data.table's melt function can melt into multiple columns. Using that, we can simply do:

require(data.table)
melt(setDT(dat), id=1L,
     measure=patterns("Start$", "End$", "^Value"), 
     value.name=c("DateRangeStart", "DateRangeEnd", "Value"))

#    ID variable DateRangeStart DateRangeEnd Value
# 1:  1        1         1/1/90       3/1/90   4.4
# 2:  1        2         4/5/91       6/7/91   6.2
# 3:  1        3         5/5/95       6/6/96   3.3

Alternatively, you can also reference the three sets of measure columns by the column position:

melt(setDT(dat), id = 1L, 
     measure = list(c(2,5,8), c(3,6,9), c(4,7,10)), 
     value.name = c("DateRangeStart", "DateRangeEnd", "Value"))

Solution 3 - R

Reshaping from wide to long format with multiple value/measure columns is possible with the function pivot_longer() of the tidyr package since version 1.0.0.

This is superior to the previous tidyr strategy of gather() than spread() (see answer by @AndrewMacDonald), because the attributes are no longer dropped (dates remain dates and numerics remain numerics in the example below).

library("tidyr")
library("magrittr")

a <- structure(list(ID = 1L, 
                    DateRange1Start = structure(7305, class = "Date"), 
                    DateRange1End = structure(7307, class = "Date"), 
                    Value1 = 4.4, 
                    DateRange2Start = structure(7793, class = "Date"),
                    DateRange2End = structure(7856, class = "Date"), 
                    Value2 = 6.2, 
                    DateRange3Start = structure(9255, class = "Date"), 
                    DateRange3End = structure(9653, class = "Date"), 
                    Value3 = 3.3),
               row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))

pivot_longer() (counterpart: pivot_wider()) works similar to gather(). However, it offers additional functionality such as multiple value columns. With only one value column, all colnames of the wide data set would go into one long column with the name given in names_to. For multiple value columns, names_to may receive multiple new names.

This is easiest if all column names follow a specific pattern like Start_1, End_1, Start_2, etc. Therefore, I renamed the columns in the first step.

(names(a) <- sub("(\\d)(\\w*)", "\\2_\\1", names(a)))
#>  [1] "ID"               "DateRangeStart_1" "DateRangeEnd_1"  
#>  [4] "Value_1"          "DateRangeStart_2" "DateRangeEnd_2"  
#>  [7] "Value_2"          "DateRangeStart_3" "DateRangeEnd_3"  
#> [10] "Value_3"

pivot_longer(a, 
             cols = -ID, 
             names_to = c(".value", "group"),
             # names_prefix = "DateRange",
             names_sep = "_")
#> # A tibble: 3 x 5
#>      ID group DateRangeEnd DateRangeStart Value
#>   <int> <chr> <date>       <date>         <dbl>
#> 1     1 1     1990-01-03   1990-01-01       4.4
#> 2     1 2     1991-07-06   1991-05-04       6.2
#> 3     1 3     1996-06-06   1995-05-05       3.3

Alternatively, the reshape may be done using a pivot spec that offers finer control (see link below):

spec <- a %>%
    build_longer_spec(cols = -ID) %>%
    dplyr::transmute(.name = .name,
                     group = readr::parse_number(name),
                     .value = stringr::str_extract(name, "Start|End|Value"))

pivot_longer(a, spec = spec)

Created on 2019-03-26 by the reprex package (v0.2.1)

See also: https://tidyr.tidyverse.org/articles/pivot.html

Solution 4 - R

Here is an approach to the problem using tidyr. This is an interesting use case for its function extract_numeric(), which I used to pull out the group from the column names

library(dplyr)
library(tidyr)

a <- read.table(textConnection("
ID DateRange1Start DateRange1End Value1 DateRange2Start DateRange2End Value2 DateRange3Start DateRange3End Value3
1 1/1/90 3/1/90 4.4 4/5/91 6/7/91 6.2 5/5/95 6/6/96 3.3 
"),header=TRUE)

a %>%
  gather(variable,value,-ID) %>%
  mutate(group = extract_numeric(variable)) %>%
  mutate(variable =  gsub("\\d","",x = variable)) %>%
  spread(variable,value)

  ID group DateRangeEnd DateRangeStart Value
1  1     1       3/1/90         1/1/90   4.4
2  1     2       6/7/91         4/5/91   6.2
3  1     3       6/6/96         5/5/95   3.3

Solution 5 - R

Two additional options (with an example dataframe with more than one row to better show the working of the code):

1) with base R:

l <- lapply(split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))),
            setNames, c('DateRangeStart','DateRangeEnd','Value'))
data.frame(ID = d[,1], do.call(rbind, l), row.names = NULL)

which gives:

> ID DateRangeStart DateRangeEnd Value > 1 1 1/1/90 3/1/90 4.4 > 2 2 1/2/90 3/2/90 6.1 > 3 1 4/5/91 6/7/91 6.2 > 4 2 4/6/91 6/8/91 3.2 > 5 1 5/5/95 6/6/96 3.3 > 6 2 5/5/97 6/6/98 1.3

2) with the tidyverse:

library(dplyr)
library(purrr)

split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))) %>%
  map_dfr(~set_names(., c('DateRangeStart','DateRangeEnd','Value'))) %>% 
  bind_cols(ID = rep(d$ID, nrow(.)/nrow(d)), .)

3) with the sjmisc-package:

library(sjmisc)
to_long(d, keys = 'group',
        values = c('DateRangeStart','DateRangeEnd','Value'), 
        c('DateRange1Start','DateRange2Start','DateRange3Start'),
        c('DateRange1End','DateRange2End','DateRange3End'),
        c('Value1','Value2','Value3'))[,-2]

If you also want a group/time column, you can adapt the approaches above to:

1) with base R:

l <- lapply(split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))),
            setNames, c('DateRangeStart','DateRangeEnd','Value'))
data.frame(ID = d[,1],
           group = rep(seq_along(l), each = nrow(d)),
           do.call(rbind, l), row.names = NULL)

which gives:

> ID group DateRangeStart DateRangeEnd Value > 1 1 1 1/1/90 3/1/90 4.4 > 2 2 1 1/2/90 3/2/90 6.1 > 3 1 2 4/5/91 6/7/91 6.2 > 4 2 2 4/6/91 6/8/91 3.2 > 5 1 3 5/5/95 6/6/96 3.3 > 6 2 3 5/5/97 6/6/98 1.3

2) with the tidyverse:

split.default(d[-1], cumsum(grepl('Start$', names(d)[-1]))) %>%
  map_dfr(~set_names(., c('DateRangeStart','DateRangeEnd','Value'))) %>% 
  bind_cols(ID = rep(d$ID, nrow(.)/nrow(d)),
            group = rep(1:(nrow(.)/nrow(d)), each = nrow(d)), .)

3) with the sjmisc-package:

library(sjmisc)
to_long(d, keys = 'group', recode.key = TRUE,
        values = c('DateRangeStart','DateRangeEnd','Value'), 
        c('DateRange1Start','DateRange2Start','DateRange3Start'),
        c('DateRange1End','DateRange2End','DateRange3End'),
        c('Value1','Value2','Value3'))

Used data:

d <- read.table(text = "ID DateRange1Start DateRange1End Value1 DateRange2Start DateRange2End Value2 DateRange3Start DateRange3End Value3
1 1/1/90 3/1/90 4.4 4/5/91 6/7/91 6.2 5/5/95 6/6/96 3.3
2 1/2/90 3/2/90 6.1 4/6/91 6/8/91 3.2 5/5/97 6/6/98 1.3", header = TRUE, stringsAsFactors = FALSE)

Solution 6 - R

Using recycling:

data.frame(ID = d[, 1],
           DateRangeStart = unlist(d[, -1][, c(TRUE, FALSE, FALSE)]),
           DateRangeEnd  = unlist(d[, -1][, c(FALSE, TRUE, FALSE)]),
           Value =  unlist(d[, -1][, c(FALSE, FALSE, TRUE)]))

Solution 7 - R

Another solution in the tidyverse, which leverages the names_pattern argument to tidyr::pivot_longer().

> names_pattern takes the same specification as extract(), a regular expression containing matching groups (()).

As a single command that requires no preliminary string manipulations, this might present an improvement over @hplieninger's answer here.

Solution

library(tidyverse)


# ...
# Code to generate dataset 'ds'.
# ...


ds %>% pivot_longer(
  # Target only those columns names with a numeric index; possibly suffixed by "Start"
  # or "End".
  cols = matches("^(.*)(\\d+)(Start|End)?$"),
  # Break each name into its prefix, index, and suffix.
  names_pattern = "^(.*)(\\d+)(Start|End)?$",
  # Pivot by index and reassemble the other components.
  names_to = c(".value", "group_id", ".value")
)

You can further streamline the code into a one-liner, like so

pivot_longer(ds, !ID, names_pattern = "^(.*)(\\d+)(Start|End)?$", names_to = c(".value", NA, ".value"))

where !ID simply targets every column (but ID) as a grouped variable; and NA omits the group index (group_id) as in your example output.

Result

Given a ds like your sample dataset

ds <- structure(
  list(
    ID = 1L,
    DateRange1Start = structure(7305, class = "Date"), 
    DateRange1End = structure(7307, class = "Date"),
    Value1 = 4.4, 
    DateRange2Start = structure(7793, class = "Date"),
    DateRange2End = structure(7856, class = "Date"), 
    Value2 = 6.2,
    DateRange3Start = structure(9255, class = "Date"), 
    DateRange3End = structure(9653, class = "Date"),
    Value3 = 3.3
  ),
  row.names = c(NA, -1L),
  class = c("tbl_df", "tbl", "data.frame")
)

this solution should yield the following result

# A tibble: 3 x 5
     ID group_id DateRangeStart DateRangeEnd Value
  <int> <chr>    <date>         <date>       <dbl>
1     1 1        1990-01-01     1990-01-03     4.4
2     1 2        1991-05-04     1991-07-06     6.2
3     1 3        1995-05-05     1996-06-06     3.3

or alternatively for the streamlined command:

# A tibble: 3 x 4
     ID DateRangeStart DateRangeEnd Value
  <int> <date>         <date>       <dbl>
1     1 1990-01-01     1990-01-03     4.4
2     1 1991-05-04     1991-07-06     6.2
3     1 1995-05-05     1996-06-06     3.3

Solution 8 - R

You don't need anything fancy; base R functions will do.

a <- read.table(textConnection("
ID DateRange1Start DateRange1End Value1 DateRange2Start DateRange2End Value2 DateRange3Start DateRange3End Value3
1 1/1/90 3/1/90 4.4 4/5/91 6/7/91 6.2 5/5/95 6/6/96 3.3 
"),header=TRUE)
b1 <- a[,c(1:4)]; b2 <- a[,c(1,5:7)]; b3 <- a[,c(1,8:10)]
colnames(b1) <- colnames(b2) <- colnames(b3) <- c("ID","DateRangeStart","DateRangeEnd","Value")
b <- rbind(b1,b2,b3)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestiondajView Question on Stackoverflow
Solution 1 - RIRTFMView Answer on Stackoverflow
Solution 2 - RArunView Answer on Stackoverflow
Solution 3 - RhplieningerView Answer on Stackoverflow
Solution 4 - RAndrewMacDonaldView Answer on Stackoverflow
Solution 5 - RJaapView Answer on Stackoverflow
Solution 6 - Rzx8754View Answer on Stackoverflow
Solution 7 - RGregView Answer on Stackoverflow
Solution 8 - RBlue MagisterView Answer on Stackoverflow