Repeat rows of a data.frame

RDataframeRowsRepeat

R Problem Overview


I want to repeat the rows of a data.frame, each N times. The result should be a new data.frame (with nrow(new.df) == nrow(old.df) * N) keeping the data types of the columns.

Example for N = 2:

                        A B   C
  A B   C             1 j i 100
1 j i 100     -->     2 j i 100
2 K P 101             3 K P 101
                      4 K P 101

So, each row is repeated 2 times and characters remain characters, factors remain factors, numerics remain numerics, ...

My first attempt used apply: apply(old.df, 2, function(co) rep(co, each = N)), but this one transforms my values to characters and I get:

     A   B   C    
[1,] "j" "i" "100"
[2,] "j" "i" "100"
[3,] "K" "P" "101"
[4,] "K" "P" "101"

R Solutions


Solution 1 - R

df <- data.frame(a = 1:2, b = letters[1:2]) 
df[rep(seq_len(nrow(df)), each = 2), ]

Solution 2 - R

A clean dplyr solution, taken from [here][1]

library(dplyr)
df <- tibble(x = 1:2, y = c("a", "b"))
df %>% slice(rep(1:n(), each = 2))

[1]: https://stackoverflow.com/questions/38237350/repeating-rows-of-data-frame-in-dplyr#comment63896747_38237805 "here"

Solution 3 - R

There is a lovely vectorized solution that repeats only certain rows n-times each, possible for example by adding an ntimes column to your data frame:

  A B   C ntimes
1 j i 100      2
2 K P 101      4
3 Z Z 102      1

Method:

df <- data.frame(A=c("j","K","Z"), B=c("i","P","Z"), C=c(100,101,102), ntimes=c(2,4,1))
df <- as.data.frame(lapply(df, rep, df$ntimes))

Result:

  A B   C ntimes
1 Z Z 102      1
2 j i 100      2
3 j i 100      2
4 K P 101      4
5 K P 101      4
6 K P 101      4
7 K P 101      4

This is very similar to Josh O'Brien and Mark Miller's method:

df[rep(seq_len(nrow(df)), df$ntimes),]

However, that method appears quite a bit slower:

df <- data.frame(A=c("j","K","Z"), B=c("i","P","Z"), C=c(100,101,102), ntimes=c(2000,3000,4000))

microbenchmark::microbenchmark(
  df[rep(seq_len(nrow(df)), df$ntimes),],
  as.data.frame(lapply(df, rep, df$ntimes)),
  times = 10
)

Result:

Unit: microseconds
                                      expr      min       lq      mean   median       uq      max neval
   df[rep(seq_len(nrow(df)), df$ntimes), ] 3563.113 3586.873 3683.7790 3613.702 3657.063 4326.757    10
 as.data.frame(lapply(df, rep, df$ntimes))  625.552  654.638  676.4067  668.094  681.929  799.893    10

Solution 4 - R

If you can repeat the whole thing, or subset it first then repeat that, then this similar question may be helpful. Once again:

library(mefa)
rep(mtcars,10) 

or simply

mefa:::rep.data.frame(mtcars)

Solution 5 - R

Adding to what @dardisco mentioned about mefa::rep.data.frame(), it's very flexible.

You can either repeat each row N times:

rep(df, each=N)

or repeat the entire dataframe N times (think: like when you recycle a vectorized argument)

rep(df, times=N)

Two thumbs up for mefa! I had never heard of it until now and I had to write manual code to do this.

Solution 6 - R

For reference and adding to answers citing mefa, it might worth to take a look on the implementation of mefa::rep.data.frame() in case you don't want to include the whole package:

> data <- data.frame(a=letters[1:3], b=letters[4:6])
> data
  a b
1 a d
2 b e
3 c f
> as.data.frame(lapply(data, rep, 2))
  a b
1 a d
2 b e
3 c f
4 a d
5 b e
6 c f

Solution 7 - R

The rep.row function seems to sometimes make lists for columns, which leads to bad memory hijinks. I have written the following which seems to work well:

library(plyr)
rep.row <- function(r, n){
  colwise(function(x) rep(x, n))(r)
}

Solution 8 - R

My solution similar as mefa:::rep.data.frame, but a little faster and cares about row names:

rep.data.frame <- function(x, times) {
    rnames <- attr(x, "row.names")
    x <- lapply(x, rep.int, times = times)
    class(x) <- "data.frame"
    if (!is.numeric(rnames))
        attr(x, "row.names") <- make.unique(rep.int(rnames, times))
    else
        attr(x, "row.names") <- .set_row_names(length(rnames) * times)
    x
}

Compare solutions:

library(Lahman)
library(microbenchmark)
microbenchmark(
    mefa:::rep.data.frame(Batting, 10),
    rep.data.frame(Batting, 10),
    Batting[rep.int(seq_len(nrow(Batting)), 10), ],
    times = 10
)
#> Unit: milliseconds
#>                                            expr       min       lq     mean   median        uq       max neval cld
#>              mefa:::rep.data.frame(Batting, 10) 127.77786 135.3480 198.0240 148.1749  278.1066  356.3210    10  a 
#>                     rep.data.frame(Batting, 10)  79.70335  82.8165 134.0974  87.2587  191.1713  307.4567    10  a 
#>  Batting[rep.int(seq_len(nrow(Batting)), 10), ] 895.73750 922.7059 981.8891 956.3463 1018.2411 1127.3927    10   b

Solution 9 - R

try using for example

N=2
rep(1:4, each = N) 

as an index

Solution 10 - R

Another way to do this would to first get row indices, append extra copies of the df, and then order by the indices:

df$index = 1:nrow(df)
df = rbind(df,df)
df = df[order(df$index),][,-ncol(df)]

Although the other solutions may be shorter, this method may be more advantageous in certain situations.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionStefanView Question on Stackoverflow
Solution 1 - RJosh O'BrienView Answer on Stackoverflow
Solution 2 - RDavid RubingerView Answer on Stackoverflow
Solution 3 - RAdam EricksonView Answer on Stackoverflow
Solution 4 - RdardiscoView Answer on Stackoverflow
Solution 5 - RsmciView Answer on Stackoverflow
Solution 6 - RFabio GabrielView Answer on Stackoverflow
Solution 7 - RjebyrnesView Answer on Stackoverflow
Solution 8 - RArtem KlevtsovView Answer on Stackoverflow
Solution 9 - RshhhhimhuntingrabbitsView Answer on Stackoverflow
Solution 10 - RcrazjoView Answer on Stackoverflow