# Create a group number for each consecutive sequence

RDataframeSequence## R Problem Overview

I have the data.frame below. I want to add a column 'g' that classifies my data according to consecutive sequences in column `h_no`

. That is, the first sequence of h_no `1, 2, 3, 4`

is group 1, the second series of `h_no`

(1 to 7) is group 2, and so on, as indicated in the last column 'g'.

```
h_no h_freq h_freqsq g
1 0.09091 0.008264628 1
2 0.00000 0.000000000 1
3 0.04545 0.002065702 1
4 0.00000 0.000000000 1
1 0.13636 0.018594050 2
2 0.00000 0.000000000 2
3 0.00000 0.000000000 2
4 0.04545 0.002065702 2
5 0.31818 0.101238512 2
6 0.00000 0.000000000 2
7 0.50000 0.250000000 2
1 0.13636 0.018594050 3
2 0.09091 0.008264628 3
3 0.40909 0.167354628 3
4 0.04545 0.002065702 3
```

## R Solutions

## Solution 1 - R

You can add a column to your data using various techniques. The quotes below come from the "Details" section of the relevant help text, `[[.data.frame`

.

> Data frames can be indexed in several modes. When `[`

and `[[`

are used with a single vector index (`x[i]`

or `x[[i]]`

), they index the data frame as if it were a list.

```
my.dataframe["new.col"] <- a.vector
my.dataframe[["new.col"]] <- a.vector
```

> The data.frame method for `$`

, treats `x`

as a list

```
my.dataframe$new.col <- a.vector
```

> When `[`

and `[[`

are used with two indices (`x[i, j]`

and `x[[i, j]]`

) they act like indexing a matrix

```
my.dataframe[ , "new.col"] <- a.vector
```

Since the method for `data.frame`

assumes that if you don't specify if you're working with columns or rows, it will assume you mean columns.

For your example, this should work:

```
# make some fake data
your.df <- data.frame(no = c(1:4, 1:7, 1:5), h_freq = runif(16), h_freqsq = runif(16))
# find where one appears and
from <- which(your.df$no == 1)
to <- c((from-1)[-1], nrow(your.df)) # up to which point the sequence runs
# generate a sequence (len) and based on its length, repeat a consecutive number len times
get.seq <- mapply(from, to, 1:length(from), FUN = function(x, y, z) {
len <- length(seq(from = x[1], to = y[1]))
return(rep(z, times = len))
})
# when we unlist, we get a vector
your.df$group <- unlist(get.seq)
# and append it to your original data.frame. since this is
# designating a group, it makes sense to make it a factor
your.df$group <- as.factor(your.df$group)
no h_freq h_freqsq group
1 1 0.40998238 0.06463876 1
2 2 0.98086928 0.33093795 1
3 3 0.28908651 0.74077119 1
4 4 0.10476768 0.56784786 1
5 1 0.75478995 0.60479945 2
6 2 0.26974011 0.95231761 2
7 3 0.53676266 0.74370154 2
8 4 0.99784066 0.37499294 2
9 5 0.89771767 0.83467805 2
10 6 0.05363139 0.32066178 2
11 7 0.71741529 0.84572717 2
12 1 0.10654430 0.32917711 3
13 2 0.41971959 0.87155514 3
14 3 0.32432646 0.65789294 3
15 4 0.77896780 0.27599187 3
16 5 0.06100008 0.55399326 3
```

## Solution 2 - R

Easily: Your data frame is A

```
b <- A[,1]
b <- b==1
b <- cumsum(b)
```

Then you get the column b.

## Solution 3 - R

If I understand the question correctly, you want to detect when the `h_no`

doesn't increase and then increment the `class`

. (I'm going to walk through how I solved this problem, there is a self-contained function at the end.)

#### Working

We only care about the `h_no`

column for the moment, so we can extract that from the data frame:

```
> h_no <- data$h_no
```

We want to detect when `h_no`

doesn't go up, which we can do by working out when the difference between successive elements is either negative or zero. R provides the `diff`

function which gives us the vector of differences:

```
> d.h_no <- diff(h_no)
> d.h_no
[1] 1 1 1 -3 1 1 1 1 1 1 -6 1 1 1
```

Once we have that, it is a simple matter to find the ones that are non-positive:

```
> nonpos <- d.h_no <= 0
> nonpos
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[13] FALSE FALSE
```

In R, `TRUE`

and `FALSE`

are basically the same as `1`

and `0`

, so if we get the cumulative sum of `nonpos`

, it will increase by 1 in (almost) the appropriate spots. The `cumsum`

function (which is basically the opposite of `diff`

) can do this.

```
> cumsum(nonpos)
[1] 0 0 0 1 1 1 1 1 1 1 2 2 2 2
```

But, there are two problems: the numbers are one too small; and, we are missing the first element (there should be four in the first class).

The first problem is simply solved: `1+cumsum(nonpos)`

. And the second just requires adding a `1`

to the front of the vector, since the first element is always in class `1`

:

```
> classes <- c(1, 1 + cumsum(nonpos))
> classes
[1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3
```

Now, we can attach it back onto our data frame with `cbind`

(by using the `class=`

syntax, we can give the column the `class`

heading):

```
> data_w_classes <- cbind(data, class=classes)
```

And `data_w_classes`

now contains the result.

#### Final result

We can compress the lines together and wrap it all up into a function to make it easier to use:

```
classify <- function(data) {
cbind(data, class=c(1, 1 + cumsum(diff(data$h_no) <= 0)))
}
```

Or, since it makes sense for the `class`

to be a factor:

```
classify <- function(data) {
cbind(data, class=factor(c(1, 1 + cumsum(diff(data$h_no) <= 0))))
}
```

You use either function like:

```
> classified <- classify(data) # doesn't overwrite data
> data <- classify(data) # data now has the "class" column
```

(This method of solving this problem is good because it avoids explicit iteration, which is generally recommend for R, and avoids generating lots of intermediate vectors and list etc. And also it's kinda neat how it can be written on one line :) )

## Solution 4 - R

In addition to Roman's answer, something like this might be even simpler. Note that I haven't tested it because I do not have access to R right now.

```
# Note that I use a global variable here
# normally not advisable, but I liked the
# use here to make the code shorter
index <<- 0
new_column = sapply(df$h_no, function(x) {
if(x == 1) index = index + 1
return(index)
})
```

The function iterates over the values in `n_ho`

and always returns the categorie that the current value belongs to. If a value of `1`

is detected, we increase the global variable `index`

and continue.

## Solution 5 - R

I believe that using "cbind" is the simplest way to add a column to a data frame in R. Below an example:

```
myDf = data.frame(index=seq(1,10,1), Val=seq(1,10,1))
newCol= seq(2,20,2)
myDf = cbind(myDf,newCol)
```

## Solution 6 - R

Approach based on identifying number of groups (`x`

in `mapply`

) and its length (`y`

in `mapply`

)

```
mytb<-read.table(text="h_no h_freq h_freqsq group
1 0.09091 0.008264628 1
2 0.00000 0.000000000 1
3 0.04545 0.002065702 1
4 0.00000 0.000000000 1
1 0.13636 0.018594050 2
2 0.00000 0.000000000 2
3 0.00000 0.000000000 2
4 0.04545 0.002065702 2
5 0.31818 0.101238512 2
6 0.00000 0.000000000 2
7 0.50000 0.250000000 2
1 0.13636 0.018594050 3
2 0.09091 0.008264628 3
3 0.40909 0.167354628 3
4 0.04545 0.002065702 3", header=T, stringsAsFactors=F)
mytb$group<-NULL
positionsof1s<-grep(1,mytb$h_no)
mytb$newgroup<-unlist(mapply(function(x,y)
rep(x,y), # repeat x number y times
x= 1:length(positionsof1s), # x is 1 to number of nth group = g1:g3
y= c( diff(positionsof1s), # y is number of repeats of groups g1 to penultimate (g2) = 4, 7
nrow(mytb)- # this line and the following gives number of repeat for last group (g3)
(positionsof1s[length(positionsof1s )]-1 ) # number of rows - position of penultimate group (g2)
) ) )
mytb
```

## Solution 7 - R

The `data.table`

function `rleid`

is handy for things like this. We subtract the sequence `1:nrow(data)`

to transform consecutive sequences to constants, and then use `rleid`

to create the group IDs:

```
data$g = data.table::rleid(data$h_no - 1:nrow(data))
```

## Solution 8 - R

```
Data.frame[,'h_new_column'] <- as.integer(Data.frame[,'h_no'], breaks=c(1, 4, 7))
```