Using `.N`...

    DT[ , `:=`( COUNT = .N , IDX = 1:.N ) , by = VAL ]
    #    VAL COUNT IDX
    # 1:   1     3   1
    # 2:   2     4   1
    # 3:   2     4   2
    # 4:   3     3   1
    # 5:   1     3   2
    # 6:   3     3   2
    # 7:   3     3   3
    # 8:   2     4   3
    # 9:   2     4   4
    #10:   1     3   3

`.N` is the number of records in each group, with groups defined by `&quot;VAL&quot;`.

I&#39;m aware of the `$in` operator, which appears to search for an item&#39;s presence in array, but I only want to find a match if the item is in the first position in an array.

For instance:

    {
    	&quot;_id&quot; : ObjectId(&quot;0&quot;),
    	&quot;imgs&quot; : [
    		&quot;http://foo.jpg&quot;,
    		&quot;http://bar.jpg&quot;,
    		&quot;http://moo.jpg&quot;,
    	   	]
    },
    {
    	&quot;_id&quot; : ObjectId(&quot;1&quot;),
    	&quot;imgs&quot; : [
    		&quot;http://bar.jpg&quot;,
    		&quot;http://foo.jpg&quot;,
    		&quot;http://moo.jpg&quot;,
    	   	]
    }

I&#39;m looking for a query akin to:

    db.products.find({&quot;imgs[0]&quot;: &quot;http://foo.jpg&quot;})

This would/should return the `ObjectId(&quot;0&quot;)` but not `ObjectId(&quot;1&quot;)`, as it&#39;s only checking against the first image in the array.

How can this be achieved? I&#39;m aware I could just create a separate field which contains a single string for `firstImg` but that&#39;s not really what I&#39;m after here.

Querying MongoDB to match in the first item in an array

I know that in order to be picklable, a class has to overwrite `__reduce__` method, and it has to return string or tuple.

How does this function work?
What the exact usage of `__reduce__`? When will it been used?

What&#39;s the exact usage of __reduce__ in Pickler

I have the following data.table

    set.seed(1)
    DT &lt;- data.table(VAL = sample(c(1, 2, 3), 10, replace = TRUE))
        VAL
     1:   1
     2:   2
     3:   2
     4:   3
     5:   1
     6:   3
     7:   3
     8:   2
     9:   2
    10:   1

_Within_ each number in `VAL` I want to:

 1. Count the number of records/rows
 2. Create an row index (counter) of first, second, third occurrence et c. 

At the end I want the result

        VAL COUNT IDX
     1:   1     3   1
     2:   2     4   1
     3:   2     4   2
     4:   3     3   1
     5:   1     3   2
     6:   3     3   2
     7:   3     3   3
     8:   2     4   3
     9:   2     4   4
    10:   1     3   3

where &quot;COUNT&quot; is the number of records/rows for each &quot;VAL&quot;, and &quot;IDX&quot; is the row index within each &quot;VAL&quot;.  

I tried to work with `which` and `length` using `.I`:

     dt[, list(COUNT = length(VAL == VAL[.I]), 
                 IDX = which(which(VAL == VAL[.I]) == .I))]
but this does not work as `.I` refers to a vector with the index, so I guess one must use `.I[]`. Though inside `.I[]` I again face the problem, that I do not have the row index and I do know (from reading `data.table` FAQ and following the posts here) that looping through rows should be avoided if possible. 

So, what&#39;s the `data.table` way? 

Count number of records and generate row number within each group in a data.table

I have the following data.table
<pre><code class="hljs language-yaml">set.seed(1)
DT &#x3C;- data.table(VAL = sample(c(1, 2, 3), 10, replace = TRUE))
 VAL
 1: 1
 2: 2
 3: 2
 4: 3
 5: 1
 6: 3
 7: 3
 8: 2
 9: 2
10: 1
</code></pre>
Within each number in <code>VAL</code> I want to:
<ol>
<li>Count the number of records/rows</li>
<li>Create an row index (counter) of first, second, third occurrence et c.</li>
</ol>
At the end I want the result
<pre><code class="hljs language-yaml"> VAL COUNT IDX
 1: 1 3 1
 2: 2 4 1
 3: 2 4 2
 4: 3 3 1
 5: 1 3 2
 6: 3 3 2
 7: 3 3 3
 8: 2 4 3
 9: 2 4 4
10: 1 3 3
</code></pre>
where "COUNT" is the number of records/rows for each "VAL", and "IDX" is the row index within each "VAL".
I tried to work with <code>which</code> and <code>length</code> using <code>.I</code>:
<pre><code class="hljs language-css"> dt[, list(COUNT = length(VAL == VAL[.I]), 
 IDX = which(which(VAL == VAL[.I]) == .I))]
</code></pre>
but this does not work as <code>.I</code> refers to a vector with the index, so I guess one must use <code>.I[]</code>. Though inside <code>.I[]</code> I again face the problem, that I do not have the row index and I do know (from reading <code>data.table</code> FAQ and following the posts here) that looping through rows should be avoided if possible.
So, what's the <code>data.table</code> way?

In this boxplot we can see the mean but how can we have also the number value on the plot for every mean of every box plot?

     ggplot(data=PlantGrowth, aes(x=group, y=weight, fill=group)) + geom_boxplot() +
         stat_summary(fun.y=mean, colour=&quot;darkred&quot;, geom=&quot;point&quot;, 
                               shape=18, size=3,show_guide = FALSE)

Boxplot show the value of mean

**Update:**

I&#39;ve written a brief walkthrough guide to [installing Rtools on windows](http://thecoatlessprofessor.com/programming/rcpp/install-rtools-for-rcpp/).

**Original:**

I am trying to build an R package using RStudio on Windows 7. When I attempt to build the package via RStudio&#39;s Build panel I receive:

    WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
    
    http://cran.rstudio.com/bin/windows/Rtools/

Loading `library(devtools)` and running `find_rtools(T)` gives:

    Scanning path...
    ls : F:\Rtools\bin\ls.exe 
    Scanning registry...
    Found F:/Rtools for 3.1 
    VERSION.txt
    Rtools version 3.1.0.1936 
    [1] TRUE

The Path variable is set as:

    F:\Rtools\bin;F:\Rtools\gcc-4.6.3\bin;F:\Rtools\perl\bin;F:\Rtools\MinGW\bin;F:\Program Files\R\R-3.0.2\bin\x64;F:\Program Files (x86)\HTML Help Workshop;F:\Program Files\MiKTeX 2.9\miktex\bin\x64\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\Common Files\Microsoft Shared\Windows Live;C:\Program Files (x86)\Common Files\Microsoft Shared\Windows Live;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\Windows Live\Shared;C:\Program Files\Microsoft Network Monitor 3\;F:\Program Files (x86)\QuickTime\QTSystem\

I&#39;ve also restarted several times, yet the error persists. I&#39;m a bit confused as to why this is occurring.

Output when R access system variable Path:

    &gt; Sys.getenv()[&#39;PATH&#39;]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     PATH 
    &quot;F:\\Program Files\\R\\R-3.0.2\\bin\\x64;F:\\Rtools\\bin;F:\\Rtools\\gcc-4.6.3\\bin;F:\\Rtools\\perl\\bin;F:\\Rtools\\MinGW\\bin;F:\\Program Files\\R\\R-3.0.2\\bin\\x64;F:\\Program Files (x86)\\HTML Help Workshop;F:\\Program Files\\MiKTeX 2.9\\miktex\\bin\\x64\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\Program Files\\Common Files\\Microsoft Shared\\Windows Live;C:\\Program Files (x86)\\Common Files\\Microsoft Shared\\Windows Live;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Program Files (x86)\\Windows Live\\Shared;C:\\Program Files\\Microsoft Network Monitor 3\\;F:\\Program Files (x86)\\QuickTime\\QTSystem\\&quot; 

The R version I am using is: R version 3.0.2 (2013-09-25) -- &quot;Frisbee Sailing.&quot;

The Rstudio Version I am using is: 0.97.551. When I check for updates, I&#39;m told that this is the latest patch.

    &gt; Sys.which(&quot;ls.exe&quot;)
                       ls.exe 
    &quot;F:\\Rtools\\bin\\ls.exe&quot; 
    &gt; Sys.which(&quot;gcc.exe&quot;)
    gcc.exe 
         &quot;&quot; 

Rtools not being detected by R

I searched for this question and found some answers on this, but none of them seem to work. This is the script that I&#39;m using in python to run my R script.

    import subprocess
    retcode = subprocess.call(&quot;/usr/bin/Rscript --vanilla -e &#39;source(\&quot;/pathto/MyrScript.r\&quot;)&#39;&quot;, shell=True)

and I get this error:

    Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
      no lines available in input
    Calls: source ... withVisible -&gt; eval -&gt; eval -&gt; read.csv -&gt; read.table
    Execution halted

and here is the content of my R script (pretty simple!)

    data = read.csv(&#39;features.csv&#39;)
    data1 = read.csv(&quot;BagofWords.csv&quot;)
    merged = merge(data,data1)
    write.table(merged, &quot;merged.csv&quot;,quote=FALSE,sep=&quot;,&quot;,row.names=FALSE)
    for (i in 1:length(merged$fileName))
    {
            fileConn&lt;-file(paste(&quot;output/&quot;,toString(merged$fileName[i]),&quot;.txt&quot;,sep=&quot;&quot;))
            writeLines((toString(merged$BagofWord[i])),fileConn)
            close(fileConn)
    }
The r script is working fine, when I use `source(&#39;MyrScript.r&#39;)` in r commandline. Moreover, when I try to use the exact command which I pass to the `subprocess.call` function (i.e., `/usr/bin/Rscript --vanilla -e &#39;source(&quot;/pathto/MyrScript.r&quot;)&#39;`) in my commandline it works find, I don&#39;t really get what&#39;s the problem.

Running R script from python

I have a data frame with two variables, Date and Taxa and want to get the date for the first time each taxa occurs.  There are 9 different dates and 40 different taxa in the data frame consisting of 172 rows, but my answer should only have 40 rows.  

Taxa is a factor and Date is a date.

For example, my data frame (called &#39;species&#39;) is set up like this:

    Date          Taxa
    2013-07-12    A
    2011-08-31    B
    2012-09-06    C
    2012-05-17    A
    2013-07-12    C
    2012-09-07    B

and I would be looking for an answer like this:

    Date          Taxa
    2012-05-17    A
    2011-08-31    B
    2012-09-06    C

I tried using:

    t.first &lt;-  species[unique(species$Taxa),]

and it gave me the correct number of rows but there were Taxa repeated.  If I just use unique(species$Taxa) it appears to give me the right answer, but then I don&#39;t know the date when it first occurred.



Thanks for any help.

Extract rows for the first occurrence of a variable in a data frame

I want to save data into an `.RData` file.

For instance, I&#39;d like to save into `1.RData` with two csv files and some information.

Here, **I have two csv files** 

    1) file_1.csv contains object city[[1]]
    2) file_2.csv contains object city[[2]]

and additionally save other values, country and population as follows.
So, I guess I need to make objects &#39;city&#39; from two csv files first of all.

The structure of 1.RData may looks like this:

    &gt; data = load(&quot;1.RData&quot;)

    &gt; data
    [1] &quot;city&quot;  &quot;country&quot;  &quot;population&quot;

    &gt; city
      [[1]]               
      NEW YORK         1.1
      SAN FRANCISCO    3.1
      
      [[2]]
      TEXAS            1.3
      SEATTLE          1.4

    &gt; class(city)
      [1] &quot;list&quot;

    &gt; country
      [1] &quot;east&quot;  &quot;west&quot;  &quot;north&quot;

    &gt; class(country)
      [1] &quot;character&quot;

    &gt; population
      [1] 10  11  13  14   

    &gt; class(population)
      [1] &quot;integer&quot;

`file_1.csv` and `file_2.csv` have bunch of rows and columns.


How can I create this type of RData with csv files and values?


    
      

How to save data file into .RData?

I am using `data.table` and there are many functions which require me to set a key (e.g. `X[Y]`). As such, I wish to understand what a key does in order to properly set keys in my data tables.

---
One source I read was `?setkey`.

&gt; `setkey()` sorts a `data.table` and marks it as sorted. The sorted columns are the key. The key can be any columns in any order. The columns are sorted in ascending order always. The table is changed by reference. No copy is made at all, other than temporary working memory as large as one column.

My takeaway here is that a key would &quot;sort&quot; the data.table, resulting in a very similar effect to `order()`. However, it doesn&#39;t explain the purpose of having a key.

---
The data.table FAQ 3.2 and 3.3 explains:

&gt; 3.2 I don&#39;t have a key on a large table, but grouping is still really quick. Why is that?
&gt; 
&gt; data.table uses radix sorting. This is signicantly faster than other
&gt; sort algorithms. Radix is specically for integers only, see
&gt; `?base::sort.list(x,method=&quot;radix&quot;)`. This is also one reason why
&gt; `setkey()` is quick. When no key is set, or we group in a different order
&gt; from that of the key, we call it an ad hoc by.
&gt; 
&gt; 3.3 Why is grouping by columns in the key faster than an ad hoc by?
&gt; 
&gt; Because each group is contiguous in RAM, thereby minimising page
&gt; fetches, and memory can be copied in bulk (`memcpy` in C) rather than
&gt; looping in C.

From here, I guess that setting a key somehow allows R to use &quot;radix sorting&quot; over other algorithms, and that&#39;s why it is faster.

---
The 10 minute quick start guide also has a guide on keys.

&gt; 1. Keys
&gt; 
&gt; Let&#39;s start by considering data.frame, specically rownames (or in
&gt; English, row names). That is, the multiple names belonging to a single
&gt; row. The multiple names belonging to the single row? That is not what
&gt; we are used to in a data.frame. We know that each row has at most one
&gt; name. A person has at least two names, a rst name and a second name.
&gt; That is useful to organise a telephone directory, for example, which
&gt; is sorted by surname, then rst name. However, each row in a
&gt; data.frame can only have one name. 
&gt; 
&gt; A key consists of one or more
&gt; columns of rownames, which may be integer, factor, character or some
&gt; other class, not simply character. Furthermore, the rows are sorted by
&gt; the key. Therefore, a data.table can have at most one key, because it
&gt; cannot be sorted in more than one way. 
&gt; 
&gt; Uniqueness is not enforced,
&gt; i.e., duplicate key values are allowed. Since the rows are sorted by
&gt; the key, any duplicates in the key will appear consecutively

The telephone directory was helpful in understanding what a key is, but it seems that a key is no different when compared to having a factor column. Furthermore, it does not explain why is a key needed (especially to use certain functions) and how to choose the column to set as key. Also, it seems that in a data.table with time as a column, setting any other column as key would probably mess the time column too, which makes it even more confusing as I do not know if I am allowed set any other column as key. Can someone enlighten me please?


What is the purpose of setting a key in data.table?

How do I extract a column from a data.table as a vector by its position? Below are some code snippets I have tried:

    DT&lt;-data.table(x=c(1,2),y=c(3,4),z=c(5,6))
    DT
    #   x y z
    #1: 1 3 5
    #2: 2 4 6

I want to get this output using column position
    
    DT$y 
    #[1] 3 4
    is.vector(DT$y)
    #[1] TRUE

Other way to get this output using column position    

    DT[,y] 
    #[1] 3 4
    is.vector(DT[,y])
    #[1] TRUE

This doesn&#39;t give a vector

    DT[,2,with=FALSE]
    #   y
    #1: 3
    #2: 4
    is.vector(DT[,2,with=FALSE])
    #[1] FALSE

Those two doesn&#39;t work:

    DT$noquote(names(DT)[2]) # Doesn&#39;t work
    #Error: attempt to apply non-function
    
    DT[,noquote(names(DT)[2])] # Doesn&#39;t work
    #[1] y

And this doesn&#39;t give a vector:

    DT[,noquote(names(DT)[2]),with=FALSE] # Not a vector
    #   y
    #1: 3
    #2: 4
    is.vector(DT[,noquote(names(DT)[2]),with=FALSE])
    #[1] FALSE

Extract a column from a data.table as a vector, by position

I have a large data frame (in the order of several GB) that I&#39;d like to convert to a `data.table`. Using `as.data.table` creates a copy of the data frame, which means I need available memory to be at least twice the size of the data. Is there a way to do the conversion without a copy?

Here&#39;s a simple example to demonstrate:

    library(data.table)
    N &lt;- 1e6
    K &lt;- 1e2
    data &lt;- as.data.frame(rep(data.frame(rnorm(N)), K))
    
    gc(reset=TRUE)
    tracemem(data)
    data &lt;- as.data.table(data)
    gc()
    
With output:

    library(data.table)
    # data.table 1.8.10  For help type: help(&quot;data.table&quot;)
    N &lt;- 1e6
    K &lt;- 1e2
    data &lt;- as.data.frame(rep(data.frame(rnorm(N)), K))
    
    gc(reset=TRUE)
    # used  (Mb) gc trigger   (Mb)  max used  (Mb)
    # Ncells    303759  16.3     597831   32.0    303759  16.3
    # Vcells 100442572 766.4  402928632 3074.2 100442572 766.4
    tracemem(data)
    # [1] &quot;&lt;0x363fda0&gt;&quot;
    data &lt;- as.data.table(data)
    # tracemem[0x363fda0 -&gt; 0x31e4260]: copy as.data.table.data.frame as.data.table 
    gc()
    # used  (Mb) gc trigger   (Mb)  max used   (Mb)
    # Ncells    304519  16.3     597831   32.0    306162   16.4
    # Vcells 100444242 766.4  322342905 2459.3 200933219 1533.0


Convert a data frame to a data.table without copy

### Overview ###

I&#39;m relatively familiar with `data.table`, not so much with `dplyr`.  I&#39;ve read through some [`dplyr` vignettes][1] and examples that have popped up on SO, and so far my conclusions are that:

 1. `data.table` and `dplyr` are comparable in speed, except when there are many (i.e. &gt;10-100K) groups, and in some other circumstances (see benchmarks below)
 2. `dplyr` has more accessible syntax
 3. `dplyr` abstracts (or will) potential DB interactions
 4. There are some minor functionality differences (see &quot;Examples/Usage&quot; below)

In my mind 2. doesn&#39;t bear much weight because I am fairly familiar with it `data.table`, though I understand that for users new to both it will be a big factor.  I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with `data.table`.  I also would like to avoid a discussion about how &quot;more intuitive&quot; leads to faster analysis (certainly true, but again, not what I&#39;m most interested about here).

### Question ###

What I want to know is:

 1. Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).
 2. Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.

One [recent SO question][3] got me thinking about this a bit more, because up until that point I didn&#39;t think `dplyr` would offer much beyond what I can already do in `data.table`.  Here is the `dplyr` solution (data at end of Q):

    dat %.%
      group_by(name, job) %.%
      filter(job != &quot;Boss&quot; | year == min(year)) %.%
      mutate(cumu_job2 = cumsum(job2))

Which was much better than my hack attempt at a `data.table` solution.  That said, good `data.table` solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution):

    setDT(dat)[,
      .SD[job != &quot;Boss&quot; | year == min(year)][, cumjob := cumsum(job2)], 
      by=list(id, job)
    ]

The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you&#39;re used to `data.table` (i.e. doesn&#39;t use some of the more esoteric tricks).

Ideally what I&#39;d like to see is some good examples were the `dplyr` or `data.table` way is substantially more concise or performs substantially better.

### Examples ###

#### Usage ####

 - `dplyr` does not allow grouped operations that return arbitrary number of rows (from **[eddi&#39;s question][6]**, note: this looks like it will be implemented in **[dplyr 0.5](https://github.com/hadley/dplyr/issues/154)**, also, @beginneR shows a potential work-around using `do` in the answer to @eddi&#39;s question).
 - `data.table` supports **[rolling joins](https://stackoverflow.com/questions/12030932/rolling-joins-data-table-in-r)** (thanks @dholstius) as well as **[overlap joins](https://stackoverflow.com/questions/23371747/range-join-data-frames-specific-date-column-with-date-ranges-intervals-in-r/23377309#23377309)**
 - `data.table` internally optimises expressions of the form `DT[col == value]` or `DT[col %in% values]` for *speed* through *automatic indexing* which uses *binary search* while using the same base R syntax. [See here](https://gist.github.com/arunsrinivasan/dacb9d1cac301de8d9ff) for some more details and a tiny benchmark.
 - `dplyr` offers standard evaluation versions of functions (e.g. `regroup`, `summarize_each_`) that can simplify the programmatic use of `dplyr` (note programmatic use of `data.table` is definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge)

#### Benchmarks ####

- I ran **[my own benchmarks](https://www.brodieg.com/2014/04/18/datatable-vs-dplyr-in-split-apply-comgine/)** and found both packages to be comparable in &quot;split apply combine&quot; style analysis, except when there are very large numbers of groups (&gt;100K) at which point `data.table` becomes substantially faster.
- @Arun ran some **[benchmarks on joins](https://gist.github.com/arunsrinivasan/db6e1ce05227f120a2c9)**, showing that `data.table` scales better than `dplyr` as the number of groups increase (updated with recent enhancements in both packages and recent version of R).  Also, a benchmark when trying to get **[unique values](https://stackoverflow.com/questions/23668593/using-fastmatch-package-in-r/23680560?noredirect=1#comment36389040_23680560)** has `data.table` ~6x faster.
- (Unverified) has `data.table` 75% faster on larger versions of a group/apply/sort while `dplyr` was 40% faster on the smaller ones (**[another SO question from comments][7]**, thanks danas).
- Matt, the main author of `data.table`, has [**benchmarked grouping operations on `data.table`, `dplyr` and python `pandas` on up to 2 billion rows (~100GB in RAM)**](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping).
- An **[older benchmark on 80K groups][2]** has `data.table` ~8x faster

### Data ###

This is for the first example I showed in the question section.

    dat &lt;- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L), name = c(&quot;Jane&quot;, &quot;Jane&quot;, &quot;Jane&quot;, &quot;Jane&quot;, 
    &quot;Jane&quot;, &quot;Jane&quot;, &quot;Jane&quot;, &quot;Jane&quot;, &quot;Bob&quot;, &quot;Bob&quot;, &quot;Bob&quot;, &quot;Bob&quot;, &quot;Bob&quot;, 
    &quot;Bob&quot;, &quot;Bob&quot;, &quot;Bob&quot;), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 
    1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 
    1991L, 1992L), job = c(&quot;Manager&quot;, &quot;Manager&quot;, &quot;Manager&quot;, &quot;Manager&quot;, 
    &quot;Manager&quot;, &quot;Manager&quot;, &quot;Boss&quot;, &quot;Boss&quot;, &quot;Manager&quot;, &quot;Manager&quot;, &quot;Manager&quot;, 
    &quot;Boss&quot;, &quot;Boss&quot;, &quot;Boss&quot;, &quot;Boss&quot;, &quot;Boss&quot;), job2 = c(1L, 1L, 1L, 
    1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c(&quot;id&quot;, 
    &quot;name&quot;, &quot;year&quot;, &quot;job&quot;, &quot;job2&quot;), class = &quot;data.frame&quot;, row.names = c(NA, 
    -16L))


  [1]: http://rpubs.com/hadley/dplyr-intro
  [2]: http://www.r-statistics.com/2013/09/a-speed-test-comparison-of-plyr-data-table-and-dplyr/
  [3]: https://stackoverflow.com/questions/21421004/how-to-cumulatively-add-values-in-one-vector-in-r
  [4]: https://stackoverflow.com/questions/21295936/can-dplyr-summarise-over-several-variables-without-listing-each-one
  [5]: https://stackoverflow.com/questions/22644804/how-can-i-use-dplyr-to-apply-a-function-to-all-non-group-by-columns
  [6]: https://stackoverflow.com/questions/21737815/grouped-operations-that-result-in-length-not-equal-to-1-or-length-of-group-in-dp
  [7]: https://stackoverflow.com/questions/21477525/fast-frequency-and-percentage-table-with-dplyr/

data.table vs dplyr: can one do something well the other can&#39;t or does poorly?

Would someone please explain to me the correct usage of `.I` for returning the row numbers of a data.table?

I have data like this:

    require(data.table)
    DT &lt;- data.table(X=c(5, 15, 20, 25, 30))
    DT
    #     X
    # 1:  5
    # 2: 15
    # 3: 20
    # 4: 25
    # 5: 30

I want to return a vector of row indices where a condition in `i` is `TRUE`, e.g. which rows have an `X` greater than 20.

    DT[X &gt; 20]
    # rows 4 &amp; 5 are greater than 20

To get the indices, I tried:

    DT[X &gt; 20, .I]
    # [1] 1 2 

...but clearly I am doing it wrong, because that simply returns a vector containing 1 to the number of returned rows. (Which I thought was pretty much what `.N` was for?).

Sorry if this seems extremely basic, but all I have been able to find in the data.table documentation is WHAT `.I` and `.N` do, not HOW to use them.



Content Type	Original Author	Original Content on Stackoverflow
Question	Simon Z.	View Question on Stackoverflow
Solution 1 - R	Simon O'Hanlon	View Answer on Stackoverflow

Count number of records and generate row number within each group in a data.table

R Problem Overview

R Solutions

Solution 1 - R

What's the exact usage of reduce in Pickler

Querying MongoDB to match in the first item in an array

Attributions