In this scenario it is not so different than `data.frame`

    data &lt;- data[ menuitem != &#39;coffee&#39; | amount &gt; 0] 

Delete/add row by reference it is to be implemented. You find more info in [this question][1]

Regarding speed:

 1 You can benefit from keys by doing something like:

    setkey(data, menuitem)
    data &lt;- data[!&quot;coffee&quot;]

which will be faster than `data &lt;- data[ menuitem != &#39;coffee&#39;]`. However to apply the same filters you asked in the question you&#39;ll need a rolling join (I&#39;ve finished my lunch break I can add something later :-)).

 2 Even without key data.table is much faster for relatively big table (similar speed for handful amount of rows)

    dt&lt;-data.table(id=sample(letters,1000000,T),var=rnorm(1000000))
    df&lt;-data.frame(id=sample(letters,1000000,T),var=rnorm(1000000))
    library(microbenchmark)
    &gt; microbenchmark(dt[ id == &quot;a&quot;], df[ df$id == &quot;a&quot;,])
    Unit: milliseconds
                   expr       min        lq    median        uq       max neval
          dt[id == &quot;a&quot;]  24.42193  25.74296  26.00996  26.35778  27.36355   100
     df[df$id == &quot;a&quot;, ] 138.17500 146.46729 147.38646 149.06766 154.10051   100

  [1]: https://stackoverflow.com/questions/16792001/add-a-row-by-reference-at-the-end-of-a-data-table-object

try this:

```
data &lt;- data[ !(menuitem == &#39;coffee&#39; | amount &lt;= 0),] 
```
Generally:

```
dt &lt;- data.table(a=c(1,1,1,2,2,2,3,3,3),b=c(4,2,3,1,5,3,4,7,6))
dt
#&gt;    a b
#&gt; 1: 1 4
#&gt; 2: 1 2
#&gt; 3: 1 3
#&gt; 4: 2 1
#&gt; 5: 2 5
#&gt; 6: 2 3
#&gt; 7: 3 4
#&gt; 8: 3 7
#&gt; 9: 3 6
dt[a!=1,]
#&gt;    a b
#&gt; 1: 2 1
#&gt; 2: 2 5
#&gt; 3: 2 3
#&gt; 4: 3 4
#&gt; 5: 3 7
#&gt; 6: 3 6
```



Environment is Nginx + uwsgi.

Getting a 502 bad gateway error from Nginx on certain GET requests. Seems to be related to the length of the URL. In our particular case, it was a long list of GET parameters. Shorten the GET parameters and no 502 error.

From the nginx/error.log

    [error] 22113#0: *1 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 192.168.1.100, server: server.domain.com, request: &quot;GET &lt;long_url_here&gt;&quot;

No information in the uwsgi error log.

Nginx uwsgi (104: Connection reset by peer) while reading response header from upstream

I need to write an SPI Linux character device driver for omap4 from scratch.
I know some basics of writing device drivers. But, I don&#39;t know how to start writing platform specific device driver from scratch.

I&#39;ve written some basic char drivers, and I thought writing SPI device driver would be similar to it. Char drivers have a structure `file_operations` which contains the functions implemented in the driver.

    struct file_operations Fops = {
    	.read = device_read,
    	.write = device_write,
    	.ioctl = device_ioctl,
    	.open = device_open,
    	.release = device_release,	/* a.k.a. close */
    };

Now, I am going through [spi-omap2-mcspi.c][1] code as a reference to get an idea to start developing SPI driver from scratch.

But, I don&#39;t see functions such as open, read, write etc.
Don&#39;t know from where the program starts.


  [1]: http://lxr.free-electrons.com/source/drivers/spi/spi-omap2-mcspi.c

How to write a simple Linux device driver?

I have a data.table with fields {id, menuitem, amount}.

This is transaction data - so, ids are unique, but menuitem repeats. Now, I want to remove all entries where `menuitem == &#39;coffee&#39;`.

Also, want to delete all rows where `amount &lt;= 0`;

What is the right way to do this in data.table? 

I can use `data$menuitem!=&#39;coffee&#39;` and then index int into data[] - but that is not necessarily efficient and does not take advantage of data.table.

Any pointers in the right direction are appreciated.

Remove rows conditionally from a data.table in R

I have a data.table with fields {id, menuitem, amount}.
This is transaction data - so, ids are unique, but menuitem repeats. Now, I want to remove all entries where <code>menuitem == 'coffee'</code>.
Also, want to delete all rows where <code>amount &#x3C;= 0</code>;
What is the right way to do this in data.table?
I can use <code>data$menuitem!='coffee'</code> and then index int into data[] - but that is not necessarily efficient and does not take advantage of data.table.
Any pointers in the right direction are appreciated.

I would like to fit a model for each hour(the factor variable) using dplyr, I&#39;m getting an error, and i&#39;m not quite sure what&#39;s wrong.

    df.h &lt;- data.frame( 
      hour     = factor(rep(1:24, each = 21)),
      price    = runif(504, min = -10, max = 125),
      wind     = runif(504, min = 0, max = 2500),
      temp     = runif(504, min = - 10, max = 25)  
    )
    
    df.h &lt;- tbl_df(df.h)
    df.h &lt;- group_by(df.h, hour)
    
    group_size(df.h) # checks out, 21 obs. for each factor variable

    # different attempts:
    reg.models &lt;- do(df.h, formula = price ~ wind + temp)
    
    reg.models &lt;- do(df.h, .f = lm(price ~ wind + temp, data = df.h))

I&#39;ve tried various variations, but I can&#39;t get it to work. 


Fitting several regression models with dplyr

I am using the `mtcars` dataset. I want to find the number of records for a particular combination of data. Something very similar to the `count(*)` group by clause in SQL. `ddply()` from *plyr* is working for me 

    library(plyr)
    ddply(mtcars, .(cyl,gear),nrow)

has output

      cyl gear V1
    1   4    3  1
    2   4    4  8
    3   4    5  2
    4   6    3  2
    5   6    4  4
    6   6    5  1
    7   8    3 12
    8   8    5  2

Using this code

    library(dplyr)
    g &lt;- group_by(mtcars, cyl, gear)
    summarise(g, length(gear))

has output

      length(cyl)
    1          32

I found various functions to pass in to `summarise()` but none seem to work for me. One function I found is `sum(G)`, which returned

    Error in eval(expr, envir, enclos) : object &#39;G&#39; not found

Tried using `n()`, which returned 

    Error in n() : This function should not be called directly

What am I doing wrong? How can I get `group_by()` / `summarise()` to work for me?


Count number of rows by group using dplyr

I am trying to predict weekly sales using &lt;strike&gt; ARMA &lt;/strike&gt; ARIMA models. I could not find a function for tuning the order(p,d,q) in `statsmodels`. Currently R has a function `forecast::auto.arima() ` which will tune the (p,d,q) parameters. 

How do I go about choosing the right order for my model? Are there any libraries available in python for this purpose?

auto.arima() equivalent for python

I am attempting to reproduce one of the examples in the dplyr package but am getting this error message. I am expecting to see a new column n produced with the frequency of each combination.  What am I missing?  I triple checked that the package is loaded.

     library(dplyr)
    # summarise peels off a single layer of grouping
    by_vs_am &lt;- group_by(mtcars, vs, am)

    by_vs &lt;- summarise(by_vs_am, n = n())

&gt; Error in n() : This function should not be called directly

dplyr: &quot;Error in n(): function should not be called directly&quot;

    df &lt;- structure(list(`a a` = 1:3, `a b` = 2:4), .Names = c(&quot;a a&quot;, &quot;a b&quot;
    ), row.names = c(NA, -3L), class = &quot;data.frame&quot;)

and the data looks like

      a a a b
    1   1   2
    2   2   3
    3   3   4

Following call to select

    select(df, &#39;a a&#39;)

gives

    Error in abs(ind[ind &lt; 0]) : 
      non-numeric argument to mathematical function

How can I select &quot;a a&quot; and/or rename it to something without space using `select`? I know the following approaches:

1. `names(df)[1] &lt;- &quot;a&quot;`
2. `select(df, a=1)`
3. `select(df, ends_with(&quot;a&quot;))`

but if I am working on a large data set, how can I get an exact match without knowing the index numer or similar column names?


dplyr: nonstandard column names (white space, punctuation, starts with numbers)

I have a problem with **inconsistent encoding of character vector** in R. 

The text file which I read a table from is encoded (via `Notepad++`) in `UTF-8` (I tried with `UTF-8 without BOM`, too.). 

I want to read table from this text file, convert it do `data.table`, set a `key` and make use of binary search. When I tried to do so, the following appeared: 

&gt; Warning message:
&gt;     In `[.data.table`(poli.dt, &quot;żżonymi&quot;, mult = &quot;first&quot;) :
&gt;     A known encoding (latin1 or UTF-8) was detected in a join column. data.table compares         the bytes currently, so doesn&#39;t support
&gt; *mixed* encodings well; i.e., using both latin1 and UTF-8, or if any unknown encodings are non-ascii and some of those are marked known and
&gt; others not. But if either latin1 or UTF-8 is used exclusively, and all
&gt; unknown encodings are ascii, then the result should be ok. In future
&gt; we will check for you and avoid this warning if everything is ok. The
&gt; tricky part is doing this without impacting performance for ascii-only
&gt; cases.

and binary search **does not work**. 

I realised that my `data.table`-`key` column consists of both: &quot;unknown&quot; and &quot;UTF-8&quot; Encoding types: 

    &gt; table(Encoding(poli.dt$word))
    unknown   UTF-8 
    2061312 2739122 

I tried to convert this column (before creating a `data.table` object) with the use of: 

 * `Encoding(word) &lt;- &quot;UTF-8&quot;`
 * `word&lt;- enc2utf8(word)`

but with no effect. 

I also tried a few different ways of reading a file into R (setting all helpful parameters, e.g. `encoding = &quot;UTF-8&quot;`):

 * `data.table::fread` 
 * `utils::read.table`
 * `base::scan`
 * `colbycol::cbc.read.table`

but with no effect. 

# ==================================================

My R.version: 

    &gt; R.version
               _                           
    platform       x86_64-w64-mingw32          
    arch           x86_64                      
    os             mingw32                     
    system         x86_64, mingw32             
    status                                     
    major          3                           
    minor          0.3                         
    year           2014                        
    month          03                          
    day            06                          
    svn rev        65126                       
    language       R                           
    version.string R version 3.0.3 (2014-03-06)
    nickname       Warm Puppy  


My session info: 

    &gt; sessionInfo()
    R version 3.0.3 (2014-03-06)
    Platform: x86_64-w64-mingw32/x64 (64-bit)

    locale:
    [1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250                LC_MONETARY=Polish_Poland.1250
    [4] LC_NUMERIC=C                   LC_TIME=Polish_Poland.1250    

    base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     

    other attached packages:
    [1] data.table_1.9.2 colbycol_0.8     filehash_2.2-2   rJava_0.9-6     

    loaded via a namespace (and not attached):
    [1] plyr_1.8.1     Rcpp_0.11.1    reshape2_1.2.2 stringr_0.6.2  tools_3.0.3   

Force character vector encoding from &quot;unknown&quot; to &quot;UTF-8&quot; in R

I wish to (1) group data by one variable (`State`), (2) within each group find the row of minimum value of another variable (`Employees`), and (3) extract the entire row.

(1) and (2) are easy one-liners, and I feel like (3) should be too, but I can&#39;t get it.

Here is a sample data set:

    &gt; data
      State Company Employees
    1    AK       A        82
    2    AK       B       104
    3    AK       C        37
    4    AK       D        24
    5    RI       E        19
    6    RI       F       118
    7    RI       G        88
    8    RI       H        42

    data &lt;- structure(list(State = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
            2L), .Label = c(&quot;AK&quot;, &quot;RI&quot;), class = &quot;factor&quot;), Company = structure(1:8, .Label = c(&quot;A&quot;, 
            &quot;B&quot;, &quot;C&quot;, &quot;D&quot;, &quot;E&quot;, &quot;F&quot;, &quot;G&quot;, &quot;H&quot;), class = &quot;factor&quot;), Employees = c(82L, 
            104L, 37L, 24L, 19L, 118L, 88L, 42L)), .Names = c(&quot;State&quot;, &quot;Company&quot;, 
            &quot;Employees&quot;), class = &quot;data.frame&quot;, row.names = c(NA, -8L))

Calculate `min` by group is easy, using `aggregate`:

    &gt; aggregate(Employees ~ State, data, function(x) min(x))
      State Employees
    1    AK        24
    2    RI        19

...or `data.table`:
    
    &gt; library(data.table)
    &gt; DT &lt;- data.table(data)
    &gt; DT[ , list(Employees = min(Employees)), by = State]
       State Employees
    1:    AK        24
    2:    RI        19

But how do I extract the entire row corresponding to these `min` values, i.e. also including `Company` in the result? 

Extract row corresponding to minimum value of a variable by group

I have the following data.table:

    dt &lt;- data.table(col1 = rep(&quot;a&quot;,6), col2 = c(1,1,1,2,3,1))

Now I want to replace all the 1 in col2 with value &quot;bigDog&quot;. I can do it using the data.frame spirit:

    dt$col2[dt$col2==1,] &lt;- &quot;bigDog&quot;

But I wonder if there is a different way, more *&quot;data.table oriented&quot;*?

Conditionally replacing column values with data.table

First of all: thanks to @MattDowle; `data.table` is among the best things that
ever happened to me since I started using `R`.

Second: I am aware of many workarounds for various use cases of variable column
names in `data.table`, including:

0. https://stackoverflow.com/questions/12391950/variably-selecting-assigning-to-fields-in-a-data-table
1. https://stackoverflow.com/questions/12603890/pass-column-name-in-data-table-using-variable-in-r
2. https://stackoverflow.com/questions/16617226/referring-to-data-table-columns-by-names-saved-in-variables
3. https://stackoverflow.com/questions/15009669/passing-column-names-to-data-table-programmatically
4. https://stackoverflow.com/questions/15790743/data-table-meta-programming
5. https://stackoverflow.com/questions/14837902/how-to-write-a-function-that-calls-a-function-that-calls-data-table
6. https://stackoverflow.com/questions/14937165/using-dynamic-column-names-in-data-table
7. https://stackoverflow.com/questions/11745169/dynamic-column-names-in-data-table-r
8. https://stackoverflow.com/questions/11680579/assign-multiple-columns-using-in-data-table-by-group
9. https://stackoverflow.com/questions/13525793/setting-column-name-in-group-by-operation-with-data-table
10. https://stackoverflow.com/questions/16513827/r-summarizing-multiple-columns-with-data-table

and probably more I haven&#39;t referenced.

But: even if I learned all the tricks documented above to the point that I
never had to look them up to remind myself how to use them, I still would find
that working with column names that are passed as parameters to a function is
an extremely tedious task.

What I&#39;m looking for is a &quot;best-practices-approved&quot; alternative
to the following workaround / workflow. Consider
that I have a bunch of columns of similar data, and would like to perform a sequence of similar operations on these columns or sets of them, where the operations are of arbitrarily high complexity, and the groups of column names passed to each operation specified in a variable. 

I realize this issue _sounds_ contrived, but I run into it with surprising frequency. The examples are usually so messy that it is difficult to separate out the features relevant to this question, but I recently stumbled across one that was fairly straightforward to simplify for use as a MWE here:

    library(data.table)
    library(lubridate)
    library(zoo)
     
    the.table &lt;- data.table(year=1991:1996,var1=floor(runif(6,400,1400)))
    the.table[,`:=`(var2=var1/floor(runif(6,2,5)),
                    var3=var1/floor(runif(6,2,5)))]
     
    # Replicate data across months
    new.table &lt;- the.table[, list(asofdate=seq(from=ymd((year)*10^4+101),
                                               length.out=12,
                                               by=&quot;1 month&quot;)),by=year]
     
    # Do a complicated procedure to each variable in some group.
    var.names &lt;- c(&quot;var1&quot;,&quot;var2&quot;,&quot;var3&quot;)
     
    for(varname in var.names) {
        #As suggested in an answer to Link 3 above
        #Convert the column name to a &#39;quote&#39; object
        quote.convert &lt;- function(x) eval(parse(text=paste0(&#39;quote(&#39;,x,&#39;)&#39;)))
     
        #Do this for every column name I&#39;ll need
        varname &lt;- quote.convert(varname)
        anntot &lt;- quote.convert(paste0(varname,&quot;.annual.total&quot;))
        monthly &lt;- quote.convert(paste0(varname,&quot;.monthly&quot;))
        rolling &lt;- quote.convert(paste0(varname,&quot;.rolling&quot;))
        scaled &lt;- quote.convert(paste0(varname,&quot;.scaled&quot;))
     
        #Perform the relevant tasks, using eval()
        #around every variable columnname I may want
        new.table[,eval(anntot):=
                   the.table[,rep(eval(varname),each=12)]]
        new.table[,eval(monthly):=
                   the.table[,rep(eval(varname)/12,each=12)]]
        new.table[,eval(rolling):=
                   rollapply(eval(monthly),mean,width=12,
                             fill=c(head(eval(monthly),1),
                                    tail(eval(monthly),1)))]
        new.table[,eval(scaled):=
                   eval(anntot)/sum(eval(rolling))*eval(rolling),
                  by=year]
    }


Of course, the particular effect on the data and variables here is irrelevant, so please do not focus on it or suggest improvements to accomplishing what it accomplishes in this particular case. What I am looking for, rather, is a generic strategy for the workflow of repeatedly applying an arbitrarily complicated procedure of `data.table` actions to a list of columns or list of lists-of-columns, specified in a variable or passed as an argument to a function, where the procedure must refer programmatically to columns named in the variable/argument, and possibly includes updates, joins, groupings, calls to the `data.table` special objects `.I`, `.SD`, etc.; BUT one which is simpler, more elegant, shorter, or easier to design or implement or understand than the one above or others that require frequent `quote`-ing and `eval`-ing. 

In particular please note that because the procedures can be fairly complex and involve repeatedly updating the `data.table` and then referencing the updated columns, the standard `lapply(.SD,...), ... .SDcols = ...` approach is usually not a workable substitute. Also replacing each call of `eval(a.column.name)` with `DT[[a.column.name]]` neither simplifies much nor works completely in general since that doesn&#39;t play nice with the other `data.table` operations, as far as I am aware.

How can one work fully generically in data.table in R with column names in variables

I have a data.table:

    require(data.table)

    set.seed(1)
    data &lt;- data.table(time = c(1:3, 1:4),
                       groups = c(rep(c(&quot;b&quot;, &quot;a&quot;), c(3, 4))),
                       value = rnorm(7))

    data
    #    groups time      value
    # 1:      b    1 -0.6264538
    # 2:      b    2  0.1836433
    # 3:      b    3 -0.8356286
    # 4:      a    1  1.5952808
    # 5:      a    2  0.3295078
    # 6:      a    3 -0.8204684
    # 7:      a    4  0.4874291

I want to compute a lagged version of the &quot;value&quot; column, _within_ each level of &quot;groups&quot;.

The result should look like

    #   groups time      value  lag.value
    # 1      a    1  1.5952808         NA
    # 2      a    2  0.3295078  1.5952808
    # 3      a    3 -0.8204684  0.3295078
    # 4      a    4  0.4874291 -0.8204684
    # 5      b    1 -0.6264538         NA
    # 6      b    2  0.1836433 -0.6264538
    # 7      b    3 -0.8356286  0.1836433


I have tried to use `lag` directly: 

    data$lag.value &lt;- lag(data$value) 

...which clearly wouldn&#39;t work. 

I have also tried:

    unlist(tapply(data$value, data$groups, lag))
     a1         a2         a3         a4         b1         b2         b3 
     NA -0.1162932  0.4420753  2.1505440         NA  0.5894583 -0.2890288 

Which is almost what I want. However the vector generated is ordered differently from the ordering in the data.table which is problematic.

What is the most efficient way to do this in base R, plyr, dplyr, and data.table?

Content Type	Original Author	Original Content on Stackoverflow
Question	Gopalakrishna Palem	View Question on Stackoverflow
Solution 1 - R	Michele	View Answer on Stackoverflow
Solution 2 - R	song.xiao	View Answer on Stackoverflow

Remove rows conditionally from a data.table in R

R Problem Overview

R Solutions

Solution 1 - R

Solution 2 - R

How to write a simple Linux device driver?

Nginx uwsgi (104: Connection reset by peer) while reading response header from upstream

Attributions