my.data.frame &lt;- subset(data , V1 &gt; 2 | V2 &lt; 4)

An alternative solution that mimics the behavior of this function and would be more appropriate for inclusion within a function body:

    new.data &lt;- data[ which( data$V1 &gt; 2 | data$V2 &lt; 4) , ]

Some people criticize the use of `which` as not needed, but it does prevent the `NA` values from throwing back unwanted results. The equivalent (.i.e not returning NA-rows for any NA&#39;s in V1 or V2) to the two options demonstrated above without the `which` would be:

     new.data &lt;- data[ !is.na(data$V1 | data$V2) &amp; ( data$V1 &gt; 2 | data$V2 &lt; 4)  , ]

Note: I want to thank the anonymous contributor that attempted to fix the error in the code immediately above, a fix that got rejected by the moderators. There was actually an additional error that I noticed when I was correcting the first one. The conditional clause that checks for NA values needs to be first if it is to be handled as I intended, since ...

    &gt; NA &amp; 1
    [1] NA
    &gt; 0 &amp; NA
    [1] FALSE

Order of arguments may matter when using &#39;&amp;&quot;.

You are looking for &quot;|.&quot;  See http://cran.r-project.org/doc/manuals/R-intro.html#Logical-vectors

    my.data.frame &lt;- data[(data$V1 &gt; 2) | (data$V2 &lt; 4), ]

Just for the sake of completeness, we can use the operators `[` and `[[`:

    set.seed(1)
    df &lt;- data.frame(v1 = runif(10), v2 = letters[1:10])

Several options

    df[df[1] &lt; 0.5 | df[2] == &quot;g&quot;, ] 
    df[df[[1]] &lt; 0.5 | df[[2]] == &quot;g&quot;, ] 
    df[df[&quot;v1&quot;] &lt; 0.5 | df[&quot;v2&quot;] == &quot;g&quot;, ]

df$name is [equivalent to][1] df[[&quot;name&quot;, exact = FALSE]]

Using `dplyr`:

    library(dplyr)
    filter(df, v1 &lt; 0.5 | v2 == &quot;g&quot;)
Using `sqldf`:

    library(sqldf)
    sqldf(&#39;SELECT *
          FROM df 
          WHERE v1 &lt; 0.5 OR v2 = &quot;g&quot;&#39;)
 Output for the above options:

              v1 v2
    1 0.26550866  a
    2 0.37212390  b
    3 0.20168193  e
    4 0.94467527  g
    5 0.06178627  j

  [1]: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Extract.html

I am wondering if there is any way to declare a byte variable in a short way like floats or doubles? I mean like `5f` and `5d`. Sure I could write `byte x = 5`, but that&#39;s a bit inconsequential if you use `var` for local variables.

Literal suffix for byte in .NET?

I am creating a system which polls devices for data on varying metrics such as CPU utilisation, disk utilisation, temperature etc. at (probably) 5 minute intervals using SNMP. The ultimate goal is to provide visualisations to a user of the system in the form of time-series graphs.

I have looked at using RRDTool in the past, but rejected it as storing the captured data indefinitely is important to my project, and I want higher level and more flexible access to the captured data. So my question is really:

*What is better, a relational database (such as MySQL or PostgreSQL) or a non-relational or NoSQL database (such as MongoDB or Redis) with regard to performance when querying data for graphing.*

## Relational

Given a relational database, I would use a `data_instances` table, in which would be stored every instance of data captured for every metric being measured for all devices, with the following fields:

Fields: `id` `fk_to_device` `fk_to_metric` `metric_value` `timestamp`

When I want to draw a graph for a particular metric on a particular device, I must query this singular table *filtering out* the other devices, and the other metrics being analysed for this device:

    SELECT metric_value, timestamp FROM data_instances
        WHERE fk_to_device=1 AND fk_to_metric=2

The number of rows in this table would be:

    d * m_d * f * t

where `d` is the number of **devices**, `m_d` is the accumulative **number of metrics** being recorded for all devices, `f` is the **frequency** at which data is polled for and `t` is the total amount of **time** the system has been collecting data.

For a user recording 10 metrics for 3 devices every 5 minutes for a year, we would have just under **5 million** records. 

### Indexes

Without indexes on `fk_to_device` and `fk_to_metric` scanning this continuously expanding table would take too much time. So indexing the aforementioned fields and also `timestamp` (for creating graphs with localised periods) is a requirement.

## Non-Relational (NoSQL)

MongoDB has the concept of a *collection*, unlike tables these can be created programmatically without setup. With these I could partition the storage of data for each device, or even each metric recorded for each device.

I have no experience with NoSQL and do not know if they provide any query performance enhancing features such as indexing, however the previous paragraph proposes doing most of the traditional relational query work in the structure by which the data is stored under NoSQL.

## Undecided

Would a relational solution with correct indexing reduce to a crawl within the year? Or does the collection based structure of NoSQL approaches (which matches my mental model of the stored data) provide a noticeable benefit?

Storing time-series data, relational or non?

I have a data.frame in R.  I want to try two different conditions on two different columns, but I want these conditions to be inclusive. Therefore, I would like to use &quot;OR&quot; to combine the conditions.  I have used the following syntax before with lot of success when I wanted to use the &quot;AND&quot; condition.

    my.data.frame &lt;- data[(data$V1 &gt; 2) &amp; (data$V2 &lt; 4), ]
But I don&#39;t know how to use an &#39;OR&#39; in the above.

How to combine multiple conditions to subset a data-frame using &quot;OR&quot;?

I have a data.frame in R. I want to try two different conditions on two different columns, but I want these conditions to be inclusive. Therefore, I would like to use "OR" to combine the conditions. I have used the following syntax before with lot of success when I wanted to use the "AND" condition.
<pre><code class="hljs language-kotlin">my.data.frame &#x3C;- data[(data$V1 > 2) &#x26; (data$V2 &#x3C; 4), ]
</code></pre>
But I don't know how to use an 'OR' in the above.

I have a data frame where I would like to add an additional row that totals up the values for each column. For example, Let&#39;s say I have this data:

    x &lt;- data.frame(Language=c(&quot;C++&quot;, &quot;Java&quot;, &quot;Python&quot;), 
                    Files=c(4009, 210, 35), 
                    LOC=c(15328,876, 200), 
                    stringsAsFactors=FALSE)    

Data looks like this:
    
      Language Files   LOC
    1      C++  4009 15328
    2     Java   210   876
    3   Python    35   200

My instinct is to do this: 

    y &lt;- rbind(x, c(&quot;Total&quot;, colSums(x[,2:3])))

And this works, it computes the totals:

    &gt; y
      Language Files   LOC
    1      C++  4009 15328
    2     Java   210   876
    3   Python    35   200
    4    Total  4254 16404

The problem is that the Files and LOC columns have all been converted to strings:

    &gt; y$LOC
    [1] &quot;15328&quot; &quot;876&quot;   &quot;200&quot;   &quot;16404&quot;

I understand that this is happening because I created a vector `c(&quot;Total&quot;, colSums(x[,2:3])` with inputs that are both numbers and strings, and it&#39;s converting all the elements to a common type so that all of the vector elements are the same. Then the same thing happens to the Files and LOC columns.

What&#39;s a better way to do this?

Add row to a data frame with total sum for each column

I&#39;m using `lapply` to run a complex function on a large number of items, and I&#39;d like to save the output from each item (if any) together with any warnings/errors that were produced so that I can tell which item produced which warning/error. 

I found a way to catch warnings using `withCallingHandlers` ([described here](https://stackoverflow.com/questions/4947528)).  However, I need to catch errors as well.  I can do it by wrapping it in a `tryCatch` (as in the code below), but is there a better way to do it?

    catchToList &lt;- function(expr) {
      val &lt;- NULL
      myWarnings &lt;- NULL
      wHandler &lt;- function(w) {
        myWarnings &lt;&lt;- c(myWarnings, w$message)
        invokeRestart(&quot;muffleWarning&quot;)
      }
      myError &lt;- NULL
      eHandler &lt;- function(e) {
        myError &lt;&lt;- e$message
        NULL
      }
      val &lt;- tryCatch(withCallingHandlers(expr, warning = wHandler), error = eHandler)
      list(value = val, warnings = myWarnings, error=myError)
    } 

Sample output of this function is:

    &gt; catchToList({warning(&quot;warning 1&quot;);warning(&quot;warning 2&quot;);1})
    $value
    [1] 1
    
    $warnings
    [1] &quot;warning 1&quot; &quot;warning 2&quot;
    
    $error
    NULL

    &gt; catchToList({warning(&quot;my warning&quot;);stop(&quot;my error&quot;)})
    $value
    NULL
    
    $warnings
    [1] &quot;my warning&quot;
    
    $error
    [1] &quot;my error&quot;

There are several questions here on SO that discuss `tryCatch` and error handling, but none that I found that address this particular issue.  See https://stackoverflow.com/questions/3903157, https://stackoverflow.com/questions/4020239, and https://stackoverflow.com/questions/2589275 for the most relevant ones.

How do I save warnings and errors as output from a function?

Suppose, there is some data.frame **foo_data_frame** and one wants to find regression of the target column **Y** by some others columns. For that purpose usualy some formula and model are used. For example:

    linear_model &lt;- lm(Y ~ FACTOR_NAME_1 + FACTOR_NAME_2, foo_data_frame)

That does job well if the formula is coded statically. If it is desired to root over several models with the constant number of dependent variables (say, 2) it can be treated like that:

    for (i in seq_len(factor_number)) {
      for (j in seq(i + 1, factor_number)) {
        linear_model &lt;- lm(Y ~ F1 + F2, list(Y=foo_data_frame$Y,
                                             F1=foo_data_frame[[i]],
                                             F2=foo_data_frame[[j]]))
        # linear_model further analyzing...
      }
    }

My question is how to do the same affect when the number of variables is changing dynamically during program running?

    for (number_of_factors in seq_len(5)) {
       # Then root over subsets with #number_of_factors cardinality.
       for (factors_subset in all_subsets_with_fixed_cardinality) {
         # Here I want to fit model with factors from factors_subset.
         linear_model &lt;- lm(Does R provide smth to write here?)
       }
    }



Formula with dynamic number of variables

I would like to calculate the area under a curve to do integration without defining a function such as in `integrate()`.

My data looks as this:

    Date          Strike     Volatility
    2003-01-01    20         0.2
    2003-01-01    30         0.3
    2003-01-01    40         0.4
    etc.

I plotted `plot(strike, volatility)` to look at the volatility smile. Is there a way to integrate this plotted &quot;curve&quot;?

Calculate the Area under a Curve

Consider this simple example: 

    labNames &lt;- c(&#39;xLab&#39;,&#39;yLabl&#39;)
    plot(c(1:10),xlab=expression(paste(labName[1], x^2)),ylab=expression(paste(labName[2], y^2)))

What I want is for the character entry defined by the variable &#39;labName,
 &#39;xLab&#39; or &#39;yLab&#39; to appear next to the X^2 or y^2 defined by the expression(). As it is, the actual text &#39;labName&#39; with a subscript is joined to the superscripted expression. 

Any thoughts?

Combining paste() and expression() functions in plot labels

Hi I&#39;m trying to learn SASS/SCSS and am trying to refactor my own mixin for clearfix

what I&#39;d like is for the mixin to be based on whether I pass the mixin a width.

thoughts so far (pseudo code only as I will be including other mixins)

&lt;!-- language: lang-sass --&gt;

    @mixin clearfix($width) {
    
       @if !$width {
    
      	// if width is not passed, or empty do this
    
       } @else {
     
            display: inline-block;
            width: $width;
       }
    }

here&#39;s how I thought I might call it, but it&#39;s not working.

`@include clearfix();`

or

`@include clearfix(100%)`

or

`@include clearfix(960px)`

I&#39;d appreciate any help on the best or right way to do this!

Syntax for if/else condition in SCSS mixin

Is there anyway to use inline conditions in Lua?

Such as:

    print(&quot;blah: &quot; .. (a == true ? &quot;blah&quot; : &quot;nahblah&quot;))

Inline conditions in Lua (a == b ? &quot;yes&quot; : &quot;no&quot;)?

Here&#39;s a snippet from my csproj file:


    &lt;ProjectReference Include=&quot;..\program_data\program_data.csproj&quot; Condition=&quot;&#39;$(Configuration)&#39;==&#39;Debug&#39;&quot;&gt;
          &lt;Project&gt;{4F9034E0-B8E3-448E-8794-CF9B9A5E7D46}&lt;/Project&gt;
          &lt;Name&gt;program_data&lt;/Name&gt;
    &lt;/ProjectReference&gt;

What I&#39;d like to do is include `program_data.dll` for **multiple** build configurations, for example, both Release and Debug.

I tried doing the following

    Condition=&quot;&#39;$(Configuration)&#39;==&#39;Debug&#39; || &#39;$(Configuration)&#39;==&#39;Release&#39;&quot;

but Visual Studio chokes on this. 

Is there a way I can do this, or must I have a separate `&lt;ProjectReference&gt;` for each build config?

Project reference conditional include with multiple conditions

Does a Python equivalent to the Ruby `||=` operator (&quot;set the variable if the variable is not set&quot;) exist?

Example in Ruby :


     variable_not_set ||= &#39;bla bla&#39;
     variable_not_set == &#39;bla bla&#39;

     variable_set = &#39;pi pi&#39;
     variable_set ||= &#39;bla bla&#39;
     variable_set == &#39;pi pi&#39;

Python conditional assignment operator

I&#39;ve had a colleague that told me he once worked for a company that had as a policy to never have conditionals (&quot;if&quot; and &quot;switch&quot; statements) in the code and that they let all the decisions in the code be done using polymorphism and (I&#39;m guessing) some other OO principles.

I *sort of* understand the reasoning behind this, of having code that is more DRY and easier to update, but I&#39;m looking for a more in-depth explanation of this concept. Or maybe it&#39;s part of a more general design approach.

If anyone has any resources for this or would be willing to explain or even have some more terms related to this I can use to find more answers I&#39;d be much obliged.

I found [one question on SO][1] that was kind of related but I&#39;m unfamiliar with C++ so I don&#39;t understand too much of the answers there.

(I&#39;m no OO guru btw but I can manage)

I&#39;m most proficient in PHP, and after that Python so I&#39;d prefer info that uses those languages.

Update: I&#39;ll ask my colleague for more info on what he meant exactly.

Update 2015: after some more years of experience in programming I see now that the aim of this policy was probably to prevent programmers from adding functionality in a haphazard way by just adding conditionals (if statements) in certain places. A better way to extend software is to use the [&quot;Open/Closed principle&quot;][2] where software is extended by using inheritance and polymorphism. I strongly doubt whether the policy was super strict on all conditionals as it&#39;s kinda hard to go completely without them.


  [1]: https://stackoverflow.com/questions/234458/do-polymorphism-or-conditionals-promote-better-design
  [2]: https://en.wikipedia.org/wiki/Open/closed_principle

If-less programming (basically without conditionals)

When should one use a `data.frame`, and when is it better to use a `matrix`?

Both keep data in a rectangular format, so sometimes it&#39;s unclear.

Are there any general rules of thumb for when to use which data type?





Should I use a data.frame or a matrix?

I am just starting with R and encountered a strange behaviour: when inserting the first row in an empty data frame, the original column names get lost.

example:

    a&lt;-data.frame(one = numeric(0), two = numeric(0))
    a
    #[1] one two
    #&lt;0 rows&gt; (or 0-length row.names)
    names(a)
    #[1] &quot;one&quot; &quot;two&quot;
    a&lt;-rbind(a, c(5,6))
    a
    #  X5 X6
    #1  5  6
    names(a)
    #[1] &quot;X5&quot; &quot;X6&quot;

As you can see, the column names **one** and **two** were replaced by **X5** and **X6**.

Could somebody please tell me why this happens and is there a right way to do this without losing column names?

A shotgun solution would be to save the names in an auxiliary vector and then add them back when finished working on the data frame.


Thanks

Context:

I created a function which gathers some data and adds them as a new row to a data frame received as a parameter.
I create the data frame, iterate through my data sources, passing the data.frame to each function call to be filled up with its results.

R: losing column names when adding rows to an empty data frame

I have been reading about how read.table is not efficient for large data files. Also how R is not suited for large data sets. So I was wondering where I can find what the practical limits are and any performance charts for (1) Reading in data of various sizes (2) working with data of varying sizes. 

In effect, I want to know when the performance deteriorates and when I hit a road block. Also any comparison against C++/MATLAB or other languages would be really helpful. finally if there is any special performance comparison for Rcpp and RInside, that would be great!


Practical limits of R data frame

I have a large data set and I would like to read specific columns or drop all the others.

    data &lt;- read.dta(&quot;file.dta&quot;)

I select the columns that I&#39;m not interested in:

    var.out &lt;- names(data)[!names(data) %in% c(&quot;iden&quot;, &quot;name&quot;, &quot;x_serv&quot;, &quot;m_serv&quot;)]

and than I&#39;d like to do something like:

    for(i in 1:length(var.out)) {
       paste(&quot;data$&quot;, var.out[i], sep=&quot;&quot;) &lt;- NULL
    }

to drop all the unwanted columns. Is this the optimal solution?


Content Type	Original Author	Original Content on Stackoverflow
Question	Sam	View Question on Stackoverflow
Solution 1 - R	IRTFM	View Answer on Stackoverflow
Solution 2 - R	ncray	View Answer on Stackoverflow
Solution 3 - R	mpalanco	View Answer on Stackoverflow

How to combine multiple conditions to subset a data-frame using "OR"?

R Problem Overview

R Solutions

Solution 1 - R

Solution 2 - R

Solution 3 - R

Storing time-series data, relational or non?

Literal suffix for byte in .NET?

Attributions