Use of ~ (tilde) in R programming Language

RR FaqR Formula

R Problem Overview


I saw in a tutorial about regression modeling the following command:

myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width

What exactly does this command do, and what is the role of ~ (tilde) in the command?

R Solutions


Solution 1 - R

The thing on the right of <- is a formula object. It is often used to denote a statistical model, where the thing on the left of the ~ is the response and the things on the right of the ~ are the explanatory variables. So in English you'd say something like "Species depends on Sepal Length, Sepal Width, Petal Length and Petal Width".

The myFormula <- part of that line stores the formula in an object called myFormula so you can use it in other parts of your R code.


Other common uses of formula objects in R

The lattice package uses them to specify the variables to plot.
The ggplot2 package uses them to specify panels for plotting.
The dplyr package uses them for non-standard evaulation.

Solution 2 - R

R defines a ~ (tilde) operator for use in formulas. Formulas have all sorts of uses, but perhaps the most common is for regression:

library(datasets)
lm( myFormula, data=iris)

help("~") or help("formula") will teach you more.

@Spacedman has covered the basics. Let's discuss how it works.

First, being an operator, note that it is essentially a shortcut to a function (with two arguments):

> `~`(lhs,rhs)
lhs ~ rhs
> lhs ~ rhs
lhs ~ rhs

That can be helpful to know for use in e.g. apply family commands.

Second, you can manipulate the formula as text:

oldform <- as.character(myFormula) # Get components
myFormula <- as.formula( paste( oldform[2], "Sepal.Length", sep="~" ) )

Third, you can manipulate it as a list:

myFormula[[2]]
myFormula[[3]]

Finally, there are some helpful tricks with formulae (see help("formula") for more):

myFormula <- Species ~ . 

For example, the version above is the same as the original version, since the dot means "all variables not yet used." This looks at the data.frame you use in your eventual model call, sees which variables exist in the data.frame but aren't explicitly mentioned in your formula, and replaces the dot with those missing variables.

Solution 3 - R

In a word,

The tilde(~) separates the left side of a formula with the right side of the formula.

For example, in a linear function, it would separate the dependent variable from the independent variables and can be interpreted as saying, “as a function of.” So, when a person’s wages (wages) as a function of their years of education (years_of_education), we do something like,

wages ~ years_of_education

Here,

 Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width

It means, Species is a function of Sepal Length, Sepal Width, Petal Length and Petal Width.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAnkitaView Question on Stackoverflow
Solution 1 - RSpacedmanView Answer on Stackoverflow
Solution 2 - RAri B. FriedmanView Answer on Stackoverflow
Solution 3 - Rashraful16View Answer on Stackoverflow