Extract a regular expression match

Regex Problem Overview

I'm trying to extract a number from a string.

And do something like [0-9]+ on the string "aaa12xxx" and get "12".

I thought it would be something like:

> grep("[0-9]+", "aaa12xxx", value=TRUE)
[1] "aaa12xxx"

And then I figured...

> sub("[0-9]+", "\\1", "aaa12xxx")
[1] "aaaxxx"

But I got some form of response doing:

> sub("[0-9]+", "ARGH!", "aaa12xxx")
[1] "aaaARGH!xxx"

There's a small detail I'm missing.

Regex Solutions

Solution 1 - Regex

Use the new stringr package which wraps all the existing regular expression operates in a consistent syntax and adds a few that are missing:

library(stringr)
str_locate("aaa12xxx", "[0-9]+")
#      start end
# [1,]     4   5
str_extract("aaa12xxx", "[0-9]+")
# [1] "12"

Solution 2 - Regex

It is probably a bit hasty to say 'ignore the standard functions' - the help file for ?gsub even specifically references in 'See also':

> ‘regmatches’ for extracting matched substrings based on the results of > ‘regexpr’, ‘gregexpr’ and ‘regexec’.

So this will work, and is fairly simple:

txt <- "aaa12xxx"
regmatches(txt,regexpr("[0-9]+",txt))
#[1] "12"

Solution 3 - Regex

For your specific case you could remove all not numbers:

gsub("[^0-9]", "", "aaa12xxxx")
# [1] "12"

It won't work in more complex cases

gsub("[^0-9]", "", "aaa12xxxx34")
# [1] "1234"

Solution 4 - Regex

You can use PERL regexs' lazy matching:

> sub(".*?([0-9]+).*", "\\1", "aaa12xx99",perl=TRUE)
[1] "12"

Trying to substitute out non-digits will lead to an error in this case.

Solution 5 - Regex

Use capturing parentheses in the regular expression and group references in the replacement. Anything in parentheses gets remembered. Then they're accessed by \2, the first item. The first backslash escapes the backslash's interpretation in R so that it gets passed to the regular expression parser.

gsub('([[:alpha:]]+)([0-9]+)([[:alpha:]]+)', '\\2', "aaa12xxx")

Solution 6 - Regex

One way would be this:

test <- regexpr("[0-9]+","aaa12456xxx")

Now, notice regexpr gives you the starting and ending indices of the string:

    > test
[1] 4
attr(,"match.length")
[1] 5

So you can use that info with substr function

substr("aaa12456xxx",test,test+attr(test,"match.length")-1)

I'm sure there is a more elegant way to do this, but this was the fastest way I could find. Alternatively, you can use sub/gsub to strip out what you don't want to leave what you do want.

Solution 7 - Regex

One important difference between these approaches the the behaviour with any non-matches. For example, the regmatches method may not return a string of the same length as the input if there is not a match in all positions

> txt <- c("aaa12xxx","xyz")

> regmatches(txt,regexpr("[0-9]+",txt)) # could cause problems

[1] "12"

> gsub("[^0-9]", "", txt)

[1] "12" ""  

> str_extract(txt, "[0-9]+")

[1] "12" NA

Solution 8 - Regex

A solution for this question

library(stringr)
str_extract_all("aaa12xxx", regex("[[:digit:]]{1,}"))
# [[1]]
# [1] "12"

[[:digit:]]: digit [0-9]

{1,}: Matches at least 1 times

Solution 9 - Regex

Using strapply in the gsubfn package. strapply is like apply in that the args are object, modifier and function except that the object is a vector of strings (rather than an array) and the modifier is a regular expression (rather than a margin):

library(gsubfn)
x <- c("xy13", "ab 12 cd 34 xy")
strapply(x, "\\d+", as.numeric)
# list(13, c(12, 34))

This says to match one or more digits (\d+) in each component of x passing each match through as.numeric. It returns a list whose components are vectors of matches of respective components of x. Looking the at output we see that the first component of x has one match which is 13 and the second component of x has two matches which are 12 and 34. See http://gsubfn.googlecode.com for more info.

Solution 10 - Regex

Another solution:

temp = regexpr('\\d', "aaa12xxx");
substr("aaa12xxx", temp[1], temp[1]+attr(temp,"match.length")[1])

Solution 11 - Regex

Using the package unglue we would do the following:

# install.packages("unglue")
library(unglue)
unglue_vec(c("aaa12xxx", "aaaARGH!xxx"), "{prefix}{number=\\d+}{suffix}", var = "number")
#> [1] "12" NA

^{Created on 2019-11-06 by the reprex package (v0.3.0)}

Use the convert argument to convert to a number automatically :

unglue_vec(
  c("aaa12xxx", "aaaARGH!xxx"), 
  "{prefix}{number=\\d+}{suffix}", 
  var = "number", 
  convert = TRUE)
#> [1] 12 NA

Solution 12 - Regex

You could write your regex functions with C++, compile them into a DLL and call them from R.

    #include <regex>
    
    extern "C" {
    __declspec(dllexport)
    void regex_match( const char **first, char **regexStr, int *_bool)
    {
    	std::cmatch _cmatch;
    	const char *last = *first + strlen(*first);
    	std::regex rx(*regexStr);
    	bool found = false;
    	found = std::regex_match(*first,last,_cmatch, rx);
    	*_bool = found;
    }

__declspec(dllexport)
void regex_search_results( const char **str, const char **regexStr, int *N, char **out )
{
	std::string s(*str);
	std::regex rgx(*regexStr);
	std::smatch m;

	int i=0;
	while(std::regex_search(s,m,rgx) && i < *N) {
		strcpy(out[i],m[0].str().c_str());
		i++;
		s = m.suffix().str();
	}
}
    };

call in R as

dyn.load("C:\\YourPath\\RegTest.dll")
regex_match <- function(str,regstr) {
.C("regex_match",x=as.character(str),y=as.character(regstr),z=as.logical(1))$z }

regex_match("abc","a(b)c")

regex_search_results <- function(x,y,n) {
.C("regex_search_results",x=as.character(x),y=as.character(y),i=as.integer(n),z=character(n))$z }

regex_search_results("aaa12aa34xxx", "[0-9]+", 5)

Content Type	Original Author	Original Content on Stackoverflow
Question	tovare	View Question on Stackoverflow
Solution 1 - Regex	hadley	View Answer on Stackoverflow
Solution 2 - Regex	thelatemail	View Answer on Stackoverflow
Solution 3 - Regex	Marek	View Answer on Stackoverflow
Solution 4 - Regex	Jyotirmoy Bhattacharya	View Answer on Stackoverflow
Solution 5 - Regex	Ragy Isaac	View Answer on Stackoverflow
Solution 6 - Regex	Robert	View Answer on Stackoverflow
Solution 7 - Regex	andyyy	View Answer on Stackoverflow
Solution 8 - Regex	Tho Vu	View Answer on Stackoverflow
Solution 9 - Regex	G. Grothendieck	View Answer on Stackoverflow
Solution 10 - Regex	pari	View Answer on Stackoverflow
Solution 11 - Regex	moodymudskipper	View Answer on Stackoverflow
Solution 12 - Regex	user2074102	View Answer on Stackoverflow