Extracting a string between other two strings in R

RegexRStringr

Regex Problem Overview


I am trying to find a simple way to extract an unknown substring (could be anything) that appear between two known substrings. For example, I have a string:

a<-" anything goes here, STR1 GET_ME STR2, anything goes here"

I need to extract the string GET_ME which is between STR1 and STR2 (without the white spaces).

I am trying str_extract(a, "STR1 (.+) STR2"), but I am getting the entire match

[1] "STR1 GET_ME STR2"

I can of course strip the known strings, to isolate the substring I need, but I think there should be a cleaner way to do it by using a correct regular expression.

Regex Solutions


Solution 1 - Regex

You may use str_match with STR1 (.*?) STR2 (note the spaces are "meaningful", if you want to just match anything in between STR1 and STR2 use STR1(.*?)STR2, or use STR1\\s*(.*?)\\s*STR2 to trim the value you need). If you have multiple occurrences, use str_match_all.

Also, if you need to match strings that span across line breaks/newlines add (?s) at the start of the pattern: (?s)STR1(.*?)STR2 / (?s)STR1\\s*(.*?)\\s*STR2.

library(stringr)
a <- " anything goes here, STR1 GET_ME STR2, anything goes here"
res <- str_match(a, "STR1\\s*(.*?)\\s*STR2")
res[,2]
[1] "GET_ME"

Another way using base R regexec (to get the first match):

test <- " anything goes here, STR1 GET_ME STR2, anything goes here STR1 GET_ME2 STR2"
pattern <- "STR1\\s*(.*?)\\s*STR2"
result <- regmatches(test, regexec(pattern, test))
result[[1]][2]
[1] "GET_ME"

Solution 2 - Regex

Here's another way by using base R

a<-" anything goes here, STR1 GET_ME STR2, anything goes here"

gsub(".*STR1 (.+) STR2.*", "\\1", a)

Output:

[1] "GET_ME"

Solution 3 - Regex

Another option is to use qdapRegex::ex_between to extract strings between left and right boundaries

qdapRegex::ex_between(a, "STR1", "STR2")[[1]]
#[1] "GET_ME"

It also works with multiple occurrences

a <- "anything STR1 GET_ME STR2, anything goes here, STR1 again get me STR2"

qdapRegex::ex_between(a, "STR1", "STR2")[[1]]
#[1] "GET_ME"       "again get me"

Or multiple left and right boundaries

a <- "anything STR1 GET_ME STR2, anything goes here, STR4 again get me STR5"
qdapRegex::ex_between(a, c("STR1", "STR4"), c("STR2", "STR5"))[[1]]
#[1] "GET_ME"       "again get me"

First capture is between "STR1" and "STR2" whereas second between "STR4" and "STR5".

Solution 4 - Regex

We could use {unglue}, in that case we don't need regex at all :

library(unglue)
unglue::unglue_vec(
  " anything goes here, STR1 GET_ME STR2, anything goes here", 
  "{}STR1 {x} STR2{}")
#> [1] "GET_ME"

{} matches anything without keeping it, {x} captures its match (any variable other than x could be used. The syntax"{}STR1 {x} STR2{}" is short for : "{=.*?}STR1 {x=.*?} STR2{=.*?}"

If you wanted to extract the sides too you could do:

unglue::unglue_data(
  " anything goes here, STR1 GET_ME STR2, anything goes here", 
  "{left}, STR1 {x} STR2, {right}")
#>                  left      x              right
#> 1  anything goes here GET_ME anything goes here

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionSashaView Question on Stackoverflow
Solution 1 - RegexWiktor StribiżewView Answer on Stackoverflow
Solution 2 - RegexUlises Rosas-PuchuriView Answer on Stackoverflow
Solution 3 - RegexRonak ShahView Answer on Stackoverflow
Solution 4 - RegexmoodymudskipperView Answer on Stackoverflow