Extract a substring according to a pattern

RegexRSubstr

Regex Problem Overview


Suppose I have a list of string:

string = c("G1:E001", "G2:E002", "G3:E003")

Now I hope to get a vector of string that contains only the parts after the colon ":", i.e substring = c(E001,E002,E003).

Is there a convenient way in R to do this? Using substr?

Regex Solutions


Solution 1 - Regex

Here are a few ways:

1) sub

sub(".*:", "", string)
## [1] "E001" "E002" "E003"

2) strsplit

sapply(strsplit(string, ":"), "[", 2)
## [1] "E001" "E002" "E003"

3) read.table

read.table(text = string, sep = ":", as.is = TRUE)$V2
## [1] "E001" "E002" "E003"

4) substring

This assumes second portion always starts at 4th character (which is the case in the example in the question):

substring(string, 4)
## [1] "E001" "E002" "E003"

4a) substring/regex

If the colon were not always in a known position we could modify (4) by searching for it:

substring(string, regexpr(":", string) + 1)

5) strapplyc

strapplyc returns the parenthesized portion:

library(gsubfn)
strapplyc(string, ":(.*)", simplify = TRUE)
## [1] "E001" "E002" "E003"

6) read.dcf

This one only works if the substrings prior to the colon are unique (which they are in the example in the question). Also it requires that the separator be colon (which it is in the question). If a different separator were used then we could use sub to replace it with a colon first. For example, if the separator were _ then string <- sub("_", ":", string)

c(read.dcf(textConnection(string)))
## [1] "E001" "E002" "E003"

7) separate

7a) Using tidyr::separate we create a data frame with two columns, one for the part before the colon and one for after, and then extract the latter.

library(dplyr)
library(tidyr)
library(purrr)

DF <- data.frame(string)
DF %>% 
  separate(string, into = c("pre", "post")) %>% 
  pull("post")
## [1] "E001" "E002" "E003"

7b) Alternately separate can be used to just create the post column and then unlist and unname the resulting data frame:

library(dplyr)
library(tidyr)

DF %>% 
  separate(string, into = c(NA, "post")) %>% 
  unlist %>%
  unname
## [1] "E001" "E002" "E003"

8) trimws We can use trimws to trim word characters off the left and then use it again to trim the colon.

trimws(trimws(string, "left", "\\w"), "left", ":")
## [1] "E001" "E002" "E003"

Note

The input string is assumed to be:

string <- c("G1:E001", "G2:E002", "G3:E003")

Solution 2 - Regex

For example using gsub or sub

    gsub('.*:(.*)','\\1',string)
    [1] "E001" "E002" "E003"

Solution 3 - Regex

Here is another simple answer

gsub("^.*:","", string)

Solution 4 - Regex

Late to the party, but for posterity, the stringr package (part of the popular "tidyverse" suite of packages) now provides functions with harmonised signatures for string handling:

string <- c("G1:E001", "G2:E002", "G3:E003")
# match string to keep
stringr::str_extract(string = string, pattern = "E[0-9]+")
# [1] "E001" "E002" "E003"

# replace leading string with ""
stringr::str_remove(string = string, pattern = "^.*:")
# [1] "E001" "E002" "E003"

Solution 5 - Regex

This should do:

gsub("[A-Z][1-9]:", "", string)

gives

[1] "E001" "E002" "E003"

Solution 6 - Regex

If you are using data.table then tstrsplit() is a natural choice:

tstrsplit(string, ":")[[2]]
[1] "E001" "E002" "E003"

Solution 7 - Regex

The unglue package provides an alternative, no knowledge about regular expressions is required for simple cases, here we'd do :

# install.packages("unglue")
library(unglue)
string = c("G1:E001", "G2:E002", "G3:E003")
unglue_vec(string,"{x}:{y}", var = "y")
#> [1] "E001" "E002" "E003"

Created on 2019-11-06 by the reprex package (v0.3.0)

More info : https://github.com/moodymudskipper/unglue/blob/master/README.md

Solution 8 - Regex

Another method to extract a substring

library(stringr)
substring <- str_extract(string, regex("(?<=:).*"))
#[1] "E001" "E002" "E003
  • (?<=:): look behind the colon (:)

Solution 9 - Regex

Surprisingly, a very "base R" solution has not yet been added:

string = c("G1:E001", "G2:E002", "G3:E003")

regmatches(string, regexpr('E[0-9]+', string))

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionalittleboyView Question on Stackoverflow
Solution 1 - RegexG. GrothendieckView Answer on Stackoverflow
Solution 2 - RegexagstudyView Answer on Stackoverflow
Solution 3 - RegexRagy IsaacView Answer on Stackoverflow
Solution 4 - RegexCSJCampbellView Answer on Stackoverflow
Solution 5 - Regexuser1981275View Answer on Stackoverflow
Solution 6 - Regexsindri_baldurView Answer on Stackoverflow
Solution 7 - RegexmoodymudskipperView Answer on Stackoverflow
Solution 8 - RegexTho VuView Answer on Stackoverflow
Solution 9 - RegexandscharView Answer on Stackoverflow