Using gsub to extract character string before white space in R

R Problem Overview

I have a list of birthdays that look something like this:

dob <- c("9/9/43 12:00 AM/PM", "9/17/88 12:00 AM/PM", "11/21/48 12:00 AM/PM")

I want to just grab the calendar date from this variable (ie drop everything after the first occurrence of white-space).

Here's what I have tried so far:

dob.abridged <- substring(dob,1,8)
dob
[1] "9/9/43 1" "9/17/88 " "11/21/48"
dob.abridged <- gsub(" $","", dob.abridged, perl=T)
> dob.abridged
[1] "9/9/43 1" "9/17/88"  "11/21/48"

So my code works for calendar dates of length 6 or 7, but not length 8. Any pointers on a more effective regex to use with gsub that can handle calendar dates of length 6, 7 or 8?

Thank you.

R Solutions

Solution 1 - R

No need for substring, just use gsub:

gsub( " .*$", "", dob )
# [1] "9/9/43"   "9/17/88"  "11/21/48"

A space ( ), then any character (.) any number of times (*) until the end of the string ($). See ?regex to learn regular expressions.

Solution 2 - R

I often use strsplit for these sorts of problems but liked how simple Romain's answer was. I thought it would be interesting to compare Romain's solution to a strsplit answer:

Here's a strsplit solution:

sapply(strsplit(dob, "\\s+"), "[", 1)

Using the microbenchmark package and dob <- rep(dob, 1000) with the original data:

Unit: milliseconds
                                    expr       min        lq    median
                   gsub(" .*$", "", dob)  4.228843  4.247969  4.258232
 sapply(strsplit(dob, "\\\\s+"), "[", 1) 14.438241 14.558832 14.634638
        uq       max neval
  4.268029  5.081608  1000
 14.756628 53.344984  1000

The clear winner on a Win 7 machine is the gsub regex from Romain. Thanks for the answer and explanation Romain.

Solution 3 - R

The library stringr contains a function tailored to this problem.

library(stringr)
word(dob,1)
# [1] "9/9/43"   "9/17/88"  "11/21/48"

Solution 4 - R

Another way to extract characters from alphabet before a white space is:

You have to install the package: "stringr"

stringr::str_extract(c("juan carlos", "miguel angel"), stringr::regex(pattern = "[a-z]+(?=\\s)", ignore_case = F))

[a-z]: matches every character between a and z (in Unicode code point order).

+: 1 or more.

(?=\\s): Lookahead, followed by \s (which is white space) (not matching \s).

More info: https://stringr.tidyverse.org/articles/regular-expressions.html

Solution 5 - R

Another regex pattern to extract the date only

library(stringr)
str_extract(dob, regex("\\d{1,}\\/\\d{1,}\\/\\d{1,}"))
#[1] "9/9/43"   "9/17/88"  "11/21/48"

\\d{1,}: Matches digits at least 1 time
\\/: Escapes forward slash

Content Type	Original Author	Original Content on Stackoverflow
Question	Anupa Fabian	View Question on Stackoverflow
Solution 1 - R	Romain Francois	View Answer on Stackoverflow
Solution 2 - R	Tyler Rinker	View Answer on Stackoverflow
Solution 3 - R	tiago	View Answer on Stackoverflow
Solution 4 - R	Juan Carlos Herrera Burbano	View Answer on Stackoverflow
Solution 5 - R	Tho Vu	View Answer on Stackoverflow