Extract string from string using RegEx in the Terminal

RegexBashGrep

Regex Problem Overview


I have a string like first url, second url, third url and would like to extract only the url after the word second in the OS X Terminal (only the first occurrence). How can I do it?

In my favorite editor I used the regex /second (url)/ and used $1 to extract it, I just don't know how to do it in the Terminal.

Keep in mind that url is an actual url, I'll be using one of these expressions to match it: <https://stackoverflow.com/questions/1141848/regex-to-match-url>

Regex Solutions


Solution 1 - Regex

echo 'first url, second url, third url' | sed 's/.*second//'

Edit: I misunderstood. Better:

echo 'first url, second url, third url' | sed 's/.*second \([^ ]*\).*/\1/'

or:

echo 'first url, second url, third url' | perl -nle 'm/second ([^ ]*)/; print $1'

Solution 2 - Regex

Piping to another process (like 'sed' and 'perl' suggested above) might be very expensive, especially when you need to run this operation multiple times. Bash does support regexp:

[[ "string" =~ regex ]]

Similarly to the way you extract matches in your favourite editor by using $1, $2, etc., Bash fills in the $BASH_REMATCH array with all the matches.

In your particular example:

str="first url1, second url2, third url3"
if [[ $str =~ (second )([^,]*) ]]; then
  echo "match: '${BASH_REMATCH[2]}'"
else
  echo "no match found"
fi

Output:

match: 'url2'

Specifically, =~ supports extended regular expressions as defined by POSIX, but with platform-specific extensions (which vary in extent and can be incompatible).
On Linux platforms (GNU userland), see man grep; on macOS/BSD platforms, see man re_format.

Solution 3 - Regex

In the other answer provided you still remain with everything after the desired URL. So I propose you the following solution.

echo 'first url, second url, third url' | sed 's/.*second \(url\)*.*/\1/'

Under sed you group an expression by escaping the parenthesis around it (POSIX standard).

Solution 4 - Regex

While trying this, what you probably forgot was the -E argument for sed.

From sed --help:

  -E, -r, --regexp-extended
                 use extended regular expressions in the script
                 (for portability use POSIX -E).

You don't have to change your regex significantly, but you do need to add .* to match greedily around it to remove the other part of string.

This works fine for me:

echo "first url, second url, third url" | sed -E 's/.*second (url).*/\1/'

Output:

url

In which the output "url" is actually the second instance in the string. But if you already know that it is formatted in between comma and space, and you don't allow these characters in URLs, then the regex [^,]* should be fine.

Optionally:

echo "first http://test.url/1, second ://test.url/with spaces/2, third ftp://test.url/3" \
     | sed -E 's/.*second ([a-zA-Z]*:\/\/[^,]*).*/\1/'

Which correctly outputs:

://example.com/with spaces/2

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionfreganteView Question on Stackoverflow
Solution 1 - RegexSjoerdView Answer on Stackoverflow
Solution 2 - RegexDmitry ShevkoplyasView Answer on Stackoverflow
Solution 3 - RegexmhitzaView Answer on Stackoverflow
Solution 4 - RegexYetiView Answer on Stackoverflow