Capturing Groups From a Grep RegEx

BashShellGrep

Bash Problem Overview


I've got this little script in sh (Mac OSX 10.6) to look through an array of files. Google has stopped being helpful at this point:

files="*.jpg"
for f in $files
    do
        echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'
        name=$?
        echo $name
    done

So far (obviously, to you shell gurus) $name merely holds 0, 1 or 2, depending on if grep found that the filename matched the matter provided. What I'd like is to capture what's inside the parens ([a-z]+) and store that to a variable.

I'd like to use grep only, if possible. If not, please no Python or Perl, etc. sed or something like it – I'm new to shell and would like to attack this from the *nix purist angle.

Also, as a super-cool bonus, I'm curious as to how I can concatenate string in shell? Is the group I captured was the string "somename" stored in $name, and I wanted to add the string ".jpg" to the end of it, could I cat $name '.jpg'?

Please explain what's going on, if you've got the time.

Bash Solutions


Solution 1 - Bash

If you're using Bash, you don't even have to use grep:

files="*.jpg"
regex="[0-9]+_([a-z]+)_[0-9a-z]*"
for f in $files    # unquoted in order to allow the glob to expand
do
    if [[ $f =~ $regex ]]
    then
        name="${BASH_REMATCH[1]}"
        echo "${name}.jpg"    # concatenate strings
        name="${name}.jpg"    # same thing stored in a variable
    else
        echo "$f doesn't match" >&2 # this could get noisy if there are a lot of non-matching files
    fi
done

It's better to put the regex in a variable. Some patterns won't work if included literally.

This uses =~ which is Bash's regex match operator. The results of the match are saved to an array called $BASH_REMATCH. The first capture group is stored in index 1, the second (if any) in index 2, etc. Index zero is the full match.

You should be aware that without anchors, this regex (and the one using grep) will match any of the following examples and more, which may not be what you're looking for:

123_abc_d4e5
xyz123_abc_d4e5
123_abc_d4e5.xyz
xyz123_abc_d4e5.xyz

To eliminate the second and fourth examples, make your regex like this:

^[0-9]+_([a-z]+)_[0-9a-z]*

which says the string must start with one or more digits. The carat represents the beginning of the string. If you add a dollar sign at the end of the regex, like this:

^[0-9]+_([a-z]+)_[0-9a-z]*$

then the third example will also be eliminated since the dot is not among the characters in the regex and the dollar sign represents the end of the string. Note that the fourth example fails this match as well.

If you have GNU grep (around 2.5 or later, I think, when the \K operator was added):

name=$(echo "$f" | grep -Po '(?i)[0-9]+_\K[a-z]+(?=_[0-9a-z]*)').jpg

The \K operator (variable-length look-behind) causes the preceding pattern to match, but doesn't include the match in the result. The fixed-length equivalent is (?<=) - the pattern would be included before the closing parenthesis. You must use \K if quantifiers may match strings of different lengths (e.g. +, *, {2,4}).

The (?=) operator matches fixed or variable-length patterns and is called "look-ahead". It also does not include the matched string in the result.

In order to make the match case-insensitive, the (?i) operator is used. It affects the patterns that follow it so its position is significant.

The regex might need to be adjusted depending on whether there are other characters in the filename. You'll note that in this case, I show an example of concatenating a string at the same time that the substring is captured.

Solution 2 - Bash

This isn't really possible with pure grep, at least not generally.

But if your pattern is suitable, you may be able to use grep multiple times within a pipeline to first reduce your line to a known format, and then to extract just the bit you want. (Although tools like cut and sed are far better at this).

Suppose for the sake of argument that your pattern was a bit simpler: [0-9]+_([a-z]+)_ You could extract this like so:

echo $name | grep -Ei '[0-9]+_[a-z]+_' | grep -oEi '[a-z]+'

The first grep would remove any lines that didn't match your overall patern, the second grep (which has --only-matching specified) would display the alpha portion of the name. This only works because the pattern is suitable: "alpha portion" is specific enough to pull out what you want.

(Aside: Personally I'd use grep + cut to achieve what you are after: echo $name | grep {pattern} | cut -d _ -f 2. This gets cut to parse the line into fields by splitting on the delimiter _, and returns just field 2 (field numbers start at 1)).

Unix philosophy is to have tools which do one thing, and do it well, and combine them to achieve non-trivial tasks, so I'd argue that grep + sed etc is a more Unixy way of doing things :-)

Solution 3 - Bash

I realize that an answer was already accepted for this, but from a "strictly *nix purist angle" it seems like the right tool for the job is pcregrep, which doesn't seem to have been mentioned yet. Try changing the lines:

    echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'
    name=$?

to the following:

    name=$(echo $f | pcregrep -o1 -Ei '[0-9]+_([a-z]+)_[0-9a-z]*')

to get only the contents of the capturing group 1.

The pcregrep tool utilizes all of the same syntax you've already used with grep, but implements the functionality that you need.

The parameter -o works just like the grep version if it is bare, but it also accepts a numeric parameter in pcregrep, which indicates which capturing group you want to show.

With this solution there is a bare minimum of change required in the script. You simply replace one modular utility with another and tweak the parameters.

Interesting Note: You can use multiple -o arguments to return multiple capture groups in the order in which they appear on the line.

Solution 4 - Bash

Not possible in just grep I believe

for sed:

name=`echo $f | sed -E 's/([0-9]+_([a-z]+)_[0-9a-z]*)|.*/\2/'`

I'll take a stab at the bonus though:

echo "$name.jpg"

Solution 5 - Bash

This is a solution that uses gawk. It's something I find I need to use often so I created a function for it

function regex1 { gawk 'match($0,/'$1'/, ary) {print ary['${2:-'1'}']}'; }

to use just do

$ echo 'hello world' | regex1 'hello\s(.*)'
world

Solution 6 - Bash

str="1w 2d 1h"
regex="([0-9])w ([0-9])d ([0-9])h"
if [[ $str =~ $regex ]]
then
    week="${BASH_REMATCH[1]}"
    day="${BASH_REMATCH[2]}"
    hour="${BASH_REMATCH[3]}"
    echo $week --- $day ---- $hour
fi

output: 1 --- 2 ---- 1

Solution 7 - Bash

A suggestion for you - you can use parameter expansion to remove the part of the name from the last underscore onwards, and similarly at the start:

f=001_abc_0za.jpg
work=${f%_*}
name=${work#*_}

Then name will have the value abc.

See Apple developer docs, search forward for 'Parameter Expansion'.

Solution 8 - Bash

if you have bash, you can use extended globbing

shopt -s extglob
shopt -s nullglob
shopt -s nocaseglob
for file in +([0-9])_+([a-z])_+([a-z0-9]).jpg
do
   IFS="_"
   set -- $file
   echo "This is your captured output : $2"
done

or

ls +([0-9])_+([a-z])_+([a-z0-9]).jpg | while read file
do
   IFS="_"
   set -- $file
   echo "This is your captured output : $2"
done

Solution 9 - Bash

I prefer the one line python or perl command, both often included in major linux disdribution

echo $'
<a href="http://stackoverflow.com">
</a>
<a href="http://google.com">
</a>
' |  python -c $'
import re
import sys
for i in sys.stdin:
  g=re.match(r\'.*href="(.*)"\',i);
  if g is not None:
    print g.group(1)
'

and to handle files:

ls *.txt | python -c $'
import sys
import re
for i in sys.stdin:
  i=i.strip()
  f=open(i,"r")
  for j in f:
    g=re.match(r\'.*href="(.*)"\',j);
    if g is not None:
      print g.group(1)
  f.close()
'

Solution 10 - Bash

The follow example shows how to extract the 3 character sequence from a filename using a regex capture group:

for f in 123_abc_123.jpg 123_xyz_432.jpg
do
    echo "f:    " $f
    name=$( perl -ne 'if (/[0-9]+_([a-z]+)_[0-9a-z]*/) { print $1 . "\n" }' <<< $f )
    echo "name: " $name
done

Outputs:

f:     123_abc_123.jpg
name:  abc
f:     123_xyz_432.jpg
name:  xyz

So the if-regex conditional in perl will filter out all non-matching lines at the same time, for those lines that do match, it will apply the capture group(s) which you can access with $1, $2, ... respectively,

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionIsaacView Question on Stackoverflow
Solution 1 - BashDennis WilliamsonView Answer on Stackoverflow
Solution 2 - BashRobMView Answer on Stackoverflow
Solution 3 - BashJohn SherwoodView Answer on Stackoverflow
Solution 4 - BashcobbalView Answer on Stackoverflow
Solution 5 - BashopsbView Answer on Stackoverflow
Solution 6 - Bashchirag nayakView Answer on Stackoverflow
Solution 7 - Bashmartin claytonView Answer on Stackoverflow
Solution 8 - Bashghostdog74View Answer on Stackoverflow
Solution 9 - BashtowithView Answer on Stackoverflow
Solution 10 - BashStephen QuanView Answer on Stackoverflow