AWK: Access captured group from line pattern

Regex Problem Overview

If I have an awk command

pattern { ... }

and pattern uses a capturing group, how can I access the string so captured in the block?

Regex Solutions

Solution 1 - Regex

With gawk, you can use the match function to capture parenthesized groups.

gawk 'match($0, pattern, ary) {print ary[1]}'

example:

echo "abcdef" | gawk 'match($0, /b(.*)e/, a) {print a[1]}'

outputs cd.

Note the specific use of gawk which implements the feature in question.

For a portable alternative you can achieve similar results with match() and substr.

example:

echo "abcdef" | awk 'match($0, /b[^e]*/) {print substr($0, RSTART+1, RLENGTH-1)}'

outputs cd.

Solution 2 - Regex

That was a stroll down memory lane...

I replaced awk by perl a long time ago.

Apparently the AWK regular expression engine does not capture its groups.

you might consider using something like :

perl -n -e'/test(\d+)/ && print $1'

the -n flag causes perl to loop over every line like awk does.

Solution 3 - Regex

This is something I need all the time so I created a bash function for it. It's based on glenn jackman's answer.

Definition

Add this to your .bash_profile etc.

function regex { gawk 'match($0,/'$1'/, ary) {print ary['${2:-'0'}']}'; }

Usage

Capture regex for each line in file

$ cat filename | regex '.*'

Capture 1st regex capture group for each line in file

$ cat filename | regex '(.*)' 1

Solution 4 - Regex

You can use GNU awk:

$ cat hta
RewriteCond %{HTTP_HOST} !^www\.mysite\.net$
RewriteRule (.*) http://www.mysite.net/$1 [R=301,L]

$ gawk 'match($0, /.*(http.*?)\$/, m) { print m[1]; }' < hta
http://www.mysite.net/

Solution 5 - Regex

You can simulate capturing in vanilla awk too, without extensions. Its not intuitive though:

step 1. use gensub to surround matches with some character that doesnt appear in your string. step 2. Use split against the character. step 3. Every other element in the splitted array is your capture group.

$ echo 'ab cb ad' | awk '{ split(gensub(/a./,SUBSEP"&"SUBSEP,"g",$0),cap,SUBSEP); print cap[2]"|" cap[4] ; }'
ab|ad

Solution 6 - Regex

I struggled a bit with coming up with a bash function that wraps Peter Tillemans' answer but here's what I came up with:

> function regex { perl -n -e "/$1/ && printf "%s\n", "'$1' }

I found this worked better than opsb's awk-based bash function for the following regular expression argument, because I do not want the "ms" to be printed.

'([0-9]*)ms$'

Solution 7 - Regex

i think gawk match()-to-array is only for first instance of the capture group.

if there are multiple things you'd like to capture, and perform any complex operations upon them, perhaps

gawk 'BEGIN { S = SUBSEP 
          } { 
              nx=split(gensub(/(..(..)..(..))/, 
                              "\\1"(S)"\\2"(S)"\\3", "g", str), 
                       arr, S)
              for(x in nx) { perform-ops-over arr[x] } }'

This way you aren't constrained by either gensub(), which limits the complexity if your modifications, or by match().

by pure trial-and-error, one caveat i've noted about gawk in unicode mode : for a valid unicode string 뀇꿬 with the 6 octal codes listed below :

> Scenario 1 : matching individual bytes are fine, but will also report you the multi-byte RSTART of 1 instead of a byte-level answer of 2. It also won't provide info on whether \207 is the 1st continuation byte, or the second one, since RLENGTH will always be 1 here.

$ gawk 'BEGIN{ print match("\353\200\207\352\277\254", "\207") }' 
$ 1

> Scenario 2 : Match also works against unicode-invalid patterns like this

$ gawk 'BEGIN{ match("\353\200\207\352\277\254", "\207\352"); 
$                print RSTART, RLENGTH }' 
$ 1 2

> Scenario 3 : you can check for existence of a pattern against a unicode-illegal string (\300 \xC0 is UTF8-invalid for all possible byte pairings)

$ gawk 'BEGIN{ print ("\300\353\200\207\352\277\254" ~ /\200/) }' 
$ 1

> Scenarios 4/5/6 : the error message will show up for either (a) match() with unicode-invalid string, index() for either argument to be unicode-invalid/incomplete.

$ gawk 'BEGIN{ match("\300\353\200\207\352\277\254", "\207\352"); print RSTART, RLENGTH }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 2 2

$ gawk 'BEGIN{ print index("\353\200\207\352\277\254", "\352") }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 0

$ gawk 'BEGIN{ print index("\353\200\207\352\277\254", "\200") }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 0

Content Type	Original Author	Original Content on Stackoverflow
Question	rampion	View Question on Stackoverflow
Solution 1 - Regex	glenn jackman	View Answer on Stackoverflow
Solution 2 - Regex	Peter Tillemans	View Answer on Stackoverflow
Solution 3 - Regex	opsb	View Answer on Stackoverflow
Solution 4 - Regex	Isvara	View Answer on Stackoverflow
Solution 5 - Regex	ydrol	View Answer on Stackoverflow
Solution 6 - Regex	wytten	View Answer on Stackoverflow
Solution 7 - Regex	RARE Kpop Manifesto	View Answer on Stackoverflow