AWK: Access captured group from line pattern

RegexAwk

Regex Problem Overview


If I have an awk command

pattern { ... }

and pattern uses a capturing group, how can I access the string so captured in the block?

Regex Solutions


Solution 1 - Regex

With gawk, you can use the match function to capture parenthesized groups.

gawk 'match($0, pattern, ary) {print ary[1]}' 

example:

echo "abcdef" | gawk 'match($0, /b(.*)e/, a) {print a[1]}' 

outputs cd.

Note the specific use of gawk which implements the feature in question.

For a portable alternative you can achieve similar results with match() and substr.

example:

echo "abcdef" | awk 'match($0, /b[^e]*/) {print substr($0, RSTART+1, RLENGTH-1)}'

outputs cd.

Solution 2 - Regex

That was a stroll down memory lane...

I replaced awk by perl a long time ago.

Apparently the AWK regular expression engine does not capture its groups.

you might consider using something like :

perl -n -e'/test(\d+)/ && print $1'

the -n flag causes perl to loop over every line like awk does.

Solution 3 - Regex

This is something I need all the time so I created a bash function for it. It's based on glenn jackman's answer.

Definition

Add this to your .bash_profile etc.

function regex { gawk 'match($0,/'$1'/, ary) {print ary['${2:-'0'}']}'; }

Usage

Capture regex for each line in file

$ cat filename | regex '.*'

Capture 1st regex capture group for each line in file

$ cat filename | regex '(.*)' 1

Solution 4 - Regex

You can use GNU awk:

$ cat hta
RewriteCond %{HTTP_HOST} !^www\.mysite\.net$
RewriteRule (.*) http://www.mysite.net/$1 [R=301,L]

$ gawk 'match($0, /.*(http.*?)\$/, m) { print m[1]; }' < hta
http://www.mysite.net/

Solution 5 - Regex

You can simulate capturing in vanilla awk too, without extensions. Its not intuitive though:

step 1. use gensub to surround matches with some character that doesnt appear in your string. step 2. Use split against the character. step 3. Every other element in the splitted array is your capture group.

$ echo 'ab cb ad' | awk '{ split(gensub(/a./,SUBSEP"&"SUBSEP,"g",$0),cap,SUBSEP); print cap[2]"|" cap[4] ; }'
ab|ad

Solution 6 - Regex

I struggled a bit with coming up with a bash function that wraps Peter Tillemans' answer but here's what I came up with:

> function regex { perl -n -e "/$1/ && printf "%s\n", "'$1' }

I found this worked better than opsb's awk-based bash function for the following regular expression argument, because I do not want the "ms" to be printed.

'([0-9]*)ms$'

Solution 7 - Regex

i think gawk match()-to-array is only for first instance of the capture group.

if there are multiple things you'd like to capture, and perform any complex operations upon them, perhaps

gawk 'BEGIN { S = SUBSEP 
          } { 
              nx=split(gensub(/(..(..)..(..))/, 
                              "\\1"(S)"\\2"(S)"\\3", "g", str), 
                       arr, S)
              for(x in nx) { perform-ops-over arr[x] } }'

This way you aren't constrained by either gensub(), which limits the complexity if your modifications, or by match().

by pure trial-and-error, one caveat i've noted about gawk in unicode mode : for a valid unicode string 뀇꿬 with the 6 octal codes listed below :

> Scenario 1 : matching individual bytes are fine, but will also report you the multi-byte RSTART of 1 instead of a byte-level answer of 2. It also won't provide info on whether \207 is the 1st continuation byte, or the second one, since RLENGTH will always be 1 here.

$ gawk 'BEGIN{ print match("\353\200\207\352\277\254", "\207") }' 
$ 1 

> Scenario 2 : Match also works against unicode-invalid patterns like this

$ gawk 'BEGIN{ match("\353\200\207\352\277\254", "\207\352"); 
$                print RSTART, RLENGTH }' 
$ 1 2

> Scenario 3 : you can check for existence of a pattern against a unicode-illegal string (\300 \xC0 is UTF8-invalid for all possible byte pairings)

$ gawk 'BEGIN{ print ("\300\353\200\207\352\277\254" ~ /\200/) }' 
$ 1

> Scenarios 4/5/6 : the error message will show up for either (a) match() with unicode-invalid string, index() for either argument to be unicode-invalid/incomplete.

$ gawk 'BEGIN{ match("\300\353\200\207\352\277\254", "\207\352"); print RSTART, RLENGTH }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 2 2

$ gawk 'BEGIN{ print index("\353\200\207\352\277\254", "\352") }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 0

$ gawk 'BEGIN{ print index("\353\200\207\352\277\254", "\200") }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 0

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionrampionView Question on Stackoverflow
Solution 1 - Regexglenn jackmanView Answer on Stackoverflow
Solution 2 - RegexPeter TillemansView Answer on Stackoverflow
Solution 3 - RegexopsbView Answer on Stackoverflow
Solution 4 - RegexIsvaraView Answer on Stackoverflow
Solution 5 - RegexydrolView Answer on Stackoverflow
Solution 6 - RegexwyttenView Answer on Stackoverflow
Solution 7 - RegexRARE Kpop ManifestoView Answer on Stackoverflow