grep: group capturing

RegexLinuxBashGrep

Regex Problem Overview


I have following string:

{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}

and I need to get value of "scheme version", which is 1234 in this example.

I have tried

grep -Eo "\"scheme_version\":(\w*)"

however it returns

"scheme_version":1234

How can I make it? I know I can add sed call, but I would prefer to do it with single grep.

Regex Solutions


Solution 1 - Regex

You'll need to use a look behind assertion so that it isn't included in the match:

grep -Po '(?<=scheme_version":)[0-9]+'

Solution 2 - Regex

This might work for you:

echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' |
sed -n 's/.*"scheme_version":\([^}]*\)}/\1/p'
1234

Sorry it's not grep, so disregard this solution if you like.

Or stick with grep and add:

grep -Eo "\"scheme_version\":(\w*)"| cut -d: -f2

Solution 3 - Regex

I would recommend that you use jq for the job. jq is a command-line JSON processor.

$ cat tmp
{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}

$ cat tmp | jq .scheme_version
1234

Solution 4 - Regex

As an alternative to the positive lookbehind method suggested by SiegeX, you can reset the match starting point to directly after scheme_version": with the \K escape sequence. E.g.,

$ grep -Po 'scheme_version":\K[0-9]+'

This restarts the matching process after having matched scheme_version":, and tends to have far better performance than the positive lookbehind. Comparing the two on regexp101 demonstrates that the reset match start method takes 37 steps and 1ms, while the positive lookbehind method takes 194 steps and 21ms.

You can compare the performance yourself on regex101 and you can read more about resetting the match starting point in the PCRE documentation.

Solution 5 - Regex

To avoid using greps PCRE feature which is available in GNU grep, but not in BSD version, another method is to use ripgrep, e.g.

$ rg -o 'scheme_version.?:(\d+)' -r '$1' <file.json 
1234

> -r Capture group indices (e.g., $5) and names (e.g., $foo).

Another example with Python and json.tool module which can validate and pretty-print:

$ python -mjson.tool file.json | rg -o 'scheme_version[^\d]+(\d+)' -r '$1'
1234

Related: Can grep output only specified groupings that match?

Solution 6 - Regex

You can do this:

$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | awk -F ':' '{print $4}' | tr -d '}'

Solution 7 - Regex

Improving @potong's answer that works only to get "scheme_version", you can use this expression :

$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"_id":["]*\([^(",})]*\)[",}].*/\1/p'
scheme_version

$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"_rev":["]*\([^(",})]*\)[",}].*/\1/p'
4-cad1842a7646b4497066e09c3788e724

$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"scheme_version":["]*\([^(",})]*\)[",}].*/\1/p'
1234

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionlstipakovView Question on Stackoverflow
Solution 1 - RegexSiegeXView Answer on Stackoverflow
Solution 2 - RegexpotongView Answer on Stackoverflow
Solution 3 - RegexMarc O'MorainView Answer on Stackoverflow
Solution 4 - RegexClarkZinzowView Answer on Stackoverflow
Solution 5 - RegexkenorbView Answer on Stackoverflow
Solution 6 - Regexkris.zhangView Answer on Stackoverflow
Solution 7 - RegexAlexandre HamonView Answer on Stackoverflow