How to get the part of a file after the first line that matches a regular expression

BashShellScriptingGrep

Bash Problem Overview


I have a file with about 1000 lines. I want the part of my file after the line which matches my grep statement.

That is:

cat file | grep 'TERMINATE'     # It is found on line 534

So, I want the file from line 535 to line 1000 for further processing.

How can I do that?

Bash Solutions


Solution 1 - Bash

The following will print the line matching TERMINATE till the end of the file:

sed -n -e '/TERMINATE/,$p'

Explained: -n disables default behavior of sed of printing each line after executing its script on it, -e indicated a script to sed, /TERMINATE/,$ is an address (line) range selection meaning the first line matching the TERMINATE regular expression (like grep) to the end of the file ($), and p is the print command which prints the current line.

This will print from the line that follows the line matching TERMINATE till the end of the file: (from AFTER the matching line to EOF, NOT including the matching line)

sed -e '1,/TERMINATE/d'

Explained: 1,/TERMINATE/ is an address (line) range selection meaning the first line for the input to the 1st line matching the TERMINATE regular expression, and d is the delete command which delete the current line and skip to the next line. As sed default behavior is to print the lines, it will print the lines after TERMINATE to the end of input.

If you want the lines before TERMINATE:

sed -e '/TERMINATE/,$d'

And if you want both lines before and after TERMINATE in two different files in a single pass:

sed -e '1,/TERMINATE/w before
/TERMINATE/,$w after' file

The before and after files will contain the line with terminate, so to process each you need to use:

head -n -1 before
tail -n +2 after

IF you do not want to hard code the filenames in the sed script, you can:

before=before.txt
after=after.txt
sed -e "1,/TERMINATE/w $before
/TERMINATE/,\$w $after" file

But then you have to escape the $ meaning the last line so the shell will not try to expand the $w variable (note that we now use double quotes around the script instead of single quotes).

I forgot to tell that the new line is important after the filenames in the script so that sed knows that the filenames end.

How would you replace the hardcoded TERMINATE by a variable?

You would make a variable for the matching text and then do it the same way as the previous example:

matchtext=TERMINATE
before=before.txt
after=after.txt
sed -e "1,/$matchtext/w $before
/$matchtext/,\$w $after" file

to use a variable for the matching text with the previous examples:

## Print the line containing the matching text, till the end of the file:
## (from the matching line to EOF, including the matching line)
matchtext=TERMINATE
sed -n -e "/$matchtext/,\$p"

## Print from the line that follows the line containing the
## matching text, till the end of the file:
## (from AFTER the matching line to EOF, NOT including the matching line)
matchtext=TERMINATE
sed -e "1,/$matchtext/d"

## Print all the lines before the line containing the matching text:
## (from line-1 to BEFORE the matching line, NOT including the matching line)
matchtext=TERMINATE
sed -e "/$matchtext/,\$d"

The important points about replacing text with variables in these cases are:

  1. Variables ($variablename) enclosed in single quotes ['] won't "expand" but variables inside double quotes ["] will. So, you have to change all the single quotes to double quotes if they contain text you want to replace with a variable.
  2. The sed ranges also contain a $ and are immediately followed by a letter like: $p, $d, $w. They will also look like variables to be expanded, so you have to escape those $ characters with a backslash [\] like: \$p, \$d, \$w.

Solution 2 - Bash

As a simple approximation you could use

grep -A100000 TERMINATE file

which greps for TERMINATE and outputs up to 100,000 lines following that line.

From the man page:

> -A NUM, --after-context=NUM
> > Print NUM lines of trailing context after matching lines. > Places a line containing a group separator (--) between > contiguous groups of matches. With the -o or --only-matching > option, this has no effect and a warning is given.

Solution 3 - Bash

A tool to use here is AWK:

cat file | awk 'BEGIN{ found=0} /TERMINATE/{found=1}  {if (found) print }'

How does this work:

  1. We set the variable 'found' to zero, evaluating false
  2. if a match for 'TERMINATE' is found with the regular expression, we set it to one.
  3. If our 'found' variable evaluates to True, print :)

The other solutions might consume a lot of memory if you use them on very large files.

Solution 4 - Bash

If I understand your question correctly you do want the lines after TERMINATE, not including the TERMINATE-line. AWK can do this in a simple way:

awk '{if(found) print} /TERMINATE/{found=1}' your_file

Explanation:

  1. Although not best practice, you could rely on the fact that all variables defaults to 0 or the empty string if not defined. So the first expression (if(found) print) will not print anything to start off with.
  2. After the printing is done, we check if this is the starter-line (that should not be included).

This will print all lines after the TERMINATE-line.


Generalization:

  • You have a file with start- and end-lines and you want the lines between those lines excluding the start- and end-lines.
  • start- and end-lines could be defined by a regular expression matching the line.

Example:

$ cat ex_file.txt
not this line
second line
START
A good line to include
And this line
Yep
END
Nope more
...
never ever
$ awk '/END/{found=0} {if(found) print} /START/{found=1}' ex_file.txt
A good line to include
And this line
Yep
$

Explanation:

  1. If the end-line is found no printing should be done. Note that this check is done before the actual printing to exclude the end-line from the result.
  2. Print the current line if found is set.
  3. If the start-line is found then set found=1 so that the following lines are printed. Note that this check is done after the actual printing to exclude the start-line from the result.

Notes:

  • The code rely on the fact that all AWK variables defaults to 0 or the empty string if not defined. This is valid, but it may not be best practice so you could add a BEGIN{found=0} to the start of the AWK expression.

  • If multiple start-end-blocks are found, they are all printed.

Solution 5 - Bash

grep -A 10000000 'TERMINATE' file       

is much, much faster than sed, especially working on really a big file. It works up to 10M lines (or whatever you put in), so there isn't any harm in making this big enough to handle about anything you hit.

Solution 6 - Bash

Use Bash parameter expansion like the following:

content=$(cat file)
echo "${content#*TERMINATE}"

Solution 7 - Bash

There are many ways to do it with sed or awk:

sed -n '/TERMINATE/,$p' file

This looks for TERMINATE in your file and prints from that line up to the end of the file.

awk '/TERMINATE/,0' file

This is exactly the same behaviour as sed.

In case you know the number of the line from which you want to start printing, you can specify it together with NR (number of record, which eventually indicates the number of the line):

awk 'NR>=535' file
Example
$ seq 10 > a        #generate a file with one number per line, from 1 to 10
$ sed -n '/7/,$p' a
7
8
9
10
$ awk '/7/,0' a
7
8
9
10
$ awk 'NR>=7' a
7
8
9
10

Solution 8 - Bash

If for any reason, you want to avoid using sed, the following will print the line matching TERMINATE till the end of the file:

tail -n "+$(grep -n 'TERMINATE' file | head -n 1 | cut -d ":" -f 1)" file

And the following will print from the following line matching TERMINATE till the end of the file:

tail -n "+$(($(grep -n 'TERMINATE' file | head -n 1 | cut -d ":" -f 1)+1))" file

It takes two processes to do what sed can do in one process, and if the file changes between the execution of grep and tail, the result can be incoherent, so I recommend using sed. Moreover, if the file doesn’t not contain TERMINATE, the first command fails.

Solution 9 - Bash

Alternatives to the excellent sed answer by jfg956, and which don't include the matching line:

Solution 10 - Bash

This could be one way of doing it. If you know in what line of the file you have your grep word and how many lines you have in your file:

grep -A466 'TERMINATE' file

Solution 11 - Bash

sed is a much better tool for the job:

sed -n '/re/,$p' file

where re is a regular expression.

Another option is grep's --after-context flag. You need to pass in a number to end at, using wc on the file should give the right value to stop at. Combine this with -n and your match expression.

Solution 12 - Bash

This will print all lines from the last found line "TERMINATE" till the end of the file:

LINE_NUMBER=`grep -o -n TERMINATE $OSCAM_LOG | tail -n 1 | sed "s/:/ \\'/g" | awk -F" " '{print $1}'`
tail -n +$LINE_NUMBER $YOUR_FILE_NAME

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionYugal JindleView Question on Stackoverflow
Solution 1 - Bashjfg956View Answer on Stackoverflow
Solution 2 - BashaioobeView Answer on Stackoverflow
Solution 3 - BashJos De GraeveView Answer on Stackoverflow
Solution 4 - BashUlfRView Answer on Stackoverflow
Solution 5 - Bashuser8910163View Answer on Stackoverflow
Solution 6 - BashMu QiaoView Answer on Stackoverflow
Solution 7 - BashfedorquiView Answer on Stackoverflow
Solution 8 - Bashjfg956View Answer on Stackoverflow
Solution 9 - BashmivkView Answer on Stackoverflow
Solution 10 - BashMariahView Answer on Stackoverflow
Solution 11 - BashckwangView Answer on Stackoverflow
Solution 12 - BasheasyyuView Answer on Stackoverflow