Deleting lines from one file which are in another file

BashScriptingSh

Bash Problem Overview


I have a file f1:

line1
line2
line3
line4
..
..

I want to delete all the lines which are in another file f2:

line2
line8
..
..

I tried something with cat and sed, which wasn't even close to what I intended. How can I do this?

Bash Solutions


Solution 1 - Bash

grep -v -x -f f2 f1 should do the trick.

Explanation:

  • -v to select non-matching lines
  • -x to match whole lines only
  • -f f2 to get patterns from f2

One can instead use grep -F or fgrep to match fixed strings from f2 rather than patterns (in case you want remove the lines in a "what you see if what you get" manner rather than treating the lines in f2 as regex patterns).

Solution 2 - Bash

Try comm instead (assuming f1 and f2 are "already sorted")

comm -2 -3 f1 f2

Solution 3 - Bash

For exclude files that aren't too huge, you can use AWK's associative arrays.

awk 'NR == FNR { list[tolower($0)]=1; next } { if (! list[tolower($0)]) print }' exclude-these.txt from-this.txt 

The output will be in the same order as the "from-this.txt" file. The tolower() function makes it case-insensitive, if you need that.

The algorithmic complexity will probably be O(n) (exclude-these.txt size) + O(n) (from-this.txt size)

Solution 4 - Bash

Similar to Dennis Williamson's answer (mostly syntactic changes, e.g. setting the file number explicitly instead of the NR == FNR trick):

awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 exclude-these.txt f=2 from-this.txt

Accessing r[$0] creates the entry for that line, no need to set a value.

Assuming awk uses a hash table with constant lookup and (on average) constant update time, the time complexity of this will be O(n + m), where n and m are the lengths of the files. In my case, n was ~25 million and m ~14000. The awk solution was much faster than sort, and I also preferred keeping the original order.

Solution 5 - Bash

if you have Ruby (1.9+)

#!/usr/bin/env ruby 
b=File.read("file2").split
open("file1").each do |x|
  x.chomp!
  puts x if !b.include?(x)
end

Which has O(N^2) complexity. If you want to care about performance, here's another version

b=File.read("file2").split
a=File.read("file1").split
(a-b).each {|x| puts x}

which uses a hash to effect the subtraction, so is complexity O(n) (size of a) + O(n) (size of b)

here's a little benchmark, courtesy of user576875, but with 100K lines, of the above:

$ for i in $(seq 1 100000); do echo "$i"; done|sort --random-sort > file1
$ for i in $(seq 1 2 100000); do echo "$i"; done|sort --random-sort > file2
$ time ruby test.rb > ruby.test

real    0m0.639s
user    0m0.554s
sys     0m0.021s

$time sort file1 file2|uniq -u  > sort.test

real    0m2.311s
user    0m1.959s
sys     0m0.040s

$ diff <(sort -n ruby.test) <(sort -n sort.test)
$

diff was used to show there are no differences between the 2 files generated.

Solution 6 - Bash

Some timing comparisons between various other answers:

$ for n in {1..10000}; do echo $RANDOM; done > f1
$ for n in {1..10000}; do echo $RANDOM; done > f2
$ time comm -23 <(sort f1) <(sort f2) > /dev/null

real    0m0.019s
user    0m0.023s
sys     0m0.012s
$ time ruby -e 'puts File.readlines("f1") - File.readlines("f2")' > /dev/null

real    0m0.026s
user    0m0.018s
sys     0m0.007s
$ time grep -xvf f2 f1 > /dev/null

real    0m43.197s
user    0m43.155s
sys     0m0.040s

sort f1 f2 | uniq -u isn't even a symmetrical difference, because it removes lines that appear multiple times in either file.

comm can also be used with stdin and here strings:

echo $'a\nb' | comm -23 <(sort) <(sort <<< $'c\nb') # a

Solution 7 - Bash

Seems to be a job suitable for the SQLite shell:

create table file1(line text);
create index if1 on file1(line ASC);
create table file2(line text);
create index if2 on file2(line ASC);
-- comment: if you have | in your files then specify “ .separator ××any_improbable_string×× ”
.import 'file1.txt' file1
.import 'file2.txt' file2
.output result.txt
select * from file2 where line not in (select line from file1);
.q

Solution 8 - Bash

Did you try this with sed?

sed 's#^#sed -i '"'"'s%#g' f2 > f2.sh

sed -i 's#$#%%g'"'"' f1#g' f2.sh

sed -i '1i#!/bin/bash' f2.sh

sh f2.sh

Solution 9 - Bash

Not a 'programming' answer but here's a quick and dirty solution: just go to http://www.listdiff.com/compare-2-lists-difference-tool.

Obviously won't work for huge files but it did the trick for me. A few notes:

  • I'm not affiliated with the website in any way (if you still don't believe me, then you can just search for a different tool online; I used the search term "set difference list online")
  • The linked website seems to make network calls on every list comparison, so don't feed it any sensitive data

Solution 10 - Bash

A Python way of filtering one list using another list.

Load files:

>>> f1 = open('f1').readlines()
>>> f2 = open('f2.txt').readlines()

Remove '\n' string at the end of each line:

>>> f1 = [i.replace('\n', '') for i in f1]
>>> f2 = [i.replace('\n', '') for i in f2]

Print only the f1 lines that are also in the f2 file:

>>> [a for a in f1 if all(b not in a for b in f2)]

Solution 11 - Bash

$ cat values.txt
apple
banana
car
taxi

$ cat source.txt
fruits
mango
king
queen
number
23
43
sentence is long
so what
...
...

I made a small shell scrip to "weed" out the values in source file which are present in values.txt file.

$cat weed_out.sh
from=$1
cp -p $from $from.final
for x in `cat values.txt`;
do
 grep -v $x $from.final > $from.final.tmp
 mv $from.final.tmp $from.final
done

executing...

$ ./weed_out source.txt

and you get a nicely cleaned up file....

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionlalliView Question on Stackoverflow
Solution 1 - BashgabuzoView Answer on Stackoverflow
Solution 2 - BashIgnacio Vazquez-AbramsView Answer on Stackoverflow
Solution 3 - BashDennis WilliamsonView Answer on Stackoverflow
Solution 4 - Bashjcsahnwaldt Reinstate MonicaView Answer on Stackoverflow
Solution 5 - BashkurumiView Answer on Stackoverflow
Solution 6 - BashLriView Answer on Stackoverflow
Solution 7 - BashBenoitView Answer on Stackoverflow
Solution 8 - BashRuanView Answer on Stackoverflow
Solution 9 - BashyoungrrrrView Answer on Stackoverflow
Solution 10 - BashKSVelArcView Answer on Stackoverflow
Solution 11 - BashrajeevView Answer on Stackoverflow