How do I find and remove duplicate lines from a file using Regular Expressions?

Regex

Regex Problem Overview


This question is meant to be language agnostic. Using only Regular Expressions, can I find and replace duplicate lines in a file?

Please consider the follwing example input and the output that I want;

Input>>

11
22
22  <-duplicate
33
44
44  <-duplicate
55

Output>>

11
22
33
44
55

Regex Solutions


Solution 1 - Regex

Regular-expressions.info has a page on Deleting Duplicate Lines From a File

This basically boils down to searching for this oneliner:

^(.*)(\r?\n\1)+$

... And replacing with \1.
Note: Dot must not match Newline

Explanation:

> The caret will match only at the start of a line. So the regex engine will only attempt to match the remainder of the regex there. The dot and star combination simply matches an entire line, whatever its contents, if any. The parentheses store the matched line into the first backreference. > > Next we will match the line separator. I put the question mark into \r?\n to make this regex work with both Windows (\r\n) and UNIX (\n) text files. So up to this point we matched a line and the following line break. > > Now we need to check if this combination is followed by a duplicate of that same line. We do this simply with \1. This is the first backreference which holds the line we matched. The backreference will match that very same text. > > If the backreference fails to match, the regex match and the backreference are discarded, and the regex engine tries again at the start of the next line. If the backreference succeeds, the plus symbol in the regular expression will try to match additional copies of the line. Finally, the dollar symbol forces the regex engine to check if the text matched by the backreference is a complete line. We already know the text matched by the backreference is preceded by a line break (matched by \r?\n). Therefore, we now check if it is also followed by a line break or if it is at the end of the file using the dollar sign. > > The entire match becomes line\nline (or line\nline\nline etc.). Because we are doing a search and replace, the line, its duplicates, and the line breaks in between them, are all deleted from the file. Since we want to keep the original line, but not the duplicates, we use \1 as the replacement text to put the original line back in.

Solution 2 - Regex

See my request for more info, I'm answering in the easy way now.

  1. If the order doesn't matter, just a

    sort -u

will do the trick

  1. If the order does matter but you don't mind re-run multiple passes (this is vim syntax), you can use:

    %s/\(.*\)\(\_.*\)\(\1\)/\2\1/g

to preserve the last occurrence, or

%s/\\(.\*\\)\\(\\_.*\\)\\(\1\\)/\1\2/g

to preserve the first occurrence.

If you do mind re-run multiple passes, than it's more difficult, so before we work on that, please say so in the question!

EDIT: in your edit you weren't very clear, but it looks like you want just a single-pass duplicate ADJACENT lines removal! Well, that's much easier!

A simple:

/(.*)\1*/\1/

(/\(.*\)\1*/\1/ in vim) i.e. searching for (.*)\1* and replacing it with just \1 will do the trick

Solution 3 - Regex

In RegexBuddy you can do this as follows:

  1. On the Library tab, load the RegexBuddy.rbl library if not loaded by default.
  2. In the lookup box, type "duplicate"
  3. Click the Use button to load the "delete duplicate lines" regex.
  4. On the GREP tab, specify the folder and file mask of the files you want to delete duplicates from.
  5. In the drop-down menu of the GREP button, select Execute.

If you're only doing this on one file, you can use the Test tab instead of the GREP tab. Load the file on the Test tab, and then click the Replace button in the main toolbar.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionebattulgaView Question on Stackoverflow
Solution 1 - RegexBen JamesView Answer on Stackoverflow
Solution 2 - RegexDavideView Answer on Stackoverflow
Solution 3 - RegexJan GoyvaertsView Answer on Stackoverflow