Neatest way to remove linebreaks in Perl

PerlLine Breaks

Perl Problem Overview


I'm maintaining a script that can get its input from various sources, and works on it per line. Depending on the actual source used, linebreaks might be Unix-style, Windows-style or even, for some aggregated input, mixed(!).

When reading from a file it goes something like this:

@lines = <IN>;
process(\@lines);

...

sub process {
    @lines = shift;
    foreach my $line (@{$lines}) {
        chomp $line;
        #Handle line by line
    }
}

So, what I need to do is replace the chomp with something that removes either Unix-style or Windows-style linebreaks. I'm coming up with way too many ways of solving this, one of the usual drawbacks of Perl :)

What's your opinion on the neatest way to chomp off generic linebreaks? What would be the most efficient?

Edit: A small clarification - the method 'process' gets a list of lines from somewhere, not nessecarily read from a file. Each line might have

  • No trailing linebreaks
  • Unix-style linebreaks
  • Windows-style linebreaks
  • Just Carriage-Return (when original data has Windows-style linebreaks and is read with $/ = '\n')
  • An aggregated set where lines have different styles

Perl Solutions


Solution 1 - Perl

After digging a bit through the perlre docs a bit, I'll present my best suggestion so far that seems to work pretty good. Perl 5.10 added the \R character class as a generalized linebreak:

$line =~ s/\R//g;

It's the same as:

(?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}])

I'll keep this question open a while yet, just to see if there's more nifty ways waiting to be suggested.

Solution 2 - Perl

Whenever I go through input and want to remove or replace characters I run it through little subroutines like this one.

sub clean {

    my $text = shift;

	$text =~ s/\n//g;
	$text =~ s/\r//g;

	return $text;
}

It may not be fancy but this method has been working flawless for me for years.

Solution 3 - Perl

$line =~ s/[\r\n]+//g;

Solution 4 - Perl

Reading perlport I'd suggest something like

$line =~ s/\015?\012?$//;

to be safe for whatever platform you're on and whatever linefeed style you may be processing because what's in \r and \n may differ through different Perl flavours.

Solution 5 - Perl

Note from 2017: File::Slurp is not recommended due to design mistakes and unmaintained errors. Use File::Slurper or Path::Tiny instead.

extending on your answer

use File::Slurp ();
my $value = File::Slurp::slurp($filename);
$value =~ s/\R*//g;

File::Slurp abstracts away the File IO stuff and just returns a string for you.

NOTE

  1. Important to note the addition of /g , without it, given a multi-line string, it will only replace the first offending character.

  2. Also, the removal of $, which is redundant for this purpose, as we want to strip all line breaks, not just line-breaks before whatever is meant by $ on this OS.

  3. In a multi-line string, $ matches the end of the string and that would be problematic ).

  4. Point 3 means that point 2 is made with the assumption that you'd also want to use /m otherwise '$' would be basically meaningless for anything practical in a string with >1 lines, or, doing single line processing, an OS which actually understands $ and manages to find the \R* that proceed the $

Examples

while( my $line = <$foo> ){
      $line =~ $regex;
}

Given the above notation, an OS which does not understand whatever your files '\n' or '\r' delimiters, in the default scenario with the OS's default delimiter set for $/ will result in reading your whole file as one contiguous string ( unless your string has the $OS's delimiters in it, where it will delimit by that )

So in this case all of these regex are useless:

  • /\R*$// : Will only erase the last sequence of \R in the file

  • /\R*// : Will only erase the first sequence of \R in the file

  • /\012?\015?// : When will only erase the first 012\015 , \012 , or \015 sequence, \015\012 will result in either \012 or \015 being emitted.

  • /\R*$// : If there happens to be no byte sequences of '\015$OSDELIMITER' in the file, then then NO linebreaks will be removed except for the OS's own ones.

It would appear nobody gets what I'm talking about, so here is example code, that is tested to NOT remove line feeds. Run it, you'll see that it leaves the linefeeds in.

#!/usr/bin/perl 

use strict;
use warnings;

my $fn = 'TestFile.txt';

my $LF = "\012";
my $CR = "\015";

my $UnixNL = $LF;
my $DOSNL  = $CR . $LF;
my $MacNL  = $CR;

sub generate { 
    my $filename = shift;
    my $lineDelimiter = shift;

    open my $fh, '>', $filename;
    for ( 0 .. 10 )
    {
        print $fh "{0}";
        print $fh join "", map { chr( int( rand(26) + 60 ) ) } 0 .. 20;
        print $fh "{1}";
        print $fh $lineDelimiter->();
        print $fh "{2}";
    }
    close $fh;
}

sub parse { 
    my $filename = shift;
    my $osDelimiter = shift;
    my $message = shift;
    print "Parsing $message File $filename : \n";

    local $/ = $osDelimiter;

    open my $fh, '<', $filename;
    while ( my $line = <$fh> )
    {

        $line =~ s/\R*$//;
        print ">|" . $line . "|<";

    }
    print "Done.\n\n";
}


my @all = ( $DOSNL,$MacNL,$UnixNL);
generate 'Windows.txt' , sub { $DOSNL }; 
generate 'Mac.txt' , sub { $MacNL };
generate 'Unix.txt', sub { $UnixNL };
generate 'Mixed.txt', sub {
    return @all[ int(rand(2)) ];
};


for my $os ( ["$MacNL", "On Mac"], ["$DOSNL", "On Windows"], ["$UnixNL", "On Unix"]){
    for ( qw( Windows Mac Unix Mixed ) ){
        parse $_ . ".txt", @{ $os };
    }
}

For the CLEARLY Unprocessed output, see here: http://pastebin.com/f2c063d74

Note there are certain combinations that of course work, but they are likely the ones you yourself naívely tested.

Note that in this output, all results must be of the form >|$string|<>|$string|< with NO LINE FEEDS to be considered valid output.

and $string is of the general form {0}$data{1}$delimiter{2} where in all output sources, there should be either :

  1. Nothing between {1} and {2}
  2. only |<>| between {1} and {2}

Solution 6 - Perl

In your example, you can just go:

chomp(@lines);

Or:

$_=join("", @lines);
s/[\r\n]+//g;

Or:

@lines = split /[\r\n]+/, join("", @lines);

Using these directly on a file:

perl -e '$_=join("",<>); s/[\r\n]+//g; print' <a.txt |less

perl -e 'chomp(@a=<>);print @a' <a.txt |less

Solution 7 - Perl

To extend Ted Cambron's answer above and something that hasn't been addressed here: If you remove all line breaks indiscriminately from a chunk of entered text, you will end up with paragraphs running into each other without spaces when you output that text later. This is what I use:

sub cleanLines{

    my $text = shift;

    $text =~ s/\r/ /; #replace \r with space
    $text =~ s/\n/ /; #replace \n with space
    $text =~ s/  / /g; #replace double-spaces with single space

    return $text;
}

The last substitution uses the g 'greedy' modifier so it continues to find double-spaces until it replaces them all. (Effectively substituting anything more that single space)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionChristofferView Question on Stackoverflow
Solution 1 - PerlChristofferView Answer on Stackoverflow
Solution 2 - PerlTed CambronView Answer on Stackoverflow
Solution 3 - PerldsmView Answer on Stackoverflow
Solution 4 - PerlOlfanView Answer on Stackoverflow
Solution 5 - PerlKent FredricView Answer on Stackoverflow
Solution 6 - PerlCurtis YallopView Answer on Stackoverflow
Solution 7 - PerlColin R. TurnerView Answer on Stackoverflow