Replace all whitespace with a line break/paragraph mark to make a word list

RegexSed

Regex Problem Overview


I am trying to vocab list for a Greek text we are translating in class. I want to replace every space or tab character with a paragraph mark so that every word appears on its own line. Can anyone give me the sed command, and explain what it is that I'm doing? I’m still trying to figure sed out.

Regex Solutions


Solution 1 - Regex

For reasonably modern versions of sed, edit the standard input to yield the standard output with

$ echo 'τέχνη βιβλίο γη κήπος' | sed -E -e 's/[[:blank:]]+/\n/g'
τέχνη
βιβλίο
γη
κήπος

If your vocabulary words are in files named lesson1 and lesson2, redirect sed’s standard output to the file all-vocab with

sed -E -e 's/[[:blank:]]+/\n/g' lesson1 lesson2 > all-vocab

What it means:

  • The character class [[:blank:]] matches either a single space character or a single tab character.
    • Use [[:space:]] instead to match any single whitespace character (commonly space, tab, newline, carriage return, form-feed, and vertical tab).
    • The + quantifier means match one or more of the previous pattern.
    • So [[:blank:]]+ is a sequence of one or more characters that are all space or tab.
  • The \n in the replacement is the newline that you want.
  • The /g modifier on the end means perform the substitution as many times as possible rather than just once.
  • The -E option tells sed to use POSIX extended regex syntax and in particular for this case the + quantifier. Without -E, your sed command becomes sed -e 's/[[:blank:]]\+/\n/g'. (Note the use of \+ rather than simple +.)
Perl Compatible Regexes

For those familiar with Perl-compatible regexes and a PCRE-capable sed, use \s+ to match runs of at least one whitespace character, as in

sed -E -e 's/\s+/\n/g' old > new

or

sed -e 's/\s\+/\n/g' old > new

These commands read input from the file old and write the result to a file named new in the current directory.

Maximum portability, maximum cruftiness

Going back to almost any version of sed since Version 7 Unix, the command invocation is a bit more baroque.

$ echo 'τέχνη βιβλίο γη κήπος' | sed -e 's/[ \t][ \t]*/\
/g'
τέχνη
βιβλίο
γη
κήπος

Notes:

  • Here we do not even assume the existence of the humble + quantifier and simulate it with a single space-or-tab ([ \t]) followed by zero or more of them ([ \t]*).
  • Similarly, assuming sed does not understand \n for newline, we have to include it on the command line verbatim.
    • The \ and the end of the first line of the command is a continuation marker that escapes the immediately following newline, and the remainder of the command is on the next line.
      • Note: There must be no whitespace preceding the escaped newline. That is, the end of the first line must be exactly backslash followed by end-of-line.
    • This error prone process helps one appreciate why the world moved to visible characters, and you will want to exercise some care in trying out the command with copy-and-paste.
Note on backslashes and quoting

The commands above all used single quotes ('') rather than double quotes (""). Consider:

$ echo '\\\\' "\\\\"
\\\\ \\

That is, the shell applies different escaping rules to single-quoted strings as compared with double-quoted strings. You typically want to protect all the backslashes common in regexes with single quotes.

Solution 2 - Regex

The portable way to do this is:

sed -e 's/[ \t][ \t]*/\
/g'

That's an actual newline between the backslash and the slash-g. Many sed implementations don't know about \n, so you need a literal newline. The backslash before the newline prevents sed from getting upset about the newline. (in sed scripts the commands are normally terminated by newlines)

With GNU sed you can use \n in the substitution, and \s in the regex:

sed -e 's/\s\s*/\n/g'

GNU sed also supports "extended" regular expressions (that's egrep style, not perl-style) if you give it the -r flag, so then you can use +:

sed -r -e 's/\s+/\n/g'

If this is for Linux only, you can probably go with the GNU command, but if you want this to work on systems with a non-GNU sed (eg: BSD, Mac OS-X), you might want to go with the more portable option.

Solution 3 - Regex

All of the examples listed above for sed break on one platform or another. None of them work with the version of sed shipped on Macs.

However, Perl's regex works the same on any machine with Perl installed:

perl -pe 's/\s+/\n/g' file.txt

If you want to save the output:

perl -pe 's/\s+/\n/g' file.txt > newfile.txt

If you want only unique occurrences of words:

perl -pe 's/\s+/\n/g' file.txt | sort -u > newfile.txt

Solution 4 - Regex

  1. option 1

    echo $(cat testfile)
    
  2. Option 2

    tr ' ' '\n' < testfile
    

Solution 5 - Regex

This should do the work:

sed -e 's/[ \t]+/\n/g'

[ \t] means a space OR an tab. If you want any kind of space, you could also use \s.

[ \t]+ means as many spaces OR tabs as you want (but at least one)

s/x/y/ means replace the pattern x by y (here \n is a new line)

The g at the end means that you have to repeat as many times it occurs in every line.

Solution 6 - Regex

You could use POSIX [[:blank:]] to match a horizontal white-space character.

sed 's/[[:blank:]]\+/\n/g' file

or you may use [[:space:]] instead of [[:blank:]] also.

Example:

$ echo 'this  is a sentence' | sed 's/[[:blank:]]\+/\n/g'
this
is
a
sentence

Solution 7 - Regex

You can also do it with xargs:

cat old | xargs -n1 > new

or

xargs -n1 < old > new

Solution 8 - Regex

Using gawk:

gawk '{$1=$1}1' OFS="\n" file

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionDavidRView Question on Stackoverflow
Solution 1 - RegexGreg BaconView Answer on Stackoverflow
Solution 2 - RegexLaurence GonsalvesView Answer on Stackoverflow
Solution 3 - RegexEarl RubyView Answer on Stackoverflow
Solution 4 - RegexRAJView Answer on Stackoverflow
Solution 5 - RegexTristram GräbenerView Answer on Stackoverflow
Solution 6 - RegexAvinash RajView Answer on Stackoverflow
Solution 7 - RegexFranMowinckelView Answer on Stackoverflow
Solution 8 - Regexghostdog74View Answer on Stackoverflow