Substitute multiple whitespace with single whitespace in Python
PythonSubstitutionRemoving WhitespacePython Problem Overview
I have this string:
mystring = 'Here is some text I wrote '
How can I substitute the double, triple (...) whitespace chracters with a single space, so that I get:
mystring = 'Here is some text I wrote'
Python Solutions
Solution 1 - Python
A simple possibility (if you'd rather avoid REs) is
' '.join(mystring.split())
The split and join perform the task you're explicitly asking about -- plus, they also do the extra one that you don't talk about but is seen in your example, removing trailing spaces;-).
Solution 2 - Python
A regular expression can be used to offer more control over the whitespace characters that are combined.
To match unicode whitespace:
import re
_RE_COMBINE_WHITESPACE = re.compile(r"\s+")
my_str = _RE_COMBINE_WHITESPACE.sub(" ", my_str).strip()
To match ASCII whitespace only:
import re
_RE_COMBINE_WHITESPACE = re.compile(r"(?a:\s+)")
_RE_STRIP_WHITESPACE = re.compile(r"(?a:^\s+|\s+$)")
my_str = _RE_COMBINE_WHITESPACE.sub(" ", my_str)
my_str = _RE_STRIP_WHITESPACE.sub("", my_str)
Matching only ASCII whitespace is sometimes essential for keeping control characters such as x0b, x0c, x1c, x1d, x1e, x1f.
Reference:
About \s
:
> For Unicode (str) patterns:
> Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the
> non-breaking spaces mandated by typography rules in many languages).
> If the ASCII flag is used, only [ \t\n\r\f\v] is matched.
About re.ASCII
:
> Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode > patterns, and is ignored for byte patterns. Corresponds to the inline > flag (?a).
strip()
will remote any leading and trailing whitespaces.
Solution 3 - Python
For completeness, you can also use:
mystring = mystring.strip() # the while loop will leave a trailing space,
# so the trailing whitespace must be dealt with
# before or after the while loop
while ' ' in mystring:
mystring = mystring.replace(' ', ' ')
which will work quickly on strings with relatively few spaces (faster than re
in these situations).
In any scenario, Alex Martelli's split/join solution performs at least as quickly (usually significantly more so).
In your example, using the default values of timeit.Timer.repeat(), I get the following times:
str.replace: [1.4317800167340238, 1.4174888149192384, 1.4163512401715934]
re.sub: [3.741931446594549, 3.8389395858970374, 3.973777672860706]
split/join: [0.6530919432498195, 0.6252146571700905, 0.6346594329726258]
EDIT:
Just came across this post which provides a rather long comparison of the speeds of these methods.