Remove all special characters, punctuation and spaces from string

Python Problem Overview

I need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers.

Python Solutions

Solution 1 - Python

This can be done without regex:

>>> string = "Special $#! characters   spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'

You can use str.isalnum:

> > S.isalnum() -> bool >
> Return True if all characters in S are alphanumeric > and there is at least one character in S, False otherwise.

If you insist on using regex, other solutions will do fine. However note that if it can be done without using a regular expression, that's the best way to go about it.

Solution 2 - Python

Here is a regex to match a string of characters that are not a letters or numbers:

[^A-Za-z0-9]+

Here is the Python command to do a regex substitution:

re.sub('[^A-Za-z0-9]+', '', mystring)

Solution 3 - Python

Shorter way :

import re
cleanString = re.sub('\W+','', string )

If you want spaces between words and numbers substitute '' with ' '

Solution 4 - Python

TLDR

I timed the provided answers.

import re
re.sub('\W+','', string)

is typically 3x faster than the next fastest provided top answer.

Caution should be taken when using this option. Some special characters (e.g. ø) may not be striped using this method.

After seeing this, I was interested in expanding on the provided answers by finding out which executes in the least amount of time, so I went through and checked some of the proposed answers with timeit against two of the example strings:

string1 = 'Special $#! characters spaces 888323'
string2 = 'how much for the maple syrup? $20.99? That s ridiculous!!!'

Example 1

'.join(e for e in string if e.isalnum())

string1 - Result: 10.7061979771
string2 - Result: 7.78372597694

Example 2

import re
re.sub('[^A-Za-z0-9]+', '', string)

string1 - Result: 7.10785102844
string2 - Result: 4.12814903259

Example 3

import re
re.sub('\W+','', string)

string1 - Result: 3.11899876595
string2 - Result: 2.78014397621

The above results are a product of the lowest returned result from an average of: repeat(3, 2000000)

Example 3 can be 3x faster than Example 1.

Solution 5 - Python

#!/usr/bin/python
import re

strs = "how much for the maple syrup? $20.99? That's ricidulous!!!"
print strs
nstr = re.sub(r'[?|$|.|!]',r'',strs)
print nstr
nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr)
print nestr

you can add more special character and that will be replaced by '' means nothing i.e they will be removed.

Solution 6 - Python

###Python 2.*

I think just filter(str.isalnum, string) works

In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.')
Out[20]: 'stringwithspecialcharslikeetcs'

###Python 3.*

In Python3, filter( ) function would return an itertable object (instead of string unlike in above). One has to join back to get a string from itertable:

''.join(filter(str.isalnum, string))

or to pass list in join use (not sure but can be fast a bit)

''.join([*filter(str.isalnum, string)])

note: unpacking in [*args] valid from Python >= 3.5

Solution 7 - Python

Differently than everyone else did using regex, I would try to exclude every character that is not what I want, instead of enumerating explicitly what I don't want.

For example, if I want only characters from 'a to z' (upper and lower case) and numbers, I would exclude everything else:

import re
s = re.sub(r"[^a-zA-Z0-9]","",s)

This means "substitute every character that is not a number, or a character in the range 'a to z' or 'A to Z' with an empty string".

In fact, if you insert the special character ^ at the first place of your regex, you will get the negation.

Extra tip: if you also need to lowercase the result, you can make the regex even faster and easier, as long as you won't find any uppercase now.

import re
s = re.sub(r"[^a-z0-9]","",s.lower())

Solution 8 - Python

string.punctuation contains following characters:

> '!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~'

You can use translate and maketrans functions to map punctuations to empty values (replace)

import string

'This, is. A test!'.translate(str.maketrans('', '', string.punctuation))

Output:

'This is A test'

Solution 9 - Python

s = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", s)

Solution 10 - Python

Assuming you want to use a regex and you want/need Unicode-cognisant 2.x code that is 2to3-ready:

>>> import re
>>> rx = re.compile(u'[\W_]+', re.UNICODE)
>>> data = u''.join(unichr(i) for i in range(256))
>>> rx.sub(u'', data)
u'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb2 [snip] \xfe\xff'
>>>

Solution 11 - Python

The most generic approach is using the 'categories' of the unicodedata table which classifies every single character. E.g. the following code filters only printable characters based on their category:

import unicodedata
# strip of crap characters (based on the Unicode database
# categorization:
# http://www.sql-und-xml.de/unicode-database/#kategorien

PRINTABLE = set(('Lu', 'Ll', 'Nd', 'Zs'))

def filter_non_printable(s):
    result = []
    ws_last = False
    for c in s:
        c = unicodedata.category(c) in PRINTABLE and c or u'#'
        result.append(c)
    return u''.join(result).replace(u'#', u' ')

Look at the given URL above for all related categories. You also can of course filter by the punctuation categories.

Solution 12 - Python

For other languages like German, Spanish, Danish, French etc that contain special characters (like German "Umlaute" as ü, ä, ö) simply add these to the regex search string:

Example for German:

re.sub('[^A-ZÜÖÄa-z0-9]+', '', mystring)

Solution 13 - Python

This will remove all special characters, punctuation, and spaces from a string and only have numbers and letters.

import re

sample_str = "Hel&&lo %% Wo$#rl@d"

# using isalnum()
print("".join(k for k in sample_str if k.isalnum()))


# using regex
op2 = re.sub("[^A-Za-z]", "", sample_str)
print(f"op2 = ", op2)


special_char_list = ["$", "@", "#", "&", "%"]

# using list comprehension
op1 = "".join([k for k in sample_str if k not in special_char_list])
print(f"op1 = ", op1)


# using lambda function
op3 = "".join(filter(lambda x: x not in special_char_list, sample_str))
print(f"op3 = ", op3)

Solution 14 - Python

Use translate:

import string

def clean(instr):
    return instr.translate(None, string.punctuation + ' ')

Caveat: Only works on ascii strings.

Solution 15 - Python

import re
my_string = """Strings are amongst the most popular data types in Python. We can create the strings by enclosing characters in quotes. Python treats single quotes the

same as double quotes."""

# if we need to count the word python that ends with or without ',' or '.' at end

count = 0
for i in text:
    if i.endswith("."):
        text[count] = re.sub("^([a-z]+)(.)?$", r"\1", i)
    count += 1
print("The count of Python : ", text.count("python"))

Solution 16 - Python

This will remove all non-alphanumeric characters except spaces.

string = "Special $#! characters   spaces 888323"
''.join(e for e in string if (e.isalnum() or e.isspace()))

>Special characters spaces 888323

Solution 17 - Python

After 10 Years, below I wrote there is the best solution. You can remove/clean all special characters, punctuation, ASCII characters and spaces from the string.

from clean_text import clean

string = 'Special $#! characters   spaces 888323'
new = clean(string,lower=False,no_currency_symbols=True, no_punct = True,replace_with_currency_symbol='')
print(new)
Output ==> 'Special characters spaces 888323'
you can replace space if you want.
update = new.replace(' ','')
print(update)
Output ==> 'Specialcharactersspaces888323'

Solution 18 - Python

function regexFuntion(st) {
  const regx = /[^\w\s]/gi; // allow : [a-zA-Z0-9, space]
  st = st.replace(regx, ''); // remove all data without [a-zA-Z0-9, space]
  st = st.replace(/\s\s+/g, ' '); // remove multiple space

  return st;
}

console.log(regexFuntion('$Hello; # -world--78asdf+-===asdflkj******lkjasdfj67;'));
// Output: Hello world78asdfasdflkjlkjasdfj67

Solution 19 - Python

import re
abc = "askhnl#$%askdjalsdk"
ddd = abc.replace("#$%","")
print (ddd)

and you shall see your result as

'askhnlaskdjalsdk

Content Type	Original Author	Original Content on Stackoverflow
Question	user664546	View Question on Stackoverflow
Solution 1 - Python	user225312	View Answer on Stackoverflow
Solution 2 - Python	Andy White	View Answer on Stackoverflow
Solution 3 - Python	tuxErrante	View Answer on Stackoverflow
Solution 4 - Python	mbeacom	View Answer on Stackoverflow
Solution 5 - Python	pkm	View Answer on Stackoverflow
Solution 6 - Python	Grijesh Chauhan	View Answer on Stackoverflow
Solution 7 - Python	Andrea	View Answer on Stackoverflow
Solution 8 - Python	Vlad Bezden	View Answer on Stackoverflow
Solution 9 - Python	sneha	View Answer on Stackoverflow
Solution 10 - Python	John Machin	View Answer on Stackoverflow
Solution 11 - Python	Andreas Jung	View Answer on Stackoverflow
Solution 12 - Python	petezurich	View Answer on Stackoverflow
Solution 13 - Python	Shubham Shewdikar	View Answer on Stackoverflow
Solution 14 - Python	jjmurre	View Answer on Stackoverflow
Solution 15 - Python	Vinay Kumar Kuresi	View Answer on Stackoverflow
Solution 16 - Python	Gobinath Viswanathan	View Answer on Stackoverflow
Solution 17 - Python	Viren Ramani	View Answer on Stackoverflow
Solution 18 - Python	Art Bindu	View Answer on Stackoverflow
Solution 19 - Python	Dsw Wds	View Answer on Stackoverflow