Best way to strip punctuation from a string

PythonStringPunctuation

Python Problem Overview


It seems like there should be a simpler way than:

import string
s = "string. With. Punctuation?" # Sample string 
out = s.translate(string.maketrans("",""), string.punctuation)

Is there?

Python Solutions


Solution 1 - Python

From an efficiency perspective, you're not going to beat

s.translate(None, string.punctuation)

For higher versions of Python use the following code:

s.translate(str.maketrans('', '', string.punctuation))

It's performing raw string operations in C with a lookup table - there's not much that will beat that but writing your own C code.

If speed isn't a worry, another option though is:

exclude = set(string.punctuation)
s = ''.join(ch for ch in s if ch not in exclude)

This is faster than s.replace with each char, but won't perform as well as non-pure python approaches such as regexes or string.translate, as you can see from the below timings. For this type of problem, doing it at as low a level as possible pays off.

Timing code:

import re, string, timeit

s = "string. With. Punctuation"
exclude = set(string.punctuation)
table = string.maketrans("","")
regex = re.compile('[%s]' % re.escape(string.punctuation))

def test_set(s):
    return ''.join(ch for ch in s if ch not in exclude)

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_trans(s):
    return s.translate(table, string.punctuation)

def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s

print "sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)
print "regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)
print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)
print "replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)

This gives the following results:

sets      : 19.8566138744
regex     : 6.86155414581
translate : 2.12455511093
replace   : 28.4436721802

Solution 2 - Python

Regular expressions are simple enough, if you know them.

import re
s = "string. With. Punctuation?"
s = re.sub(r'[^\w\s]','',s)

Solution 3 - Python

For the convenience of usage, I sum up the note of striping punctuation from a string in both Python 2 and Python 3. Please refer to other answers for the detailed description.


Python 2

import string

s = "string. With. Punctuation?"
table = string.maketrans("","")
new_s = s.translate(table, string.punctuation)      # Output: string without punctuation

Python 3

import string

s = "string. With. Punctuation?"
table = str.maketrans(dict.fromkeys(string.punctuation))  # OR {key: None for key in string.punctuation}
new_s = s.translate(table)                          # Output: string without punctuation

Solution 4 - Python

myString.translate(None, string.punctuation)

Solution 5 - Python

I usually use something like this:

>>> s = "string. With. Punctuation?" # Sample string
>>> import string
>>> for c in string.punctuation:
...     s= s.replace(c,"")
...
>>> s
'string With Punctuation'

Solution 6 - Python

string.punctuation is ASCII only! A more correct (but also much slower) way is to use the unicodedata module:

# -*- coding: utf-8 -*-
from unicodedata import category
s = u'String — with -  «punctation »...'
s = ''.join(ch for ch in s if category(ch)[0] != 'P')
print 'stripped', s

You can generalize and strip other types of characters as well:

''.join(ch for ch in s if category(ch)[0] not in 'SP')

It will also strip characters like ~*+§$ which may or may not be "punctuation" depending on one's point of view.

Solution 7 - Python

Not necessarily simpler, but a different way, if you are more familiar with the re family.

import re, string
s = "string. With. Punctuation?" # Sample string 
out = re.sub('[%s]' % re.escape(string.punctuation), '', s)

Solution 8 - Python

For Python 3 str or Python 2 unicode values, str.translate() only takes a dictionary; codepoints (integers) are looked up in that mapping and anything mapped to None is removed.

To remove (some?) punctuation then, use:

import string

remove_punct_map = dict.fromkeys(map(ord, string.punctuation))
s.translate(remove_punct_map)

The dict.fromkeys() class method makes it trivial to create the mapping, setting all values to None based on the sequence of keys.

To remove all punctuation, not just ASCII punctuation, your table needs to be a little bigger; see J.F. Sebastian's answer (Python 3 version):

import unicodedata
import sys

remove_punct_map = dict.fromkeys(i for i in range(sys.maxunicode)
                                 if unicodedata.category(chr(i)).startswith('P'))

Solution 9 - Python

string.punctuation misses loads of punctuation marks that are commonly used in the real world. How about a solution that works for non-ASCII punctuation?

import regex
s = u"string. With. Some・Really Weird、Non?ASCII。 「(Punctuation)」?"
remove = regex.compile(ur'[\p{C}|\p{M}|\p{P}|\p{S}|\p{Z}]+', regex.UNICODE)
remove.sub(u" ", s).strip()

Personally, I believe this is the best way to remove punctuation from a string in Python because:

  • It removes all Unicode punctuation
  • It's easily modifiable, e.g. you can remove the \{S} if you want to remove punctuation, but keep symbols like $.
  • You can get really specific about what you want to keep and what you want to remove, for example \{Pd} will only remove dashes.
  • This regex also normalizes whitespace. It maps tabs, carriage returns, and other oddities to nice, single spaces.

This uses Unicode character properties, which you can read more about on Wikipedia.

Solution 10 - Python

I haven't seen this answer yet. Just use a regex; it removes all characters besides word characters (\w) and number characters (\d), followed by a whitespace character (\s):

import re
s = "string. With. Punctuation?" # Sample string 
out = re.sub(ur'[^\w\d\s]+', '', s)

Solution 11 - Python

Here's a one-liner for Python 3.5:

import string
"l*ots! o(f. p@u)n[c}t]u[a'ti\"on#$^?/".translate(str.maketrans({a:None for a in string.punctuation}))

Solution 12 - Python

This might not be the best solution however this is how I did it.

import string
f = lambda x: ''.join([i for i in x if i not in string.punctuation])

Solution 13 - Python

Here is a function I wrote. It's not very efficient, but it is simple and you can add or remove any punctuation that you desire:

def stripPunc(wordList):
    """Strips punctuation from list of words"""
    puncList = [".",";",":","!","?","/","\\",",","#","@","$","&",")","(","\""]
    for punc in puncList:
        for word in wordList:
            wordList=[word.replace(punc,'') for word in wordList]
    return wordList

Solution 14 - Python

import re
s = "string. With. Punctuation?" # Sample string 
out = re.sub(r'[^a-zA-Z0-9\s]', '', s)

Solution 15 - Python

A one-liner might be helpful in not very strict cases:

''.join([c for c in s if c.isalnum() or c.isspace()])

Solution 16 - Python

Just as an update, I rewrote the @Brian example in Python 3 and made changes to it to move regex compile step inside of the function. My thought here was to time every single step needed to make the function work. Perhaps you are using distributed computing and can't have regex object shared between your workers and need to have re.compile step at each worker. Also, I was curious to time two different implementations of maketrans for Python 3

table = str.maketrans({key: None for key in string.punctuation})

vs

table = str.maketrans('', '', string.punctuation)

Plus I added another method to use set, where I take advantage of intersection function to reduce number of iterations.

This is the complete code:

import re, string, timeit

s = "string. With. Punctuation"


def test_set(s):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in s if ch not in exclude)


def test_set2(s):
    _punctuation = set(string.punctuation)
    for punct in set(s).intersection(_punctuation):
        s = s.replace(punct, ' ')
    return ' '.join(s.split())


def test_re(s):  # From Vinko's solution, with fix.
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    return regex.sub('', s)


def test_trans(s):
    table = str.maketrans({key: None for key in string.punctuation})
    return s.translate(table)


def test_trans2(s):
    table = str.maketrans('', '', string.punctuation)
    return(s.translate(table))


def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s


print("sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000))
print("sets2      :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000))
print("regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000))
print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000))
print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000))
print("replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))

This is my results:

sets      : 3.1830138750374317
sets2      : 2.189873124472797
regex     : 7.142953420989215
translate : 4.243278483860195
translate2 : 2.427158243022859
replace   : 4.579746678471565

Solution 17 - Python

>>> s = "string. With. Punctuation?"
>>> s = re.sub(r'[^\w\s]','',s)
>>> re.split(r'\s*', s)


['string', 'With', 'Punctuation']

Solution 18 - Python

Here's a solution without regex.

import string

input_text = "!where??and!!or$$then:)"
punctuation_replacer = string.maketrans(string.punctuation, ' '*len(string.punctuation))    
print ' '.join(input_text.translate(punctuation_replacer).split()).strip()

Output>> where and or then
  • Replaces the punctuations with spaces
  • Replace multiple spaces in between words with a single space
  • Remove the trailing spaces, if any with strip()

Solution 19 - Python

Why none of you use this?

 ''.join(filter(str.isalnum, s)) 

Too slow?

Solution 20 - Python

# FIRST METHOD
# Storing all punctuations in a variable    
punctuation='!?,.:;"\')(_-'
newstring ='' # Creating empty string
word = raw_input("Enter string: ")
for i in word:
     if(i not in punctuation):
                  newstring += i
print ("The string without punctuation is", newstring)

# SECOND METHOD
word = raw_input("Enter string: ")
punctuation = '!?,.:;"\')(_-'
newstring = word.translate(None, punctuation)
print ("The string without punctuation is",newstring)


# Output for both methods
Enter string: hello! welcome -to_python(programming.language)??,
The string without punctuation is: hello welcome topythonprogramminglanguage

Solution 21 - Python

Here's one other easy way to do it using RegEx

import re

punct = re.compile(r'(\w+)')

sentence = 'This ! is : a # sample $ sentence.' # Text with punctuation
tokenized = [m.group() for m in punct.finditer(sentence)]
sentence = ' '.join(tokenized)
print(sentence) 
'This is a sample sentence'

Solution 22 - Python

I was looking for a really simple solution. here's what I got:

import re 

s = "string. With. Punctuation?" 
s = re.sub(r'[\W\s]', ' ', s)

print(s)
'string  With  Punctuation '

Solution 23 - Python

with open('one.txt','r')as myFile:

    str1=myFile.read()

    print(str1)


    punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"] 
 
for i in punctuation:

        str1 = str1.replace(i," ") 
        myList=[]
        myList.extend(str1.split(" "))
print (str1) 
for i in myList:
   
    print(i,end='\n')
    print ("____________")

Solution 24 - Python

Try that one :)

regex.sub(r'\p{P}','', s)

Solution 25 - Python

The question does not have a lot of specifics, so the approach I took is to come up with a solution with the simplest interpretation of the problem: just remove the punctuation.

Note that solutions presented don't account for contracted words (e.g., you're) or hyphenated words (e.g., anal-retentive)...which is debated as to whether they should or shouldn't be treated as punctuations...nor to account for non-English character set or anything like that...because those specifics were not mentioned in the question. Someone argued that space is punctuation, which is technically correct...but to me it makes zero sense in the context of the question at hand.

# using lambda
''.join(filter(lambda c: c not in string.punctuation, s))

# using list comprehension
''.join('' if c in string.punctuation else c for c in s)

Solution 26 - Python

Apparently I can't supply edits to the selected answer, so here's an update which works for Python 3. The translate approach is still the most efficient option when doing non-trivial transformations.

Credit for the original heavy lifting to @Brian above. And thanks to @ddejohn for his excellent suggestion for improvement to the original test.

#!/usr/bin/env python3

"""Determination of most efficient way to remove punctuation in Python 3.

Results in Python 3.8.10 on my system using the default arguments:

set       : 51.897
regex     : 17.901
translate :  2.059
replace   : 13.209
"""

import argparse
import re
import string
import timeit

parser = argparse.ArgumentParser()
parser.add_argument("--filename", "-f", default=argparse.__file__)
parser.add_argument("--iterations", "-i", type=int, default=10000)
opts = parser.parse_args()
with open(opts.filename) as fp:
    s = fp.read()
exclude = set(string.punctuation)
table = str.maketrans("", "", string.punctuation)
regex = re.compile(f"[{re.escape(string.punctuation)}]")

def test_set(s):
    return "".join(ch for ch in s if ch not in exclude)

def test_regex(s):  # From Vinko's solution, with fix.
    return regex.sub("", s)

def test_translate(s):
    return s.translate(table)

def test_replace(s):  # From S.Lott's solution
    for c in string.punctuation:
        s = s.replace(c, "")
    return s

opts = dict(globals=globals(), number=opts.iterations)
solutions = "set", "regex", "translate", "replace"
for solution in solutions:
    elapsed = timeit.timeit(f"test_{solution}(s)", **opts)
    print(f"{solution:<10}: {elapsed:6.3f}")

Solution 27 - Python

Considering unicode. Code checked in python3.

from unicodedata import category
text = 'hi, how are you?'
text_without_punc = ''.join(ch for ch in text if not category(ch).startswith('P'))

Solution 28 - Python

You can also do this:

import string
' '.join(word.strip(string.punctuation) for word in 'text'.split())

Solution 29 - Python

When you deal with the Unicode strings, I suggest using PyPi regex module because it supports both Unicode property classes (like \p{X} / \P{X}) and POSIX character classes (like [:name:]).

Just install the package by typing pip install regex (or pip3 install regex) in your terminal and hit ENTER.

In case you need to remove punctuation and symbols of any kind (that is, anything other than letters, digits and whitespace) you can use

regex.sub(r'[\p{P}\p{S}]', '', text)  # to remove one by one
regex.sub(r'[\p{P}\p{S}]+', '', text) # to remove all consecutive punctuation/symbols with one go
regex.sub(r'[[:punct:]]+', '', text)  # Same with a POSIX character class

See a Python demo online:

import regex

text = 'भारत India <><>^$.,,! 002'
new_text = regex.sub(r'[\p{P}\p{S}\s]+', ' ', text).lower().strip()
# OR
# new_text = regex.sub(r'[[:punct:]\s]+', ' ', text).lower().strip()

print(new_text)
# => भारत india 002

Here, I added a whitespace \s pattern to the character class

Solution 30 - Python

For serious natural language processing (NLP), you should let a library like SpaCy handle punctuation through tokenization, which you can then manually tweak to your needs.

For example, how do you want to handle hyphens in words? Exceptional cases like abbreviations? Begin and end quotes? URLs? IN NLP it's often useful to separate out a contraction like "let's" into "let" and "'s" for further processing.

SpaCy example tokenization

Solution 31 - Python

Remove stop words from the text file using Python

print('====THIS IS HOW TO REMOVE STOP WORS====')

with open('one.txt','r')as myFile:

    str1=myFile.read()

    stop_words ="not", "is", "it", "By","between","This","By","A","when","And","up","Then","was","by","It","If","can","an","he","This","or","And","a","i","it","am","at","on","in","of","to","is","so","too","my","the","and","but","are","very","here","even","from","them","then","than","this","that","though","be","But","these"
   
    myList=[]

    myList.extend(str1.split(" "))

    for i in myList:

        if i not in stop_words:

            print ("____________")

            print(i,end='\n')

Solution 32 - Python

I like to use a function like this:

def scrub(abc):
    while abc[-1] is in list(string.punctuation):
        abc=abc[:-1]
    while abc[0] is in list(string.punctuation):
        abc=abc[1:]
    return abc

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionLawrence JohnstonView Question on Stackoverflow
Solution 1 - PythonBrianView Answer on Stackoverflow
Solution 2 - PythonEratosthenesView Answer on Stackoverflow
Solution 3 - PythonSparkAndShineView Answer on Stackoverflow
Solution 4 - PythonpyrouView Answer on Stackoverflow
Solution 5 - PythonS.LottView Answer on Stackoverflow
Solution 6 - PythonBjörn LindqvistView Answer on Stackoverflow
Solution 7 - PythonVinko VrsalovicView Answer on Stackoverflow
Solution 8 - PythonMartijn PietersView Answer on Stackoverflow
Solution 9 - PythonZachView Answer on Stackoverflow
Solution 10 - PythonBlairg23View Answer on Stackoverflow
Solution 11 - PythonTim PView Answer on Stackoverflow
Solution 12 - PythonDavid VuongView Answer on Stackoverflow
Solution 13 - PythonDr.TautologyView Answer on Stackoverflow
Solution 14 - PythonHaythem HADHABView Answer on Stackoverflow
Solution 15 - PythonDom GreyView Answer on Stackoverflow
Solution 16 - PythonkrinkerView Answer on Stackoverflow
Solution 17 - PythonPablo Rodriguez BertorelloView Answer on Stackoverflow
Solution 18 - Pythonngub05View Answer on Stackoverflow
Solution 19 - PythonDehua LiView Answer on Stackoverflow
Solution 20 - PythonAnimeartistView Answer on Stackoverflow
Solution 21 - PythonZain SarwarView Answer on Stackoverflow
Solution 22 - PythonalohaView Answer on Stackoverflow
Solution 23 - PythonIsayas Wakgari KelbessaView Answer on Stackoverflow
Solution 24 - PythonVivianView Answer on Stackoverflow
Solution 25 - PythonDexter LegaspiView Answer on Stackoverflow
Solution 26 - PythonBob KlineView Answer on Stackoverflow
Solution 27 - PythonRajan saha RajuView Answer on Stackoverflow
Solution 28 - PythonmohannatdView Answer on Stackoverflow
Solution 29 - PythonWiktor StribiżewView Answer on Stackoverflow
Solution 30 - PythonqwrView Answer on Stackoverflow
Solution 31 - PythonIsayas Wakgari KelbessaView Answer on Stackoverflow
Solution 32 - PythonDisk GiantView Answer on Stackoverflow