Stripping non printable characters from a string in python

PythonStringNon Printable

Python Problem Overview


I use to run

$s =~ s/[^[:print:]]//g;

on Perl to get rid of non printable characters.

In Python there's no POSIX regex classes, and I can't write [:print:] having it mean what I want. I know of no way in Python to detect if a character is printable or not.

What would you do?

EDIT: It has to support Unicode characters as well. The string.printable way will happily strip them out of the output. curses.ascii.isprint will return false for any unicode character.

Python Solutions


Solution 1 - Python

Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.

import unicodedata, re, itertools, sys

all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

For Python2

import unicodedata, re, sys

all_chars = (unichr(i) for i in xrange(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:

  • Cc (control): 65
  • Cf (format): 161
  • Cs (surrogate): 2048
  • Co (private-use): 137468
  • Cn (unassigned): 836601

Edit Adding suggestions from the comments.

Solution 2 - Python

As far as I know, the most pythonic/efficient method would be:

import string

filtered_string = filter(lambda x: x in string.printable, myStr)

Solution 3 - Python

You could try setting up a filter using the unicodedata.category() function:

import unicodedata
printable = {'Lu', 'Ll'}
def filter_non_printable(str):
  return ''.join(c for c in str if unicodedata.category(c) in printable)

See Table 4-9 on page 175 in the Unicode database character properties for the available categories

Solution 4 - Python

In Python 3,

def filter_nonprintable(text):
    import itertools
    # Use characters of control category
    nonprintable = itertools.chain(range(0x00,0x20),range(0x7f,0xa0))
    # Use translate to remove all non-printable characters
    return text.translate({character:None for character in nonprintable})

See this StackOverflow post on removing punctuation for how .translate() compares to regex & .replace()

The ranges can be generated via nonprintable = (ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if unicodedata.category(c)=='Cc') using the Unicode character database categories as shown by @Ants Aasma.

Solution 5 - Python

The following will work with Unicode input and is rather fast...

import sys

# build a table mapping all non-printable characters to None
NOPRINT_TRANS_TABLE = {
    i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable()
}

def make_printable(s):
    """Replace non-printable characters in a string."""
    
    # the translate method on str removes characters
    # that map to None from the string
    return s.translate(NOPRINT_TRANS_TABLE)


assert make_printable('Café') == 'Café'
assert make_printable('\x00\x11Hello') == 'Hello'
assert make_printable('') == ''

My own testing suggests this approach is faster than functions that iterate over the string and return a result using str.join.

Solution 6 - Python

This function uses list comprehensions and str.join, so it runs in linear time instead of O(n^2):

from curses.ascii import isprint

def printable(input):
    return ''.join(char for char in input if isprint(char))

Solution 7 - Python

Yet another option in python 3:

re.sub(f'[^{re.escape(string.printable)}]', '', my_string)

Solution 8 - Python

Based on @Ber's answer, I suggest removing only control characters as defined in the Unicode character database categories:

import unicodedata
def filter_non_printable(s):
    return ''.join(c for c in s if not unicodedata.category(c).startswith('C'))

Solution 9 - Python

The best I've come up with now is (thanks to the python-izers above)

def filter_non_printable(str):
  return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9])

This is the only way I've found out that works with Unicode characters/strings

Any better options?

Solution 10 - Python

The one below performs faster than the others above. Take a look

''.join([x if x in string.printable else '' for x in Str])

Solution 11 - Python

> In Python there's no POSIX regex classes

There are when using the regex library: https://pypi.org/project/regex/

It is well maintained and supports Unicode regex, Posix regex and many more. The usage (method signatures) is very similar to Python's re.

From the documentation:

> [[:alpha:]]; [[:^alpha:]] > > POSIX character classes are supported. These > are normally treated as an alternative form of \p{...}.

(I'm not affiliated, just a user.)

Solution 12 - Python

An elegant pythonic solution to stripping 'non printable' characters from a string in python is to use the isprintable() string method together with a generator expression or list comprehension depending on the use case ie. size of the string:

    ''.join(c for c in my_string if c.isprintable())

str.isprintable() Return True if all characters in the string are printable or the string is empty, False otherwise. Nonprintable characters are those characters defined in the Unicode character database as “Other” or “Separator”, excepting the ASCII space (0x20) which is considered printable. (Note that printable characters in this context are those which should not be escaped when repr() is invoked on a string. It has no bearing on the handling of strings written to sys.stdout or sys.stderr.)

Solution 13 - Python

To remove 'whitespace',

import re
t = """
\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>
"""
pat = re.compile(r'[\t\n]')
print(pat.sub("", t))

Solution 14 - Python

Adapted from answers by Ants Aasma and shawnrad:

nonprintable = set(map(chr, list(range(0,32)) + list(range(127,160))))
ord_dict = {ord(character):None for character in nonprintable}
def filter_nonprintable(text):
    return text.translate(ord_dict)

#use
str = "this” is’ my“ string"
str = filter_nonprintable(str)
print(str)

tested on Python 3.7.7

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionVinko VrsalovicView Question on Stackoverflow
Solution 1 - PythonAnts AasmaView Answer on Stackoverflow
Solution 2 - PythonWilliam KellerView Answer on Stackoverflow
Solution 3 - PythonBerView Answer on Stackoverflow
Solution 4 - PythonshawnradView Answer on Stackoverflow
Solution 5 - PythonChrisPView Answer on Stackoverflow
Solution 6 - PythonKirk StrauserView Answer on Stackoverflow
Solution 7 - Pythonc6401View Answer on Stackoverflow
Solution 8 - PythondarkdragonView Answer on Stackoverflow
Solution 9 - PythonVinko VrsalovicView Answer on Stackoverflow
Solution 10 - PythonNilav Baran GhoshView Answer on Stackoverflow
Solution 11 - PythonRisadinhaView Answer on Stackoverflow
Solution 12 - PythonThomas Juul DyhrView Answer on Stackoverflow
Solution 13 - PythonknowingparkView Answer on Stackoverflow
Solution 14 - PythonJoeView Answer on Stackoverflow