Removing control characters from a string in python

PythonStringPython 3.x

Python Problem Overview


I currently have the following code

def removeControlCharacters(line):
    i = 0
    for c in line:
        if (c < chr(32)):
            line = line[:i - 1] + line[i+1:]
            i += 1
    return line

This is just does not work if there are more than one character to be deleted.

Python Solutions


Solution 1 - Python

There are hundreds of control characters in unicode. If you are sanitizing data from the web or some other source that might contain non-ascii characters, you will need Python's unicodedata module. The unicodedata.category(…) function returns the unicode category code (e.g., control character, whitespace, letter, etc.) of any character. For control characters, the category always starts with "C".

This snippet removes all control characters from a string.

import unicodedata
def remove_control_characters(s):
    return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")

Examples of unicode categories:

>>> from unicodedata import category
>>> category('\r')      # carriage return --> Cc : control character
'Cc'
>>> category('\0')      # null character ---> Cc : control character
'Cc'
>>> category('\t')      # tab --------------> Cc : control character
'Cc'
>>> category(' ')       # space ------------> Zs : separator, space
'Zs'
>>> category(u'\u200A') # hair space -------> Zs : separator, space
'Zs'
>>> category(u'\u200b') # zero width space -> Cf : control character, formatting
'Cf'
>>> category('A')       # letter "A" -------> Lu : letter, uppercase
'Lu'
>>> category(u'\u4e21') # 両 ---------------> Lo : letter, other
'Lo'
>>> category(',')       # comma  -----------> Po : punctuation
'Po'
>>>

Solution 2 - Python

You could use str.translate with the appropriate map, for example like this:

>>> mpa = dict.fromkeys(range(32))
>>> 'abc\02de'.translate(mpa)
'abcde'

Solution 3 - Python

Anyone interested in a regex character class that matches any Unicode control character may use [\x00-\x1f\x7f-\x9f].

You may test it like this:

>>> import unicodedata, re, sys
>>> all_chars = [chr(i) for i in range(sys.maxunicode)]
>>> control_chars = ''.join(c for c in all_chars if unicodedata.category(c) == 'Cc')
>>> expanded_class = ''.join(c for c in all_chars if re.match(r'[\x00-\x1f\x7f-\x9f]', c))
>>> control_chars == expanded_class
True

So to remove the control characters using re just use the following:

>>> re.sub(r'[\x00-\x1f\x7f-\x9f]', '', 'abc\02de')
'abcde'

Solution 4 - Python

This is the easiest, most complete, and most robust way I am aware of. It does require an external dependency, however. I consider it to be worth it for most projects.

pip install regex

import regex as rx
def remove_control_characters(str):
    return rx.sub(r'\p{C}', '', 'my-string')

\p{C} is the unicode character property for control characters, so you can leave it up to the unicode consortium which ones of the millions of unicode characters available should be considered control. There are also other extremely useful character properties I frequently use, for example \p{Z} for any kind of whitespace.

Solution 5 - Python

Your implementation is wrong because the value of i is incorrect. However that's not the only problem: it also repeatedly uses slow string operations, meaning that it runs in O(n2) instead of O(n). Try this instead:

return ''.join(c for c in line if ord(c) >= 32)

Solution 6 - Python

And for Python 2, with the builtin translate:

import string
all_bytes = string.maketrans('', '')  # String of 256 characters with (byte) value 0 to 255

line.translate(all_bytes, all_bytes[:32])  # All bytes < 32 are deleted (the second argument lists the bytes to delete)

Solution 7 - Python

You modify the line during iterating over it. Something like ''.join([x for x in line if ord(x) >= 32])

Solution 8 - Python

filter(string.printable[:-5].__contains__,line)

Solution 9 - Python

I've tried all the above and it didn't help. In my case, I had to remove Unicode 'LRM' chars:

Finally I found this solution that did the job:

df["AMOUNT"] = df["AMOUNT"].str.encode("ascii", "ignore")
df["AMOUNT"] = df["AMOUNT"].str.decode('UTF-8')

Reference here.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionDavidView Question on Stackoverflow
Solution 1 - PythonAlex QuinnView Answer on Stackoverflow
Solution 2 - PythonSilentGhostView Answer on Stackoverflow
Solution 3 - PythonAXOView Answer on Stackoverflow
Solution 4 - PythoncmcView Answer on Stackoverflow
Solution 5 - PythonMark ByersView Answer on Stackoverflow
Solution 6 - PythonEric O LebigotView Answer on Stackoverflow
Solution 7 - PythonkhachikView Answer on Stackoverflow
Solution 8 - PythonKabieView Answer on Stackoverflow
Solution 9 - PythonOded LView Answer on Stackoverflow