(grep) Regex to match non-ASCII characters?

RegexUnicodeGrepAsciiNon Ascii-Characters

Regex Problem Overview


On Linux, I have a directory with lots of files. Some of them have non-ASCII characters, but they are all valid UTF-8. One program has a bug that prevents it working with non-ASCII filenames, and I have to find out how many are affected. I was going to do this with find and then do a grep to print the non-ASCII characters, and then do a wc -l to find the number. It doesn't have to be grep; I can use any standard Unix regular expression, like Perl, sed, AWK, etc.

However, is there a regular expression for 'any character that's not an ASCII character'?

Regex Solutions


Solution 1 - Regex

This will match a single non-ASCII character:

[^\x00-\x7F]

This is a valid PCRE (Perl-Compatible Regular Expression).

You can also use the POSIX shorthands:

  • [[:ascii:]] - matches a single ASCII char
  • [^[:ascii:]] - matches a single non-ASCII char

[^[:print:]] will probably suffice for you.**

Solution 2 - Regex

No, [^\x20-\x7E] is not ASCII.

This is real ASCII:

 [^\x00-\x7F]

Otherwise, it will trim out newlines and other special characters that are part of the ASCII table!

Solution 3 - Regex

You could also to check this page: Unicode Regular Expressions, as it contains some useful Unicode characters classes, like:

\p{Control}: an ASCII 0x00..0x1F or Latin-1 0x80..0x9F control character.

Solution 4 - Regex

You can use this regex:

[^\w \xC0-\xFF]

Case ask, the options is Multiline.

Solution 5 - Regex

[^\x00-\x7F] and [^[:ascii:]] miss some control bytes so strings can be the better option sometimes. For example cat test.torrent | perl -pe 's/[^[:ascii:]]+/\n/g' will do odd things to your terminal, where as strings test.torrent will behave.

Solution 6 - Regex

To Validate Text Box Accept Ascii Only use this Pattern

[\x00-\x7F]+

Solution 7 - Regex

I use [^\t\r\n\x20-\x7E]+ and that seems to be working fine.

Solution 8 - Regex

You don't really need a regex.

printf "%s\n" *[!\ -~]*

This will show file names with control characters in their names, too, but I consider that a feature.

If you don't have any matching files, the glob will expand to just itself, unless you have nullglob set. (The expression does not match itself, so technically, this output is unambiguous.)

Solution 9 - Regex

This turned out to be very flexible and extensible. $field =~ s/[^\x00-\x7F]//g ; # thus all non ASCII or specific items in question could be cleaned. Very nice either in selection or pre-processing of items that will eventually become hash keys.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAmandasaurusView Question on Stackoverflow
Solution 1 - RegexAlix AxelView Answer on Stackoverflow
Solution 2 - RegexPeter LView Answer on Stackoverflow
Solution 3 - RegexRubens FariasView Answer on Stackoverflow
Solution 4 - RegexCypherPotatoView Answer on Stackoverflow
Solution 5 - Regexuser1133275View Answer on Stackoverflow
Solution 6 - RegexOthman MahmoudView Answer on Stackoverflow
Solution 7 - RegexSolidSnakeUk89View Answer on Stackoverflow
Solution 8 - RegextripleeeView Answer on Stackoverflow
Solution 9 - RegexDon TurnbladeView Answer on Stackoverflow