(grep) Regex to match non-ASCII characters?
RegexUnicodeGrepAsciiNon Ascii-CharactersRegex Problem Overview
On Linux, I have a directory with lots of files. Some of them have non-ASCII characters, but they are all valid UTF-8. One program has a bug that prevents it working with non-ASCII filenames, and I have to find out how many are affected. I was going to do this with find
and then do a grep to print the non-ASCII characters, and then do a wc -l
to find the number. It doesn't have to be grep; I can use any standard Unix regular expression, like Perl, sed, AWK, etc.
However, is there a regular expression for 'any character that's not an ASCII character'?
Regex Solutions
Solution 1 - Regex
This will match a single non-ASCII character:
[^\x00-\x7F]
This is a valid PCRE (Perl-Compatible Regular Expression).
You can also use the POSIX shorthands:
[[:ascii:]]
- matches a single ASCII char[^[:ascii:]]
- matches a single non-ASCII char
[^[:print:]]
will probably suffice for you.**
Solution 2 - Regex
No, [^\x20-\x7E]
is not ASCII.
This is real ASCII:
[^\x00-\x7F]
Otherwise, it will trim out newlines and other special characters that are part of the ASCII table!
Solution 3 - Regex
You could also to check this page: Unicode Regular Expressions, as it contains some useful Unicode characters classes, like:
\p{Control}: an ASCII 0x00..0x1F or Latin-1 0x80..0x9F control character.
Solution 4 - Regex
You can use this regex:
[^\w \xC0-\xFF]
Case ask, the options is Multiline.
Solution 5 - Regex
[^\x00-\x7F]
and [^[:ascii:]]
miss some control bytes so strings can be the better option sometimes. For example cat test.torrent | perl -pe 's/[^[:ascii:]]+/\n/g'
will do odd things to your terminal, where as strings test.torrent
will behave.
Solution 6 - Regex
To Validate Text Box Accept Ascii Only use this Pattern
[\x00-\x7F]+
Solution 7 - Regex
I use [^\t\r\n\x20-\x7E]+
and that seems to be working fine.
Solution 8 - Regex
You don't really need a regex.
printf "%s\n" *[!\ -~]*
This will show file names with control characters in their names, too, but I consider that a feature.
If you don't have any matching files, the glob will expand to just itself, unless you have nullglob
set. (The expression does not match itself, so technically, this output is unambiguous.)
Solution 9 - Regex
This turned out to be very flexible and extensible. $field =~ s/[^\x00-\x7F]//g ; # thus all non ASCII or specific items in question could be cleaned. Very nice either in selection or pre-processing of items that will eventually become hash keys.