How can non-ASCII characters be removed from a string?
JavaRegexStringReplaceCharJava Problem Overview
I have strings "A função"
, "Ãugent"
in which I need to replace characters like ç
, ã
, and Ã
with empty strings.
How can I remove those non-ASCII characters from my string?
I have attempted to implement this using the following function, but it is not working properly. One problem is that the unwanted characters are getting replaced by the space character.
public static String matchAndReplaceNonEnglishChar(String tmpsrcdta) {
String newsrcdta = null;
char array[] = Arrays.stringToCharArray(tmpsrcdta);
if (array == null)
return newsrcdta;
for (int i = 0; i < array.length; i++) {
int nVal = (int) array[i];
boolean bISO =
// Is character ISO control
Character.isISOControl(array[i]);
boolean bIgnorable =
// Is Ignorable identifier
Character.isIdentifierIgnorable(array[i]);
// Remove tab and other unwanted characters..
if (nVal == 9 || bISO || bIgnorable)
array[i] = ' ';
else if (nVal > 255)
array[i] = ' ';
}
newsrcdta = Arrays.charArrayToString(array);
return newsrcdta;
}
Java Solutions
Solution 1 - Java
This will search and replace all non ASCII letters:
String resultString = subjectString.replaceAll("[^\\x00-\\x7F]", "");
Solution 2 - Java
FailedDev's answer is good, but can be improved. If you want to preserve the ascii equivalents, you need to normalize first:
String subjectString = "öäü";
subjectString = Normalizer.normalize(subjectString, Normalizer.Form.NFD);
String resultString = subjectString.replaceAll("[^\\x00-\\x7F]", "");
=> will produce "oau"
That way, characters like "öäü" will be mapped to "oau", which at least preserves some information. Without normalization, the resulting String will be blank.
Solution 3 - Java
This would be the Unicode solution
String s = "A função, Ãugent";
String r = s.replaceAll("\\P{InBasic_Latin}", "");
\p{InBasic_Latin}
is the Unicode block that contains all letters in the Unicode range U+0000..U+007F (see regular-expression.info)
\P{InBasic_Latin}
is the negated \p{InBasic_Latin}
Solution 4 - Java
You can try something like this. Special Characters range for alphabets starts from 192, so you can avoid such characters in the result.
String name = "A função";
StringBuilder result = new StringBuilder();
for(char val : name.toCharArray()) {
if(val < 192) result.append(val);
}
System.out.println("Result "+result.toString());
Solution 5 - Java
Or you can use the function below for removing non-ascii character from the string. You will get know internal working.
private static String removeNonASCIIChar(String str) {
StringBuffer buff = new StringBuffer();
char chars[] = str.toCharArray();
for (int i = 0; i < chars.length; i++) {
if (0 < chars[i] && chars[i] < 127) {
buff.append(chars[i]);
}
}
return buff.toString();
}
Solution 6 - Java
[Updated solution]
can be used with "Normalize" (Canonical decomposition) and "replaceAll", to replace it with the appropriate characters.
import java.text.Normalizer;
import java.text.Normalizer.Form;
import java.util.regex.Pattern;
public final class NormalizeUtils {
public static String normalizeASCII(final String string) {
final String normalize = Normalizer.normalize(string, Form.NFD);
return Pattern.compile("\\p{InCombiningDiacriticalMarks}+")
.matcher(normalize)
.replaceAll("");
} ...
Solution 7 - Java
The ASCII table contains 128 codes, with a total of 95 printable characters, of which only 52 characters are letters:
[0-127]
ASCII codes[32-126]
printable characters[48-57]
digits[0-9]
[65-90]
uppercase letters[A-Z]
[97-122]
lowercase letters[a-z]
You can use String.codePoints
method to get a stream over int
values of characters of this string and filter
out non-ASCII characters:
String str1 = "A função, Ãugent";
String str2 = str1.codePoints()
.filter(ch -> ch < 128)
.mapToObj(Character::toString)
.collect(Collectors.joining());
System.out.println(str2); // A funo, ugent
Or you can explicitly specify character ranges. For example filter out everything except letters:
String str3 = str1.codePoints()
.filter(ch -> ch >= 'A' && ch <= 'Z'
|| ch >= 'a' && ch <= 'z')
.mapToObj(Character::toString)
.collect(Collectors.joining());
System.out.println(str3); // Afunougent
See also: How do I not take Special Characters in my Password Validation (without Regex)?
Solution 8 - Java
String s = "A função";
String stripped = s.replaceAll("\\P{ASCII}", "");
System.out.println(stripped); // Prints "A funo"
or
private static final Pattern NON_ASCII_PATTERN = Pattern.compile("\\P{ASCII}");
public static String matchAndReplaceNonEnglishChar(String tmpsrcdta) {
return NON_ASCII_PATTERN.matcher(s).replaceAll("");
}
public static void main(String[] args) {
matchAndReplaceNonEnglishChar("A função"); // Prints "A funo"
}
Explanation
The method String.replaceAll(String regex, String replacement)
replaces all instances of a given regular expression (regex) with a given replacement string.
> Replaces each substring of this string that matches the given regular expression with the given replacement.
Java has the "\p{ASCII}"
regular expression construct which matches any ASCII character, and its inverse, "\P{ASCII}"
, which matches any non-ASCII character. The matched characters can then be replaced with the empty string, effectively removing them from the resulting string.
String s = "A função";
String stripped = s.replaceAll("\\P{ASCII}", "");
System.out.println(stripped); // Prints "A funo"
The full list of valid regex constructs is documented in the Pattern
class.
Note: If you are going to be calling this pattern multiple times within a run, it will be more efficient to use a compiled Pattern
directly, rather than String.replaceAll
. This way the pattern is compiled only once and reused, rather than each time replaceAll
is called:
public class AsciiStripper {
private static final Pattern NON_ASCII_PATTERN = Pattern.compile("\\P{ASCII}");
public static String stripNonAscii(String s) {
return NON_ASCII_PATTERN.matcher(s).replaceAll("");
}
}
Solution 9 - Java
An easily-readable, ascii-printable, streams solution:
String result = str.chars()
.filter(c -> isAsciiPrintable((char) c))
.mapToObj(c -> String.valueOf((char) c))
.collect(Collectors.joining());
private static boolean isAsciiPrintable(char ch) {
return ch >= 32 && ch < 127;
}
To convert to "_": .map(c -> isAsciiPrintable((char) c) ? c : '_')
32 to 127 is equivalent to the regex [^\\x20-\\x7E]
(from comment on the regex solution)
Source for isAsciiPrintable: http://www.java2s.com/Code/Java/Data-Type/ChecksifthestringcontainsonlyASCIIprintablecharacters.htm
Solution 10 - Java
CharMatcher.retainFrom
can be used, if you're using the Google Guava library:
String s = "A função";
String stripped = CharMatcher.ascii().retainFrom(s);
System.out.println(stripped); // Prints "A funo"