Remove all non-"word characters" from a String in Java, leaving accented characters?

Java Problem Overview

Apparently Java's Regex flavor counts Umlauts and other special characters as non-"word characters" when I use Regex.

        "TESTÜTEST".replaceAll( "\\W", "" )

returns "TESTTEST" for me. What I want is for only all truly non-"word characters" to be removed. Any way to do this without having something along the lines of

         "[^A-Za-z0-9äöüÄÖÜßéèáàúùóò]"

only to realize I forgot ô?

Java Solutions

Solution 1 - Java

Use [^\p{L}\p{Nd}]+ - this matches all (Unicode) characters that are neither letters nor (decimal) digits.

In Java:

String resultString = subjectString.replaceAll("[^\\p{L}\\p{Nd}]+", "");

Edit:

I changed \p{N} to \p{Nd} because the former also matches some number symbols like ¼; the latter doesn't. See it on regex101.com.

Solution 2 - Java

I was trying to achieve the exact opposite when I bumped on this thread. I know it's quite old, but here's my solution nonetheless. You can use blocks, see here. In this case, compile the following code (with the right imports):

> String s = "äêìóblah"; 
> Pattern p = Pattern.compile("[\\p{InLatin-1Supplement}]+"); // this regex uses a block
> Matcher m = p.matcher(s);
> System.out.println(m.find());
> System.out.println(s.replaceAll(p.pattern(), "#"));

You should see the following output:

> true > > #blah

Best,

Solution 3 - Java

At times you do not want to simply remove the characters, but just remove the accents. I came up with the following utility class which I use in my Java REST web projects whenever I need to include a String in an URL:

import java.text.Normalizer;
import java.text.Normalizer.Form;

import org.apache.commons.lang.StringUtils;

/**
 * Utility class for String manipulation.
 * 
 * @author Stefan Haberl
 */
public abstract class TextUtils {
	private static String[] searchList = { "Ä", "ä", "Ö", "ö", "Ü", "ü", "ß" };
	private static String[] replaceList = { "Ae", "ae", "Oe", "oe", "Ue", "ue",
			"sz" };

	/**
	 * Normalizes a String by removing all accents to original 127 US-ASCII
	 * characters. This method handles German umlauts and "sharp-s" correctly
	 * 
	 * @param s
	 *            The String to normalize
	 * @return The normalized String
	 */
	public static String normalize(String s) {
		if (s == null)
			return null;

		String n = null;

		n = StringUtils.replaceEachRepeatedly(s, searchList, replaceList);
		n = Normalizer.normalize(n, Form.NFD).replaceAll("[^\\p{ASCII}]", "");

		return n;
	}

	/**
	 * Returns a clean representation of a String which might be used safely
	 * within an URL. Slugs are a more human friendly form of URL encoding a
	 * String.
	 * <p>
	 * The method first normalizes a String, then converts it to lowercase and
	 * removes ASCII characters, which might be problematic in URLs:
	 * <ul>
	 * <li>all whitespaces
	 * <li>dots ('.')
     * <li>(semi-)colons (';' and ':')
     * <li>equals ('=')
     * <li>ampersands ('&')
	 * <li>slashes ('/')
     * <li>angle brackets ('<' and '>')
	 * </ul>
	 * 
	 * @param s
	 *            The String to slugify
	 * @return The slugified String
	 * @see #normalize(String)
	 */
	public static String slugify(String s) {

		if (s == null)
			return null;

		String n = normalize(s);
		n = StringUtils.lowerCase(n);
		n = n.replaceAll("[\\s.:;&=<>/]", "");

		return n;
	}
}

Being a German speaker I've included proper handling of German umlauts as well - the list should be easy to extend for other languages.

HTH

EDIT: Note that it may be unsafe to include the returned String in an URL. You should at least HTML encode it to prevent XSS attacks.

Solution 4 - Java

Well, here is one solution I ended up with, but I hope there's a more elegant one...

StringBuilder result = new StringBuilder();
for(int i=0; i<name.length(); i++) {
    char tmpChar = name.charAt( i );
    if (Character.isLetterOrDigit( tmpChar) || tmpChar == '_' ) {
        result.append( tmpChar );
    }
}

result ends up with the desired result...

Solution 5 - Java

You might want to remove the accents and diacritic signs first, then on each character position check if the "simplified" string is an ascii letter - if it is, the original position shall contain word characters, if not, it can be removed.

Content Type	Original Author	Original Content on Stackoverflow
Question	Epaga	View Question on Stackoverflow
Solution 1 - Java	Tim Pietzcker	View Answer on Stackoverflow
Solution 2 - Java	Mena	View Answer on Stackoverflow
Solution 3 - Java	Stefan Haberl	View Answer on Stackoverflow
Solution 4 - Java	Epaga	View Answer on Stackoverflow
Solution 5 - Java	István	View Answer on Stackoverflow