Remove all non-"word characters" from a String in Java, leaving accented characters?

JavaRegexString

Java Problem Overview


Apparently Java's Regex flavor counts Umlauts and other special characters as non-"word characters" when I use Regex.

        "TESTÜTEST".replaceAll( "\\W", "" )

returns "TESTTEST" for me. What I want is for only all truly non-"word characters" to be removed. Any way to do this without having something along the lines of

         "[^A-Za-z0-9äöüÄÖÜßéèáàúùóò]"

only to realize I forgot ô?

Java Solutions


Solution 1 - Java

Use [^\p{L}\p{Nd}]+ - this matches all (Unicode) characters that are neither letters nor (decimal) digits.

In Java:

String resultString = subjectString.replaceAll("[^\\p{L}\\p{Nd}]+", "");

Edit:

I changed \p{N} to \p{Nd} because the former also matches some number symbols like ¼; the latter doesn't. See it on regex101.com.

Solution 2 - Java

I was trying to achieve the exact opposite when I bumped on this thread. I know it's quite old, but here's my solution nonetheless. You can use blocks, see here. In this case, compile the following code (with the right imports):

> String s = "äêìóblah"; 
> Pattern p = Pattern.compile("[\\p{InLatin-1Supplement}]+"); // this regex uses a block
> Matcher m = p.matcher(s);
> System.out.println(m.find());
> System.out.println(s.replaceAll(p.pattern(), "#"));

You should see the following output:

> true > > #blah

Best,

Solution 3 - Java

At times you do not want to simply remove the characters, but just remove the accents. I came up with the following utility class which I use in my Java REST web projects whenever I need to include a String in an URL:

import java.text.Normalizer;
import java.text.Normalizer.Form;

import org.apache.commons.lang.StringUtils;

/**
 * Utility class for String manipulation.
 * 
 * @author Stefan Haberl
 */
public abstract class TextUtils {
	private static String[] searchList = { "Ä", "ä", "Ö", "ö", "Ü", "ü", "ß" };
	private static String[] replaceList = { "Ae", "ae", "Oe", "oe", "Ue", "ue",
			"sz" };

	/**
	 * Normalizes a String by removing all accents to original 127 US-ASCII
	 * characters. This method handles German umlauts and "sharp-s" correctly
	 * 
	 * @param s
	 *            The String to normalize
	 * @return The normalized String
	 */
	public static String normalize(String s) {
		if (s == null)
			return null;

		String n = null;

		n = StringUtils.replaceEachRepeatedly(s, searchList, replaceList);
		n = Normalizer.normalize(n, Form.NFD).replaceAll("[^\\p{ASCII}]", "");

		return n;
	}

	/**
	 * Returns a clean representation of a String which might be used safely
	 * within an URL. Slugs are a more human friendly form of URL encoding a
	 * String.
	 * <p>
	 * The method first normalizes a String, then converts it to lowercase and
	 * removes ASCII characters, which might be problematic in URLs:
	 * <ul>
	 * <li>all whitespaces
	 * <li>dots ('.')
     * <li>(semi-)colons (';' and ':')
     * <li>equals ('=')
     * <li>ampersands ('&')
	 * <li>slashes ('/')
     * <li>angle brackets ('<' and '>')
	 * </ul>
	 * 
	 * @param s
	 *            The String to slugify
	 * @return The slugified String
	 * @see #normalize(String)
	 */
	public static String slugify(String s) {

		if (s == null)
			return null;

		String n = normalize(s);
		n = StringUtils.lowerCase(n);
		n = n.replaceAll("[\\s.:;&=<>/]", "");

		return n;
	}
}

Being a German speaker I've included proper handling of German umlauts as well - the list should be easy to extend for other languages.

HTH

EDIT: Note that it may be unsafe to include the returned String in an URL. You should at least HTML encode it to prevent XSS attacks.

Solution 4 - Java

Well, here is one solution I ended up with, but I hope there's a more elegant one...

StringBuilder result = new StringBuilder();
for(int i=0; i<name.length(); i++) {
    char tmpChar = name.charAt( i );
    if (Character.isLetterOrDigit( tmpChar) || tmpChar == '_' ) {
        result.append( tmpChar );
    }
}

result ends up with the desired result...

Solution 5 - Java

You might want to remove the accents and diacritic signs first, then on each character position check if the "simplified" string is an ascii letter - if it is, the original position shall contain word characters, if not, it can be removed.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionEpagaView Question on Stackoverflow
Solution 1 - JavaTim PietzckerView Answer on Stackoverflow
Solution 2 - JavaMenaView Answer on Stackoverflow
Solution 3 - JavaStefan HaberlView Answer on Stackoverflow
Solution 4 - JavaEpagaView Answer on Stackoverflow
Solution 5 - JavaIstvánView Answer on Stackoverflow