Regex to remove all special characters from string?

C#RegexString

C# Problem Overview


I'm completely incapable of regular expressions, and so I need some help with a problem that I think would best be solved by using regular expressions.

I have list of strings in C#:

List<string> lstNames = new List<string>();
lstNames.add("TRA-94:23");
lstNames.add("TRA-42:101");
lstNames.add("TRA-109:AD");

foreach (string n in lstNames) {
  // logic goes here that somehow uses regex to remove all special characters
  string regExp = "NO_IDEA";
  string tmp = Regex.Replace(n, regExp, "");
}

I need to be able to loop over the list and return each item without any special characters. For example, item one would be "TRA9423", item two would be "TRA42101" and item three would be TRA109AD.

Is there a regular expression that can accomplish this for me?

Also, the list contains more than 4000 items, so I need the search and replace to be efficient and quick if possible.

EDIT: I should have specified that any character beside a-z, A-Z and 0-9 is special in my circumstance.

C# Solutions


Solution 1 - C#

It really depends on your definition of special characters. I find that a whitelist rather than a blacklist is the best approach in most situations:

tmp = Regex.Replace(n, "[^0-9a-zA-Z]+", "");

You should be careful with your current approach because the following two items will be converted to the same string and will therefore be indistinguishable:

"TRA-12:123"
"TRA-121:23"

Solution 2 - C#

This should do it:

[^a-zA-Z0-9]

Basically it matches all non-alphanumeric characters.

Solution 3 - C#

[^a-zA-Z0-9] is a character class matches any non-alphanumeric characters.

Alternatively, [^\w\d] does the same thing.

Usage:

string regExp = "[^\w\d]";
string tmp = Regex.Replace(n, regExp, "");

Solution 4 - C#

You can use:

string regExp = "\\W";

This is equivalent to Daniel's "[^a-zA-Z0-9]"

\W matches any nonword character. Equivalent to the Unicode categories [^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].

Solution 5 - C#

For my purposes I wanted all English ASCII chars, so this worked.

html = Regex.Replace(html, "[^\x00-\x80]+", "")

Solution 6 - C#

Depending on your definition of "special character", I think "[^a-zA-Z0-9]" would probably do the trick. That would find anything that is not a small letter, a capital letter, or a digit.

Solution 7 - C#

tmp = Regex.Replace(n, @"\W+", "");

\w matches letters, digits, and underscores, \W is the negated version.

Solution 8 - C#

If you don't want to use Regex then another option is to use

char.IsLetterOrDigit

You can use this to loop through each char of the string and only return if true.

Solution 9 - C#

public static string Letters(this string input)
{
    return string.Concat(input.Where(x => char.IsLetter(x) && !char.IsSymbol(x) && !char.IsWhiteSpace(x)));
}

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJagdView Question on Stackoverflow
Solution 1 - C#Mark ByersView Answer on Stackoverflow
Solution 2 - C#Daniel EgebergView Answer on Stackoverflow
Solution 3 - C#MikeDView Answer on Stackoverflow
Solution 4 - C#Dan DiploView Answer on Stackoverflow
Solution 5 - C#BobCView Answer on Stackoverflow
Solution 6 - C#JayView Answer on Stackoverflow
Solution 7 - C#Paul CreaseyView Answer on Stackoverflow
Solution 8 - C#DemarilyView Answer on Stackoverflow
Solution 9 - C#mattylantzView Answer on Stackoverflow