Remove HTML tags from string including &nbsp in C#

C#HtmlRegexString

C# Problem Overview


How can I remove all the HTML tags including   using regex in C#. My string looks like

  "<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"

C# Solutions


Solution 1 - C#

If you can't use an HTML parser oriented solution to filter out the tags, here's a simple regex for it.

string noHTML = Regex.Replace(inputHTML, @"<[^>]+>|&nbsp;", "").Trim();

You should ideally make another pass through a regex filter that takes care of multiple spaces as

string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");

Solution 2 - C#

I took @Ravi Thapliyal's code and made a method: It is simple and might not clean everything, but so far it is doing what I need it to do.

public static string ScrubHtml(string value) {
	var step1 = Regex.Replace(value, @"<[^>]+>|&nbsp;", "").Trim();
	var step2 = Regex.Replace(step1, @"\s{2,}", " ");
	return step2;
}

Solution 3 - C#

I've been using this function for a while. Removes pretty much any messy html you can throw at it and leaves the text intact.

        private static readonly Regex _tags_ = new Regex(@"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled);

        //add characters that are should not be removed to this regex
        private static readonly Regex _notOkCharacter_ = new Regex(@"[^\w;&#@.:/\\?=|%!() -]", RegexOptions.Compiled);

        public static String UnHtml(String html)
        {
            html = HttpUtility.UrlDecode(html);
            html = HttpUtility.HtmlDecode(html);

            html = RemoveTag(html, "<!--", "-->");
            html = RemoveTag(html, "<script", "</script>");
            html = RemoveTag(html, "<style", "</style>");

            //replace matches of these regexes with space
            html = _tags_.Replace(html, " ");
            html = _notOkCharacter_.Replace(html, " ");
            html = SingleSpacedTrim(html);

            return html;
        }

        private static String RemoveTag(String html, String startTag, String endTag)
        {
            Boolean bAgain;
            do
            {
                bAgain = false;
                Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase);
                if (startTagPos < 0)
                    continue;
                Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase);
                if (endTagPos <= startTagPos)
                    continue;
                html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length);
                bAgain = true;
            } while (bAgain);
            return html;
        }

        private static String SingleSpacedTrim(String inString)
        {
            StringBuilder sb = new StringBuilder();
            Boolean inBlanks = false;
            foreach (Char c in inString)
            {
                switch (c)
                {
                    case '\r':
                    case '\n':
                    case '\t':
                    case ' ':
                        if (!inBlanks)
                        {
                            inBlanks = true;
                            sb.Append(' ');
                        }   
                        continue;
                    default:
                        inBlanks = false;
                        sb.Append(c);
                        break;
                }
            }
            return sb.ToString().Trim();
        }

Solution 4 - C#

var noHtml = Regex.Replace(inputHTML, @"<[^>]*(>|$)|&nbsp;|&zwnj;|&raquo;|&laquo;", string.Empty).Trim();

Solution 5 - C#

I have used the @RaviThapliyal & @Don Rolling's code but made a little modification. Since we are replacing the   with empty string but instead   should be replaced with space, so added an additional step. It worked for me like a charm.

public static string FormatString(string value) {
    var step1 = Regex.Replace(value, @"<[^>]+>", "").Trim();
    var step2 = Regex.Replace(step1, @"&nbsp;", " ");
    var step3 = Regex.Replace(step2, @"\s{2,}", " ");
    return step3;
}

Used &nbps without semicolon because it was getting formatted by the Stack Overflow.

Solution 6 - C#

this:

(<.+?> | &nbsp;)

will match any tag or &nbsp;

string regex = @"(<.+?>|&nbsp;)";
var x = Regex.Replace(originalString, regex, "").Trim();

then x = hello

Solution 7 - C#

Sanitizing an Html document involves a lot of tricky things. This package maybe of help: https://github.com/mganss/HtmlSanitizer

Solution 8 - C#

HTML is in its basic form just XML. You could Parse your text in an XmlDocument object, and on the root element call InnerText to extract the text. This will strip all HTML tages in any form and also deal with special characters like &lt; &nbsp; all in one go.

Solution 9 - C#

(<([^>]+)>|&nbsp;)

You can test it here: https://regex101.com/r/kB0rQ4/1

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionrampuriyaaaView Question on Stackoverflow
Solution 1 - C#Ravi K ThapliyalView Answer on Stackoverflow
Solution 2 - C#Don RollingView Answer on Stackoverflow
Solution 3 - C#David S.View Answer on Stackoverflow
Solution 4 - C#MRPView Answer on Stackoverflow
Solution 5 - C#Sabique A KhanView Answer on Stackoverflow
Solution 6 - C#JonesopolisView Answer on Stackoverflow
Solution 7 - C#Ehsan88View Answer on Stackoverflow
Solution 8 - C#nivs1978View Answer on Stackoverflow
Solution 9 - C#Ananth RamView Answer on Stackoverflow