Regular expression to remove HTML tags from a string

HtmlRegex

Html Problem Overview


> Possible Duplicate:
> Regular expression to remove HTML tags

Is there an expression which will get the value between two HTML tags?

Given this:

<td class="played">0</td>

I am looking for an expression which will return 0, stripping the <td> tags.

Html Solutions


Solution 1 - Html

You should not attempt to parse HTML with regex. HTML is not a regular language, so any regex you come up with will likely fail on some esoteric edge case. Please refer to the seminal answer to this question for specifics. While mostly formatted as a joke, it makes a very good point.


The following examples are Java, but the regex will be similar -- if not identical -- for other languages.


String target = someString.replaceAll("<[^>]*>", "");

Assuming your non-html does not contain any < or > and that your input string is correctly structured.

If you know they're a specific tag -- for example you know the text contains only <td> tags, you could do something like this:

String target = someString.replaceAll("(?i)<td[^>]*>", "");

Edit: Ωmega brought up a good point in a comment on another post that this would result in multiple results all being squished together if there were multiple tags.

For example, if the input string were <td>Something</td><td>Another Thing</td>, then the above would result in SomethingAnother Thing.

In a situation where multiple tags are expected, we could do something like:

String target = someString.replaceAll("(?i)<td[^>]*>", " ").replaceAll("\\s+", " ").trim();

This replaces the HTML with a single space, then collapses whitespace, and then trims any on the ends.

Solution 2 - Html

A trivial approach would be to replace

<[^>]*>

with nothing. But depending on how ill-structured your input is that may well fail.

Solution 3 - Html

You could do it with jsoup http://jsoup.org/

Whitelist whitelist = Whitelist.none();
String cleanStr = Jsoup.clean(yourText, whitelist);

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestiondannyView Question on Stackoverflow
Solution 1 - HtmlRoddy of the Frozen PeasView Answer on Stackoverflow
Solution 2 - HtmlJoeyView Answer on Stackoverflow
Solution 3 - HtmlmihaisimiView Answer on Stackoverflow