Regex select all text between tags

HtmlRegexHtml Parsing

Html Problem Overview


What is the best way to select all the text between 2 tags - ex: the text between all the '<pre>' tags on the page.

Html Solutions


Solution 1 - Html

You can use "<pre>(.*?)</pre>", (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.

As other commenters have suggested, if you're doing something complex, use a HTML parser.

Solution 2 - Html

Tag can be completed in another line. This is why \n needs to be added.

<PRE>(.|\n)*?<\/PRE>

Solution 3 - Html

This is what I would use.

(?<=(<pre>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|`~]| )+?(?=(</pre>))

Basically what it does is:

(?<=(<pre>)) Selection have to be prepend with <pre> tag

(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|~]| ) This is just a regular expression I want to apply. In this case, it selects letter or digit or newline character or some special characters listed in the example in the square brackets. The pipe character | simply means "OR".

+? Plus character states to select one or more of the above - order does not matter. Question mark changes the default behavior from 'greedy' to 'ungreedy'.

(?=(</pre>)) Selection have to be appended by the </pre> tag

enter image description here

Depending on your use case you might need to add some modifiers like (i or m)

  • i - case-insensitive
  • m - multi-line search

Here I performed this search in Sublime Text so I did not have to use modifiers in my regex.

Javascript does not support lookbehind

The above example should work fine with languages such as PHP, Perl, Java ...<\b>
Javascript<\b> however does not support lookbehind so we have to forget about using (?<=(<pre>)) and look for some kind of workaround. Perhaps simple strip the first four chars from our result for each selection like in here https://stackoverflow.com/questions/11592033/regex-match-text-between-tags

Also look at the JAVASCRIPT REGEX DOCUMENTATION for non-capturing parentheses

Solution 4 - Html

To exclude the delimiting tags:

(?<=<pre>)(.*?)(?=</pre>)

(?<=<pre>) looks for text after <pre>

(?=</pre>) looks for text before </pre>

Results will text inside pre tag

Solution 5 - Html

use the below pattern to get content between element. Replace [tag] with the actual element you wish to extract the content from.

<[tag]>(.+?)</[tag]>

Sometime tags will have attributes, like anchor tag having href, then use the below pattern.

 <[tag][^>]*>(.+?)</[tag]>

Solution 6 - Html

This answer supposes support for look around! This allowed me to identify all the text between pairs of opening and closing tags. That is all the text between the '>' and the '<'. It works because look around doesn't consume the characters it matches.

(?<=>)([\w\s]+)(?=<\/)

I tested it in https://regex101.com/ using this HTML fragment.

<table>
<tr><td>Cell 1</td><td>Cell 2</td><td>Cell 3</td></tr>
<tr><td>Cell 4</td><td>Cell 5</td><td>Cell 6</td></tr>
</table>

It's a game of three parts: the look behind, the content, and the look ahead.

(?<=>)    # look behind (but don't consume/capture) for a '>'
([\w\s]+) # capture/consume any combination of alpha/numeric/whitespace
(?=<\/)   # look ahead  (but don't consume/capture) for a '</'

screen capture from regex101.com

I hope that serves as a started for 10. Luck.

Solution 7 - Html

This seems to be the simplest regular expression of all that I found

(?:<TAG>)([\s\S]*)(?:<\/TAG>)
  1. Exclude opening tag (?:<TAG>) from the matches
  2. Include any whitespace or non-whitespace characters ([\s\S]*) in the matches
  3. Exclude closing tag (?:<\/TAG>) from the matches

Solution 8 - Html

You shouldn't be trying to parse html with regexes see this question and how it turned out.

In the simplest terms, html is not a regular language so you can't fully parse is with regular expressions.

Having said that you can parse subsets of html when there are no similar tags nested. So as long as anything between and is not that tag itself, this will work:

preg_match("/<([\w]+)[^>]*>(.*?)<\/\1>/", $subject, $matches);
$matches = array ( [0] => full matched string [1] => tag name [2] => tag content )

A better idea is to use a parser, like the native DOMDocument, to load your html, then select your tag and get the inner html which might look something like this:

$obj = new DOMDocument();
$obj -> load($html);
$obj -> getElementByTagName('el');
$value = $obj -> nodeValue();

And since this is a proper parser it will be able to handle nesting tags etc.

Solution 9 - Html

Try this....

(?<=\<any_tag\>)(\s*.*\s*)(?=\<\/any_tag\>)

Solution 10 - Html

var str = "Lorem ipsum <pre>text 1</pre> Lorem ipsum <pre>text 2</pre>";
    str.replace(/<pre>(.*?)<\/pre>/g, function(match, g1) { console.log(g1); });

Since accepted answer is without javascript code, so adding that:

Solution 11 - Html

preg_match_all(/<pre>([^>]*?)<\/pre>/,$content,$matches) this regex will select everyting between

 tag. no matter is it in new line(work with multiline.

Solution 12 - Html

In Python, setting the DOTALL flag will capture everything, including newlines.

> If the DOTALL flag has been specified, this matches any character including a newline. docs.python.org

#example.py using Python 3.7.4  
import re

str="""Everything is awesome! <pre>Hello,
World!
    </pre>
"""

# Normally (.*) will not capture newlines, but here re.DOTATLL is set 
pattern = re.compile(r"<pre>(.*)</pre>",re.DOTALL)
matches = pattern.search(str)

print(matches.group(1))

python example.py

Hello,
World!

Capturing text between all opening and closing tags in a document

To capture text between all opening and closing tags in a document, finditer is useful. In the example below, three opening and closing <pre> tags are present in the string.

#example2.py using Python 3.7.4
import re

# str contains three <pre>...</pre> tags
str = """In two different ex-
periments, the authors had subjects chat and solve the <pre>Desert Survival Problem</pre> with a
humorous or non-humorous computer. In both experiments the computer made pre-
programmed comments, but in study 1 subjects were led to believe they were interact-
ing with another person. In the <pre>humor conditions</pre> subjects received a number of funny
comments, for instance: “The mirror is probably too small to be used as a signaling
device to alert rescue teams to your location. Rank it lower. (On the other hand, it
offers <pre>endless opportunity for self-reflection</pre>)”."""

# Normally (.*) will not capture newlines, but here re.DOTATLL is set
# The question mark in (.*?) indicates non greedy matching.
pattern = re.compile(r"<pre>(.*?)</pre>",re.DOTALL)

matches = pattern.finditer(str)


for i,match in enumerate(matches):
    print(f"tag {i}: ",match.group(1))

python example2.py

tag 0:  Desert Survival Problem
tag 1:  humor conditions
tag 2:  endless opportunity for self-reflection

Solution 13 - Html

To select all text between pre tag I prefer

preg_match('#<pre>([\w\W\s]*)</pre>#',$str,$matches);

> $matches[0] will have results including <pre> tag

> $matches[1] will have all the content inside <pre>.

DomDocument cannot work in situations where the requirement is to get text with tag details within the searched tag as it strips all tags, nodeValue & textContent will only return text without tags & attributes.

Solution 14 - Html

(?<=>)[^<]+

for Notepad++

>([^<]+)

for AutoIt (option Return array of global matches).

or

 (?=>([^<]+))

https://regex101.com/r/VtmEmY/

Solution 15 - Html

You can use Pattern pattern = Pattern.compile( "[^<'tagname'/>]" );

Solution 16 - Html

I use this solution:

preg_match_all( '/<((?!<)(.|\n))*?\>/si',  $content, $new);
var_dump($new);

Solution 17 - Html

const content = '<p class="title responsive">ABC</p>';
const blog = {content};
const re = /<([^> ]+)([^>]*)>([^<]+)(<\/\1>)/;
const matches = content.match(re);
console.log(matches[3]);

matches[3] is the content text and this is adapted to any tag name with classes. (not support nested structures)

Solution 18 - Html

For multiple lines:

<htmltag>(.+)((\s)+(.+))+</htmltag>

Solution 19 - Html

In Javascript (among others), this is simple. It covers attributes and multiple lines:

/<pre[^>]*>([\s\S]*?)<\/pre>/

Solution 20 - Html

<pre>([\r\n\s]*(?!<\w+.*[\/]*>).*[\r\n\s]*|\s*[\r\n\s]*)<code\s+(?:class="(\w+|\w+\s*.+)")>(((?!<\/code>)[\s\S])*)<\/code>[\r\n\s]*((?!<\w+.*[\/]*>).*|\s*)[\r\n\s]*<\/pre>

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionbashepsView Question on Stackoverflow
Solution 1 - HtmlPyKingView Answer on Stackoverflow
Solution 2 - HtmlzacView Answer on Stackoverflow
Solution 3 - HtmlDevWLView Answer on Stackoverflow
Solution 4 - HtmlJean-Simon CollardView Answer on Stackoverflow
Solution 5 - HtmlShravan RamamurthyView Answer on Stackoverflow
Solution 6 - HtmlClariusView Answer on Stackoverflow
Solution 7 - HtmlmaqduniView Answer on Stackoverflow
Solution 8 - Htmlsg3sView Answer on Stackoverflow
Solution 9 - HtmlHeriberto RiveraView Answer on Stackoverflow
Solution 10 - HtmlShishir AroraView Answer on Stackoverflow
Solution 11 - HtmlKrishna thakorView Answer on Stackoverflow
Solution 12 - HtmlJohnView Answer on Stackoverflow
Solution 13 - Htmlnirvana74vView Answer on Stackoverflow
Solution 14 - HtmlaptypView Answer on Stackoverflow
Solution 15 - HtmlAmbrish RajputView Answer on Stackoverflow
Solution 16 - HtmlT.ToduaView Answer on Stackoverflow
Solution 17 - HtmlcoosigmaView Answer on Stackoverflow
Solution 18 - HtmlDilipView Answer on Stackoverflow
Solution 19 - HtmlJonathanView Answer on Stackoverflow
Solution 20 - Htmluser5988518View Answer on Stackoverflow