Regex select all text between tags
HtmlRegexHtml ParsingHtml Problem Overview
What is the best way to select all the text between 2 tags - ex: the text between all the '<pre>
' tags on the page.
Html Solutions
Solution 1 - Html
You can use "<pre>(.*?)</pre>"
, (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.
As other commenters have suggested, if you're doing something complex, use a HTML parser.
Solution 2 - Html
Tag can be completed in another line. This is why \n
needs to be added.
<PRE>(.|\n)*?<\/PRE>
Solution 3 - Html
This is what I would use.
(?<=(<pre>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|`~]| )+?(?=(</pre>))
Basically what it does is:
(?<=(<pre>))
Selection have to be prepend with <pre>
tag
(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|~]| )
This is just a regular expression I want to apply. In this case, it selects letter or digit or newline character or some special characters listed in the example in the square brackets. The pipe character |
simply means "OR".
+?
Plus character states to select one or more of the above - order does not matter. Question mark changes the default behavior from 'greedy' to 'ungreedy'.
(?=(</pre>))
Selection have to be appended by the </pre>
tag
Depending on your use case you might need to add some modifiers like (i or m)
- i - case-insensitive
- m - multi-line search
Here I performed this search in Sublime Text so I did not have to use modifiers in my regex.
The above example should work fine with languages such as PHP, Perl, Java ...<\b> Javascript does not support lookbehind
Javascript<\b> however does not support lookbehind so we have to forget about using
(?<=(<pre>))
and look for some kind of workaround. Perhaps simple strip the first four chars from our result for each selection like in here
https://stackoverflow.com/questions/11592033/regex-match-text-between-tags
Also look at the JAVASCRIPT REGEX DOCUMENTATION for non-capturing parentheses
Solution 4 - Html
To exclude the delimiting tags:
(?<=<pre>)(.*?)(?=</pre>)
(?<=<pre>)
looks for text after <pre>
(?=</pre>)
looks for text before </pre>
Results will text inside pre
tag
Solution 5 - Html
use the below pattern to get content between element. Replace [tag]
with the actual element you wish to extract the content from.
<[tag]>(.+?)</[tag]>
Sometime tags will have attributes, like anchor
tag having href
, then use the below pattern.
<[tag][^>]*>(.+?)</[tag]>
Solution 6 - Html
This answer supposes support for look around! This allowed me to identify all the text between pairs of opening and closing tags. That is all the text between the '>' and the '<'. It works because look around doesn't consume the characters it matches.
(?<=>)([\w\s]+)(?=<\/)
I tested it in https://regex101.com/ using this HTML fragment.
<table>
<tr><td>Cell 1</td><td>Cell 2</td><td>Cell 3</td></tr>
<tr><td>Cell 4</td><td>Cell 5</td><td>Cell 6</td></tr>
</table>
It's a game of three parts: the look behind, the content, and the look ahead.
(?<=>) # look behind (but don't consume/capture) for a '>'
([\w\s]+) # capture/consume any combination of alpha/numeric/whitespace
(?=<\/) # look ahead (but don't consume/capture) for a '</'
I hope that serves as a started for 10. Luck.
Solution 7 - Html
This seems to be the simplest regular expression of all that I found
(?:<TAG>)([\s\S]*)(?:<\/TAG>)
- Exclude opening tag
(?:<TAG>)
from the matches - Include any whitespace or non-whitespace characters
([\s\S]*)
in the matches - Exclude closing tag
(?:<\/TAG>)
from the matches
Solution 8 - Html
You shouldn't be trying to parse html with regexes see this question and how it turned out.
In the simplest terms, html is not a regular language so you can't fully parse is with regular expressions.
Having said that you can parse subsets of html when there are no similar tags nested. So as long as anything between
preg_match("/<([\w]+)[^>]*>(.*?)<\/\1>/", $subject, $matches);
$matches = array ( [0] => full matched string [1] => tag name [2] => tag content )
A better idea is to use a parser, like the native DOMDocument, to load your html, then select your tag and get the inner html which might look something like this:
$obj = new DOMDocument();
$obj -> load($html);
$obj -> getElementByTagName('el');
$value = $obj -> nodeValue();
And since this is a proper parser it will be able to handle nesting tags etc.
Solution 9 - Html
Try this....
(?<=\<any_tag\>)(\s*.*\s*)(?=\<\/any_tag\>)
Solution 10 - Html
var str = "Lorem ipsum <pre>text 1</pre> Lorem ipsum <pre>text 2</pre>";
str.replace(/<pre>(.*?)<\/pre>/g, function(match, g1) { console.log(g1); });
Since accepted answer is without javascript code, so adding that:
Solution 11 - Html
preg_match_all(/<pre>([^>]*?)<\/pre>/,$content,$matches)
this regex will select everyting between
tag. no matter is it in new line(work with multiline.
Solution 12 - Html
In Python, setting the DOTALL
flag will capture everything, including newlines.
> If the DOTALL flag has been specified, this matches any character including a newline. docs.python.org
#example.py using Python 3.7.4
import re
str="""Everything is awesome! <pre>Hello,
World!
</pre>
"""
# Normally (.*) will not capture newlines, but here re.DOTATLL is set
pattern = re.compile(r"<pre>(.*)</pre>",re.DOTALL)
matches = pattern.search(str)
print(matches.group(1))
python example.py
Hello,
World!
Capturing text between all opening and closing tags in a document
To capture text between all opening and closing tags in a document, finditer
is useful. In the example below, three opening and closing <pre>
tags are present in the string.
#example2.py using Python 3.7.4
import re
# str contains three <pre>...</pre> tags
str = """In two different ex-
periments, the authors had subjects chat and solve the <pre>Desert Survival Problem</pre> with a
humorous or non-humorous computer. In both experiments the computer made pre-
programmed comments, but in study 1 subjects were led to believe they were interact-
ing with another person. In the <pre>humor conditions</pre> subjects received a number of funny
comments, for instance: “The mirror is probably too small to be used as a signaling
device to alert rescue teams to your location. Rank it lower. (On the other hand, it
offers <pre>endless opportunity for self-reflection</pre>)”."""
# Normally (.*) will not capture newlines, but here re.DOTATLL is set
# The question mark in (.*?) indicates non greedy matching.
pattern = re.compile(r"<pre>(.*?)</pre>",re.DOTALL)
matches = pattern.finditer(str)
for i,match in enumerate(matches):
print(f"tag {i}: ",match.group(1))
python example2.py
tag 0: Desert Survival Problem
tag 1: humor conditions
tag 2: endless opportunity for self-reflection
Solution 13 - Html
To select all text between pre tag I prefer
preg_match('#<pre>([\w\W\s]*)</pre>#',$str,$matches);
> $matches[0] will have results including <pre> tag
> $matches[1] will have all the content inside <pre>.
DomDocument cannot work in situations where the requirement is to get text with tag details within the searched tag as it strips all tags, nodeValue & textContent will only return text without tags & attributes.
Solution 14 - Html
(?<=>)[^<]+
for Notepad++
>([^<]+)
for AutoIt (option Return array of global matches).
or
(?=>([^<]+))
Solution 15 - Html
You can use Pattern pattern = Pattern.compile( "[^<'tagname'/>]" );
Solution 16 - Html
I use this solution:
preg_match_all( '/<((?!<)(.|\n))*?\>/si', $content, $new);
var_dump($new);
Solution 17 - Html
const content = '<p class="title responsive">ABC</p>';
const blog = {content};
const re = /<([^> ]+)([^>]*)>([^<]+)(<\/\1>)/;
const matches = content.match(re);
console.log(matches[3]);
matches[3]
is the content text and this is adapted to any tag name with classes. (not support nested structures)
Solution 18 - Html
For multiple lines:
<htmltag>(.+)((\s)+(.+))+</htmltag>
Solution 19 - Html
In Javascript (among others), this is simple. It covers attributes and multiple lines:
/<pre[^>]*>([\s\S]*?)<\/pre>/
Solution 20 - Html
<pre>([\r\n\s]*(?!<\w+.*[\/]*>).*[\r\n\s]*|\s*[\r\n\s]*)<code\s+(?:class="(\w+|\w+\s*.+)")>(((?!<\/code>)[\s\S])*)<\/code>[\r\n\s]*((?!<\w+.*[\/]*>).*|\s*)[\r\n\s]*<\/pre>