remove script tag from HTML content

PhpRegexHtmlpurifier

Php Problem Overview


I am using HTML Purifier (http://htmlpurifier.org/)

I just want to remove <script> tags only. I don't want to remove inline formatting or any other things.

How can I achieve this?

One more thing, it there any other way to remove script tags from HTML

Php Solutions


Solution 1 - Php

Because this question is tagged with [tag:regex] I'm going to answer with poor man's solution in this situation:

$html = preg_replace('#<script(.*?)>(.*?)</script>#is', '', $html);

However, regular expressions are not for parsing HTML/XML, even if you write the perfect expression it will break eventually, it's not worth it, although, in some cases it's useful to quickly fix some markup, and as it is with quick fixes, forget about security. Use regex only on content/markup you trust.

Remember, anything that user inputs should be considered not safe.

Better solution here would be to use DOMDocument which is designed for this. Here is a snippet that demonstrate how easy, clean (compared to regex), (almost) reliable and (nearly) safe is to do the same:

<?php

$html = <<<HTML
...
HTML;

$dom = new DOMDocument();

$dom->loadHTML($html);

$script = $dom->getElementsByTagName('script');

$remove = [];
foreach($script as $item)
{
  $remove[] = $item;
}

foreach ($remove as $item)
{
  $item->parentNode->removeChild($item); 
}

$html = $dom->saveHTML();

I have removed the HTML intentionally because even this can bork.

Solution 2 - Php

Use the PHP DOMDocument parser.

$doc = new DOMDocument();

// load the HTML string we want to strip
$doc->loadHTML($html);

// get all the script tags
$script_tags = $doc->getElementsByTagName('script');

$length = $script_tags->length;

// for each tag, remove it from the DOM
for ($i = 0; $i < $length; $i++) {
  $script_tags->item($i)->parentNode->removeChild($script_tags->item($i));
}

// get the HTML string back
$no_script_html_string = $doc->saveHTML();

This worked me me using the following HTML document:

<!doctype html>
<html>
    <head>
        <meta charset="utf-8">
        <title>
            hey
        </title>
        <script>
            alert("hello");
        </script>
    </head>
    <body>
        hey
    </body>
</html>

Just bear in mind that the DOMDocument parser requires PHP 5 or greater.

Solution 3 - Php

$html = <<<HTML
...
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags_to_remove = array('script','style','iframe','link');
foreach($tags_to_remove as $tag){
    $element = $dom->getElementsByTagName($tag);
    foreach($element  as $item){
        $item->parentNode->removeChild($item);
    }
}
$html = $dom->saveHTML();

Solution 4 - Php

A simple way by manipulating string.

function stripStr($str, $ini, $fin)
{
    while (($pos = mb_stripos($str, $ini)) !== false) {
        $aux = mb_substr($str, $pos + mb_strlen($ini));
        $str = mb_substr($str, 0, $pos);
        
        if (($pos2 = mb_stripos($aux, $fin)) !== false) {
            $str .= mb_substr($aux, $pos2 + mb_strlen($fin));
        }
    }

    return $str;
}

Solution 5 - Php

  • this is a merge of both ClandestineCoder & Binh WPO.

the problem with the script tag arrows is that they can have more than one variant

> ex. (< = &lt; = &amp;lt;) & ( > = &gt; = &amp;gt;)

so instead of creating a pattern array with like a bazillion variant, imho a better solution would be

return preg_replace('/script.*?\/script/ius', '', $text)
       ? preg_replace('/script.*?\/script/ius', '', $text)
       : $text;

this will remove anything that look like script.../script regardless of the arrow code/variant and u can test it in here https://regex101.com/r/lK6vS8/1

Solution 6 - Php

Try this complete and flexible solution. It works perfectly, and is based in-part by some previous answers, but contains additional validation checks, and gets rid of additional implied HTML from the loadHTML(...) function. It is divided into two separate functions (one with a previous dependency so don't re-order/rearrange) so you can use it with multiple HTML tags that you would like to remove simultaneously (i.e. not just 'script' tags). For example removeAllInstancesOfTag(...) function accepts an array of tag names, or optionally just one as a string. So, without further ado here is the code:


/* Remove all instances of a particular HTML tag (e.g. <script>...</script>) from a variable containing raw HTML data. [BEGIN] */

/* Usage Example: $scriptless_html = removeAllInstancesOfTag($html, 'script'); */

if (!function_exists('removeAllInstancesOfTag'))
	{
		function removeAllInstancesOfTag($html, $tag_nm)
			{
				if (!empty($html))
					{
						$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'); /* For UTF-8 Compatibility. */
						$doc = new DOMDocument();
						$doc->loadHTML($html,LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD|LIBXML_NOWARNING);

						if (!empty($tag_nm))
							{
								if (is_array($tag_nm))
									{
										$tag_nms = $tag_nm;
										unset($tag_nm);

										foreach ($tag_nms as $tag_nm)
											{
												$rmvbl_itms = $doc->getElementsByTagName(strval($tag_nm));
												$rmvbl_itms_arr = [];

												foreach ($rmvbl_itms as $itm)
													{
														$rmvbl_itms_arr[] = $itm;
													};

												foreach ($rmvbl_itms_arr as $itm)
													{
														$itm->parentNode->removeChild($itm);
													};
											};
									}
								else if (is_string($tag_nm))
									{
										$rmvbl_itms = $doc->getElementsByTagName($tag_nm);
										$rmvbl_itms_arr = [];

										foreach ($rmvbl_itms as $itm)
											{
												$rmvbl_itms_arr[] = $itm;
											};

										foreach ($rmvbl_itms_arr as $itm)
											{
												$itm->parentNode->removeChild($itm); 
											};
									};
							};

						return $doc->saveHTML();
					}
				else
					{
						return '';
					};
			};
	};

/* Remove all instances of a particular HTML tag (e.g. <script>...</script>) from a variable containing raw HTML data. [END] */

/* Remove all instances of dangerous and pesky <script> tags from a variable containing raw user-input HTML data. [BEGIN] */

/* Prerequisites: 'removeAllInstancesOfTag(...)' */

if (!function_exists('removeAllScriptTags'))
	{
		function removeAllScriptTags($html)
			{
				return removeAllInstancesOfTag($html, 'script');
			};
	};

/* Remove all instances of dangerous and pesky <script> tags from a variable containing raw user-input HTML data. [END] */


And here is a test usage example:


$html = 'This is a JavaScript retention test.<br><br><span id="chk_frst_scrpt">Congratulations! The first \'script\' tag was successfully removed!</span><br><br><span id="chk_secd_scrpt">Congratulations! The second \'script\' tag was successfully removed!</span><script>document.getElementById("chk_frst_scrpt").innerHTML = "Oops! The first \'script\' tag was NOT removed!";</script><script>document.getElementById("chk_secd_scrpt").innerHTML = "Oops! The second \'script\' tag was NOT removed!";</script>';
echo removeAllScriptTags($html);

I hope my answer really helps someone. Enjoy!

Solution 7 - Php

Shorter:

$html = preg_replace("/<script.*?\/script>/s", "", $html);

When doing regex things might go wrong, so it's safer to do like this:

$html = preg_replace("/<script.*?\/script>/s", "", $html) ? : $html;

So that when the "accident" happen, we get the original $html instead of empty string.

Solution 8 - Php

An example modifing ctf0's answer. This should only do the preg_replace once but also check for errors and block char code for forward slash.

$str = '<script> var a - 1; <&#47;script>'; 

$pattern = '/(script.*?(?:\/|&#47;|&#x0002F;)script)/ius';
$replace = preg_replace($pattern, '', $str); 
return ($replace !== null)? $replace : $str;  

If you are using php 7 you can use the null coalesce operator to simplify it even more.

$pattern = '/(script.*?(?:\/|&#47;|&#x0002F;)script)/ius'; 
return (preg_replace($pattern, '', $str) ?? $str); 

Solution 9 - Php

function remove_script_tags($html){
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    $script = $dom->getElementsByTagName('script');

    $remove = [];
    foreach($script as $item){
        $remove[] = $item;
    }

    foreach ($remove as $item){
        $item->parentNode->removeChild($item);
    }

    $html = $dom->saveHTML();
    $html = preg_replace('/<!DOCTYPE.*?<html>.*?<body><p>/ims', '', $html);
    $html = str_replace('</p></body></html>', '', $html);
    return $html;
}

Dejan's answer was good, but saveHTML() adds unnecessary doctype and body tags, this should get rid of it. See https://3v4l.org/82FNP

Solution 10 - Php

I would use BeautifulSoup if it's available. Makes this sort of thing very easy.

Don't try to do it with regexps. That way lies madness.

Solution 11 - Php

I had been struggling with this question. I discovered you only really need one function. explode('>', $html); The single common denominator to any tag is < and >. Then after that it's usually quotation marks ( " ). You can extract information so easily once you find the common denominator. This is what I came up with:

$html = file_get_contents('http://some_page.html');

$h = explode('>', $html);

foreach($h as $k => $v){
    
	$v = trim($v);//clean it up a bit

	if(preg_match('/^(<script[.*]*)/ius', $v)){//my regex here might be questionable

		$counter = $k;//match opening tag and start counter for backtrace

		}elseif(preg_match('/([.*]*<\/script$)/ius', $v)){//but it gets the job done

			$script_length = $k - $counter;

			$counter = 0;

			for($i = $script_length; $i >= 0; $i--){
				$h[$k-$i] = '';//backtrace and clear everything in between
				}
			}			
		}
for($i = 0; $i <= count($h); $i++){
	if($h[$i] != ''){
	$ht[$i] = $h[$i];//clean out the blanks so when we implode it works right.
		}
	}
$html = implode('>', $ht);//all scripts stripped.


echo $html;

I see this really only working for script tags because you will never have nested script tags. Of course, you can easily add more code that does the same check and gather nested tags.

I call it accordion coding. implode();explode(); are the easiest ways to get your logic flowing if you have a common denominator.

Solution 12 - Php

This is a simplified variant of Dejan Marjanovic's answer:

function removeTags($html, $tag) {
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    foreach (iterator_to_array($dom->getElementsByTagName($tag)) as $item) {
        $item->parentNode->removeChild($item);
    }
    return $dom->saveHTML();
}

Can be used to remove any kind of tag, including <script>:

$scriptlessHtml = removeTags($html, 'script');

Solution 13 - Php

use the str_replace function to replace them with empty space or something

$query = '<script>console.log("I should be banned")</script>';

$badChar = array('<script>','</script>');
$query = str_replace($badChar, '', $query);

echo $query; 
//this echoes console.log("I should be banned")

?>

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionI-M-JMView Question on Stackoverflow
Solution 1 - PhpDejan MarjanovićView Answer on Stackoverflow
Solution 2 - PhpAlexView Answer on Stackoverflow
Solution 3 - PhpprasanthnvView Answer on Stackoverflow
Solution 4 - PhpJosé Carlos PHPView Answer on Stackoverflow
Solution 5 - Phpctf0View Answer on Stackoverflow
Solution 6 - PhpJames Anderson Jr.View Answer on Stackoverflow
Solution 7 - PhpBinh WPOView Answer on Stackoverflow
Solution 8 - Phptech-eView Answer on Stackoverflow
Solution 9 - PhprelipseView Answer on Stackoverflow
Solution 10 - PhpMichael LortonView Answer on Stackoverflow
Solution 11 - PhpClandestineCoderView Answer on Stackoverflow
Solution 12 - PhpmaeView Answer on Stackoverflow
Solution 13 - PhpOliver TemboView Answer on Stackoverflow