Remove style attribute from HTML tags

PhpRegexTinymce

Php Problem Overview


I'm not too good with regular expressions, but with PHP I'm wanting to remove the style attribute from HTML tags in a string that's coming back from TinyMCE.

So change <p style="...">Text</p> to just vanilla <p>Test</p>.

How would I achieve this with something like the preg_replace() function?

Php Solutions


Solution 1 - Php

The pragmatic regex (<[^>]+) style=".*?" will solve this problem in all reasonable cases. The part of the match that is not the first captured group should be removed, like this:

$output = preg_replace('/(<[^>]+) style=".*?"/i', '$1', $input);

Match a < followed by one or more "not >" until we come to space and the style="..." part. The /i makes it work even with STYLE="...". Replace this match with $1, which is the captured group. It will leave the tag as is, if the tag doesn't include style="...".

Solution 2 - Php

Something like this should work (untested code warning):

<?php

$html = '<p style="asd">qwe</p><br /><p class="qwe">qweqweqwe</p>';

$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($html);
libxml_use_internal_errors(false);

$domx = new DOMXPath($domd);
$items = $domx->query("//p[@style]");

foreach($items as $item) {
  $item->removeAttribute("style");
}

echo $domd->saveHTML();

Solution 3 - Php

I commented on @Mayerln 's function. It does work but DOMDocument really stuffs with encoding. Here's my simplehtmldom version

function stripAttributes($html,$attribs) {
    $dom = new simple_html_dom();
    $dom->load($html);
    foreach($attribs as $attrib)
        foreach($dom->find("*[$attrib]") as $e)
            $e->$attrib = null; 
    $dom->load($dom->save());
    return $dom->save();
}

Solution 4 - Php

Here you go:

<?php

$html = '<p style="border: 1px solid red;">Test</p>';
echo preg_replace('/<p style="(.+?)">(.+?)<\/p>/i', "<p>$2</p>", $html);

?>

By the way, as pointed out by others, regex are not suggested for this.

Solution 5 - Php

I use this:

function strip_word_html($text, $allowed_tags = '<a><ul><li><b><i><sup><sub><em><strong><u><br><br/><br /><p><h2><h3><h4><h5><h6>')
{
	mb_regex_encoding('UTF-8');
	//replace MS special characters first
	$search = array('/&lsquo;/u', '/&rsquo;/u', '/&ldquo;/u', '/&rdquo;/u', '/&mdash;/u');
	$replace = array('\'', '\'', '"', '"', '-');
	$text = preg_replace($search, $replace, $text);
	//make sure _all_ html entities are converted to the plain ascii equivalents - it appears
	//in some MS headers, some html entities are encoded and some aren't
	//$text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
	//try to strip out any C style comments first, since these, embedded in html comments, seem to
	//prevent strip_tags from removing html comments (MS Word introduced combination)
	if(mb_stripos($text, '/*') !== FALSE){
		$text = mb_eregi_replace('#/\*.*?\*/#s', '', $text, 'm');
	}
	//introduce a space into any arithmetic expressions that could be caught by strip_tags so that they won't be
	//'<1' becomes '< 1'(note: somewhat application specific)
	$text = preg_replace(array('/<([0-9]+)/'), array('< $1'), $text);
	$text = strip_tags($text, $allowed_tags);
	//eliminate extraneous whitespace from start and end of line, or anywhere there are two or more spaces, convert it to one
	$text = preg_replace(array('/^\s\s+/', '/\s\s+$/', '/\s\s+/u'), array('', '', ' '), $text);
	//strip out inline css and simplify style tags
	$search = array('#<(strong|b)[^>]*>(.*?)</(strong|b)>#isu', '#<(em|i)[^>]*>(.*?)</(em|i)>#isu', '#<u[^>]*>(.*?)</u>#isu');
	$replace = array('<b>$2</b>', '<i>$2</i>', '<u>$1</u>');
	$text = preg_replace($search, $replace, $text);
	//on some of the ?newer MS Word exports, where you get conditionals of the form 'if gte mso 9', etc., it appears
	//that whatever is in one of the html comments prevents strip_tags from eradicating the html comment that contains
	//some MS Style Definitions - this last bit gets rid of any leftover comments */
	$num_matches = preg_match_all("/\<!--/u", $text, $matches);
	if($num_matches){
		$text = preg_replace('/\<!--(.)*--\>/isu', '', $text);
	}
	$text = preg_replace('/(<[^>]+) style=".*?"/i', '$1', $text);
return $text;
}

Solution 6 - Php

I'm using such thing to clean-up the style='...' section out of tags with keeping of other attributes at the moment.

$output = preg_replace('/<([^>]+)(\sstyle=(?P<stq>["\'])(.*)\k<stq>)([^<]*)>/iUs', '<$1$5>', $input);

Solution 7 - Php

In addition to Lorenzo Marcon's answer:

Using preg_replace to select everything except style attribute:

$html = preg_replace('/(<p.+?)style=".+?"(>.+?)/i', "$1$2", $html);

Solution 8 - Php

$html = preg_replace('/\sstyle=("|\').*?("|\')/i', '', $html);

For replacing all style="" with blank.

Solution 9 - Php

You could handle it client side, the easiest would be with jQuery. Something like:

$("#tinyMce p").removeAttr("style");

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMartin BeanView Question on Stackoverflow
Solution 1 - PhpStaffan NötebergView Answer on Stackoverflow
Solution 2 - PhpMaerlynView Answer on Stackoverflow
Solution 3 - PhpJaseCView Answer on Stackoverflow
Solution 4 - PhpLorenzo MarconView Answer on Stackoverflow
Solution 5 - PhpDreschFView Answer on Stackoverflow
Solution 6 - PhpAshguardView Answer on Stackoverflow
Solution 7 - PhpRafaSashiView Answer on Stackoverflow
Solution 8 - PhpZmaelView Answer on Stackoverflow
Solution 9 - PhpDanielView Answer on Stackoverflow