How to remove html special chars?

PhpHtml Encode

Php Problem Overview


I am creating a RSS feed file for my application in which I want to remove HTML tags, which is done by strip_tags. But strip_tags is not removing HTML special code chars:

  & © 

etc.

Please tell me any function which I can use to remove these special code chars from my string.

Php Solutions


Solution 1 - Php

Either decode them using html_entity_decode or remove them using preg_replace:

$Content = preg_replace("/&#?[a-z0-9]+;/i","",$Content); 

(From here)

EDIT: Alternative according to Jacco's comment

> might be nice to replace the '+' with > {2,8} or something. This will limit > the chance of replacing entire > sentences when an unencoded '&' is > present.

$Content = preg_replace("/&#?[a-z0-9]{2,8};/i","",$Content); 

Solution 2 - Php

Use html_entity_decode to convert HTML entities.

You'll need to set charset to make it work correctly.

Solution 3 - Php

In addition to the good answers above, PHP also has a built-in filter function that is quite useful: filter-var.

To remove HMTL characters, use:

$cleanString = filter_var($dirtyString, FILTER_SANITIZE_STRING);

More info:

  1. function.filter-var
  2. filter_sanitize_string

Solution 4 - Php

You may want take a look at htmlentities() and html_entity_decode() here

$orig = "I'll \"walk\" the <b>dog</b> now";

$a = htmlentities($orig);

$b = html_entity_decode($a);

echo $a; // I'll &quot;walk&quot; the &lt;b&gt;dog&lt;/b&gt; now

echo $b; // I'll "walk" the <b>dog</b> now

Solution 5 - Php

This might work well to remove special characters.

$modifiedString = preg_replace("/[^a-zA-Z0-9_.-\s]/", "", $content); 

Solution 6 - Php

If you want to convert the HTML special characters and not just remove them as well as strip things down and prepare for plain text this was the solution that worked for me...

function htmlToPlainText($str){
    $str = str_replace('&nbsp;', ' ', $str);
    $str = html_entity_decode($str, ENT_QUOTES | ENT_COMPAT , 'UTF-8');
    $str = html_entity_decode($str, ENT_HTML5, 'UTF-8');
    $str = html_entity_decode($str);
    $str = htmlspecialchars_decode($str);
    $str = strip_tags($str);

    return $str;
}

$string = '<p>this is (&nbsp;) a test</p>
<div>Yes this is! &amp; does it get "processed"? </div>'

htmlToPlainText($string);
// "this is ( ) a test. Yes this is! & does it get processed?"`

html_entity_decode w/ ENT_QUOTES | ENT_XML1 converts things like &#39; htmlspecialchars_decode converts things like &amp; html_entity_decode converts things like '&lt; and strip_tags removes any HTML tags left over.

EDIT - Added str_replace(' ', ' ', $str); and several other html_entity_decode() as continued testing has shown a need for them.

Solution 7 - Php

A plain vanilla strings way to do it without engaging the preg regex engine:

function remEntities($str) {
  if(substr_count($str, '&') && substr_count($str, ';')) {
    // Find amper
    $amp_pos = strpos($str, '&');
    //Find the ;
    $semi_pos = strpos($str, ';');
    // Only if the ; is after the &
    if($semi_pos > $amp_pos) {
      //is a HTML entity, try to remove
      $tmp = substr($str, 0, $amp_pos);
      $tmp = $tmp. substr($str, $semi_pos + 1, strlen($str));
      $str = $tmp;
      //Has another entity in it?
      if(substr_count($str, '&') && substr_count($str, ';'))
        $str = remEntities($tmp);
    }
  }
  return $str;
}

Solution 8 - Php

What I have done was to use: html_entity_decode, then use strip_tags to removed them.

Solution 9 - Php

try this

<?php
$str = "\x8F!!!";

// Outputs an empty string
echo htmlentities($str, ENT_QUOTES, "UTF-8");

// Outputs "!!!"
echo htmlentities($str, ENT_QUOTES | ENT_IGNORE, "UTF-8");
?>

Solution 10 - Php

It looks like what you really want is:

function xmlEntities($string) {
    $translationTable = get_html_translation_table(HTML_ENTITIES, ENT_QUOTES);
    
    foreach ($translationTable as $char => $entity) {
        $from[] = $entity;
        $to[] = '&#'.ord($char).';';
    }
    return str_replace($from, $to, $string);
}

It replaces the named-entities with their number-equivalent.

Solution 11 - Php

<?php
function strip_only($str, $tags, $stripContent = false) {
    $content = '';
    if(!is_array($tags)) {
        $tags = (strpos($str, '>') !== false
                 ? explode('>', str_replace('<', '', $tags))
                 : array($tags));
        if(end($tags) == '') array_pop($tags);
    }
    foreach($tags as $tag) {
        if ($stripContent)
             $content = '(.+</'.$tag.'[^>]*>|)';
         $str = preg_replace('#</?'.$tag.'[^>]*>'.$content.'#is', '', $str);
    }
    return $str;
}

$str = '<font color="red">red</font> text';
$tags = 'font';
$a = strip_only($str, $tags); // red text
$b = strip_only($str, $tags, true); // text
?> 

Solution 12 - Php

The function I used to perform the task, joining the upgrade made by schnaader is:

    mysql_real_escape_string(
    	preg_replace_callback("/&#?[a-z0-9]+;/i", function($m) { 
    		return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); 
    	}, strip_tags($row['cuerpo'])))

This function removes every html tag and html symbol, converted in UTF-8 ready to save in MySQL

Solution 13 - Php

You can try htmlspecialchars_decode($string). It works for me.

http://www.w3schools.com/php/func_string_htmlspecialchars_decode.asp

Solution 14 - Php

If you are working in WordPress and are like me and simply need to check for an empty field (and there are a copious amount of random html entities in what seems like a blank string) then take a look at:

sanitize_title_with_dashes( string $title, string $raw_title = '', string $context = 'display' )

Link to wordpress function page

For people not working on WordPress, I found this function REALLY useful to create my own sanitizer, take a look at the full code and it's really in depth!

Solution 15 - Php

$string = "äáčé";

$convert = Array(
        'ä'=>'a',
        'Ä'=>'A',
        'á'=>'a',
        'Á'=>'A',
        'à'=>'a',
        'À'=>'A',
        'ã'=>'a',
        'Ã'=>'A',
        'â'=>'a',
        'Â'=>'A',
        'č'=>'c',
        'Č'=>'C',
        'ć'=>'c',
        'Ć'=>'C',
        'ď'=>'d',
        'Ď'=>'D',
        'ě'=>'e',
        'Ě'=>'E',
        'é'=>'e',
        'É'=>'E',
        'ë'=>'e',
    );

$string = strtr($string , $convert );

echo $string; //aace

Solution 16 - Php

What If By "Remove HTML Special Chars" You Meant "Replace Appropriately"?

After all, just look at your example...

&nbsp; &amp; &copy;

If you're stripping this for an RSS feed, shouldn't you want the equivalents?

" ", &, ©

Or maybe you don't exactly want the equivalents. Maybe you'd want to have &nbsp; just be ignored (to prevent too much space), but then have &copy; actually get replaced. Let's work out a solution that solves anyone's version of this problem...

How to SELECTIVELY-REPLACE HTML Special Chars

The logic is simple: preg_match_all('/(&#[0-9]+;)/' grabs all of the matches, and then we simply build a list of matchables and replaceables, such as str_replace([searchlist], [replacelist], $term). Before we do this, we also need to convert named entities to their numeric counterparts, i.e., "&nbsp;" is unacceptable, but "&#00A0;" is fine. (Thanks to it-alien's solution to this part of the problem.)

Working Demo

In this demo, I replace &#123; with "HTML Entity #123". Of course, you can fine-tune this to any kind of find-replace you want for your case.

Why did I make this? I use it with generating Rich Text Format from UTF8-character-encoded HTML.

See full working demo:

Full Online Working Demo

	function FixUTF8($args) {
		$output = $args['input'];
		
		$output = convertNamedHTMLEntitiesToNumeric(['input'=>$output]);
		
		preg_match_all('/(&#[0-9]+;)/', $output, $matches, PREG_OFFSET_CAPTURE);
		$full_matches = $matches[0];
		
		$found = [];
		$search = [];
		$replace = [];
		
		for($i = 0; $i < count($full_matches); $i++) {
			$match = $full_matches[$i];
			$word = $match[0];
			if(!$found[$word]) {
				$found[$word] = TRUE;
				$search[] = $word;
				$replacement = str_replace(['&#', ';'], ['HTML Entity #', ''], $word);
				$replace[] = $replacement;
			}
		}

		$new_output = str_replace($search, $replace, $output);
		
		return $new_output;
	}
	
	function convertNamedHTMLEntitiesToNumeric($args) {
		$input = $args['input'];
		return preg_replace_callback("/(&[a-zA-Z][a-zA-Z0-9]*;)/",function($m){
			$c = html_entity_decode($m[0],ENT_HTML5,"UTF-8");
			# return htmlentities($c,ENT_XML1,"UTF-8"); -- see update below
			
			$convmap = array(0x80, 0xffff, 0, 0xffff);
			return mb_encode_numericentity($c, $convmap, 'UTF-8');
		}, $input);
	}

print(FixUTF8(['input'=>"Oggi &egrave; un bel&nbsp;giorno"]));

Input:

>"Oggi &egrave; un bel&nbsp;giorno"

Output:

>Oggi HTML Entity #232 un belHTML Entity #160giorno

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestiondjmzfKnmView Question on Stackoverflow
Solution 1 - PhpschnaaderView Answer on Stackoverflow
Solution 2 - PhpandiView Answer on Stackoverflow
Solution 3 - PhpgpkampView Answer on Stackoverflow
Solution 4 - Php0xFFView Answer on Stackoverflow
Solution 5 - PhpVinit KadkolView Answer on Stackoverflow
Solution 6 - PhpJayView Answer on Stackoverflow
Solution 7 - Phpkarim79View Answer on Stackoverflow
Solution 8 - PhpGwapz JuanView Answer on Stackoverflow
Solution 9 - PhpRaGuView Answer on Stackoverflow
Solution 10 - PhpJaccoView Answer on Stackoverflow
Solution 11 - PhpjahanzaibView Answer on Stackoverflow
Solution 12 - PhpLalalaView Answer on Stackoverflow
Solution 13 - PhpsurabhivinView Answer on Stackoverflow
Solution 14 - PhpBenjamin VaughanView Answer on Stackoverflow
Solution 15 - PhpIvoššView Answer on Stackoverflow
Solution 16 - PhpHoldOffHungerView Answer on Stackoverflow