Getting DOM elements by classname

PhpCssDom

Php Problem Overview


I'm using PHP DOM and I'm trying to get an element within a DOM node that have a given class name. What's the best way to get that sub-element?

Update: I ended up using Mechanize for PHP which was much easier to work with.

Php Solutions


Solution 1 - Php

Update: Xpath version of *[@class~='my-class'] css selector

So after my comment below in response to hakre's comment, I got curious and looked into the code behind Zend_Dom_Query. It looks like the above selector is compiled to the following xpath (untested):

[contains(concat(' ', normalize-space(@class), ' '), ' my-class ')]

So the PHP would be:

$dom = new DomDocument();
$dom->load($filePath);
$finder = new DomXPath($dom);
$classname="my-class";
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname ')]");

Basically, all we do here is normalize the class attribute so that even a single class is bounded by spaces, and the complete class list is bounded in spaces. Then append the class we are searching for with a space. This way we are effectively looking for and find only instances of my-class .


Use an xpath selector?

$dom = new DomDocument();
$dom->load($filePath);
$finder = new DomXPath($dom);
$classname="my-class";
$nodes = $finder->query("//*[contains(@class, '$classname')]");

If it is only ever one type of element you can replace the * with the particular tagname.

If you need to do a lot of this with very complex selector I would recommend Zend_Dom_Query which supports CSS selector syntax (a la jQuery):

$finder = new Zend_Dom_Query($html);
$classname = 'my-class';
$nodes = $finder->query("*[class~=\"$classname\"]");

Solution 2 - Php

If you wish to get the innerhtml of the class without the zend you could use this:

$dom = new DomDocument();
$dom->load($filePath);
$classname = 'main-article';
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname ')]");
$tmp_dom = new DOMDocument(); 
foreach ($nodes as $node) 
    {
    $tmp_dom->appendChild($tmp_dom->importNode($node,true));
    }
$innerHTML.=trim($tmp_dom->saveHTML()); 
echo $innerHTML;

Solution 3 - Php

I think the accepted way is better, but I guess this might work as well

function getElementByClass(&$parentNode, $tagName, $className, $offset = 0) {
	$response = false;

	$childNodeList = $parentNode->getElementsByTagName($tagName);
	$tagCount = 0;
	for ($i = 0; $i < $childNodeList->length; $i++) {
		$temp = $childNodeList->item($i);
		if (stripos($temp->getAttribute('class'), $className) !== false) {
			if ($tagCount == $offset) {
				$response = $temp;
				break;
			}
			
			$tagCount++;
		}
		
	}
	
	return $response;
}

Solution 4 - Php

There is also another approach without the use of DomXPath or Zend_Dom_Query.

Based on dav's original function, I wrote the following function that returns all the children of the parent node whose tag and class match the parameters.

function getElementsByClass(&$parentNode, $tagName, $className) {
    $nodes=array();

    $childNodeList = $parentNode->getElementsByTagName($tagName);
    for ($i = 0; $i < $childNodeList->length; $i++) {
        $temp = $childNodeList->item($i);
        if (stripos($temp->getAttribute('class'), $className) !== false) {
            $nodes[]=$temp;
        }
    }

    return $nodes;
}

suppose you have a variable $html the following HTML:

<html>
 <body>
  <div id="content_node">
    <p class="a">I am in the content node.</p>
    <p class="a">I am in the content node.</p>
    <p class="a">I am in the content node.</p>    
  </div>
  <div id="footer_node">
    <p class="a">I am in the footer node.</p>
  </div>
 </body>
</html>

use of getElementsByClass is as simple as:

$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTML($html);
$content_node=$dom->getElementById("content_node");

$div_a_class_nodes=getElementsByClass($content_node, 'div', 'a');//will contain the three nodes under "content_node".

Solution 5 - Php

DOMDocument is slow to type and phpQuery has bad memory leak issues. I ended up using:

https://github.com/wasinger/htmlpagedom

To select a class:

include 'includes/simple_html_dom.php';

$doc = str_get_html($html);
$href = $doc->find('.lastPage')[0]->href;

I hope this helps someone else as well

Solution 6 - Php

I prefer using Symfony for this. Their libraries are pretty nice.

Use the The DomCrawler Component

Example:

$browser = new HttpBrowser(HttpClient::create());
$crawler = $browser->request('GET', 'example.com');
$class = $crawler->filter('.class')->first();

Solution 7 - Php

PHP's native DOM handling is so absurdly bad, do yourself a favour and use this or any other modern HTML parsing package which can handle this within in few lines:

Install paquettg/php-html-parser with

composer require paquettg/php-html-parser

Then create a .php file in the same folder with this content

<?php

// load dependencies via Composer
require __DIR__ . '/vendor/autoload.php';

use PHPHtmlParser\Dom;

$dom = new Dom;
$dom->loadFromUrl("https://example.com");
$links = $dom->find('.classname a');

foreach ($links as $link) {
    echo $link->getAttribute('href');
}

P.S. You'll find information on how to install Composer on Composer's homepage.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionBen GView Question on Stackoverflow
Solution 1 - PhpprodigitalsonView Answer on Stackoverflow
Solution 2 - PhpTschallackaView Answer on Stackoverflow
Solution 3 - PhpdavView Answer on Stackoverflow
Solution 4 - PhpoabarcaView Answer on Stackoverflow
Solution 5 - PhpiautomationView Answer on Stackoverflow
Solution 6 - PhpUniccoView Answer on Stackoverflow
Solution 7 - PhpSliqView Answer on Stackoverflow