Parsing Huge XML Files in PHP

PhpXmlParsingLarge FilesDmoz

Php Problem Overview


I'm trying to parse the DMOZ content/structures XML files into MySQL, but all existing scripts to do this are very old and don't work well. How can I go about opening a large (+1GB) XML file in PHP for parsing?

Php Solutions


Solution 1 - Php

There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).

For an example, you might want to look at this partial parser of the DMOZ-catalog:

<?php

class SimpleDMOZParser
{
    protected $_stack = array();
    protected $_file = "";
    protected $_parser = null;

    protected $_currentId = "";
    protected $_current = "";

    public function __construct($file)
    {
        $this->_file = $file;

        $this->_parser = xml_parser_create("UTF-8");
        xml_set_object($this->_parser, $this);
        xml_set_element_handler($this->_parser, "startTag", "endTag");
    }

    public function startTag($parser, $name, $attribs)
    {
        array_push($this->_stack, $this->_current);

        if ($name == "TOPIC" && count($attribs)) {
            $this->_currentId = $attribs["R:ID"];
        }

        if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
            echo $attribs["R:RESOURCE"] . "\n";
        }

        $this->_current = $name;
    }

    public function endTag($parser, $name)
    {
        $this->_current = array_pop($this->_stack);
    }

    public function parse()
    {
        $fh = fopen($this->_file, "r");
        if (!$fh) {
            die("Epic fail!\n");
        }

        while (!feof($fh)) {
            $data = fread($fh, 4096);
            xml_parse($this->_parser, $data, feof($fh));
        }
    }
}

$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();

Solution 2 - Php

This is a very similar question to https://stackoverflow.com/questions/1167062/best-way-to-process-large-xml-in-php/8250302 but with a very good specific answer upvoted addressing the specific problem of DMOZ catalogue parsing. However, since this is a good Google hit for large XMLs in general, I will repost my answer from the other question as well:

My take on it:

https://github.com/prewk/XmlStreamer

A simple class that will extract all children to the XML root element while streaming the file. Tested on 108 MB XML file from pubmed.com.

class SimpleXmlStreamer extends XmlStreamer {
    public function processNode($xmlString, $elementName, $nodeIndex) {
        $xml = simplexml_load_string($xmlString);
        
        // Do something with your SimpleXML object
        
        return true;
    }
}

$streamer = new SimpleXmlStreamer("myLargeXmlFile.xml");
$streamer->parse();

Solution 3 - Php

I've recently had to parse some pretty large XML documents, and needed a method to read one element at a time.

If you have the following file complex-test.xml:

<?xml version="1.0" encoding="UTF-8"?>
<Complex>
  <Object>
    <Title>Title 1</Title>
    <Name>It's name goes here</Name>
    <ObjectData>
      <Info1></Info1>
      <Info2></Info2>
      <Info3></Info3>
      <Info4></Info4>
    </ObjectData>
    <Date></Date>
  </Object>
  <Object></Object>
  <Object>
    <AnotherObject></AnotherObject>
    <Data></Data>
  </Object>
  <Object></Object>
  <Object></Object>
</Complex>

And wanted to return the <Object/>s

PHP:

require_once('class.chunk.php');
 
$file = new Chunk('complex-test.xml', array('element' => 'Object'));
 
while ($xml = $file->read()) {
  $obj = simplexml_load_string($xml);
  // do some parsing, insert to DB whatever
}

###########
Class File
###########

<?php
/**
 * Chunk
 * 
 * Reads a large file in as chunks for easier parsing.
 * 
 * The chunks returned are whole <$this->options['element']/>s found within file.
 * 
 * Each call to read() returns the whole element including start and end tags.
 * 
 * Tested with a 1.8MB file, extracted 500 elements in 0.11s
 * (with no work done, just extracting the elements)
 * 
 * Usage:
 * <code>
 *   // initialize the object
 *   $file = new Chunk('chunk-test.xml', array('element' => 'Chunk'));
 *   
 *   // loop through the file until all lines are read
 *   while ($xml = $file->read()) {
 *     // do whatever you want with the string
 *     $o = simplexml_load_string($xml);
 *   }
 * </code>
 * 
 * @package default
 * @author Dom Hastings
 */
class Chunk {
  /**
   * options
   *
   * @var array Contains all major options
   * @access public
   */
  public $options = array(
    'path' => './',       // string The path to check for $file in
    'element' => '',      // string The XML element to return
    'chunkSize' => 512    // integer The amount of bytes to retrieve in each chunk
  );
  
  /**
   * file
   *
   * @var string The filename being read
   * @access public
   */
  public $file = '';
  /**
   * pointer
   *
   * @var integer The current position the file is being read from
   * @access public
   */
  public $pointer = 0;
  
  /**
   * handle
   *
   * @var resource The fopen() resource
   * @access private
   */
  private $handle = null;
  /**
   * reading
   *
   * @var boolean Whether the script is currently reading the file
   * @access private
   */
  private $reading = false;
  /**
   * readBuffer
   * 
   * @var string Used to make sure start tags aren't missed
   * @access private
   */
  private $readBuffer = '';
  
  /**
   * __construct
   * 
   * Builds the Chunk object
   *
   * @param string $file The filename to work with
   * @param array $options The options with which to parse the file
   * @author Dom Hastings
   * @access public
   */
  public function __construct($file, $options = array()) {
    // merge the options together
    $this->options = array_merge($this->options, (is_array($options) ? $options : array()));
    
    // check that the path ends with a /
    if (substr($this->options['path'], -1) != '/') {
      $this->options['path'] .= '/';
    }
    
    // normalize the filename
    $file = basename($file);
    
    // make sure chunkSize is an int
    $this->options['chunkSize'] = intval($this->options['chunkSize']);
    
    // check it's valid
    if ($this->options['chunkSize'] < 64) {
      $this->options['chunkSize'] = 512;
    }
    
    // set the filename
    $this->file = realpath($this->options['path'].$file);
    
    // check the file exists
    if (!file_exists($this->file)) {
      throw new Exception('Cannot load file: '.$this->file);
    }
    
    // open the file
    $this->handle = fopen($this->file, 'r');
    
    // check the file opened successfully
    if (!$this->handle) {
      throw new Exception('Error opening file for reading');
    }
  }
  
  /**
   * __destruct
   * 
   * Cleans up
   *
   * @return void
   * @author Dom Hastings
   * @access public
   */
  public function __destruct() {
    // close the file resource
    fclose($this->handle);
  }
  
  /**
   * read
   * 
   * Reads the first available occurence of the XML element $this->options['element']
   *
   * @return string The XML string from $this->file
   * @author Dom Hastings
   * @access public
   */
  public function read() {
    // check we have an element specified
    if (!empty($this->options['element'])) {
      // trim it
      $element = trim($this->options['element']);
      
    } else {
      $element = '';
    }
    
    // initialize the buffer
    $buffer = false;
    
    // if the element is empty
    if (empty($element)) {
      // let the script know we're reading
      $this->reading = true;
      
      // read in the whole doc, cos we don't know what's wanted
      while ($this->reading) {
        $buffer .= fread($this->handle, $this->options['chunkSize']);
        
        $this->reading = (!feof($this->handle));
      }
      
      // return it all
      return $buffer;
      
    // we must be looking for a specific element
    } else {
      // set up the strings to find
      $open = '<'.$element.'>';
      $close = '</'.$element.'>';
      
      // let the script know we're reading
      $this->reading = true;
      
      // reset the global buffer
      $this->readBuffer = '';
      
      // this is used to ensure all data is read, and to make sure we don't send the start data again by mistake
      $store = false;
      
      // seek to the position we need in the file
      fseek($this->handle, $this->pointer);
      
      // start reading
      while ($this->reading && !feof($this->handle)) {
        // store the chunk in a temporary variable
        $tmp = fread($this->handle, $this->options['chunkSize']);
        
        // update the global buffer
        $this->readBuffer .= $tmp;
        
        // check for the open string
        $checkOpen = strpos($tmp, $open);
        
        // if it wasn't in the new buffer
        if (!$checkOpen && !($store)) {
          // check the full buffer (in case it was only half in this buffer)
          $checkOpen = strpos($this->readBuffer, $open);
          
          // if it was in there
          if ($checkOpen) {
            // set it to the remainder
            $checkOpen = $checkOpen % $this->options['chunkSize'];
          }
        }
        
        // check for the close string
        $checkClose = strpos($tmp, $close);
        
        // if it wasn't in the new buffer
        if (!$checkClose && ($store)) {
          // check the full buffer (in case it was only half in this buffer)
          $checkClose = strpos($this->readBuffer, $close);
          
          // if it was in there
          if ($checkClose) {
            // set it to the remainder plus the length of the close string itself
            $checkClose = ($checkClose + strlen($close)) % $this->options['chunkSize'];
          }
          
        // if it was
        } elseif ($checkClose) {
          // add the length of the close string itself
          $checkClose += strlen($close);
        }
        
        // if we've found the opening string and we're not already reading another element
        if ($checkOpen !== false && !($store)) {
          // if we're found the end element too
          if ($checkClose !== false) {
            // append the string only between the start and end element
            $buffer .= substr($tmp, $checkOpen, ($checkClose - $checkOpen));
            
            // update the pointer
            $this->pointer += $checkClose;
            
            // let the script know we're done
            $this->reading = false;
            
          } else {
            // append the data we know to be part of this element
            $buffer .= substr($tmp, $checkOpen);
            
            // update the pointer
            $this->pointer += $this->options['chunkSize'];
            
            // let the script know we're gonna be storing all the data until we find the close element
            $store = true;
          }
          
        // if we've found the closing element
        } elseif ($checkClose !== false) {
          // update the buffer with the data upto and including the close tag
          $buffer .= substr($tmp, 0, $checkClose);
          
          // update the pointer
          $this->pointer += $checkClose;
          
          // let the script know we're done
          $this->reading = false;
          
        // if we've found the closing element, but half in the previous chunk
        } elseif ($store) {
          // update the buffer
          $buffer .= $tmp;
          
          // and the pointer
          $this->pointer += $this->options['chunkSize'];
        }
      }
    }
    
    // return the element (or the whole file if we're not looking for elements)
    return $buffer;
  }
}

Solution 4 - Php

I would suggest using a SAX based parser rather than DOM based parsing.

Info on using SAX in PHP: http://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm

Solution 5 - Php

This isn't a great solution, but just to throw another option out there:

You can break many large XML files up into chunks, especially those that are really just lists of similar elements (as I suspect the file you're working with would be).

e.g., if your doc looks like:

<dmoz>
  <listing>....</listing>
  <listing>....</listing>
  <listing>....</listing>
  <listing>....</listing>
  <listing>....</listing>
  <listing>....</listing>
  ...
</dmoz>

You can read it in a meg or two at a time, artificially wrap the few complete <listing> tags you loaded in a root level tag, and then load them via simplexml/domxml (I used domxml, when taking this approach).

Frankly, I prefer this approach if you're using PHP < 5.1.2. With 5.1.2 and higher, XMLReader is available, which is probably the best option, but before that, you're stuck with either the above chunking strategy, or the old SAX/expat lib. And I don't know about the rest of you, but I HATE writing/maintaining SAX/expat parsers.

Note, however, that this approach is NOT really practical when your document doesn't consist of many identical bottom-level elements (e.g., it works great for any sort of list of files, or URLs, etc., but wouldn't make sense for parsing a large HTML document)

Solution 6 - Php

This is an old post, but first in the google search result, so I thought I post another solution based on this post:

http://drib.tech/programming/parse-large-xml-files-php

This solution uses both XMLReader and SimpleXMLElement :

$xmlFile = 'the_LARGE_xml_file_to_load.xml'
$primEL  = 'the_name_of_your_element';

$xml     = new XMLReader();
$xml->open($xmlFile);

// finding first primary element to work with
while($xml->read() && $xml->name != $primEL){;}

// looping through elements
while($xml->name == $primEL) {
    // loading element data into simpleXML object
	$element = new SimpleXMLElement($xml->readOuterXML());

	// DO STUFF

	// moving pointer	
	$xml->next($primEL);
	// clearing current element
	unset($element);
} // end while

$xml->close();

Solution 7 - Php

You can combine XMLReader with DOM for this. In PHP both APIs (and SimpleXML) are based on the same library - libxml2. Large XMLs are a list of records typically. So you use XMLReader to iterate the records, load a single record into DOM and use DOM methods and Xpath to extract values. The key is the method XMLReader::expand(). It loads the current node in an XMLReader instance and its descendants as DOM nodes.

Example XML:

<books>
  <book>
    <title isbn="978-0596100087">XSLT 1.0 Pocket Reference</title>
  </book>
  <book>
    <title isbn="978-0596100506">XML Pocket Reference</title>
  </book>
  <!-- ... -->
</books>

Example code:

// open the XML file
$reader = new XMLReader();
$reader->open('books.xml');

// prepare a DOM document
$document = new DOMDocument();
$xpath = new DOMXpath($document);

// find the first `book` element node at any depth
while ($reader->read() && $reader->localName !== 'book') {
  continue;
}

// as long as here is a node with the name "book"
while ($reader->localName === 'book') {
  // expand the node into the prepared DOM
  $book = $reader->expand($document);
  // use Xpath expressions to fetch values
  var_dump(
    $xpath->evaluate('string(title/@isbn)', $book),
    $xpath->evaluate('string(title)', $book)
  );
  // move to the next book sibling node
  $reader->next('book');
}
$reader->close();

Take note that the expanded node is never appended to the DOM document. It allows the GC to clean it up.

This approach works with XML namespaces as well.

$namespaceURI = 'urn:example-books';

$reader = new XMLReader();
$reader->open('books.xml');

$document = new DOMDocument();
$xpath = new DOMXpath($document);
// register a prefix for the Xpath expressions
$xpath->registerNamespace('b', $namespaceURI);

// compare local node name and namespace URI
while (
  $reader->read() &&
  (
    $reader->localName !== 'book' ||
    $reader->namespaceURI !== $namespaceURI
  )
) {
  continue;
}

// iterate the book elements 
while ($reader->localName === 'book') {
  // validate that they are in the namespace
  if ($reader->namespaceURI === $namespaceURI) {
    $book = $reader->expand($document);
    var_dump(
      $xpath->evaluate('string(b:title/@isbn)', $book),
      $xpath->evaluate('string(b:title)', $book)
    );
  }
  $reader->next('book');
}
$reader->close();

Solution 8 - Php

I've written a wrapper for XMLReader to (IMHO) make it easier to just get the bits your after. The wrapper allows you to associate a set of paths of the data elements and a callback to be run when this path is found. The path allows regex expressions and also capture groups which can also be passed to the callback.

The library is at https://github.com/NigelRel3/XMLReaderReg and can also be installed using composer require nigelrel3/xml-reader-reg.

An example of how to use it...

$inputFile = __DIR__ ."/../tests/data/simpleTest1.xml";
$reader = new XMLReaderReg\XMLReaderReg();
$reader->open($inputFile);

$reader->process([
    '(.*/person(?:\[\d*\])?)' => function (SimpleXMLElement $data, $path): void {
        echo "1) Value for ".$path[1]." is ".PHP_EOL.
            $data->asXML().PHP_EOL;
    },
    '(.*/person3(\[\d*\])?)' => function (DOMElement $data, $path): void {
        echo "2) Value for ".$path[1]." is ".PHP_EOL.
            $data->ownerDocument->saveXML($data).PHP_EOL;
    },
    '/root/person2/firstname' => function (string $data): void {
        echo "3) Value for /root/person2/firstname is ". $data.PHP_EOL;
    }
    ]);

$reader->close();

As can be seen from the example, you can get the data to be passed as a SimpleXMLElement, a DOMElement or the last one is a string. This will represent only the data which matches the path.

The paths also show how capture groups can be used - (.*/person(?:\[\d*\])?) looks for any person element (including arrays of elements) and $path[1] in the callback displays the path where this particular instance is found.

There is an expanded example in the library as well as unit tests.

Solution 9 - Php

I tested the following code with 2 GB xml:

<?php
set_time_limit(0);
$reader = new XMLReader();
if (!$reader->open("data.xml"))
{
    die("Failed to open 'data.xml'");
}
while($reader->read())
{
    $node = $reader->expand();
    // process $node...
}
$reader->close();
?>

Solution 10 - Php

My solution:

$reader = new XMLReader();
$reader->open($fileTMP);
 while ($reader->read()) {
 if ($reader->nodeType === XMLReader::ELEMENT && $reader->name === 'xmltag' && $reader->isEmptyElement === false) {
 $item = simplexml_load_string($reader->readOuterXML(), null, LIBXML_NOCDATA); 
   //operations on file
}
}
$reader->close();
    

Solution 11 - Php

Very high performed way is

preg_split('/(<|>)/m', $xmlString);

And after that, only one cycle is needed.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionIanView Question on Stackoverflow
Solution 1 - PhpEmil HView Answer on Stackoverflow
Solution 2 - PhposkarthView Answer on Stackoverflow
Solution 3 - PhpMihir RawalView Answer on Stackoverflow
Solution 4 - PhpTetsujin no OniView Answer on Stackoverflow
Solution 5 - PhpFrank FarmerView Answer on Stackoverflow
Solution 6 - PhpSzekelygobeView Answer on Stackoverflow
Solution 7 - PhpThWView Answer on Stackoverflow
Solution 8 - PhpNigel RenView Answer on Stackoverflow
Solution 9 - PhpAlexView Answer on Stackoverflow
Solution 10 - PhpEmil DworniczakView Answer on Stackoverflow
Solution 11 - PhpNikolay GechevView Answer on Stackoverflow