Parse and Traverse HTML Documents

This section provides a tutorial example on how to parse an HTML document into a DOMDocument object and traverse it as a DOMNode object tree.

If you have an existing HTML document, you can use the DOMDocument::loadHTML() method to parse in into a DOMDocument object.

A DOMDocument object is also a DOMNode object, which presents attributes and child elements as child DOMNode objects to form a DOMNode object tree.

If you want to traverse the DOMNode object tree, you can use the following API properties and methods:

Here is a PHP script example, Traverse-HTML-DOM-Tree.php, that prints out information of all elements and their attributes in a given HTML document.

<?php
#  Traverses-HTML-DOM-Tree.php
#- Copyright 2009 (c) HerongYang.com. All Rights Reserved.

  $file = $argv[1];
  $html = file_get_contents($file);

  $doc = new DOMDocument();
  $doc->loadHTML($html);
  printNode($doc, "");

function printNode($node, $prefix) {
  $attriLen = -1;
  $attriList = $node->attributes;
  if (isset($attriList)) $attriLen = $attriList->length;

  $childLen = -1;
  $childList = $node->childNodes;
  if (isset($childList)) $childLen = $childList->length;

  print($prefix . $node->nodeName . "= "
    . $attriLen . ", " . $childLen . ", " . $node->nodeType) . " ";
  if ($node->nodeType==XML_TEXT_NODE)
    print("text: (" . $node->nodeValue . ")\n");
  else if ($node->nodeType==XML_ATTRIBUTE_NODE)
    print("attribute: (" . $node->nodeValue . ")\n");
  else
    print("other: (...)\n");

  if (isset($attriList)) foreach ($attriList as $n)
    printNode($n, $prefix." @");
  if (isset($childList)) foreach ($childList as $n)
    printNode($n, $prefix." ");
}
?>

Here is an XHTML document example, XHTML-Example.html, which is also an HTML document.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   "DTD/xhtml1-strict.dtd">
<!-- XHTML-Example.html
 - Copyright (c) 2009 HerongYang.com. All Rights Reserved.
-->
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Strict DTD XHTML Example </title>
</head>
<body>
<p>Please Choose a Day:<br/><br/>
<select name="day">
  <option selected="selected">Monday</option>
  <option>Tuesday</option>
  <option>Wednesday</option>
</select>
</p>
</body>
</html>

Now run the PHP script on the HTML document file. You should see the following output.

herong> php Traverses-HTML-DOM-Tree.php XHTML-Example.html

#document= -1, 4, 13 other: (...)
 html= -1, -1, 10 other: (...)
 xml= -1, -1, 7 other: (...)
 #comment= -1, -1, 8 other: (...)
 html= 3, 5, 1 other: (...)
  @xmlns= -1, 1, 2 attribute: (http://www.w3.org/1999/xhtml)
  @ #text= -1, -1, 3 text: (http://www.w3.org/1999/xhtml)
  @xml:lang= -1, 1, 2 attribute: (en)
  @ #text= -1, -1, 3 text: (en)
  @lang= -1, 1, 2 attribute: (en)
  @ #text= -1, -1, 3 text: (en)
  #text= -1, -1, 3 text: (
)
  head= 0, 3, 1 other: (...)
   #text= -1, -1, 3 text: (
)
   title= 0, 1, 1 other: (...)
    #text= -1, -1, 3 text: (Strict DTD XHTML Example )
   #text= -1, -1, 3 text: (
)
  #text= -1, -1, 3 text: (
)
  body= 0, 3, 1 other: (...)
   #text= -1, -1, 3 text: (
)
   p= 0, 6, 1 other: (...)
    #text= -1, -1, 3 text: (Please Choose a Day:)
    br= 0, 0, 1 other: (...)
    br= 0, 0, 1 other: (...)
    #text= -1, -1, 3 text: (
)
    select= 1, 7, 1 other: (...)
     @name= -1, 1, 2 attribute: (day)
     @ #text= -1, -1, 3 text: (day)
     #text= -1, -1, 3 text: (
  )
     option= 1, 1, 1 other: (...)
      @selected= -1, 1, 2 attribute: (selected)
      @ #text= -1, -1, 3 text: (selected)
      #text= -1, -1, 3 text: (Monday)
     #text= -1, -1, 3 text: (
  )
     option= 0, 1, 1 other: (...)
      #text= -1, -1, 3 text: (Tuesday)
     #text= -1, -1, 3 text: (
  )
     option= 0, 1, 1 other: (...)
      #text= -1, -1, 3 text: (Wednesday)
     #text= -1, -1, 3 text: (
)
    #text= -1, -1, 3 text: (
)
   #text= -1, -1, 3 text: (
)
  #text= -1, -1, 3 text: (
)

The output gives us a clear picture of how an HTML document is represented as a DOMNode object tree.

Table of Contents

 About This Book

 Introduction and Installation of PHP

 Managing PHP Engine and Modules on macOS

 Managing PHP Engine and Modules on CentOS

 cURL Module - Client for URL

DOM Module - Parsing HTML Documents

 DOM (Document Object Model) Module

Parse and Traverse HTML Documents

 Build New HTML Documents

 Load HTML Documents with LIBXML_NOBLANKS

 Remove Whitespaces in HTML Documents

 DOCTYPE Element in HTML Documents

 Remove Dummy Elements in HTML Documents

 Install DOM Extension on CentOS

 GD Module - Manipulating Images and Pictures

 MySQLi Module - Accessing MySQL Server

 OpenSSL Module - Cryptography and SSL/TLS Toolkit

 PCRE Module - Perl Compatible Regular Expressions

 SOAP Module - Creating and Calling Web Services

 SOAP Module - Server Functions and Examples

 Zip Module - Managing ZIP Archive Files

 References

 Full Version in PDF/EPUB