PHP Modules Tutorials - Herong's Tutorial Examples - v5.18, by Herong Yang
Parse and Traverse HTML Documents
This section provides a tutorial example on how to parse an HTML document into a DOMDocument object and traverse it as a DOMNode object tree.
If you have an existing HTML document, you can use the DOMDocument::loadHTML() method to parse in into a DOMDocument object.
A DOMDocument object is also a DOMNode object, which presents attributes and child elements as child DOMNode objects to form a DOMNode object tree.
If you want to traverse the DOMNode object tree, you can use the following API properties and methods:
Here is a PHP script example, Traverse-HTML-DOM-Tree.php, that prints out information of all elements and their attributes in a given HTML document.
<?php
# Traverses-HTML-DOM-Tree.php
#- Copyright 2009 (c) HerongYang.com. All Rights Reserved.
$file = $argv[1];
$html = file_get_contents($file);
$doc = new DOMDocument();
$doc->loadHTML($html);
printNode($doc, "");
function printNode($node, $prefix) {
$attriLen = -1;
$attriList = $node->attributes;
if (isset($attriList)) $attriLen = $attriList->length;
$childLen = -1;
$childList = $node->childNodes;
if (isset($childList)) $childLen = $childList->length;
print($prefix . $node->nodeName . "= "
. $attriLen . ", " . $childLen . ", " . $node->nodeType) . " ";
if ($node->nodeType==XML_TEXT_NODE)
print("text: (" . $node->nodeValue . ")\n");
else if ($node->nodeType==XML_ATTRIBUTE_NODE)
print("attribute: (" . $node->nodeValue . ")\n");
else
print("other: (...)\n");
if (isset($attriList)) foreach ($attriList as $n)
printNode($n, $prefix." @");
if (isset($childList)) foreach ($childList as $n)
printNode($n, $prefix." ");
}
?>
Here is an XHTML document example, XHTML-Example.html, which is also an HTML document.
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd"> <!-- XHTML-Example.html - Copyright (c) 2009 HerongYang.com. All Rights Reserved. --> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Strict DTD XHTML Example </title> </head> <body> <p>Please Choose a Day:<br/><br/> <select name="day"> <option selected="selected">Monday</option> <option>Tuesday</option> <option>Wednesday</option> </select> </p> </body> </html>
Now run the PHP script on the HTML document file. You should see the following output.
herong> php Traverses-HTML-DOM-Tree.php XHTML-Example.html
#document= -1, 4, 13 other: (...)
html= -1, -1, 10 other: (...)
xml= -1, -1, 7 other: (...)
#comment= -1, -1, 8 other: (...)
html= 3, 5, 1 other: (...)
@xmlns= -1, 1, 2 attribute: (http://www.w3.org/1999/xhtml)
@ #text= -1, -1, 3 text: (http://www.w3.org/1999/xhtml)
@xml:lang= -1, 1, 2 attribute: (en)
@ #text= -1, -1, 3 text: (en)
@lang= -1, 1, 2 attribute: (en)
@ #text= -1, -1, 3 text: (en)
#text= -1, -1, 3 text: (
)
head= 0, 3, 1 other: (...)
#text= -1, -1, 3 text: (
)
title= 0, 1, 1 other: (...)
#text= -1, -1, 3 text: (Strict DTD XHTML Example )
#text= -1, -1, 3 text: (
)
#text= -1, -1, 3 text: (
)
body= 0, 3, 1 other: (...)
#text= -1, -1, 3 text: (
)
p= 0, 6, 1 other: (...)
#text= -1, -1, 3 text: (Please Choose a Day:)
br= 0, 0, 1 other: (...)
br= 0, 0, 1 other: (...)
#text= -1, -1, 3 text: (
)
select= 1, 7, 1 other: (...)
@name= -1, 1, 2 attribute: (day)
@ #text= -1, -1, 3 text: (day)
#text= -1, -1, 3 text: (
)
option= 1, 1, 1 other: (...)
@selected= -1, 1, 2 attribute: (selected)
@ #text= -1, -1, 3 text: (selected)
#text= -1, -1, 3 text: (Monday)
#text= -1, -1, 3 text: (
)
option= 0, 1, 1 other: (...)
#text= -1, -1, 3 text: (Tuesday)
#text= -1, -1, 3 text: (
)
option= 0, 1, 1 other: (...)
#text= -1, -1, 3 text: (Wednesday)
#text= -1, -1, 3 text: (
)
#text= -1, -1, 3 text: (
)
#text= -1, -1, 3 text: (
)
#text= -1, -1, 3 text: (
)
The output gives us a clear picture of how an HTML document is represented as a DOMNode object tree.
Table of Contents
Introduction and Installation of PHP
Managing PHP Engine and Modules on macOS
Managing PHP Engine and Modules on CentOS
►DOM Module - Parsing HTML Documents
DOM (Document Object Model) Module
►Parse and Traverse HTML Documents
Load HTML Documents with LIBXML_NOBLANKS
Remove Whitespaces in HTML Documents
DOCTYPE Element in HTML Documents
Remove Dummy Elements in HTML Documents
Install DOM Extension on CentOS
GD Module - Manipulating Images and Pictures
MySQLi Module - Accessing MySQL Server
OpenSSL Module - Cryptography and SSL/TLS Toolkit
PCRE Module - Perl Compatible Regular Expressions
SOAP Module - Creating and Calling Web Services
SOAP Module - Server Functions and Examples