Using XML Part 6 – Validation

This series has so far focused on XML technologies and how they can be utilised using PHP 5. A subject we have not touched upon yet, is XML validation. This article will explore the application independent XML validation standards of DTD's, the XML Schema Language and the XSLT-based Schematron language. I will demonstrate how to validate XML in PHP and demonstrate how PHP 5's XSL extension can be used to validate XML using the Schematron language.
Throughout this article I will be using the Library XML seen in part 1 of this series. An up to date, namespaced example including a doctype is contained within the ZIP file accompanying this article.
The XML standard states that an XML document must be well formed, attributes must enclosed in the double or single quotes and character data escaped accordingly. An XML document which does not conform to these constraints is not an XML document and this simple structural validation is carried out by the majority of XML parsers. This is however by no means the be all and end all of XML data validation. Drawing a metaphor to a real world scenario, a well formed XML document is like a well built building. It is structurally sound but if the only requirement is that it is a building, how does one know that the building contains the right rooms, the required number of floors and the correct décor. If the building is not to the correct specification, it may as well be a pile of rubble.
Back to XML, being well formed is simply not enough to constitute a valid XML document. You must ensure that the data and elements you need are all present, in the correct order and contain the data you expect. This validation can be carried out at the application level. For example in PHP, the DOM API can be used to validate the document structure.

PHP 5:

$categories = array();
$XMLCategories $xml->getElementsByTagName('categories')->item(0);

if (
$XMLCategories) {
    foreach(
$XMLCategories->getElementsByTagName('category') as $categoryNode) {
        
/* notice how we get attributes */
        
$cid $categoryNode->getAttribute('id');
            
        if(! 
$cid) continue; // id the cid attribute is not present ignore it
        
$categories[$cid] = $categoryNode->firstChild->nodeValue;
    }
} else {
    die(
'No Categories Found');
}
The above example uses the sample code from the first article in the series and checks that the categories element contains category elements and that each one contains an id attribute. While this validation works, it destroys the portable nature of XML and forces applications to agree on a validation standard before using the XML data.

Validation using document Type Definitions (DTD's)
DTD's were included as part of the original XML 1.0 specification. The DTD has been around since the day of SGML (the standard with which XML has its roots) and as their name suggests, a DTD defines the structure of a document; the elements that can appear, the order in which they can appear, the attributes they can have and the data they can contain.
A DTD is referenced in the document type declaration of at the top of an XML document or can be included inside the document type declaration as an in-line DTD:
<?xml version="1.0"?>
<!DOCTYPE library SYSTEM "library.dtd" [
<!ENTITY % nsp "lib:">

<!ENTITY % nss ":lib">
] >
The above document type declaration references an external DTD, library.dtd and includes its own document type definition. Declarations of elements, entities and attributes override declarations made in the external DTD. In the above example, the entity nsp is declared in the local DTD, which will thus override any declaration of the entity in the external DTD.
The external DTD which defines the library XML is as follows:

<?xml version="1.0" encoding="iso-8859-1" ?>
<!-- The XML declaration must be present in an external DTD -->
 
<!--
By default, DTD's do not support namespaces. However, a namespace suffix and prefix can be
included as an entity in the DTD which can be overridden in the DOCTYPE declaration or by changing
the DTD.

-->

<!-- Namespace prefix should be overrriden in the local instance where it is not the default namespace-->
<!ENTITY % nsp "" >
 
<!-- Namespace suffix should be overriden in the local instance where it is no the default namespace -->
<!ENTITY % nss "" >
 
<!-- xmlns entity dclaration -->
<!ENTITY % nsdec "xmlns%nss;" >
 
<!-- each POSSIBLE namespaced element is now declared as an entity -->
<!ENTITY % library "%nsp;library" >
<!ENTITY % categories "%nsp;categories" >
<!ENTITY % authors "%nsp;authors" >
<!ENTITY % books "%nsp;books" >
<!ENTITY % book "%nsp;book" >
<!ENTITY % category "%nsp;category" >
<!ENTITY % author "%nsp;author" >
<!ENTITY % title "%nsp;title" >
<!ENTITY % publisher "%nsp;publisher" >
<!ENTITY % cover "%nsp;cover" >
<!ENTITY % synopsis "%nsp;synopsis" >
 
<!-- element and attribute list declarations now use the element
entities declared above to define theXML document -->

<!ELEMENT %library; (%categories;,%authors;, %books;) >
 
<!-- allow for possible declataion of an XML schema. Must be use the XSI namespace -->
<!ATTLIST %library;
%nsdec; CDATA #FIXED "http://www.phpbuilder.com/adam_delves/library_xml"
xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation CDATA #IMPLIED >

 
<!-- the categories element must contain at least one category element, indicated by + -->
<!ELEMENT %categories; ((%category;)+) >
 
<!ELEMENT %category; (#PCDATA) >
<!-- the ID attribute is declared as optional as the category element also appears as a child
element of the book element, where it should reference a valid ID from a category element
in the categories section
-->

<!ATTLIST %category; id CDATA #IMPLIED >
 
<!-- the authors element must contain at least 1 author element, indicated by + -->
<!ELEMENT %authors; ((%author;)+) >
 
<!-- the ID attribute is declared as optional as the author element also appears as a child
element of the book element, where it should reference a valid ID from a author element
in the categories section -->

<!ELEMENT %author; (#PCDATA) >
<!ATTLIST %author; id CDATA #IMPLIED >
 
<!-- the books element may contain 0 or more book elements, indicated by the * -->
<!ELEMENT %books; ((%book;)*) >
 
<!-- a sequence of elements is declared using a commas, the elements in the same order as sepcified in
the sequence. The cover and synopsis elements are optional (indicated by a ?) they may occur only
once but their occurance is optional -->

<!ELEMENT %book; ((%title;),(%publisher;),(%category;)+,(%author;)+,(%cover;)?,(%synopsis;)?) >
 
<!-- the isbn atribute is required. the hascover attribute is optional, but when specified may contain
only th vlaues, yes and no. no is the default -->

<!ATTLIST %book;
isbn CDATA #REQUIRED
hascover (yes | no) "no" >

 
<!-- these elements may only contain character data, commonents and programming instructions -->
<!ELEMENT %title; (#PCDATA) >
<!ELEMENT %publisher; (#PCDATA) >
<!ELEMENT %cover; (#PCDATA) >
<!ELEMENT %synopsis;(#PCDATA) >
DTD's are not without their shortcomings. The library XML DTD demonstrates a few of these:

Validating against a DTD in PHP
There are two ways in which a DTD can be validated using PHP 5's DOM extension. The first and preferred way is to validate it as it is parsed. This involves setting a flag before the XML is loaded:

PHP 5:

$library = new DOMDocument("1.0");
$library->validateOnParse true;
 
libxml_clear_errors();
           
if (!
$doc->load($file)) {
    die(
'Error Loading Document');
}

if (
libxml_get_last_error()) {
    die(
'Error Parsing Document');
}


The validateOnParse property causes the DOMDocument object to do exactly that. Notice how the libxml functions are used to check for validation errors. Although it should, the load function does not return false if DTD validation fails. In order to have entities replaced and default attribute values set, pass the appropriate libxml constants to the second, optional argument of the load() function.
The second way is to use the validate() function of the DOMDocument object. Because the XML document has already been loaded; entity replacements cannot be carried out when using the validate method. In particular, any entity declarations contained within the XML document type declaration are ignored and do not override the external declarations.
 

PHP 5:

$library = new DOMDocument("1.0");
$library->validateOnParse true;
           
$library->load('library.xml');

if (
$library->validate()) {
    die (
'DTD Validation failure.');
}

A DTD can also be validated when loading XML into a SimpleXML object, using the LIBXML constants:

PHP 5:

libxml_clear_errors();
$library simplexml_load_file('library.xml','',LIBXML_DTDVALID);

if (
libxml_get_last_error() ) {
    die (
'Error validating / loading XML');
}


While DTD's have their limitations they are still the de facto standard in XML validation supported by the vast majority of XML parsers and do not rely on the XML standard. DTD's still play an important part in XML validation. One of their main strengths being that they are applied to the XML as it is parsed. This allows for the creation of custom entities such as the &copy; entity in HTML which are replaced as the document is loaded. Further information on the capabilities of DTD's can be found in the specification.

The XML Schema Language
The XML schema language was designed to supersede DTD's and address their limitations, allowing further control over validation. XML Schemas are written in XML making them extensible and easy to understand. They also include full support for XML namespaces. Being a W3C standard all but a few XML parsers include support for schema validation.

The importance of namespaces in XML Schema
Namespaces play an important role in XML schema validation. All schemas should (but are not required to) be declared with the target namespace of the XML they are validating. Failure to define a namespace in the schema and the XML may result in naming conflicts. The library XML is declared using the following namespace:
<library xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.phpbuilder.com/adam_delves/library_xml library-xml.xsd"
xmlns="http://www.phpbuilder.com/adam_delves/library_xml">
The location of the XML schema is referenced in the root element using the schemaLocation attribute and the default namespace used for the library XML. The full XML schema is included as in the ZIP file which accompanies this article. Below are some of the key features of the XML Schema language.

Validating XML Schema's in PHP
Schema validation is carried out after the XML document has been loaded. In PHP 5, the DOMDocument object provides the schemaValidate() method. To validate the current document against an XML schema, simply supply it with the path of the XML schema file. For XML documents with a schema declared in the root element, it is possible to write a small function to carry out schema validation automatically.

PHP 5:

$library = new SchemaDOMDocument("1.0");
$library->validateOnParse true;

 
$library->load('library.xml');
$library->validateXMLSchemas();

class 
SchemaDOMDocument extends DOMDocument
{
    public function 
validateXMLSchemas()
    {
        
$schemaLocation $this->documentElement->getAttributeNS('http://www.w3.org/2001/XMLSchema-instance''schemaLocation');

        if (! 
$schemaLocation) {
            throw new 
DOMException('No schemas found');
        }

        
/* the schemaLocation contains pairs of values separated by spaces the first value in each pair
           is the name space to be validated. The second is a URI defining the location of the schema
          
           validate each namespace using the provided URI
         */

         
$pairs preg_split('/\s+/'$schemaLocation);
         
$pairCount count($pairs);
        
         if (
$pairCount <= 1) {
             throw new 
DOMException('Invalid schema location value.');
         }

         
$valid true;
         for(
$x 1$x $pairCount$x+=2) {
             
$valid $this->schemaValidate($pairs[$x]) && $valid;
         }
        
         if(! 
$valid) {
             throw new 
DOMException('XML Schema Validation Failure');
         }

         return 
true;
    }