PHPBuilder - Using XML - Part 6: Validation

RSS Twitter

Using XML - Part 6: Validation

by: PHP Builder Staff
August 28, 2008

Using XML Part 6 – Validation

This series has so far focused on XML technologies and how they can be utilised using PHP 5. A subject we have not touched upon yet, is XML validation. This article will explore the application independent XML validation standards of DTD's, the XML Schema Language and the XSLT-based Schematron language. I will demonstrate how to validate XML in PHP and demonstrate how PHP 5's XSL extension can be used to validate XML using the Schematron language.
Throughout this article I will be using the Library XML seen in part 1 of this series. An up to date, namespaced example including a doctype is contained within the ZIP file accompanying this article.
The XML standard states that an XML document must be well formed, attributes must enclosed in the double or single quotes and character data escaped accordingly. An XML document which does not conform to these constraints is not an XML document and this simple structural validation is carried out by the majority of XML parsers. This is however by no means the be all and end all of XML data validation. Drawing a metaphor to a real world scenario, a well formed XML document is like a well built building. It is structurally sound but if the only requirement is that it is a building, how does one know that the building contains the right rooms, the required number of floors and the correct décor. If the building is not to the correct specification, it may as well be a pile of rubble.
Back to XML, being well formed is simply not enough to constitute a valid XML document. You must ensure that the data and elements you need are all present, in the correct order and contain the data you expect. This validation can be carried out at the application level. For example in PHP, the DOM API can be used to validate the document structure.

PHP 5:

$categories = array();
$XMLCategories $xml->getElementsByTagName('categories')->item(0);

if (
$XMLCategories) {
$XMLCategories->getElementsByTagName('category') as $categoryNode) {
/* notice how we get attributes */
$cid $categoryNode->getAttribute('id');
$cid) continue; // id the cid attribute is not present ignore it
$categories[$cid] = $categoryNode->firstChild->nodeValue;
} else {
'No Categories Found');
The above example uses the sample code from the first article in the series and checks that the categories element contains category elements and that each one contains an id attribute. While this validation works, it destroys the portable nature of XML and forces applications to agree on a validation standard before using the XML data.

Validation using document Type Definitions (DTD's)
DTD's were included as part of the original XML 1.0 specification. The DTD has been around since the day of SGML (the standard with which XML has its roots) and as their name suggests, a DTD defines the structure of a document; the elements that can appear, the order in which they can appear, the attributes they can have and the data they can contain.
A DTD is referenced in the document type declaration of at the top of an XML document or can be included inside the document type declaration as an in-line DTD:
<?xml version="1.0"?>
<!DOCTYPE library SYSTEM "library.dtd" [
<!ENTITY % nsp "lib:">

<!ENTITY % nss ":lib">
] >
The above document type declaration references an external DTD, library.dtd and includes its own document type definition. Declarations of elements, entities and attributes override declarations made in the external DTD. In the above example, the entity nsp is declared in the local DTD, which will thus override any declaration of the entity in the external DTD.
The external DTD which defines the library XML is as follows:

<?xml version="1.0" encoding="iso-8859-1" ?>
<!-- The XML declaration must be present in an external DTD -->
By default, DTD's do not support namespaces. However, a namespace suffix and prefix can be
included as an entity in the DTD which can be overridden in the DOCTYPE declaration or by changing
the DTD.


<!-- Namespace prefix should be overrriden in the local instance where it is not the default namespace-->
<!ENTITY % nsp "" >
<!-- Namespace suffix should be overriden in the local instance where it is no the default namespace -->
<!ENTITY % nss "" >
<!-- xmlns entity dclaration -->
<!ENTITY % nsdec "xmlns%nss;" >
<!-- each POSSIBLE namespaced element is now declared as an entity -->
<!ENTITY % library "%nsp;library" >
<!ENTITY % categories "%nsp;categories" >
<!ENTITY % authors "%nsp;authors" >
<!ENTITY % books "%nsp;books" >
<!ENTITY % book "%nsp;book" >
<!ENTITY % category "%nsp;category" >
<!ENTITY % author "%nsp;author" >
<!ENTITY % title "%nsp;title" >
<!ENTITY % publisher "%nsp;publisher" >
<!ENTITY % cover "%nsp;cover" >
<!ENTITY % synopsis "%nsp;synopsis" >
<!-- element and attribute list declarations now use the element
entities declared above to define theXML document -->

<!ELEMENT %library; (%categories;,%authors;, %books;) >
<!-- allow for possible declataion of an XML schema. Must be use the XSI namespace -->
<!ATTLIST %library;
%nsdec; CDATA #FIXED ""
xmlns:xsi CDATA #FIXED ""
xsi:schemaLocation CDATA #IMPLIED >

<!-- the categories element must contain at least one category element, indicated by + -->
<!ELEMENT %categories; ((%category;)+) >
<!ELEMENT %category; (#PCDATA) >
<!-- the ID attribute is declared as optional as the category element also appears as a child
element of the book element, where it should reference a valid ID from a category element
in the categories section

<!ATTLIST %category; id CDATA #IMPLIED >
<!-- the authors element must contain at least 1 author element, indicated by + -->
<!ELEMENT %authors; ((%author;)+) >
<!-- the ID attribute is declared as optional as the author element also appears as a child
element of the book element, where it should reference a valid ID from a author element
in the categories section -->

<!ELEMENT %author; (#PCDATA) >
<!ATTLIST %author; id CDATA #IMPLIED >
<!-- the books element may contain 0 or more book elements, indicated by the * -->
<!ELEMENT %books; ((%book;)*) >
<!-- a sequence of elements is declared using a commas, the elements in the same order as sepcified in
the sequence. The cover and synopsis elements are optional (indicated by a ?) they may occur only
once but their occurance is optional -->

<!ELEMENT %book; ((%title;),(%publisher;),(%category;)+,(%author;)+,(%cover;)?,(%synopsis;)?) >
<!-- the isbn atribute is required. the hascover attribute is optional, but when specified may contain
only th vlaues, yes and no. no is the default -->

<!ATTLIST %book;
hascover (yes | no) "no" >

<!-- these elements may only contain character data, commonents and programming instructions -->
<!ELEMENT %title; (#PCDATA) >
<!ELEMENT %publisher; (#PCDATA) >
<!ELEMENT %cover; (#PCDATA) >
<!ELEMENT %synopsis;(#PCDATA) >
DTD's are not without their shortcomings. The library XML DTD demonstrates a few of these:
  • DTD's do not support XML namespaces. Namespaces were an addition to the XML 1.0 standard created to prevent element name conflicts in more complex documents. Each element in the DTD must be referenced using the fully qualified name. Therefore the above DTD uses two entities, nsp (namespace prefix) and nss (namespace suffix), which should be overridden in the the XML document if namespaces are used. Each element is then declared with the prefix as an entity.
  • All elements and attribute declarations are global to the XML document. As demonstrated in the XML, the author and category elements are used in two contexts. The element declarations for these elements must therefore be flexible enough to encompass them both. This means the required id attribute of the category and author elements cannot be enforced when it is part of a category list or author list.
  • Unique attributes can be defined in DTD through the use of the ID type, but only at a global level. Each author and category must have a unique id attribute, but the id need only be unique within the context of the categories and authors list. Using the DTD ID type would require that all authors and categories be unique. This is not a limitation of DTD's but a feature, as using ID types in attributes provides a way of uniquely identifying an element. This then allows use of the DOM function getElementById().
  • DTD's allow you to define optional elements (with ?), elements that may occur zero or more times (with *), elements that must occur at least once (with +) and elements that must occur exactly once. However, they do not allow you enforce an upper limit on the number of elements that may occur.
  • DTD's support several types of data including enumerations, as demonstrated by hascover attribute declaration for the book element. However, they do not allow you to declare more specific data types such as numbers and booleans, and do not support custom data types.

Validating against a DTD in PHP
There are two ways in which a DTD can be validated using PHP 5's DOM extension. The first and preferred way is to validate it as it is parsed. This involves setting a flag before the XML is loaded:

PHP 5:

$library = new DOMDocument("1.0");
$library->validateOnParse true;
if (!
$doc->load($file)) {
'Error Loading Document');

if (
libxml_get_last_error()) {
'Error Parsing Document');

The validateOnParse property causes the DOMDocument object to do exactly that. Notice how the libxml functions are used to check for validation errors. Although it should, the load function does not return false if DTD validation fails. In order to have entities replaced and default attribute values set, pass the appropriate libxml constants to the second, optional argument of the load() function.
The second way is to use the validate() function of the DOMDocument object. Because the XML document has already been loaded; entity replacements cannot be carried out when using the validate method. In particular, any entity declarations contained within the XML document type declaration are ignored and do not override the external declarations.

PHP 5:

$library = new DOMDocument("1.0");
$library->validateOnParse true;

if (
$library->validate()) {
    die (
'DTD Validation failure.');

A DTD can also be validated when loading XML into a SimpleXML object, using the LIBXML constants:

PHP 5:

$library simplexml_load_file('library.xml','',LIBXML_DTDVALID);

if (
libxml_get_last_error() ) {
    die (
'Error validating / loading XML');

While DTD's have their limitations they are still the de facto standard in XML validation supported by the vast majority of XML parsers and do not rely on the XML standard. DTD's still play an important part in XML validation. One of their main strengths being that they are applied to the XML as it is parsed. This allows for the creation of custom entities such as the &copy; entity in HTML which are replaced as the document is loaded. Further information on the capabilities of DTD's can be found in the specification.

The XML Schema Language
The XML schema language was designed to supersede DTD's and address their limitations, allowing further control over validation. XML Schemas are written in XML making them extensible and easy to understand. They also include full support for XML namespaces. Being a W3C standard all but a few XML parsers include support for schema validation.

The importance of namespaces in XML Schema
Namespaces play an important role in XML schema validation. All schemas should (but are not required to) be declared with the target namespace of the XML they are validating. Failure to define a namespace in the schema and the XML may result in naming conflicts. The library XML is declared using the following namespace:
<library xmlns:xsi=""
xsi:schemaLocation=" library-xml.xsd"
The location of the XML schema is referenced in the root element using the schemaLocation attribute and the default namespace used for the library XML. The full XML schema is included as in the ZIP file which accompanies this article. Below are some of the key features of the XML Schema language.
  • Elements can be of a simple type or complex type.

    A simple type element may contain only simple character data, similar to the #PCDATA type in DTD's: 

    <xs:element  name="author"  type="xs:int"  maxOccurs="unbounded"  />
  • A complex type is an element which may contain attributes, other elements, a mixture of other elements and character data or a custom type:
    <!-- elements in an all group may appear 0 or 1 times in any order -->
    <!-- the minOccurs attribute effectivley makes these elements mandatory -->
    <xs:element minOccurs="1" type="lib:authorsDef" name="authors" />
    <xs:element minOccurs="1" type="lib:categoriesDef" name="categories" />
    <xs:element minOccurs="1" type="lib:booksDef" name="books" />
    <!-- declaration of an optional name attribute -->
    <xs:attribute name="name" type="xs:string" />

  • xPath expressions are used to define unique key constraints. They are not only limited to attribute values, they can also be applied to element content and any other data derived from an xPath expression. Unique keys are also applied to the id's in the author and category lists. 
        <xs:key name="isbnUnique">
    <xs:selector xpath="lib:books/lib:book" />
    <xs:field xpath="@isbn" />
  • Reference constraints can also be applied to XML data. They are defined in a similar manner to the unique key constraints and use an xPath expression to select the data that the reference applies to. Notice how the qualified name (element name including the namespace prefix) is used to refer to the name assigned to the key constraint above.
    <xs:keyref name="validCategory" refer="lib:cidUnique">
    <xs:selector xpath="lib:books/lib:book/lib:category" />
    <xs:field xpath="." />
  • The Schema itself is not defined in terms of its root element. In fact it can be included as part of another schema or the schema itself can include definitions and declarations and custom types that are defined in other schemas. This is one of XML schema's biggest strengths.

Validating XML Schema's in PHP
Schema validation is carried out after the XML document has been loaded. In PHP 5, the DOMDocument object provides the schemaValidate() method. To validate the current document against an XML schema, simply supply it with the path of the XML schema file. For XML documents with a schema declared in the root element, it is possible to write a small function to carry out schema validation automatically.

PHP 5:

$library = new SchemaDOMDocument("1.0");
$library->validateOnParse true;


SchemaDOMDocument extends DOMDocument
    public function 
$schemaLocation $this->documentElement->getAttributeNS('''schemaLocation');

        if (! 
$schemaLocation) {
            throw new 
DOMException('No schemas found');

/* the schemaLocation contains pairs of values separated by spaces the first value in each pair
           is the name space to be validated. The second is a URI defining the location of the schema
           validate each namespace using the provided URI

$pairs preg_split('/\s+/'$schemaLocation);
$pairCount count($pairs);
         if (
$pairCount <= 1) {
             throw new 
DOMException('Invalid schema location value.');

$valid true;
$x 1$x $pairCount$x+=2) {
$valid $this->schemaValidate($pairs[$x]) && $valid;
$valid) {
             throw new 
DOMException('XML Schema Validation Failure');


Comment and Contribute

Your comment has been submitted and is pending approval.




(Maximum characters: 1200). You have characters left.