simplexml

simplexml at sourceforge
Home > Limitations
Simple XML Parser
Home
Overview
News
Usage
Demo applications
About the parser
Handlers
Limitations
Known limitations
Extending the parser
Download
Requirements
Release 1.0
Links

Known limitations

Please note that the main purpose of the simple XML parser is to provide a simple non-validating XML parser that parses a subset of XML. It cannot be stressed enough that this is not a full-fledged "do-everything" parser.

The following limitations of the simple XML parser are known. They have been discovered while testing the parser using the James Clark's Canonical XML Test Cases.

Invalid element names
The simple XML parser doesn't care about invalid element names such as <.foo> or <123>.
Invalid processing instructions, entities and doctypes
The simple XML parser doesn't care about processing instructions, entities and doctypes. They are ignored and therefore invalid processing instructions as <? ?>, invalid entities such as <!ENTITY foo PUBLIC "some public id"> and illegal doctypes as <!DOCTYPE doc -- a comment -- []> are ignored.
Invalid comments
The contents of a comment is not checked, i.e. it doesn't care about an illegal '--' (double-hyphen) inside the comment.
Illegal attribute names and values
Illegal attribute names and values such as <foo bar="<quux>"> and <foo 12="34"> are accepted.
Illegal content
Illegal content such as <foo>]]></foo> or <foo>a>b</foo> as well as illegal characters such as escape sequences, form-feeds, etc. are accepted by the simple XML parser.
Illegal data at end of document
Once the simple XML parser has seen the closing document tag it stops parsing, therefore illegal data at the end of the document is ignored.
Malformed XML declarations
Simple XML doesn't care about document declarations, therefore stuff like <?xml VERSION="1.0"?> is accepted (i.e. ignored). Also the parser accepts whitespace or comments in front of the XML declaration.
Illegal character encodings
Simple XML doesn't care about encodings, it simply accepts all 8-bit character values encountered. It only emits errors for unsupported unicode entities such as &#8482;.
No support for CDATA or entities
The parser doesn't support CDATA therefore statements such as <doc><![CDATA[<&]]></doc> or &unknown-entity; cause an error.

Extending the parser

Well, as this parser should remain small and simple there isn't much I would want to add. If you need a callback for comments, then take a look at the parser, it's already perpared. Also adding support for CDATA if you need it shouldn't be too much of a problem.

One thing that is definately going to cause trouble is if you want to extend it to make it work on data buffers that are repeatedly filled. The only easy way to do that is to do it via a callback (i.e. a call- back to gets called whenever the parser needs more data). If you really need this then you probably want to have a look at one of the full-fledged XML parsers, some of them support read buffers.

I certainly don't plan to add entity support or validation.