Please note that the main purpose of the simple XML parser is
to provide a simple non-validating XML parser that parses a
subset of XML. It cannot be stressed enough that this is not a
full-fledged "do-everything" parser. The following limitations of the simple XML parser are known.
They have been discovered while testing the parser using the
James Clark's
Canonical XML Test Cases. - Invalid element names
- The simple XML parser doesn't
care about invalid element names such as <.foo> or
<123>.
- Invalid processing instructions, entities and doctypes
-
The simple XML parser doesn't care about processing instructions,
entities and doctypes. They are ignored and therefore invalid processing
instructions as <? ?>, invalid entities such as
<!ENTITY foo PUBLIC "some public id"> and
illegal doctypes as <!DOCTYPE doc -- a comment -- []>
are ignored.
- Invalid comments
- The contents of a comment is not
checked, i.e. it doesn't care about an illegal '--' (double-hyphen)
inside the comment.
- Illegal attribute names and values
- Illegal attribute
names and values such as <foo bar="<quux>">
and <foo 12="34"> are accepted.
- Illegal content
- Illegal content such as
<foo>]]></foo> or <foo>a>b</foo>
as well as illegal characters such as escape sequences, form-feeds,
etc. are accepted by the simple XML parser.
- Illegal data at end of document
- Once the simple XML
parser has seen the closing document tag it stops parsing, therefore
illegal data at the end of the document is ignored.
- Malformed XML declarations
- Simple XML doesn't care about
document declarations, therefore stuff like <?xml VERSION="1.0"?>
is accepted (i.e. ignored). Also the parser accepts whitespace or comments
in front of the XML declaration.
- Illegal character encodings
- Simple XML doesn't care about
encodings, it simply accepts all 8-bit character values encountered. It
only emits errors for unsupported unicode entities such as ™.
- No support for CDATA or entities
- The parser doesn't support CDATA
therefore statements such as <doc><![CDATA[<&]]></doc>
or &unknown-entity; cause an error.
Well, as this parser should remain small and simple there isn't much
I would want to add. If you need a callback for comments, then take
a look at the parser, it's already perpared. Also adding support for
CDATA if you need it shouldn't be too much of a problem. One thing that is definately going to cause trouble is if you want to
extend it to make it work on data buffers that are repeatedly filled.
The only easy way to do that is to do it via a callback (i.e. a call-
back to gets called whenever the parser needs more data). If you
really need this then you probably want to have a look at one of the
full-fledged XML parsers, some of them support read buffers. I certainly don't plan to add entity support or validation. |