Hi, Thanks for the email.
My answers to your questions:
1. It is a tradeoff-VTD-XMl consumes more memory, but
is easy to use and more powerful, Any random access capable XML processing API *needs* to at least load the entire hierachical structure in memory. My take is that among SAX, STAX, DOM
and JDOM, vtd-xml is the least likely one to choke, and best one
to handle peak loads...
2. Agree with you, benchmarking a dummy SAX parser is unfair for VTD-XML,
that will make VTD-XML look prettier in real life scenario.
3. Look at all the vertical industry XML related vocubalry,  SOAP,
Rest and XML schema, and infoset data model, DTD seems deprecated
a bit, and VTD-XMl doesn't support external entities... other than that
VTD-XML is equally capable
Cheers,
jz



----- Original Message ----- From: "Stefano Mazzocchi" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Sunday, February 19, 2006 8:57 PM
Subject: Re: [ANN] VTD-XML Version 1.5 Released






Hmmmm, I have to admit that I've toyed with this idea myself lately, especially since I'm diving deep into processing large quantities of XML files these days (when I say 'large', I mean it, large that 32 bits of address space are not enough).

The idea of non-extracting parsing is nice but there are few issues:

1) the memory requirements, still much less than DOM, but are still *way* more than an event-driven model like SAX. Cocoon, for example, would die if we were to move to a parser like this one, especially under load spikes.

2) benchmarking against a dummy SAX content handler is completely meaningless. in order for the API to be of any use, you have to create strings, you can't simply pass pointers to char arrays around. I bet that if the SAX parser could go on without creating strings, it would be just as fast (xerces, in fact, does use a similar mechanism to return you the character() SAX event, where the entire document is kept in memory and the start/finish pointers are passed instead of a new array.

3) 90% of the slowness comes from 10% of the details in the XML spec, which means in order to keep fast, you need to sacrifice compliance... which is not an option these days given how cheap silicon is.

But don't get me wrong, I think there is something interesting in what you are doing: I think it would be cool if you could serialize the 'tree index' alongside the document on disk and provide some sort of b-tree indexing for it. It would help me in my multi-GB-of-XML day2day struggle.

You claim xpath random access, but what is the algorithmical complexity of that? O(1), O(log(n)), O(n), O(n*log(n))? If one were to store the parsed tree index on disk, how many pages would one need to page in before reaching the required xpath?

--
Stefano.