stevedlawrence commented on code in PR #1514:
URL: https://github.com/apache/daffodil/pull/1514#discussion_r2246581922
##########
daffodil-schematron/src/main/scala/org/apache/daffodil/validation/schematron/SchematronValidator.scala:
##########
@@ -20,25 +20,52 @@ package org.apache.daffodil.validation.schematron
import java.io.File
import java.io.FileOutputStream
import java.io.InputStream
+import java.io.StringWriter
import java.net.URI
+import javax.xml.transform.Templates
+import javax.xml.transform.Transformer
+import javax.xml.transform.sax.SAXSource
+import javax.xml.transform.stream.StreamResult
import scala.util.Try
import scala.xml.Elem
import scala.xml.XML
import org.apache.daffodil.api
+import org.apache.daffodil.lib.xml.DaffodilSAXParserFactory
+
+import org.xml.sax.InputSource
+import org.xml.sax.XMLReader
/**
* Daffodil Validator implementation for ISO schematron
*/
final class SchematronValidator(
- engine: Schematron,
+ templates: Templates,
svrlPath: Option[URI]
) extends api.validation.Validator {
+
+ // XMLReader and Transformer are not thread safe so are ThreadLocals.
Alternatively, they could be
+ // created every time validateXML, but ThreadLocal allows reuse of objects
+ // that are potentially expensive to create
+ val readerTransformerTL = new ThreadLocal[(XMLReader, Transformer)] {
+ override def initialValue(): (XMLReader, Transformer) = {
+ val factory = DaffodilSAXParserFactory()
+ factory.setValidating(false)
+ val reader = factory.newSAXParser.getXMLReader
+ val transformer = templates.newTransformer
+ (reader, transformer)
+ }
+ }
+
def validateXML(
document: InputStream,
handler: api.validation.ValidationHandler
): Unit = {
- val svrl = XML.loadString(engine.validate(document))
+ val (reader, transformer) = readerTransformerTL.get
+ val writer = new StringWriter
+ transformer
+ .transform(new SAXSource(reader, new InputSource(document)), new
StreamResult(writer))
+ val svrl = XML.loadString(writer.toString)
svrl.child.collect { case f @ Elem("svrl", "failed-assert", _, _, msg @
_*) =>
handler.validationError(msg.text.trim, (f \ "@location").text)
}
Review Comment:
Combining the Schematron logic here made me notice that we do some
unnecessary work. Below this we do `val svrlString = svrl.mkString`, but
`svrl.mkString` is the same as `writer.toString` that already do above. So
we're essentially creating the string twice. Instead we should just do
something like this:
```scala
transformer.transform(...)
val svrlString = writer.toString
val svrl = XML.loadString(svrlString)
svrl.child.collect { ... handler.valdiationError(...) }
svrlPath.foreach { uri => ...
val writer = FileWriter(new (File(uri)) }
writer.write(svrlString)
...
}
```
So we avoid the duplicate string creation. We can also avoid the .getBytes
to prevent converting the string to a giant byte array by using a FileWriter
instead of a FileOutputStream.
I think those changes would be an easy worthwhile performance improvement
while we're already fixing the validator.
I can imagine even more performance, e.g. create a custom SAXResult instance
that directly calls handler.validationError, completely avoiding the need to
create a string at all, but we probably shouldn't worry about that unless
people run into performance limitations with the schematron validator.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]