On 1 July 2013 12:14, Lewis John Mcgibbney <[email protected]>wrote:
> Hi All, > A while ago I lodged ANY23-134 [0] with the intention of extending the > Any23 paradigm to other document formats other than subsets of XML. > Say for example, I would like to read in PDF documents such as this one [1] > or this one [2]. > The idea would be to use Any23 (within a pipeline) to extract out the > specification data as triples. I can then build a triples representation of > this document for really domain specific inferences. > Is ANY23-134 the correct way to go about this? Should I be looking at some > other existing tool we have within Any23... XPath immediately springs to > mind but I am not sure and would really appreciate a comment or two from > anyone out there! > > Thank you very much. > Lewis > > [0] https://issues.apache.org/jira/browse/ANY23-134 > [1] > > http://www.fanucrobotics.com/cmsmedia/datasheets/ARC%20Mate%20100iC%20Series_7.pdf > [2] > http://www.fanucrobotics.com/cmsmedia/datasheets/ARC%20Mate%200iA_170.pdf > > I don't think there is much value in creating a pipeline structure inside of Any23 that doesn't use triples for interchange between pipeline stages. You may be able to come up with some reusable abstract classes to work with Tika more smoothly, but when it gets to emitting results from an extractor in Any23 I would recommend that you form the results into triples. Also, Sesame-2.7 deprecated "stopAtFirstError" and "verifyDataType" in favour of ParserConfig.addNonFatalError(... <setting that should not fail parsing> ...) and BasicParserSettings.VERIFY_DATATYPE_VALUES (and other similar settings), respectively. Not sure how tightly they are linked into Any23 as it has been a while since I went in and looked, but I noticed them in the patch so I thought I should mention that. Cheers, Peter
