Re: Extraction of structure from non-XML based formats

Peter Ansell Sun, 30 Jun 2013 19:33:19 -0700

On 1 July 2013 12:14, Lewis John Mcgibbney <[email protected]>wrote:


> Hi All,
> A while ago I lodged ANY23-134 [0] with the intention of extending the
> Any23 paradigm to other document formats other than subsets of XML.
> Say for example, I would like to read in PDF documents such as this one [1]
> or this one [2].
> The idea would be to use Any23 (within a pipeline) to extract out the
> specification data as triples. I can then build a triples representation of
> this document for really domain specific inferences.
> Is ANY23-134 the correct way to go about this? Should I be looking at some
> other existing tool we have within Any23... XPath immediately springs to
> mind but I am not sure and would really appreciate a comment or two from
> anyone out there!
>
> Thank you very much.
> Lewis
>
> [0] https://issues.apache.org/jira/browse/ANY23-134
> [1]
>
> http://www.fanucrobotics.com/cmsmedia/datasheets/ARC%20Mate%20100iC%20Series_7.pdf
> [2]
> http://www.fanucrobotics.com/cmsmedia/datasheets/ARC%20Mate%200iA_170.pdf
>
>
I don't think there is much value in creating a pipeline structure inside
of Any23 that doesn't use triples for interchange between pipeline stages.
You may be able to come up with some reusable abstract classes to work with
Tika more smoothly, but when it gets to emitting results from an extractor
in Any23 I would recommend that you form the results into triples.

Also, Sesame-2.7 deprecated "stopAtFirstError" and "verifyDataType" in
favour of ParserConfig.addNonFatalError(... <setting that should not fail
parsing> ...) and BasicParserSettings.VERIFY_DATATYPE_VALUES (and other
similar settings), respectively. Not sure how tightly they are linked into
Any23 as it has been a while since I went in and looked, but I noticed them
in the patch so I thought I should mention that.

Cheers,

Peter

Re: Extraction of structure from non-XML based formats

Reply via email to