Extraction of structure from non-XML based formats

Lewis John Mcgibbney Sun, 30 Jun 2013 19:15:18 -0700

Hi All,
A while ago I lodged ANY23-134 [0] with the intention of extending the
Any23 paradigm to other document formats other than subsets of XML.
Say for example, I would like to read in PDF documents such as this one [1]
or this one [2].
The idea would be to use Any23 (within a pipeline) to extract out the
specification data as triples. I can then build a triples representation of
this document for really domain specific inferences.
Is ANY23-134 the correct way to go about this? Should I be looking at some
other existing tool we have within Any23... XPath immediately springs to
mind but I am not sure and would really appreciate a comment or two from
anyone out there!


Thank you very much.
Lewis

[0] https://issues.apache.org/jira/browse/ANY23-134
[1]
http://www.fanucrobotics.com/cmsmedia/datasheets/ARC%20Mate%20100iC%20Series_7.pdf
[2]
http://www.fanucrobotics.com/cmsmedia/datasheets/ARC%20Mate%200iA_170.pdf

-- 
*Lewis*

Extraction of structure from non-XML based formats

Reply via email to