Re: Extraction of structure from non-XML based formats

Mattmann, Chris A (398J) Sun, 30 Jun 2013 23:21:28 -0700

Hey Lewis,

What about integrating Any23 into Tika -- which has a PDF parser,
etc.? I'd be happy to try and help out wherever I can.


Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Lewis John Mcgibbney <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Sunday, June 30, 2013 7:14 PM
To: "[email protected]" <[email protected]>
Subject: Extraction of structure from non-XML based formats

>Hi All,
>A while ago I lodged ANY23-134 [0] with the intention of extending the
>Any23 paradigm to other document formats other than subsets of XML.
>Say for example, I would like to read in PDF documents such as this one
>[1]
>or this one [2].
>The idea would be to use Any23 (within a pipeline) to extract out the
>specification data as triples. I can then build a triples representation
>of
>this document for really domain specific inferences.
>Is ANY23-134 the correct way to go about this? Should I be looking at some
>other existing tool we have within Any23... XPath immediately springs to
>mind but I am not sure and would really appreciate a comment or two from
>anyone out there!
>
>Thank you very much.
>Lewis
>
>[0] https://issues.apache.org/jira/browse/ANY23-134
>[1]
>http://www.fanucrobotics.com/cmsmedia/datasheets/ARC%20Mate%20100iC%20Seri
>es_7.pdf
>[2]
>http://www.fanucrobotics.com/cmsmedia/datasheets/ARC%20Mate%200iA_170.pdf
>
>-- 
>*Lewis*

Re: Extraction of structure from non-XML based formats

Reply via email to