Hey Lewis, What about integrating Any23 into Tika -- which has a PDF parser, etc.? I'd be happy to try and help out wherever I can.
Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Lewis John Mcgibbney <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Sunday, June 30, 2013 7:14 PM To: "[email protected]" <[email protected]> Subject: Extraction of structure from non-XML based formats >Hi All, >A while ago I lodged ANY23-134 [0] with the intention of extending the >Any23 paradigm to other document formats other than subsets of XML. >Say for example, I would like to read in PDF documents such as this one >[1] >or this one [2]. >The idea would be to use Any23 (within a pipeline) to extract out the >specification data as triples. I can then build a triples representation >of >this document for really domain specific inferences. >Is ANY23-134 the correct way to go about this? Should I be looking at some >other existing tool we have within Any23... XPath immediately springs to >mind but I am not sure and would really appreciate a comment or two from >anyone out there! > >Thank you very much. >Lewis > >[0] https://issues.apache.org/jira/browse/ANY23-134 >[1] >http://www.fanucrobotics.com/cmsmedia/datasheets/ARC%20Mate%20100iC%20Seri >es_7.pdf >[2] >http://www.fanucrobotics.com/cmsmedia/datasheets/ARC%20Mate%200iA_170.pdf > >-- >*Lewis*
