Re: Extraction of structure from non-XML based formats

Lewis John Mcgibbney Wed, 03 Jul 2013 12:51:32 -0700

Hi Chris,

BTW I should have said in my mail to Peter, sorry for taking ages ot get
back. These emails come though as batches, so sometimes it can be days if
the lists are quiet... which they are.
Anyway,


On Wed, Jul 3, 2013 at 12:34 AM, <[email protected]> wrote:

>
>
> What about integrating Any23 into Tika -- which has a PDF parser,
> etc.? I'd be happy to try and help out wherever I can.
>
> Yeah I suppose this is the next logical step Chris. The problem I see here
though is that, with regards to trivial structured content such as schemas,
name spaces, etc., which I may add are completely useless for my purpose, I
have a feeling that I am kinda beating my head against a wall here.
Any23 extracts structured markup such as DC, LKIFCore, hListings, etc. None
of this structure is/will be available within my PDF's. This creates a
problem for me. It means that I cannot use most of the built in extraction
implementations from Any23. Which leaves me to code the stuff myself...
Thanks for chiming in on this one.

Re: Extraction of structure from non-XML based formats

Reply via email to