Indexing several parts of PDF file

Jorge Luis Betancourt Gonzalez Tue, 05 Feb 2013 06:01:12 -0800

Hi:

I'm working on a search engine for several PDF documents, right now one of the 
requirements is that we can provide not only the documents matching the search 
criteria but the page that match the criteria. Normally tika only extracts the 
text content and does not do this distinction, but using some custom library 
this could be achieve, but my question is how to structure the schema. For what 
I've seen one approach could be the use dynamic fields:


<dynamicField name="page_*" type="text" indexed="true"  stored="true"/>

So at query time I could extract the page number from the fields name. Is this 
the best approach? Is there any form of storing the number page into an 
attribute and not using the dynamic fields?

Thanks in advance!

Greetings
--
"It is only in the mysterious equation of love that any 
logical reasons can be found."
"Good programmers often confuse halloween (31 OCT) with 
christmas (25 DEC)"

Indexing several parts of PDF file

Reply via email to