Re: Indexing several parts of PDF file

Upayavira Tue, 05 Feb 2013 06:06:29 -0800

This would involve you querying against every page in your document,
which will be too many fields and will break quickly.


The best way to do it is to index pages as documents. You can use field
collapsing to group pages from the same document together.

Upayavira

On Tue, Feb 5, 2013, at 02:00 PM, Jorge Luis Betancourt Gonzalez wrote:
> Hi:
> 
> I'm working on a search engine for several PDF documents, right now one
> of the requirements is that we can provide not only the documents
> matching the search criteria but the page that match the criteria.
> Normally tika only extracts the text content and does not do this
> distinction, but using some custom library this could be achieve, but my
> question is how to structure the schema. For what I've seen one approach
> could be the use dynamic fields:
> 
> <dynamicField name="page_*" type="text" indexed="true"  stored="true"/>
> 
> So at query time I could extract the page number from the fields name. Is
> this the best approach? Is there any form of storing the number page into
> an attribute and not using the dynamic fields?
> 
> Thanks in advance!
> 
> Greetings
> --
> "It is only in the mysterious equation of love that any 
> logical reasons can be found."
> "Good programmers often confuse halloween (31 OCT) with 
> christmas (25 DEC)"

Re: Indexing several parts of PDF file

Reply via email to