Re: Indexing several parts of PDF file

Jorge Luis Betancourt Gonzalez Tue, 05 Feb 2013 06:37:21 -0800

Thanks for the advice the thing with this approach is that we are using nutch 
as our crawler for the intranet, and right now, doing this (indexing one 
crawled document as several solr documents) it's not possible without changing 
the way nutch works. Is there any other workaround this?


Thanks for the replies!

----- Mensaje original -----
De: "Upayavira" <u...@odoko.co.uk>
Para: solr-user@lucene.apache.org
Enviados: Martes, 5 de Febrero 2013 9:05:58
Asunto: Re: Indexing several parts of PDF file

This would involve you querying against every page in your document,
which will be too many fields and will break quickly.

The best way to do it is to index pages as documents. You can use field
collapsing to group pages from the same document together.

Upayavira

On Tue, Feb 5, 2013, at 02:00 PM, Jorge Luis Betancourt Gonzalez wrote:
> Hi:
> 
> I'm working on a search engine for several PDF documents, right now one
> of the requirements is that we can provide not only the documents
> matching the search criteria but the page that match the criteria.
> Normally tika only extracts the text content and does not do this
> distinction, but using some custom library this could be achieve, but my
> question is how to structure the schema. For what I've seen one approach
> could be the use dynamic fields:
> 
> <dynamicField name="page_*" type="text" indexed="true"  stored="true"/>
> 
> So at query time I could extract the page number from the fields name. Is
> this the best approach? Is there any form of storing the number page into
> an attribute and not using the dynamic fields?
> 
> Thanks in advance!
> 
> Greetings
> --
> "It is only in the mysterious equation of love that any 
> logical reasons can be found."
> "Good programmers often confuse halloween (31 OCT) with 
> christmas (25 DEC)"

Re: Indexing several parts of PDF file

Reply via email to