On 1/20/2015 10:43 PM, Yusniel Hidalgo Delgado wrote:
> I am diving into Solr recently and I need help in the following usage 
> scenery. I am working on a project for extract and search bibliographic 
> metadata from PDF files. Firstly, my PDF files are processed to extract 
> bibliographic metadata such as title, authors, affiliations, keywords and 
> abstract. These metadata are stored in a relational database and then are 
> indexed in Solr via DIH, however, I need to index also the fulltext of PDF 
> and maintain the same ID between metadata indexed and fulltext of PDF indexed 
> in Solr index. How to do that? How to configure sorlconfig.xml and schema.xml 
> to do it? 

How are you doing the indexing?  If it's in a program you wrote
yourself, simply extend that program to obtain the information you need
and add it to the document that you index.  The Apache Tika project is
one way to parse rich text documents.

If you are using the dataimport handler, you are likely to need a nested
entity to gather the additional information and include it in the
document that is being indexed in the parent entity. The reply from
Alvaro shows one way to integrate Tika into DIH.  It looks like those
instructions are geared to an extremely old Solr version (3.6.2) and
probably won't work as-is on a newer version.  Solr 4.x was already
available when that blog post was written two years ago, so I don't know
why they went with 3.6.2.

Thanks,
Shawn

Reply via email to