On 1/20/2015 10:43 PM, Yusniel Hidalgo Delgado wrote: > I am diving into Solr recently and I need help in the following usage > scenery. I am working on a project for extract and search bibliographic > metadata from PDF files. Firstly, my PDF files are processed to extract > bibliographic metadata such as title, authors, affiliations, keywords and > abstract. These metadata are stored in a relational database and then are > indexed in Solr via DIH, however, I need to index also the fulltext of PDF > and maintain the same ID between metadata indexed and fulltext of PDF indexed > in Solr index. How to do that? How to configure sorlconfig.xml and schema.xml > to do it?
How are you doing the indexing? If it's in a program you wrote yourself, simply extend that program to obtain the information you need and add it to the document that you index. The Apache Tika project is one way to parse rich text documents. If you are using the dataimport handler, you are likely to need a nested entity to gather the additional information and include it in the document that is being indexed in the parent entity. The reply from Alvaro shows one way to integrate Tika into DIH. It looks like those instructions are geared to an extremely old Solr version (3.6.2) and probably won't work as-is on a newer version. Solr 4.x was already available when that blog post was written two years ago, so I don't know why they went with 3.6.2. Thanks, Shawn