Re: Data Import Handler and Extract Handler

Gora Mohanty Thu, 27 Jun 2013 07:00:36 -0700

On 27 June 2013 13:42, Venter, Scott <scott.ven...@rmb.co.za> wrote:
> Hi all,
>
> I am new to SOLR. I have been working through the SOLR 4 Cookbook and my 
> experiences so far have been great.
>
> I have worked through the extraction of PDF data recipe, and the Data import 
> recipe. I would now like to join these two things, i.e. I would like to do a 
> data import from a Database table of users, and then somehow associate 
> indexed PDF data with rows that were imported.
>
> I have a conceptual link between rows in the database and pdf documents, but 
> I don't know how to make a physical link between the two in SOLR. For 
> example, I know that user x has pdf documents a, b and c.
>
> If I have imported my users into SOLR using Data Import Handler, how would I
>
> 1) import and associate the pdf documents using the extract mechanism, in 
> such a way that there is a link between user x and the 3 pdf documents as 
> described above?
[...]


Where are your PDF documents? Presumably on the filesystem
or available from a web service. What you can do is to have
two datasources in your DIH configuration file:
* The first one is a JdbcDataSource that extracts data from a
   database. Presumably, you already have this working.
* The second is a BinFileDataSource assuming that your
   PDF files are on the filesystem.
* In the top-level entity, select the user and the names of the
  associated PDF files.
* Use a nested inner entity with the "dataSource" attribute set
  to the BinFileDataSource, and use the TikaEntityProcessor
  to index the PDF files. The documentation on this is a little
  scattered, but see:
  http://wiki.apache.org/solr/TikaEntityProcessor
  
http://lucene.472066.n3.nabble.com/problem-to-indexing-pdf-directory-td3749554.html

Regards,
Gora

Re: Data Import Handler and Extract Handler

Reply via email to