On 27 June 2013 13:42, Venter, Scott <scott.ven...@rmb.co.za> wrote: > Hi all, > > I am new to SOLR. I have been working through the SOLR 4 Cookbook and my > experiences so far have been great. > > I have worked through the extraction of PDF data recipe, and the Data import > recipe. I would now like to join these two things, i.e. I would like to do a > data import from a Database table of users, and then somehow associate > indexed PDF data with rows that were imported. > > I have a conceptual link between rows in the database and pdf documents, but > I don't know how to make a physical link between the two in SOLR. For > example, I know that user x has pdf documents a, b and c. > > If I have imported my users into SOLR using Data Import Handler, how would I > > 1) import and associate the pdf documents using the extract mechanism, in > such a way that there is a link between user x and the 3 pdf documents as > described above? [...]
Where are your PDF documents? Presumably on the filesystem or available from a web service. What you can do is to have two datasources in your DIH configuration file: * The first one is a JdbcDataSource that extracts data from a database. Presumably, you already have this working. * The second is a BinFileDataSource assuming that your PDF files are on the filesystem. * In the top-level entity, select the user and the names of the associated PDF files. * Use a nested inner entity with the "dataSource" attribute set to the BinFileDataSource, and use the TikaEntityProcessor to index the PDF files. The documentation on this is a little scattered, but see: http://wiki.apache.org/solr/TikaEntityProcessor http://lucene.472066.n3.nabble.com/problem-to-indexing-pdf-directory-td3749554.html Regards, Gora