SOLR-1499 is a plug-in for the DIH that uses Solr as a DataSource. This means that you can read the database and PDFs separately. You could index all of the PDF content in one DIH script. Then, when there's a database update, you have a separate DIH scripts that reads the old row from Solr, and pulls the stripped text from the PDF, and then re-indexes the whole thing. This would cut out the need to reparse the PDF.
Lance On Mon, Apr 11, 2011 at 8:48 AM, Shaun Campbell <campbell.sh...@gmail.com> wrote: > If it's of any help I've split the processing of PDF files from the > indexing. I put the PDF content into a text file (but I guess you could load > it into a database) and use that as part of the indexing. My processing of > the PDF files also compares timestamps on the document and the text file so > that I'm only processing documents that have changed. > > I am a newbie so perhaps there's more sophisticated approaches. > > Hope that helps. > Shaun > > On 11 April 2011 07:20, Darx Oman <darxo...@gmail.com> wrote: > >> Hi guys >> >> I'm wondering how to best configure solr to fulfills my requirements. >> >> I'm indexing data from 2 data sources: >> 1- Database >> 2- PDF files (password encrypted) >> >> Every file has related information stored in the database. Both the file >> content and the related database fields must be indexed as one document in >> solr. Among the DB data is *per-user* permissions for every document. >> >> The file contents nearly never change, on the other hand, the DB data and >> especially the permissions change very frequently which require me to >> re-index everything for every modified document. >> >> My problem is in process of decrypting the PDF files before re-indexing >> them >> which takes too much time for a large number of documents, it could span to >> days in full re-indexing. >> >> What I'm trying to accomplish is eliminating the need to re-index the PDF >> content if not changed even if the DB data changed. I know this is not >> possible in solr, because solr doesn't update documents. >> >> So how to best accomplish this: >> >> Can I use 2 indexes one for PDF contents and the other for DB data and have >> a common id field for both as a link between them, *and results are treated >> as one Document*? >> > -- Lance Norskog goks...@gmail.com