If it's of any help I've split the processing of PDF files from the indexing. I put the PDF content into a text file (but I guess you could load it into a database) and use that as part of the indexing. My processing of the PDF files also compares timestamps on the document and the text file so that I'm only processing documents that have changed.
I am a newbie so perhaps there's more sophisticated approaches. Hope that helps. Shaun On 11 April 2011 07:20, Darx Oman <darxo...@gmail.com> wrote: > Hi guys > > I'm wondering how to best configure solr to fulfills my requirements. > > I'm indexing data from 2 data sources: > 1- Database > 2- PDF files (password encrypted) > > Every file has related information stored in the database. Both the file > content and the related database fields must be indexed as one document in > solr. Among the DB data is *per-user* permissions for every document. > > The file contents nearly never change, on the other hand, the DB data and > especially the permissions change very frequently which require me to > re-index everything for every modified document. > > My problem is in process of decrypting the PDF files before re-indexing > them > which takes too much time for a large number of documents, it could span to > days in full re-indexing. > > What I'm trying to accomplish is eliminating the need to re-index the PDF > content if not changed even if the DB data changed. I know this is not > possible in solr, because solr doesn't update documents. > > So how to best accomplish this: > > Can I use 2 indexes one for PDF contents and the other for DB data and have > a common id field for both as a link between them, *and results are treated > as one Document*? >