Question: Do any of you have your crawlers write to a database rather than directly to Solr and then use a connector to index to Solr from the database? If so, have you encountered any issues with this approach? If not, why not?
I have searched forums and the Solr/Lucene email archives (including browsing of http://www.apache.org/foundation/public-archives.html) but have not found any discussions of this idea. I am certain that I am not the first person to think of it. I suspect that I have just not figured out the proper queries to find what I am looking for. Please forgive me if this idea has been discussed before and I just couldn't find the discussions. Background: I am new to Solr and have been asked to make improvements to our Solr configurations and crawlers. I have read that the Solr index should not be considered a source of record data. It is in essence a highly optimized index to be used for generating search results rather than a retainer for record copies of data. The better approach is to rely on corporate data sources for record data and retain the ability to completely blow away a Solr index and repopulate it as needed for changing search requirements. This made me think that perhaps it would be a good idea for us to create a database of crawled data for our Solr index. The idea is that the crawlers would write their findings to a corporate supported database of our own design for our own purposes and then we would populate our Solr index from this database using a connector that writes from the database to the Solr index. The only disadvantage that I can think of for this approach is that we will need to write a simple interface to the database that allows our admin personnel to "Delete" a record from the Solr index. Of course, it won't be deleted from the database but simply flagged as not to be indexed to Solr. It will then send a delete command to Solr for any successfully "deleted" records from the database. I suspect this admin interface will grow over time but we really only need to be able to delete records from the database for now. All of the rest of our admin work is query related which can still be done through the Solr Console. I can think of the following advantages: * We have a corporate sponsored and backed up repository for our crawled data which would buffer us from any inadvertent losses of our Solr index. * We would divorce the time it takes to crawl web pages from the time it takes to populate our Solr index with data from the crawlers. I have found that my Solr Connector takes minutes to populate the entire Solr index from the current Solr prod to the new Solr instances. Compare that to hours and even days to actually crawl the web pages. * We use URLs for our unique IDs in our Solr index. We can resolve the problem of retaining the shortest URL when duplicate content is detected in Solr simply by sorting the query used to populate Solr from the database by id length descending - this will ensure the last URL encountered for any duplicate is always the shortest. * We can easily ensure that certain classes of crawled content are always added last (or first if you prefer) whenever the data is indexed to Solr - rather than having to rely on the timing of crawlers. * We could quickly and easily rebuild our Solr index from scratch at any time. This would be very valuable when changes to our Solr configurations require re-indexing our data. * We can assign unique boost values to individual "documents" at index time by assigning a boost value for that document in the database and then applying that boost at index time. * We can continuously run a batch program that removes broken links against this database with no impact to Solr and then refresh Solr on a more frequent basis than we do now because the connector will take minutes rather than hours/days to refresh the content. * We can store additional information for the crawler to populate to Solr when available - such as: * actual document last updated dates * boost value for that document in the database * This database could be used for other purposes such as: * Identifying a subset of representative data to use for evaluation of configuration changes. * Easy access to "indexed" data for analysis work done by those not familiar with Solr. Thanks in advance for your feedback. Sincerely, Clay Pryor R&D SE Computer Science 9537 - Knowledge Systems Sandia National Laboratories