Question:
Do any of you have your crawlers write to a database rather than directly to 
Solr and then use a connector to index to Solr from the database?  If so, have 
you encountered any issues with this approach?  If not, why not?

I have searched forums and the Solr/Lucene email archives (including browsing 
of http://www.apache.org/foundation/public-archives.html) but have not found 
any discussions of this idea.  I am certain that I am not the first person to 
think of it.  I suspect that I have just not figured out the proper queries to 
find what I am looking for.  Please forgive me if this idea has been discussed 
before and I just couldn't find the discussions.

Background:
I am new to Solr and have been asked to make improvements to our Solr 
configurations and crawlers.  I have read that the Solr index should not be 
considered a source of record data.  It is in essence a highly optimized index 
to be used for generating search results rather than a retainer for record 
copies of data.  The better approach is to rely on corporate data sources for 
record data and retain the ability to completely blow away a Solr index and 
repopulate it as needed for changing search requirements.
This made me think that perhaps it would be a good idea for us to create a 
database of crawled data for our Solr index.  The idea is that the crawlers 
would write their findings to a corporate supported database of our own design 
for our own purposes and then we would populate our Solr index from this 
database using a connector that writes from the database to the Solr index.
The only disadvantage that I can think of for this approach is that we will 
need to write a simple interface to the database that allows our admin 
personnel to "Delete" a record from the Solr index.  Of course, it won't be 
deleted from the database but simply flagged as not to be indexed to Solr.  It 
will then send a delete command to Solr for any successfully "deleted" records 
from the database.  I suspect this admin interface will grow over time but we 
really only need to be able to delete records from the database for now.  All 
of the rest of our admin work is query related which can still be done through 
the Solr Console.
I can think of the following advantages:

  *   We have a corporate sponsored and backed up repository for our crawled 
data which would buffer us from any inadvertent losses of our Solr index.
  *   We would divorce the time it takes to crawl web pages from the time it 
takes to populate our Solr index with data from the crawlers.  I have found 
that my Solr Connector takes minutes to populate the entire Solr index from the 
current Solr prod to the new Solr instances.  Compare that to hours and even 
days to actually crawl the web pages.
  *   We use URLs for our unique IDs in our Solr index.  We can resolve the 
problem of retaining the shortest URL when duplicate content is detected in 
Solr simply by sorting the query used to populate Solr from the database by id 
length descending - this will ensure the last URL encountered for any duplicate 
is always the shortest.
  *   We can easily ensure that certain classes of crawled content are always 
added last (or first if you prefer) whenever the data is indexed to Solr - 
rather than having to rely on the timing of crawlers.
  *   We could quickly and easily rebuild our Solr index from scratch at any 
time.  This would be very valuable when changes to our Solr configurations 
require re-indexing our data.
  *   We can assign unique boost values to individual "documents" at index time 
by assigning a boost value for that document in the database and then applying 
that boost at index time.
  *   We can continuously run a batch program that removes broken links against 
this database with no impact to Solr and then refresh Solr on a more frequent 
basis than we do now because the connector will take minutes rather than 
hours/days to refresh the content.
  *   We can store additional information for the crawler to populate to Solr 
when available - such as:
     *   actual document last updated dates
     *   boost value for that document in the database
  *   This database could be used for other purposes such as:
     *   Identifying a subset of representative data to use for evaluation of 
configuration changes.
     *   Easy access to "indexed" data for analysis work done by those not 
familiar with Solr.
Thanks in advance for your feedback.
Sincerely,
Clay Pryor
R&D SE Computer Science
9537 - Knowledge Systems
Sandia National Laboratories

Reply via email to