Re: Does anybody crawl to a database and then index from the database to Solr?

Erick Erickson Fri, 13 May 2016 16:57:57 -0700

Clayton:

I think you've done a pretty thorough investigation, I think you're
spot-on. The only thing I would add is that you _will_ reindex your
entire corpus.... multiple times. Count on it. Sometime, somewhere,
somebody will say "gee, wouldn't it be nice if we could <insert new
use-case here>". And to support it you'll have to change your Solr
schema... which will almost certainly require you to re-index.....


The other thing people have done for deleting documents is to create
triggers in your DB to insert the deleted doc IDs into, say, a
"deleted" table along with a timestamp. Whenever necessary/desirable,
run a cleanup task that finds all the IDs since the last time you ran
your deleting program to remove docs that have been flagged since
then.. Obviously you also have to keep a record around of the
timestamp of the last successful run of this program......

Or, frankly, since it takes so little time to rebuild from scratch
people have foregone any of that complexity and simply rebuild the
entire index periodically. You can use "collection aliasing" to do
this in the background and then switch searches atomically, it depends
somewhat on how long you can wait until you need to see (well, _not_
see) the deleted docs.

But this is all refinements, I think you're going down the right path.

And when you say "connector", are you talking DIH or an external (say
SolrJ) program?

Best,
Erick

On Fri, May 13, 2016 at 2:04 PM, John Bickerstaff
<j...@johnbickerstaff.com> wrote:
> I've been working on a less-complex thing along the same lines - taking all
> the data from our corporate database and pumping it into Kafka for
> long-term storage -- and the ability to "play back" all the Kafka messages
> any time we need to re-index.
>
> That simpler scenario has worked like a charm.  I don't need to massage the
> data much once it's at rest in Kafka, so that was a straightforward
> solution, although I could have gone with a DB and just stored the solr
> documents with their ID's one per row in a RDBMS...
>
> The rest sounds like good ideas for your situation as Solr isn't the best
> candidate for the kind of manipulation of data you're proposing and a
> database excels at that.  It's more work, but you get a lot more
> flexibility and you de-couple Solr from the data crawling as you say.
>
> It all sounds pretty good to me, but I've only been on the list here a
> short time - so I'll leave it to others to add their comments.
>
> On Fri, May 13, 2016 at 2:46 PM, Pryor, Clayton J <cjpr...@sandia.gov>
> wrote:
>
>> Question:
>> Do any of you have your crawlers write to a database rather than directly
>> to Solr and then use a connector to index to Solr from the database?  If
>> so, have you encountered any issues with this approach?  If not, why not?
>>
>> I have searched forums and the Solr/Lucene email archives (including
>> browsing of http://www.apache.org/foundation/public-archives.html) but
>> have not found any discussions of this idea.  I am certain that I am not
>> the first person to think of it.  I suspect that I have just not figured
>> out the proper queries to find what I am looking for.  Please forgive me if
>> this idea has been discussed before and I just couldn't find the
>> discussions.
>>
>> Background:
>> I am new to Solr and have been asked to make improvements to our Solr
>> configurations and crawlers.  I have read that the Solr index should not be
>> considered a source of record data.  It is in essence a highly optimized
>> index to be used for generating search results rather than a retainer for
>> record copies of data.  The better approach is to rely on corporate data
>> sources for record data and retain the ability to completely blow away a
>> Solr index and repopulate it as needed for changing search requirements.
>> This made me think that perhaps it would be a good idea for us to create a
>> database of crawled data for our Solr index.  The idea is that the crawlers
>> would write their findings to a corporate supported database of our own
>> design for our own purposes and then we would populate our Solr index from
>> this database using a connector that writes from the database to the Solr
>> index.
>> The only disadvantage that I can think of for this approach is that we
>> will need to write a simple interface to the database that allows our admin
>> personnel to "Delete" a record from the Solr index.  Of course, it won't be
>> deleted from the database but simply flagged as not to be indexed to Solr.
>> It will then send a delete command to Solr for any successfully "deleted"
>> records from the database.  I suspect this admin interface will grow over
>> time but we really only need to be able to delete records from the database
>> for now.  All of the rest of our admin work is query related which can
>> still be done through the Solr Console.
>> I can think of the following advantages:
>>
>>   *   We have a corporate sponsored and backed up repository for our
>> crawled data which would buffer us from any inadvertent losses of our Solr
>> index.
>>   *   We would divorce the time it takes to crawl web pages from the time
>> it takes to populate our Solr index with data from the crawlers.  I have
>> found that my Solr Connector takes minutes to populate the entire Solr
>> index from the current Solr prod to the new Solr instances.  Compare that
>> to hours and even days to actually crawl the web pages.
>>   *   We use URLs for our unique IDs in our Solr index.  We can resolve
>> the problem of retaining the shortest URL when duplicate content is
>> detected in Solr simply by sorting the query used to populate Solr from the
>> database by id length descending - this will ensure the last URL
>> encountered for any duplicate is always the shortest.
>>   *   We can easily ensure that certain classes of crawled content are
>> always added last (or first if you prefer) whenever the data is indexed to
>> Solr - rather than having to rely on the timing of crawlers.
>>   *   We could quickly and easily rebuild our Solr index from scratch at
>> any time.  This would be very valuable when changes to our Solr
>> configurations require re-indexing our data.
>>   *   We can assign unique boost values to individual "documents" at index
>> time by assigning a boost value for that document in the database and then
>> applying that boost at index time.
>>   *   We can continuously run a batch program that removes broken links
>> against this database with no impact to Solr and then refresh Solr on a
>> more frequent basis than we do now because the connector will take minutes
>> rather than hours/days to refresh the content.
>>   *   We can store additional information for the crawler to populate to
>> Solr when available - such as:
>>      *   actual document last updated dates
>>      *   boost value for that document in the database
>>   *   This database could be used for other purposes such as:
>>      *   Identifying a subset of representative data to use for evaluation
>> of configuration changes.
>>      *   Easy access to "indexed" data for analysis work done by those not
>> familiar with Solr.
>> Thanks in advance for your feedback.
>> Sincerely,
>> Clay Pryor
>> R&D SE Computer Science
>> 9537 - Knowledge Systems
>> Sandia National Laboratories
>>

Re: Does anybody crawl to a database and then index from the database to Solr?

Reply via email to