RE: [EXTERNAL] Re: Does anybody crawl to a database and then index from the database to Solr?

Pryor, Clayton J Sun, 15 May 2016 06:49:46 -0700

Thank you for your feedback.  I really appreciate you taking the time to write 
it up for me (and hopefully others who might be considering the same).  My 
first thought for dealing with deleted docs was to delete the contents and 
rebuild the index from scratch but my primary customer for the deleted docs 
functionality wants to see it immediately.  I wrote a connector for 
transferring the contents of one Solr Index to another (I call it a Solr 
connector) and that takes a half hour.  As a side note, the reason I have 
multiple indexes is because we currently have physical servers for development 
and production but, as part of my effort, I am transitioning us to new VMs for 
development, quality, and production.  For quality control purposes I wanted to 
be able to reset each with the same set of data - thus the Solr connector.


Yes, by connector I am talking about a Java program (using SolrJ) that reads 
from the database and populates the Solr Index.  For now I have had our 
enterprise DBAs create a single table to hold the current index schema fields 
plus some that I can think of that we might use outside of the index.  So far 
it is a completely flat structure so it will be easy to index to Solr but I can 
see, as requirements change, we may have to have a more sophisticated database 
(with multiple tables and greater normalization) in which case the connector 
will have to flatten the data for the Solr index.

Thanks again, your response has been very reassuring!

:)

Clay

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, May 13, 2016 5:57 PM
To: solr-user
Subject: [EXTERNAL] Re: Does anybody crawl to a database and then index from 
the database to Solr?

Clayton:

I think you've done a pretty thorough investigation, I think you're spot-on. 
The only thing I would add is that you _will_ reindex your entire corpus.... 
multiple times. Count on it. Sometime, somewhere, somebody will say "gee, 
wouldn't it be nice if we could <insert new use-case here>". And to support it 
you'll have to change your Solr schema... which will almost certainly require 
you to re-index.....

The other thing people have done for deleting documents is to create triggers 
in your DB to insert the deleted doc IDs into, say, a "deleted" table along 
with a timestamp. Whenever necessary/desirable, run a cleanup task that finds 
all the IDs since the last time you ran your deleting program to remove docs 
that have been flagged since then.. Obviously you also have to keep a record 
around of the timestamp of the last successful run of this program......

Or, frankly, since it takes so little time to rebuild from scratch people have 
foregone any of that complexity and simply rebuild the entire index 
periodically. You can use "collection aliasing" to do this in the background 
and then switch searches atomically, it depends somewhat on how long you can 
wait until you need to see (well, _not_
see) the deleted docs.

But this is all refinements, I think you're going down the right path.

And when you say "connector", are you talking DIH or an external (say
SolrJ) program?

Best,
Erick

On Fri, May 13, 2016 at 2:04 PM, John Bickerstaff <j...@johnbickerstaff.com> 
wrote:
> I've been working on a less-complex thing along the same lines - 
> taking all the data from our corporate database and pumping it into 
> Kafka for long-term storage -- and the ability to "play back" all the 
> Kafka messages any time we need to re-index.
>
> That simpler scenario has worked like a charm.  I don't need to 
> massage the data much once it's at rest in Kafka, so that was a 
> straightforward solution, although I could have gone with a DB and 
> just stored the solr documents with their ID's one per row in a RDBMS...
>
> The rest sounds like good ideas for your situation as Solr isn't the 
> best candidate for the kind of manipulation of data you're proposing 
> and a database excels at that.  It's more work, but you get a lot more 
> flexibility and you de-couple Solr from the data crawling as you say.
>
> It all sounds pretty good to me, but I've only been on the list here a 
> short time - so I'll leave it to others to add their comments.
>
> On Fri, May 13, 2016 at 2:46 PM, Pryor, Clayton J <cjpr...@sandia.gov>
> wrote:
>
>> Question:
>> Do any of you have your crawlers write to a database rather than 
>> directly to Solr and then use a connector to index to Solr from the 
>> database?  If so, have you encountered any issues with this approach?  If 
>> not, why not?
>>
>> I have searched forums and the Solr/Lucene email archives (including 
>> browsing of http://www.apache.org/foundation/public-archives.html) 
>> but have not found any discussions of this idea.  I am certain that I 
>> am not the first person to think of it.  I suspect that I have just 
>> not figured out the proper queries to find what I am looking for.  
>> Please forgive me if this idea has been discussed before and I just 
>> couldn't find the discussions.
>>
>> Background:
>> I am new to Solr and have been asked to make improvements to our Solr 
>> configurations and crawlers.  I have read that the Solr index should 
>> not be considered a source of record data.  It is in essence a highly 
>> optimized index to be used for generating search results rather than 
>> a retainer for record copies of data.  The better approach is to rely 
>> on corporate data sources for record data and retain the ability to 
>> completely blow away a Solr index and repopulate it as needed for changing 
>> search requirements.
>> This made me think that perhaps it would be a good idea for us to 
>> create a database of crawled data for our Solr index.  The idea is 
>> that the crawlers would write their findings to a corporate supported 
>> database of our own design for our own purposes and then we would 
>> populate our Solr index from this database using a connector that 
>> writes from the database to the Solr index.
>> The only disadvantage that I can think of for this approach is that 
>> we will need to write a simple interface to the database that allows 
>> our admin personnel to "Delete" a record from the Solr index.  Of 
>> course, it won't be deleted from the database but simply flagged as not to 
>> be indexed to Solr.
>> It will then send a delete command to Solr for any successfully "deleted"
>> records from the database.  I suspect this admin interface will grow 
>> over time but we really only need to be able to delete records from 
>> the database for now.  All of the rest of our admin work is query 
>> related which can still be done through the Solr Console.
>> I can think of the following advantages:
>>
>>   *   We have a corporate sponsored and backed up repository for our
>> crawled data which would buffer us from any inadvertent losses of our 
>> Solr index.
>>   *   We would divorce the time it takes to crawl web pages from the time
>> it takes to populate our Solr index with data from the crawlers.  I 
>> have found that my Solr Connector takes minutes to populate the 
>> entire Solr index from the current Solr prod to the new Solr 
>> instances.  Compare that to hours and even days to actually crawl the web 
>> pages.
>>   *   We use URLs for our unique IDs in our Solr index.  We can resolve
>> the problem of retaining the shortest URL when duplicate content is 
>> detected in Solr simply by sorting the query used to populate Solr 
>> from the database by id length descending - this will ensure the last 
>> URL encountered for any duplicate is always the shortest.
>>   *   We can easily ensure that certain classes of crawled content are
>> always added last (or first if you prefer) whenever the data is 
>> indexed to Solr - rather than having to rely on the timing of crawlers.
>>   *   We could quickly and easily rebuild our Solr index from scratch at
>> any time.  This would be very valuable when changes to our Solr 
>> configurations require re-indexing our data.
>>   *   We can assign unique boost values to individual "documents" at index
>> time by assigning a boost value for that document in the database and 
>> then applying that boost at index time.
>>   *   We can continuously run a batch program that removes broken links
>> against this database with no impact to Solr and then refresh Solr on 
>> a more frequent basis than we do now because the connector will take 
>> minutes rather than hours/days to refresh the content.
>>   *   We can store additional information for the crawler to populate to
>> Solr when available - such as:
>>      *   actual document last updated dates
>>      *   boost value for that document in the database
>>   *   This database could be used for other purposes such as:
>>      *   Identifying a subset of representative data to use for evaluation
>> of configuration changes.
>>      *   Easy access to "indexed" data for analysis work done by those not
>> familiar with Solr.
>> Thanks in advance for your feedback.
>> Sincerely,
>> Clay Pryor
>> R&D SE Computer Science
>> 9537 - Knowledge Systems
>> Sandia National Laboratories
>>

RE: [EXTERNAL] Re: Does anybody crawl to a database and then index from the database to Solr?

Reply via email to