Re: [EXTERNAL] Re: Does anybody crawl to a database and then index from the database to Solr?

Erick Erickson Mon, 16 May 2016 08:27:41 -0700

bq: ...my primary customer for the deleted docs functionality wants to
see it immediately...


I don't quite know how docs get deleted, but presumably you have a
uniquekey. The
fastest would be to have a delete trigger on your database table that
puts that uniqueKey
and timestamp in a "delete_docs" table.

Now your program wakes up periodically and issues a SQL query to the
deleted_docs table
like "select unkqueKey, timestamp from deleted_docs where timestamp >
last_time_I_ran" and uses the
Solr delete by id to remove them from the table.

Once you're sure the docs are gone, clean up the deleted_docs table or
put the last timestamp
you got from the query somewhere to use next time around or.....

There's no reason you can't have some pattern like "index from scratch
every night and run
deltas every minute during the day" or some such..

Best,
Erick

On Sun, May 15, 2016 at 11:55 AM, abhi Abhishek <abhi26...@gmail.com> wrote:
> Clayton
>
>         you could also try running and optimize on the SOLR index as a
> weekly/bi weekly maintenance task to keep the segment count in check and
> the maxdoc , numdoc count as close as possible (in DB terms de-fragmenting
> the solr indexes)
>
> Best Regards,
> Abhishek
>
>
> On Sun, May 15, 2016 at 7:18 PM, Pryor, Clayton J <cjpr...@sandia.gov>
> wrote:
>
>> Thank you for your feedback.  I really appreciate you taking the time to
>> write it up for me (and hopefully others who might be considering the
>> same).  My first thought for dealing with deleted docs was to delete the
>> contents and rebuild the index from scratch but my primary customer for the
>> deleted docs functionality wants to see it immediately.  I wrote a
>> connector for transferring the contents of one Solr Index to another (I
>> call it a Solr connector) and that takes a half hour.  As a side note, the
>> reason I have multiple indexes is because we currently have physical
>> servers for development and production but, as part of my effort, I am
>> transitioning us to new VMs for development, quality, and production.  For
>> quality control purposes I wanted to be able to reset each with the same
>> set of data - thus the Solr connector.
>>
>> Yes, by connector I am talking about a Java program (using SolrJ) that
>> reads from the database and populates the Solr Index.  For now I have had
>> our enterprise DBAs create a single table to hold the current index schema
>> fields plus some that I can think of that we might use outside of the
>> index.  So far it is a completely flat structure so it will be easy to
>> index to Solr but I can see, as requirements change, we may have to have a
>> more sophisticated database (with multiple tables and greater
>> normalization) in which case the connector will have to flatten the data
>> for the Solr index.
>>
>> Thanks again, your response has been very reassuring!
>>
>> :)
>>
>> Clay
>>
>> -----Original Message-----
>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>> Sent: Friday, May 13, 2016 5:57 PM
>> To: solr-user
>> Subject: [EXTERNAL] Re: Does anybody crawl to a database and then index
>> from the database to Solr?
>>
>> Clayton:
>>
>> I think you've done a pretty thorough investigation, I think you're
>> spot-on. The only thing I would add is that you _will_ reindex your entire
>> corpus.... multiple times. Count on it. Sometime, somewhere, somebody will
>> say "gee, wouldn't it be nice if we could <insert new use-case here>". And
>> to support it you'll have to change your Solr schema... which will almost
>> certainly require you to re-index.....
>>
>> The other thing people have done for deleting documents is to create
>> triggers in your DB to insert the deleted doc IDs into, say, a "deleted"
>> table along with a timestamp. Whenever necessary/desirable, run a cleanup
>> task that finds all the IDs since the last time you ran your deleting
>> program to remove docs that have been flagged since then.. Obviously you
>> also have to keep a record around of the timestamp of the last successful
>> run of this program......
>>
>> Or, frankly, since it takes so little time to rebuild from scratch people
>> have foregone any of that complexity and simply rebuild the entire index
>> periodically. You can use "collection aliasing" to do this in the
>> background and then switch searches atomically, it depends somewhat on how
>> long you can wait until you need to see (well, _not_
>> see) the deleted docs.
>>
>> But this is all refinements, I think you're going down the right path.
>>
>> And when you say "connector", are you talking DIH or an external (say
>> SolrJ) program?
>>
>> Best,
>> Erick
>>
>> On Fri, May 13, 2016 at 2:04 PM, John Bickerstaff <
>> j...@johnbickerstaff.com> wrote:
>> > I've been working on a less-complex thing along the same lines -
>> > taking all the data from our corporate database and pumping it into
>> > Kafka for long-term storage -- and the ability to "play back" all the
>> > Kafka messages any time we need to re-index.
>> >
>> > That simpler scenario has worked like a charm.  I don't need to
>> > massage the data much once it's at rest in Kafka, so that was a
>> > straightforward solution, although I could have gone with a DB and
>> > just stored the solr documents with their ID's one per row in a RDBMS...
>> >
>> > The rest sounds like good ideas for your situation as Solr isn't the
>> > best candidate for the kind of manipulation of data you're proposing
>> > and a database excels at that.  It's more work, but you get a lot more
>> > flexibility and you de-couple Solr from the data crawling as you say.
>> >
>> > It all sounds pretty good to me, but I've only been on the list here a
>> > short time - so I'll leave it to others to add their comments.
>> >
>> > On Fri, May 13, 2016 at 2:46 PM, Pryor, Clayton J <cjpr...@sandia.gov>
>> > wrote:
>> >
>> >> Question:
>> >> Do any of you have your crawlers write to a database rather than
>> >> directly to Solr and then use a connector to index to Solr from the
>> >> database?  If so, have you encountered any issues with this approach?
>> If not, why not?
>> >>
>> >> I have searched forums and the Solr/Lucene email archives (including
>> >> browsing of http://www.apache.org/foundation/public-archives.html)
>> >> but have not found any discussions of this idea.  I am certain that I
>> >> am not the first person to think of it.  I suspect that I have just
>> >> not figured out the proper queries to find what I am looking for.
>> >> Please forgive me if this idea has been discussed before and I just
>> >> couldn't find the discussions.
>> >>
>> >> Background:
>> >> I am new to Solr and have been asked to make improvements to our Solr
>> >> configurations and crawlers.  I have read that the Solr index should
>> >> not be considered a source of record data.  It is in essence a highly
>> >> optimized index to be used for generating search results rather than
>> >> a retainer for record copies of data.  The better approach is to rely
>> >> on corporate data sources for record data and retain the ability to
>> >> completely blow away a Solr index and repopulate it as needed for
>> changing search requirements.
>> >> This made me think that perhaps it would be a good idea for us to
>> >> create a database of crawled data for our Solr index.  The idea is
>> >> that the crawlers would write their findings to a corporate supported
>> >> database of our own design for our own purposes and then we would
>> >> populate our Solr index from this database using a connector that
>> >> writes from the database to the Solr index.
>> >> The only disadvantage that I can think of for this approach is that
>> >> we will need to write a simple interface to the database that allows
>> >> our admin personnel to "Delete" a record from the Solr index.  Of
>> >> course, it won't be deleted from the database but simply flagged as not
>> to be indexed to Solr.
>> >> It will then send a delete command to Solr for any successfully
>> "deleted"
>> >> records from the database.  I suspect this admin interface will grow
>> >> over time but we really only need to be able to delete records from
>> >> the database for now.  All of the rest of our admin work is query
>> >> related which can still be done through the Solr Console.
>> >> I can think of the following advantages:
>> >>
>> >>   *   We have a corporate sponsored and backed up repository for our
>> >> crawled data which would buffer us from any inadvertent losses of our
>> >> Solr index.
>> >>   *   We would divorce the time it takes to crawl web pages from the
>> time
>> >> it takes to populate our Solr index with data from the crawlers.  I
>> >> have found that my Solr Connector takes minutes to populate the
>> >> entire Solr index from the current Solr prod to the new Solr
>> >> instances.  Compare that to hours and even days to actually crawl the
>> web pages.
>> >>   *   We use URLs for our unique IDs in our Solr index.  We can resolve
>> >> the problem of retaining the shortest URL when duplicate content is
>> >> detected in Solr simply by sorting the query used to populate Solr
>> >> from the database by id length descending - this will ensure the last
>> >> URL encountered for any duplicate is always the shortest.
>> >>   *   We can easily ensure that certain classes of crawled content are
>> >> always added last (or first if you prefer) whenever the data is
>> >> indexed to Solr - rather than having to rely on the timing of crawlers.
>> >>   *   We could quickly and easily rebuild our Solr index from scratch at
>> >> any time.  This would be very valuable when changes to our Solr
>> >> configurations require re-indexing our data.
>> >>   *   We can assign unique boost values to individual "documents" at
>> index
>> >> time by assigning a boost value for that document in the database and
>> >> then applying that boost at index time.
>> >>   *   We can continuously run a batch program that removes broken links
>> >> against this database with no impact to Solr and then refresh Solr on
>> >> a more frequent basis than we do now because the connector will take
>> >> minutes rather than hours/days to refresh the content.
>> >>   *   We can store additional information for the crawler to populate to
>> >> Solr when available - such as:
>> >>      *   actual document last updated dates
>> >>      *   boost value for that document in the database
>> >>   *   This database could be used for other purposes such as:
>> >>      *   Identifying a subset of representative data to use for
>> evaluation
>> >> of configuration changes.
>> >>      *   Easy access to "indexed" data for analysis work done by those
>> not
>> >> familiar with Solr.
>> >> Thanks in advance for your feedback.
>> >> Sincerely,
>> >> Clay Pryor
>> >> R&D SE Computer Science
>> >> 9537 - Knowledge Systems
>> >> Sandia National Laboratories
>> >>
>>

Re: [EXTERNAL] Re: Does anybody crawl to a database and then index from the database to Solr?

Reply via email to