Re: Solr Distributed Search vs Hadoop

Ted Dunning Wed, 28 Dec 2011 21:09:34 -0800

This copying is a bit overstated here because of the way that small
segments are merged into larger segments.  Those larger segments are then
copied much less often than the smaller ones.

While you can wind up with lots of copying in certain extreme cases, it is
quite rare.  In particular, if you have one of the following cases, you
won't see very many copies for any particular document:

- you don't delete files one at a time (i.e. indexing only without updates
or deletion)

or

- most documents that are going to be deleted are deleted as young documents

or

- the probability that any particular document will be deleted in a fixed
period of time decreases exponentially with the age of the documents

Any of these characteristics or many others will prevent a file from being
copied very many times because as the document ages, it keeps company with
similarly aged documents which are accordingly unlikely to have enough
compatriots deleted to make their segment have a small number of live
documents in it.  Put another way, the intervals between merges that a
particular document undergoes will become longer and longer as it ages and
thus the total number of copies it can undergo cannot grow very fast.

On Wed, Dec 28, 2011 at 7:53 PM, Lance Norskog <goks...@gmail.com> wrote:

> ...
> One problem with indexing is that Solr continally copies data into
> "segments" (index parts) while you index. So, each 5MB PDF might get
> copied 50 times during a full index job. If you can strip the index
> down to what you really want to search on, terabytes become gigabytes.
> Solr seems to handle 100g-200g fine on modern hardware.
>
>

Re: Solr Distributed Search vs Hadoop

Reply via email to