This copying is a bit overstated here because of the way that small segments are merged into larger segments. Those larger segments are then copied much less often than the smaller ones.
While you can wind up with lots of copying in certain extreme cases, it is quite rare. In particular, if you have one of the following cases, you won't see very many copies for any particular document: - you don't delete files one at a time (i.e. indexing only without updates or deletion) or - most documents that are going to be deleted are deleted as young documents or - the probability that any particular document will be deleted in a fixed period of time decreases exponentially with the age of the documents Any of these characteristics or many others will prevent a file from being copied very many times because as the document ages, it keeps company with similarly aged documents which are accordingly unlikely to have enough compatriots deleted to make their segment have a small number of live documents in it. Put another way, the intervals between merges that a particular document undergoes will become longer and longer as it ages and thus the total number of copies it can undergo cannot grow very fast. On Wed, Dec 28, 2011 at 7:53 PM, Lance Norskog <goks...@gmail.com> wrote: > ... > One problem with indexing is that Solr continally copies data into > "segments" (index parts) while you index. So, each 5MB PDF might get > copied 50 times during a full index job. If you can strip the index > down to what you really want to search on, terabytes become gigabytes. > Solr seems to handle 100g-200g fine on modern hardware. > >