Matheo Software Info <i...@matheo-software.com> wrote:
> My question is very simple ☺ I would like to know if Solr can process
> around 30To of data (Pdf, Text, Word, etc…) ?

Simple answer: Yes. Assuming 30To means 30 terabyte.

> What is the best way to index this huge data ? several servers ?
> several shards ? other ?

As other participants has mentioned, it is hard to give numbers. What we can do 
is share experience.

We are doing webarchive indexing and I guess there would be quite an overlap 
with your content as we also use Tika. One difference is that the images in a 
webarchive are quite cheap to index, so you'll probably need (relatively) more 
hardware than we use. Very roughly we used 40 CPU-years to index 600 (700? I 
forget) TB of data in one of our runs. Scaling to your 30TB this suggests 
something like 2 CPU-years, or a couple of months for a 16 core machine.

This is just to get a ballpark: You will do yourself a huge favor by building a 
test-setup and process 1 TB or so of your data to get _your_ numbers, before 
you design your indexing setup. It is our experience that the analyzing part 
(Tika) takes much more power than the Solr indexing part: At our last run we 
had 30-40 CPU-cores doing Tika (and related analysis) feeding into a Solr 
running on a 4-core machine on spinning drives.


As for Solr setup for search, then you need to describe in detail what your 
requirements are, before we can give you suggestions. Is the index updated all 
the time, in batches or one-off? How many concurrent users? Are the searches 
interactive or batch-jobs? What kind of aggregations do you need?

In our setup we build separate collections that are merged to single segments 
and never updated. Our use varies between very few interactive users and a lot 
of batch jobs. Scaling this specialized setup to your corpus size would require 
about 3TB of SSD, 64MB RAM and 4 CPU-cores, divided among 4 shards. You are 
likely to need quite a lot more than that, so this is just to say that at this 
scale the use of the index matters _a lot_.

- Toke Eskildsen

Reply via email to