Matheo Software Info <i...@matheo-software.com> wrote: > My question is very simple ☺ I would like to know if Solr can process > around 30To of data (Pdf, Text, Word, etc…) ?
Simple answer: Yes. Assuming 30To means 30 terabyte. > What is the best way to index this huge data ? several servers ? > several shards ? other ? As other participants has mentioned, it is hard to give numbers. What we can do is share experience. We are doing webarchive indexing and I guess there would be quite an overlap with your content as we also use Tika. One difference is that the images in a webarchive are quite cheap to index, so you'll probably need (relatively) more hardware than we use. Very roughly we used 40 CPU-years to index 600 (700? I forget) TB of data in one of our runs. Scaling to your 30TB this suggests something like 2 CPU-years, or a couple of months for a 16 core machine. This is just to get a ballpark: You will do yourself a huge favor by building a test-setup and process 1 TB or so of your data to get _your_ numbers, before you design your indexing setup. It is our experience that the analyzing part (Tika) takes much more power than the Solr indexing part: At our last run we had 30-40 CPU-cores doing Tika (and related analysis) feeding into a Solr running on a 4-core machine on spinning drives. As for Solr setup for search, then you need to describe in detail what your requirements are, before we can give you suggestions. Is the index updated all the time, in batches or one-off? How many concurrent users? Are the searches interactive or batch-jobs? What kind of aggregations do you need? In our setup we build separate collections that are merged to single segments and never updated. Our use varies between very few interactive users and a lot of batch jobs. Scaling this specialized setup to your corpus size would require about 3TB of SSD, 64MB RAM and 4 CPU-cores, divided among 4 shards. You are likely to need quite a lot more than that, so this is just to say that at this scale the use of the index matters _a lot_. - Toke Eskildsen