Hi Toke, Thank you for your response.
Here is some precisions. > >> - The same terms will occurs several time for a given field (from 10 >> to 100.000) > >Do you mean that any term is only present in a limited number (up to >about 100K) of documents or do you mean that some documents has fields >with content like "foo bar foo foo zoo foo..."? > >If any terms is only present in a maximum of 100K documents or 100K/15M >~= 0.0006% of the full document count, then you have a lot of unique >terms. > >I ask because a low number of unique terms probably means that searches >will result in a lot of hits, which can be heavy when we're talking >billions. Or to ask more directly: How many hits do you expect a typical >search match and how many will you return? > I mean that string fields will content in general one term (color, brand, type), but according to fields, a same value can occurs in 10 to 100.000 documents. So, a filter query on one field can return a result from 10 to 100.000 documents (of course, we will paginate) Queries date scope will often : * be on the previous day * or be on the last few days * or be successively with the same filter query but with a changing date frame (last day, last week, week before, ...) Regards Dominique 2015-02-18 10:35 GMT+01:00 Toke Eskildsen <t...@statsbiblioteket.dk>: > On Wed, 2015-02-18 at 01:40 +0100, Dominique Bejean wrote: > > (I reordered the requirements) > > > - Collection size : 15 billions document > > - Document size is nearly 300 bytes > > - 1 billion documents indexed = 5Gb index size > > - Collection update : 8 millions new documents / days + 8 millions > > deleted documents / days > > - Updates occur during the night without queries > > - Document fields are mainly string including one date field > > That does not sound too scary. Back of the envelope: If you can handle > 500 updates/second (which I would guess would be easy with such small > documents), the update phase would be done in 4 hours. > > > - The same terms will occurs several time for a given field (from 10 > > to 100.000) > > Do you mean that any term is only present in a limited number (up to > about 100K) of documents or do you mean that some documents has fields > with content like "foo bar foo foo zoo foo..."? > > If any terms is only present in a maximum of 100K documents or 100K/15M > ~= 0.0006% of the full document count, then you have a lot of unique > terms. > > I ask because a low number of unique terms probably means that searches > will result in a lot of hits, which can be heavy when we're talking > billions. Or to ask more directly: How many hits do you expect a typical > search match and how many will you return? > > > - Queries occur during the day without updates > > - Query will use a date period and a filter query on one or more > fields > > The initial call with date range filtering can be a bit heavy. Will you > have a lot of requests for unique date ranges or will they typically be > re-used (and thus nearly free)? > > > - 10.000 queries / minutes > > - expected response time < 500ms > > As Erick says, do prototype. With 5GB/1b documents, you could run a > fairly telling prototype off a desktop or a beefy desktop for that > matter. Just remember to test with 2 or more shards as there is a > performance penalty for the switch from single- to multi-shard. > > > As a yard stick, our setup has 7 billion documents (22TB index) with a > fair bit of freetext: https://sbdevel.wordpress.com/net-archive-search/ > > At one point we tried hammering it with simple queries (personal > security numbers) for simple result sets (no faceting) and got a > throughput of about 50 documents/sec. > > > - no ssd drives > > No need with such tiny (measured in bytes) indexes. Just go with the > oft-given advice of ensuring enough RAM for full disk cache. > > > So, what is you advice about : > > > > # of shards : 15 billions documents -> 16 shards ? > > Standard advice is to make smaller shards for lower latency, but as you > will likely be CPU bound with a small index (in bytes) and a high query > rate, that probably won't help your throughput. > > > - Toke Eskildsen, State and University Library, Denmark > > >