On Wed, 2015-02-18 at 01:40 +0100, Dominique Bejean wrote: (I reordered the requirements)
> - Collection size : 15 billions document > - Document size is nearly 300 bytes > - 1 billion documents indexed = 5Gb index size > - Collection update : 8 millions new documents / days + 8 millions > deleted documents / days > - Updates occur during the night without queries > - Document fields are mainly string including one date field That does not sound too scary. Back of the envelope: If you can handle 500 updates/second (which I would guess would be easy with such small documents), the update phase would be done in 4 hours. > - The same terms will occurs several time for a given field (from 10 > to 100.000) Do you mean that any term is only present in a limited number (up to about 100K) of documents or do you mean that some documents has fields with content like "foo bar foo foo zoo foo..."? If any terms is only present in a maximum of 100K documents or 100K/15M ~= 0.0006% of the full document count, then you have a lot of unique terms. I ask because a low number of unique terms probably means that searches will result in a lot of hits, which can be heavy when we're talking billions. Or to ask more directly: How many hits do you expect a typical search match and how many will you return? > - Queries occur during the day without updates > - Query will use a date period and a filter query on one or more fields The initial call with date range filtering can be a bit heavy. Will you have a lot of requests for unique date ranges or will they typically be re-used (and thus nearly free)? > - 10.000 queries / minutes > - expected response time < 500ms As Erick says, do prototype. With 5GB/1b documents, you could run a fairly telling prototype off a desktop or a beefy desktop for that matter. Just remember to test with 2 or more shards as there is a performance penalty for the switch from single- to multi-shard. As a yard stick, our setup has 7 billion documents (22TB index) with a fair bit of freetext: https://sbdevel.wordpress.com/net-archive-search/ At one point we tried hammering it with simple queries (personal security numbers) for simple result sets (no faceting) and got a throughput of about 50 documents/sec. > - no ssd drives No need with such tiny (measured in bytes) indexes. Just go with the oft-given advice of ensuring enough RAM for full disk cache. > So, what is you advice about : > > # of shards : 15 billions documents -> 16 shards ? Standard advice is to make smaller shards for lower latency, but as you will likely be CPU bound with a small index (in bytes) and a high query rate, that probably won't help your throughput. - Toke Eskildsen, State and University Library, Denmark