On Wed, 2015-02-18 at 01:40 +0100, Dominique Bejean wrote:

(I reordered the requirements)

>  - Collection size : 15 billions document
>  - Document size is nearly 300 bytes
> - 1 billion documents indexed = 5Gb index size
> - Collection update : 8 millions new documents / days + 8 millions
>    deleted documents / days
>  - Updates occur during the night without queries
> - Document fields are mainly string including one date field

That does not sound too scary. Back of the envelope: If you can handle
500 updates/second (which I would guess would be easy with such small
documents), the update phase would be done in 4 hours.

> - The same terms will occurs several time for a given field (from 10
>   to 100.000)

Do you mean that any term is only present in a limited number (up to
about 100K) of documents or do you mean that some documents has fields
with content like  "foo bar foo foo zoo foo..."?

If any terms is only present in a maximum of 100K documents or 100K/15M
~= 0.0006% of the full document count, then you have a lot of unique
terms.

I ask because a low number of unique terms probably means that searches
will result in a lot of hits, which can be heavy when we're talking
billions. Or to ask more directly: How many hits do you expect a typical
search match and how many will you return?

>    - Queries occur during the day without updates
>    - Query will use a date period and a filter query on one or more fields

The initial call with date range filtering can be a bit heavy. Will you
have a lot of requests for unique date ranges or will they typically be
re-used (and thus nearly free)?

>    - 10.000 queries / minutes
>    - expected response time < 500ms

As Erick says, do prototype. With 5GB/1b documents, you could run a
fairly telling prototype off a desktop or a beefy desktop for that
matter. Just remember to test with 2 or more shards as there is a
performance penalty for the switch from single- to multi-shard.


As a yard stick, our setup has 7 billion documents (22TB index) with a
fair bit of freetext: https://sbdevel.wordpress.com/net-archive-search/

At one point we tried hammering it with simple queries (personal
security numbers) for simple result sets (no faceting) and got a
throughput of about 50 documents/sec.

>    - no ssd drives

No need with such tiny (measured in bytes) indexes. Just go with the
oft-given advice of ensuring enough RAM for full disk cache.

> So, what is you advice about :
> 
> # of shards : 15 billions documents -> 16 shards ?

Standard advice is to make smaller shards for lower latency, but as you
will likely be CPU bound with a small index (in bytes) and a high query
rate, that probably won't help your throughput.


- Toke Eskildsen, State and University Library, Denmark


Reply via email to