Hi Toke,

Thank you for your response.

Here is some precisions.

>
>> - The same terms will occurs several time for a given field (from 10
>>   to 100.000)
>
>Do you mean that any term is only present in a limited number (up to
>about 100K) of documents or do you mean that some documents has fields
>with content like  "foo bar foo foo zoo foo..."?
>
>If any terms is only present in a maximum of 100K documents or 100K/15M
>~= 0.0006% of the full document count, then you have a lot of unique
>terms.
>
>I ask because a low number of unique terms probably means that searches
>will result in a lot of hits, which can be heavy when we're talking
>billions. Or to ask more directly: How many hits do you expect a typical
>search match and how many will you return?
>

I mean that string fields will content in general one term (color, brand,
type), but according to fields, a same value can occurs in 10 to 100.000
documents.

So, a filter query on one field can return a result from 10 to 100.000
documents (of course, we will paginate)

Queries date scope will often :
* be on the previous day
* or be on the last few days
* or be successively with the same filter query but with a changing date
frame (last day, last week, week before, ...)

Regards

Dominique


2015-02-18 10:35 GMT+01:00 Toke Eskildsen <t...@statsbiblioteket.dk>:

> On Wed, 2015-02-18 at 01:40 +0100, Dominique Bejean wrote:
>
> (I reordered the requirements)
>
> >  - Collection size : 15 billions document
> >  - Document size is nearly 300 bytes
> > - 1 billion documents indexed = 5Gb index size
> > - Collection update : 8 millions new documents / days + 8 millions
> >    deleted documents / days
> >  - Updates occur during the night without queries
> > - Document fields are mainly string including one date field
>
> That does not sound too scary. Back of the envelope: If you can handle
> 500 updates/second (which I would guess would be easy with such small
> documents), the update phase would be done in 4 hours.
>
> > - The same terms will occurs several time for a given field (from 10
> >   to 100.000)
>
> Do you mean that any term is only present in a limited number (up to
> about 100K) of documents or do you mean that some documents has fields
> with content like  "foo bar foo foo zoo foo..."?
>
> If any terms is only present in a maximum of 100K documents or 100K/15M
> ~= 0.0006% of the full document count, then you have a lot of unique
> terms.
>
> I ask because a low number of unique terms probably means that searches
> will result in a lot of hits, which can be heavy when we're talking
> billions. Or to ask more directly: How many hits do you expect a typical
> search match and how many will you return?
>
> >    - Queries occur during the day without updates
> >    - Query will use a date period and a filter query on one or more
> fields
>
> The initial call with date range filtering can be a bit heavy. Will you
> have a lot of requests for unique date ranges or will they typically be
> re-used (and thus nearly free)?
>
> >    - 10.000 queries / minutes
> >    - expected response time < 500ms
>
> As Erick says, do prototype. With 5GB/1b documents, you could run a
> fairly telling prototype off a desktop or a beefy desktop for that
> matter. Just remember to test with 2 or more shards as there is a
> performance penalty for the switch from single- to multi-shard.
>
>
> As a yard stick, our setup has 7 billion documents (22TB index) with a
> fair bit of freetext: https://sbdevel.wordpress.com/net-archive-search/
>
> At one point we tried hammering it with simple queries (personal
> security numbers) for simple result sets (no faceting) and got a
> throughput of about 50 documents/sec.
>
> >    - no ssd drives
>
> No need with such tiny (measured in bytes) indexes. Just go with the
> oft-given advice of ensuring enough RAM for full disk cache.
>
> > So, what is you advice about :
> >
> > # of shards : 15 billions documents -> 16 shards ?
>
> Standard advice is to make smaller shards for lower latency, but as you
> will likely be CPU bound with a small index (in bytes) and a high query
> rate, that probably won't help your throughput.
>
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>

Reply via email to