Hi,
We have different working hours, sorry for the reply delay. Your assumed
numbers are right, about 25-30Kb per doc. giving a total of 15G per shard,
there are two shards per server (+2 slaves that should do no work normally).
An average query has about 30 conditions (OR AND mixed), most of them
textual, a small part on dateTime. They use only simple queries (no facet,
filters etc.) as it is taken from the actual query set of my entreprise
that works with an old search engine.

As we said, if the shards in collection1 and collection2 have the same
number of docs each (and same RAM & CPU per shard), it is apparently not a
slow IO issue, right? So the fact of not having cached all my index doesn't
seem the be the bottleneck.Moreover, i do store the fields but my query set
requests only the id's and rarely snippets so I'd assume that the plenty of
RAM i'd give the OS wouldn't make any difference as these *.fdt files don't
need to get cached.

The conclusion i get to is that the merging issue is the problem, and the
only possibility of outsmarting it is to distribute to much fewer shards,
meaning that i'll get back to few millions of docs per shard which are
about linearly slower with the num of docs per shard. Though the latter
should improve if i give much more RAM per server.

I'll try tweaking a bit my schema and making better use of solr cache
(filter query as an example), but i have something telling me the problem
might be elsewhere. My main clue to it is that merging seems a simple CPU
task, and tests show that even with a small amount of responses it takes a
long time (and clearly the merging task on few docs is very short)


On Wed, Apr 10, 2013 at 2:50 AM, Shawn Heisey <s...@elyograg.org> wrote:

> On 4/9/2013 3:50 PM, Furkan KAMACI wrote:
>
>> Hi Shawn;
>>
>> You say that:
>>
>> *... your documents are about 50KB each.  That would translate to an index
>> that's at least 25GB*
>>
>> I know we can not say an exact size but what is the approximately ratio of
>> document size / index size according to your experiences?
>>
>
> If you store the fields, that is actual size plus a small amount of
> overhead.  Starting with Solr 4.1, stored fields are compressed.  I believe
> that it uses LZ4 compression.  Some people store all fields, some people
> store only a few or one - an ID field.  The size of stored fields does have
> an impact on how much OS disk cache you need, but not as much as the other
> parts of an index.
>
> It's been my experience that termvectors take up almost as much space as
> stored data for the same fields, and sometimes more.  Starting with Solr
> 4.2, termvectors are also compressed.
>
> Adding docValues (new in 4.2) to the schema will also make the index
> larger.  The requirements here are similar to stored fields.  I do not know
> whether this data gets compressed, but I don't think it does.
>
> As for the indexed data, this is where I am less clear about the storage
> ratios, but I think you can count on it needing almost as much space as the
> original data.  If the schema uses types or filters that produce a lot of
> information, the indexed data might be larger than the original input.
>  Examples of data explosions in a schema: trie fields with a non-zero
> precisionStep, the edgengram filter, the shingle filter.
>
> Thanks,
> Shawn
>
>

Reply via email to