Re: faceting over ngrams

Dmitry Kan Wed, 16 Mar 2011 11:26:27 -0700

Hi Jonathan,

Thanks for sharing useful bits. Each shard has 16G of heap. Unless I do
something fundamentally wrong in the SOLR configuration, I have to admit,
that counting ngrams up to trigrams across whole set of shard's documents is
pretty intensive task, as each ngram can occur anywhere in the index and
SOLR most probably doesn't precompute the cumulative count of it. I'll try
querying with facet.method=fc, thanks for that.


By the way, the trigrams are defined like this:

<fieldType name="shingle_text_trigram" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="true"/>
</analyzer>
</fieldType>

For the sharding -- I decided to go with it, when the index size approached
half a terabyte and doc count went over 100M, I thought it would help us
scale better. I also maintain good level of caching, and so far the faceting
over normal string fields (no ngrams) performed really well (around 1 sec).


On Wed, Mar 16, 2011 at 6:23 PM, Jonathan Rochkind <rochk...@jhu.edu> wrote:

> Ah, wait, you're doing sharding?  Yeah, I am NOT doing sharding, so that
> could explain our different experiences.  It seems like sharding definitely
> has trade-offs, makes some things faster and other things slower. So far
> I've managed to avoid it, in the interest of keeping things simpler and
> easier to understand (for me, the developer/Solr manager), thinking that
> sharding is also a somewhat less mature feature.
>
> With only 1M documents.... are you sure you need sharding at all?  You
> could still use replication to "scale out" for volume, sharding seems more
> about scaling for number of documents (or total bytes) in your index.  1M
> documents is not very large, for Solr, in general.
>
> Jonathan
>
>
> On 3/16/2011 11:51 AM, Toke Eskildsen wrote:
>
>> On Wed, 2011-03-16 at 13:05 +0100, Dmitry Kan wrote:
>>
>>> Hello guys. We are using shard'ed solr 1.4 for heavy faceted search over
>>> the
>>> trigrams field with about 1 million of entries in the result set and more
>>> than 100 million of entries to facet on in the index. Currently the
>>> faceted
>>> search is very slow, taking about 5 minutes per query.
>>>
>> I tried creating an index with 1M documents, each with 100 unique terms
>> in a field. A search for "*:*" with a facet request for the first 1M
>> entries in the field took about 20 seconds for the first call and about
>> 1-1½ second for each subsequent call. This was with Solr trunk. The
>> complexity of my setup is no doubt a lot simpler and lighter than yours,
>> but 5 minutes sounds excessive.
>>
>> My guess is that your performance problem is due to the merging process.
>> Could you try measuring the performance of a direct request to a single
>> shard? If that is satisfactory, going to the cloud would not solve your
>> problem. If you really need 1M entries in your result set, you would be
>> better of investigating whether your index can be in a single instance.
>>
>>


-- 
Regards,

Dmitry Kan

Re: faceting over ngrams

Reply via email to