Re: Problem of facet on 170M documents

Fudong-gmail Tue, 05 Nov 2013 09:39:49 -0800

One way to solve the issue may be to create another field to group the value in 
a range, so you have fewer facet values to query.


Sent from my iPhone

On Nov 5, 2013, at 4:31 AM, Erick Erickson <erickerick...@gmail.com> wrote:

> You're just going to have to accept it being slow. Think of it this way:
> you have
> 4M (say) buckets that have to be counted into. Then the top 500 have to be
> collected to return. That's just going to take some time unless you have
> very beefy machines.
> 
> I'd _really_ back up and consider whether this is a good thing or whether
> this is one of those ideas that doesn't have much use to the user. If your
> results rarely if ever show counts for a URL more than, say, 5, is it
> really giving your users useful info?
> 
> Best,
> Erick
> 
> 
> On Mon, Nov 4, 2013 at 6:54 PM, Mingfeng Yang <mfy...@wisewindow.com> wrote:
> 
>> Erick,
>> 
>> It could have more than 4M distinct values.  The purpose of this facet is
>> to display the most frequent, say top 500, urls to users.
>> 
>> Sascha,
>> 
>> Thanks for the info. I will look into this thread thing.
>> 
>> Mingfeng
>> 
>> 
>> On Mon, Nov 4, 2013 at 4:47 AM, Erick Erickson <erickerick...@gmail.com
>>> wrote:
>> 
>>> How many unique URLs do you have in your 9M
>>> docs? If your 9M hits have 4M distinct URLs, then
>>> this is not very valuable to the user.
>>> 
>>> Sascha:
>>> Was that speedup on a single field or were you faceting over
>>> multiple fields? Because as I remember that code spins off
>>> threads on a per-field basis, and if I'm mis-remembering I need
>>> to look again!
>>> 
>>> Best,
>>> Erick
>>> 
>>> 
>>> On Sat, Nov 2, 2013 at 5:07 AM, Sascha SZOTT <sz...@gmx.de> wrote:
>>> 
>>>> Hi Ming,
>>>> 
>>>> which Solr version are you using? In case you use one of the latest
>>>> versions (4.5 or above) try the new parameter facet.threads with a
>>>> reasonable value (4 to 8 gave me a massive performance speedup when
>>>> working with large facets, i.e. nTerms >> 10^7).
>>>> 
>>>> -Sascha
>>>> 
>>>> 
>>>> Mingfeng Yang wrote:
>>>>> I have an index with 170M documents, and two of the fields for each
>>>>> doc is "source" and "url".  And I want to know the top 500 most
>>>>> frequent urls from Video source.
>>>>> 
>>>>> So I did a facet with
>>>>> "fq=source:Video&facet=true&facet.field=url&facet.limit=500", and
>>>>> the matching documents are about 9 millions.
>>>>> 
>>>>> The solr cluster is hosted on two ec2 instances each with 4 cpu, and
>>>>> 32G memory. 16G is allocated tfor java heap.  4 master shards on one
>>>>> machine, and 4 replica on another machine. Connected together via
>>>>> zookeeper.
>>>>> 
>>>>> Whenever I did the query above, the response is just taking too long
>>>>> and the client will get timed out. Sometimes,  when the end user is
>>>>> impatient, so he/she may wait for a few second for the results, and
>>>>> then kill the connection, and then issue the same query again and
>>>>> again.  Then the server will have to deal with multiple such heavy
>>>>> queries simultaneously and being so busy that we got "no server
>>>>> hosting shard" error, probably due to lost communication between solr
>>>>> node and zookeeper.
>>>>> 
>>>>> Is there any way to deal with such problem?
>>>>> 
>>>>> Thanks, Ming
>>

Re: Problem of facet on 170M documents

Reply via email to