One way to solve the issue may be to create another field to group the value in a range, so you have fewer facet values to query.
Sent from my iPhone On Nov 5, 2013, at 4:31 AM, Erick Erickson <erickerick...@gmail.com> wrote: > You're just going to have to accept it being slow. Think of it this way: > you have > 4M (say) buckets that have to be counted into. Then the top 500 have to be > collected to return. That's just going to take some time unless you have > very beefy machines. > > I'd _really_ back up and consider whether this is a good thing or whether > this is one of those ideas that doesn't have much use to the user. If your > results rarely if ever show counts for a URL more than, say, 5, is it > really giving your users useful info? > > Best, > Erick > > > On Mon, Nov 4, 2013 at 6:54 PM, Mingfeng Yang <mfy...@wisewindow.com> wrote: > >> Erick, >> >> It could have more than 4M distinct values. The purpose of this facet is >> to display the most frequent, say top 500, urls to users. >> >> Sascha, >> >> Thanks for the info. I will look into this thread thing. >> >> Mingfeng >> >> >> On Mon, Nov 4, 2013 at 4:47 AM, Erick Erickson <erickerick...@gmail.com >>> wrote: >> >>> How many unique URLs do you have in your 9M >>> docs? If your 9M hits have 4M distinct URLs, then >>> this is not very valuable to the user. >>> >>> Sascha: >>> Was that speedup on a single field or were you faceting over >>> multiple fields? Because as I remember that code spins off >>> threads on a per-field basis, and if I'm mis-remembering I need >>> to look again! >>> >>> Best, >>> Erick >>> >>> >>> On Sat, Nov 2, 2013 at 5:07 AM, Sascha SZOTT <sz...@gmx.de> wrote: >>> >>>> Hi Ming, >>>> >>>> which Solr version are you using? In case you use one of the latest >>>> versions (4.5 or above) try the new parameter facet.threads with a >>>> reasonable value (4 to 8 gave me a massive performance speedup when >>>> working with large facets, i.e. nTerms >> 10^7). >>>> >>>> -Sascha >>>> >>>> >>>> Mingfeng Yang wrote: >>>>> I have an index with 170M documents, and two of the fields for each >>>>> doc is "source" and "url". And I want to know the top 500 most >>>>> frequent urls from Video source. >>>>> >>>>> So I did a facet with >>>>> "fq=source:Video&facet=true&facet.field=url&facet.limit=500", and >>>>> the matching documents are about 9 millions. >>>>> >>>>> The solr cluster is hosted on two ec2 instances each with 4 cpu, and >>>>> 32G memory. 16G is allocated tfor java heap. 4 master shards on one >>>>> machine, and 4 replica on another machine. Connected together via >>>>> zookeeper. >>>>> >>>>> Whenever I did the query above, the response is just taking too long >>>>> and the client will get timed out. Sometimes, when the end user is >>>>> impatient, so he/she may wait for a few second for the results, and >>>>> then kill the connection, and then issue the same query again and >>>>> again. Then the server will have to deal with multiple such heavy >>>>> queries simultaneously and being so busy that we got "no server >>>>> hosting shard" error, probably due to lost communication between solr >>>>> node and zookeeper. >>>>> >>>>> Is there any way to deal with such problem? >>>>> >>>>> Thanks, Ming >>