Thanks for you inputs. Regards, Modassar
On Tue, Dec 30, 2014 at 8:41 PM, Jack Krupansky <jack.krupan...@gmail.com> wrote: > I actually did that once as a test years ago, as well as support for > "paging" through the wildcard terms with a starting offset, and it worked > great. > > One way to think of the feature is as the ability to "sample" the values of > the wildcard. I mean, not all queries require absolute precision. Sometimes > you just want to know whether "something" exists matching the pattern, or > "generally" what the values look like. > > I think it would be worth a Jira. > > > -- Jack Krupansky > > On Tue, Dec 30, 2014 at 6:16 AM, Modassar Ather <modather1...@gmail.com> > wrote: > > > Hi, > > > > In the query having lots of wildcard can we put a limitation on number of > > expansion of terms done against a wildcard token something like > > maxBooleanClauses? > > > > Thanks, > > Modassar > > > > On Mon, Dec 29, 2014 at 11:15 AM, Modassar Ather <modather1...@gmail.com > > > > wrote: > > > > > Thanks Jack for your suggestions. > > > > > > Regards, > > > Modassar > > > > > > On Fri, Dec 26, 2014 at 6:04 PM, Jack Krupansky < > > jack.krupan...@gmail.com> > > > wrote: > > > > > >> Either you have too little RAM on each node or too much data on each > > node. > > >> > > >> You may need to shard the data much more heavily so that the total > work > > on > > >> a single query is distributed in parallel to more nodes, each node > > having > > >> a > > >> much smaller amount of data to work on. > > >> > > >> First, always make sure that the entire Lucene index for each node > fits > > >> entirely in the system memory available for file system caching. > > Otherwise > > >> the queries will be I/O bound. Check your current queries to see if > that > > >> is > > >> the case - are the nodes compute bound or I/O bound? If I/O bound, add > > >> more > > >> system memory until the queries are no longer I/O bound. If compute > > bound, > > >> shard more heavily until the query latency becomes acceptable. > > >> > > >> > > >> > > >> -- Jack Krupansky > > >> > > >> On Fri, Dec 26, 2014 at 1:02 AM, Modassar Ather < > modather1...@gmail.com > > > > > >> wrote: > > >> > > >> > Thanks for your suggestions Erick. > > >> > > > >> > This may be one of those situations where you really have to > > >> > push back at the users and understand why they insist on these > > >> > kinds of queries. They must be very patient since it won't be > > >> > very performant. That said, I've seen this pattern; there are > > >> > certainly valid conditions under which response times can be > > >> > many seconds if there are few users and they are doing very > > >> > complex/expert-level things. > > >> > > > >> > We have tried educating the users but it did not work because they > are > > >> used > > >> > to the old way. They feel that wildcard gives more control over the > > >> results > > >> > and may not fully understand stemming. > > >> > > > >> > Regards, > > >> > Modassar > > >> > > > >> > On Thu, Dec 25, 2014 at 3:17 AM, Erick Erickson < > > >> erickerick...@gmail.com> > > >> > wrote: > > >> > > > >> > > There's no magic bullet here that I know of. If your requirements > > >> > > are to support these huge, many-wildcard queries then you only > > >> > > have a few choices: > > >> > > > > >> > > 1> redo the index. I was surprised at how little it bloated the > > >> > > index as far as memory required is concerned to add ngrams. > > >> > > The key here is that there really aren't very many unique terms. > > >> > > If you use bigrams, then there are only maybe 36^2 distinct > > >> > > combinations. (assuming English and including numbers). > > >> > > > > >> > > 2> Increase the number of shards, putting many fewer docs > > >> > > on each shard. > > >> > > > > >> > > 3> give each shard a lot more memory. This isn't actually one > > >> > > of my preferred solutions since GC issues may raise their ugly > > >> > > heads here. > > >> > > > > >> > > 4> insert creative solution here. > > >> > > > > >> > > This may be one of those situations where you really have to > > >> > > push back at the users and understand why they insist on these > > >> > > kinds of queries. They must be very patient since it won't be > > >> > > very performant. That said, I've seen this pattern; there are > > >> > > certainly valid conditions under which response times can be > > >> > > many seconds if there are few users and they are doing very > > >> > > complex/expert-level things. > > >> > > > > >> > > Now, all that said, wildcards are often examples of poor habits > > >> > > or habits learned in DB systems where the only hammer was > > >> > > %whatever%. I've seen situations where users didn't > > >> > > understand that Solr broke the input stream up into words. And > > >> > > stemmed. And WordDelimiterFilterFactory did all the magic > > >> > > for finding, say D.C. and DC. So it's worth looking at the actual > > >> > > queries that are sent, perhaps talking to users and understanding > > >> > > what they _want_ out of the system, then perhaps educating them > > >> > > as to better ways to get what they want. > > >> > > > > >> > > Literally I've seen people insist on entering queries that > > >> > > wildcarded _everything_ both pre and post wildcards because > > >> > > they didn't realize that Solr tokenizes... > > >> > > > > >> > > Once you hit an OOM, all bets are off as Shawn outlined. > > >> > > > > >> > > Best, > > >> > > Erick > > >> > > > > >> > > On Wed, Dec 24, 2014 at 1:57 AM, Modassar Ather < > > >> modather1...@gmail.com> > > >> > > wrote: > > >> > > > Thanks for your response. > > >> > > > > > >> > > > How many items in the collection ? > > >> > > > There are about 100 millions documents. > > >> > > > > > >> > > > How are configured cache in solrconfig.xml ? > > >> > > > Each cache has size attribute as 128. > > >> > > > > > >> > > > Can you provide a sample of the query ? > > >> > > > Does it fail immediately after solrcloud startup or after > several > > >> > hours ? > > >> > > > It is a query with many terms(more than a thousand) and phrase > > where > > >> > > > phrases have many wildcards in it. > > >> > > > Once such query is executed there are many zookeeper related > > >> exceptions > > >> > > and > > >> > > > with a couple of such queries executed it goes for OutOfMemory. > > >> > > > > > >> > > > Thanks, > > >> > > > Modassar > > >> > > > > > >> > > > > > >> > > > On Wed, Dec 24, 2014 at 1:37 PM, Dominique Bejean < > > >> > > dominique.bej...@eolya.fr > > >> > > >> wrote: > > >> > > > > > >> > > >> And you didn’t give how many RAM on each servers ? > > >> > > >> > > >> > > >> 2014-12-24 8:17 GMT+01:00 Dominique Bejean < > > >> dominique.bej...@eolya.fr > > >> > >: > > >> > > >> > > >> > > >> > Modassar, > > >> > > >> > > > >> > > >> > How many items in the collection ? > > >> > > >> > I mean how many documents per collection ? 1 million, 10 > > >> millions, > > >> > …? > > >> > > >> > > > >> > > >> > How are configured cache in solrconfig.xml ? > > >> > > >> > What are the size attribute value for each cache ? > > >> > > >> > > > >> > > >> > Can you provide a sample of the query ? > > >> > > >> > Does it fail immediately after solrcloud startup or after > > several > > >> > > hours ? > > >> > > >> > > > >> > > >> > Dominique > > >> > > >> > > > >> > > >> > 2014-12-24 6:20 GMT+01:00 Modassar Ather < > > modather1...@gmail.com > > >> >: > > >> > > >> > > > >> > > >> >> Thanks for your suggestions. > > >> > > >> >> > > >> > > >> >> I will look into the link provided. > > >> > > >> >> > http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap > > >> > > >> >> > > >> > > >> >> This is usually an anti-pattern. The very first thing > > >> > > >> >> I'd be doing is trying to not do this. See ngrams for infix > > >> > > >> >> queries, or shingles or ReverseWildcardFilterFactory or..... > > >> > > >> >> > > >> > > >> >> We cannot avoid multiple wildcards since that's is our > user's > > >> > > >> requirement. > > >> > > >> >> We try to discourage it but the users insist on firing such > > >> > queries. > > >> > > >> Also, > > >> > > >> >> ngrams etc. can be tried but our index is already huge and > > >> ngrams > > >> > may > > >> > > >> >> further add lot to it. We are OK with such queries failing > as > > >> long > > >> > as > > >> > > >> >> other > > >> > > >> >> queries are not affected. > > >> > > >> >> > > >> > > >> >> > > >> > > >> >> Please find the details below. > > >> > > >> >> > > >> > > >> >> So, how many nodes in the cluster ? > > >> > > >> >> There are total 4 nodes on the cluster. > > >> > > >> >> > > >> > > >> >> How many shards and replicas for the collection ? > > >> > > >> >> There are 4 shards and no replica for any of them. > > >> > > >> >> > > >> > > >> >> How many items in the collection ? > > >> > > >> >> If I understand the question correctly there are two > > collection > > >> on > > >> > > each > > >> > > >> >> node and there size on each node is approximately 190GB and > > >> 130GB. > > >> > > >> >> > > >> > > >> >> What is the size of the index ? > > >> > > >> >> There are two collection on each node and there size on each > > >> node > > >> > is > > >> > > >> >> approximately 190GB and 130GB. > > >> > > >> >> > > >> > > >> >> How is updated the collection (frequency, how many items per > > >> days, > > >> > > what > > >> > > >> is > > >> > > >> >> your hard commit strategy) ? > > >> > > >> >> It is an optimized index and read-only. There are no > > >> inter-mediate > > >> > > >> update. > > >> > > >> >> > > >> > > >> >> How are configured cache in solrconfig.xml ? > > >> > > >> >> Filter cache, query result cache and document cache are > > enabled. > > >> > > >> >> Auto-warming is also done. > > >> > > >> >> > > >> > > >> >> Can you provide all other JVM parameters ? > > >> > > >> >> -Xms20g -Xmx24g -XX:+UseConcMarkSweepGC > > >> > > >> >> > > >> > > >> >> Thanks again, > > >> > > >> >> Modassar > > >> > > >> >> > > >> > > >> >> On Wed, Dec 24, 2014 at 2:29 AM, Dominique Bejean < > > >> > > >> >> dominique.bej...@eolya.fr > > >> > > >> >> > wrote: > > >> > > >> >> > > >> > > >> >> > Hi, > > >> > > >> >> > > > >> > > >> >> > I agree Erick it could be a good think to have more > details > > >> about > > >> > > your > > >> > > >> >> > configuration and collection. > > >> > > >> >> > > > >> > > >> >> > Your heap size is 32Gb. How many RAM on each servers ? > > >> > > >> >> > > > >> > > >> >> > By « 4 shard Solr cluster », you mean a 4 nodes Solr > servers > > >> or a > > >> > > >> >> > collection with 4 shards ? > > >> > > >> >> > > > >> > > >> >> > So, how many nodes in the cluster ? > > >> > > >> >> > How many shards and replicas for the collection ? > > >> > > >> >> > How many items in the collection ? > > >> > > >> >> > What is the size of the index ? > > >> > > >> >> > How is updated the collection (frequency, how many items > per > > >> > days, > > >> > > >> what > > >> > > >> >> is > > >> > > >> >> > your hard commit strategy) ? > > >> > > >> >> > How are configured cache in solrconfig.xml ? > > >> > > >> >> > Can you provide all other JVM parameters ? > > >> > > >> >> > > > >> > > >> >> > Regards > > >> > > >> >> > > > >> > > >> >> > Dominique > > >> > > >> >> > > > >> > > >> >> > 2014-12-23 17:50 GMT+01:00 Erick Erickson < > > >> > erickerick...@gmail.com > > >> > > >: > > >> > > >> >> > > > >> > > >> >> > > Second most important part of your message: > > >> > > >> >> > > "When executing a huge query with many wildcards inside > it > > >> the > > >> > > >> server" > > >> > > >> >> > > > > >> > > >> >> > > This is usually an anti-pattern. The very first thing > > >> > > >> >> > > I'd be doing is trying to not do this. See ngrams for > > infix > > >> > > >> >> > > queries, or shingles or ReverseWildcardFilterFactory > > or..... > > >> > > >> >> > > > > >> > > >> >> > > And if your corpus is very large with many unique terms > > it's > > >> > even > > >> > > >> >> > > worse, but you haven't really told us about that yet. > > >> > > >> >> > > > > >> > > >> >> > > Best, > > >> > > >> >> > > Erick > > >> > > >> >> > > > > >> > > >> >> > > On Tue, Dec 23, 2014 at 8:30 AM, Shawn Heisey < > > >> > > apa...@elyograg.org> > > >> > > >> >> > wrote: > > >> > > >> >> > > > On 12/23/2014 4:34 AM, Modassar Ather wrote: > > >> > > >> >> > > >> Hi, > > >> > > >> >> > > >> > > >> > > >> >> > > >> I have a setup of 4 shard Solr cluster with embedded > > >> > > zookeeper on > > >> > > >> >> one > > >> > > >> >> > of > > >> > > >> >> > > >> them. The zkClient time out is set to 30 seconds, > -Xms > > is > > >> > 20g > > >> > > and > > >> > > >> >> -Xms > > >> > > >> >> > > is > > >> > > >> >> > > >> 24g. > > >> > > >> >> > > >> When executing a huge query with many wildcards > inside > > it > > >> > the > > >> > > >> >> server > > >> > > >> >> > > >> crashes and becomes non-responsive. Even the > dashboard > > >> does > > >> > > not > > >> > > >> >> > responds > > >> > > >> >> > > >> and shows connection lost error. This requires me to > > >> restart > > >> > > the > > >> > > >> >> > > servers. > > >> > > >> >> > > > > > >> > > >> >> > > > Here's the important part of your message: > > >> > > >> >> > > > > > >> > > >> >> > > > *Caused by: java.lang.OutOfMemoryError: Java heap > space* > > >> > > >> >> > > > > > >> > > >> >> > > > > > >> > > >> >> > > > Your heap is not big enough for what Solr has been > asked > > >> to > > >> > do. > > >> > > >> You > > >> > > >> >> > > > need to either increase your heap size or change your > > >> > > >> configuration > > >> > > >> >> so > > >> > > >> >> > > > that it uses less memory. > > >> > > >> >> > > > > > >> > > >> >> > > > > > >> > http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap > > >> > > >> >> > > > > > >> > > >> >> > > > Most programs have pretty much undefined behavior when > > an > > >> > OOME > > >> > > >> >> occurs. > > >> > > >> >> > > > Lucene's IndexWriter has been hardened so that it > tries > > >> > > extremely > > >> > > >> >> hard > > >> > > >> >> > > > to avoid index corruption when OOME strikes, and I > > believe > > >> > that > > >> > > >> >> works > > >> > > >> >> > > > well enough that we can call it nearly bulletproof ... > > but > > >> > the > > >> > > >> rest > > >> > > >> >> of > > >> > > >> >> > > > Lucene and Solr will make no guarantees. > > >> > > >> >> > > > > > >> > > >> >> > > > It's very difficult to have definable program behavior > > >> when > > >> > an > > >> > > >> OOME > > >> > > >> >> > > > happens, because you simply cannot know the precise > > point > > >> > > during > > >> > > >> >> > program > > >> > > >> >> > > > execution where it will happen, or what isn't going to > > >> work > > >> > > >> because > > >> > > >> >> > Java > > >> > > >> >> > > > did not have memory space to create an object. Going > > >> > > unresponsive > > >> > > >> >> is > > >> > > >> >> > > > not surprising. > > >> > > >> >> > > > > > >> > > >> >> > > > If you can solve your heap problem, note that you may > > run > > >> > into > > >> > > >> other > > >> > > >> >> > > > performance issues discussed on the wiki page that I > > >> linked. > > >> > > >> >> > > > > > >> > > >> >> > > > Thanks, > > >> > > >> >> > > > Shawn > > >> > > >> >> > > > > > >> > > >> >> > > > > >> > > >> >> > > > >> > > >> >> > > >> > > >> >> > > >> > > >> >> On Wed, Dec 24, 2014 at 2:29 AM, Dominique Bejean < > > >> > > >> >> dominique.bej...@eolya.fr > > >> > > >> >> > wrote: > > >> > > >> >> > > >> > > >> >> > Hi, > > >> > > >> >> > > > >> > > >> >> > I agree Erick it could be a good think to have more > details > > >> about > > >> > > your > > >> > > >> >> > configuration and collection. > > >> > > >> >> > > > >> > > >> >> > Your heap size is 32Gb. How many RAM on each servers ? > > >> > > >> >> > > > >> > > >> >> > By « 4 shard Solr cluster », you mean a 4 nodes Solr > servers > > >> or a > > >> > > >> >> > collection with 4 shards ? > > >> > > >> >> > > > >> > > >> >> > So, how many nodes in the cluster ? > > >> > > >> >> > How many shards and replicas for the collection ? > > >> > > >> >> > How many items in the collection ? > > >> > > >> >> > What is the size of the index ? > > >> > > >> >> > How is updated the collection (frequency, how many items > per > > >> > days, > > >> > > >> what > > >> > > >> >> is > > >> > > >> >> > your hard commit strategy) ? > > >> > > >> >> > How are configured cache in solrconfig.xml ? > > >> > > >> >> > Can you provide all other JVM parameters ? > > >> > > >> >> > > > >> > > >> >> > Regards > > >> > > >> >> > > > >> > > >> >> > Dominique > > >> > > >> >> > > > >> > > >> >> > 2014-12-23 17:50 GMT+01:00 Erick Erickson < > > >> > erickerick...@gmail.com > > >> > > >: > > >> > > >> >> > > > >> > > >> >> > > Second most important part of your message: > > >> > > >> >> > > "When executing a huge query with many wildcards inside > it > > >> the > > >> > > >> server" > > >> > > >> >> > > > > >> > > >> >> > > This is usually an anti-pattern. The very first thing > > >> > > >> >> > > I'd be doing is trying to not do this. See ngrams for > > infix > > >> > > >> >> > > queries, or shingles or ReverseWildcardFilterFactory > > or..... > > >> > > >> >> > > > > >> > > >> >> > > And if your corpus is very large with many unique terms > > it's > > >> > even > > >> > > >> >> > > worse, but you haven't really told us about that yet. > > >> > > >> >> > > > > >> > > >> >> > > Best, > > >> > > >> >> > > Erick > > >> > > >> >> > > > > >> > > >> >> > > On Tue, Dec 23, 2014 at 8:30 AM, Shawn Heisey < > > >> > > apa...@elyograg.org> > > >> > > >> >> > wrote: > > >> > > >> >> > > > On 12/23/2014 4:34 AM, Modassar Ather wrote: > > >> > > >> >> > > >> Hi, > > >> > > >> >> > > >> > > >> > > >> >> > > >> I have a setup of 4 shard Solr cluster with embedded > > >> > > zookeeper on > > >> > > >> >> one > > >> > > >> >> > of > > >> > > >> >> > > >> them. The zkClient time out is set to 30 seconds, > -Xms > > is > > >> > 20g > > >> > > and > > >> > > >> >> -Xms > > >> > > >> >> > > is > > >> > > >> >> > > >> 24g. > > >> > > >> >> > > >> When executing a huge query with many wildcards > inside > > it > > >> > the > > >> > > >> >> server > > >> > > >> >> > > >> crashes and becomes non-responsive. Even the > dashboard > > >> does > > >> > > not > > >> > > >> >> > responds > > >> > > >> >> > > >> and shows connection lost error. This requires me to > > >> restart > > >> > > the > > >> > > >> >> > > servers. > > >> > > >> >> > > > > > >> > > >> >> > > > Here's the important part of your message: > > >> > > >> >> > > > > > >> > > >> >> > > > *Caused by: java.lang.OutOfMemoryError: Java heap > space* > > >> > > >> >> > > > > > >> > > >> >> > > > > > >> > > >> >> > > > Your heap is not big enough for what Solr has been > asked > > >> to > > >> > do. > > >> > > >> You > > >> > > >> >> > > > need to either increase your heap size or change your > > >> > > >> configuration > > >> > > >> >> so > > >> > > >> >> > > > that it uses less memory. > > >> > > >> >> > > > > > >> > > >> >> > > > > > >> > http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap > > >> > > >> >> > > > > > >> > > >> >> > > > Most programs have pretty much undefined behavior when > > an > > >> > OOME > > >> > > >> >> occurs. > > >> > > >> >> > > > Lucene's IndexWriter has been hardened so that it > tries > > >> > > extremely > > >> > > >> >> hard > > >> > > >> >> > > > to avoid index corruption when OOME strikes, and I > > believe > > >> > that > > >> > > >> >> works > > >> > > >> >> > > > well enough that we can call it nearly bulletproof ... > > but > > >> > the > > >> > > >> rest > > >> > > >> >> of > > >> > > >> >> > > > Lucene and Solr will make no guarantees. > > >> > > >> >> > > > > > >> > > >> >> > > > It's very difficult to have definable program behavior > > >> when > > >> > an > > >> > > >> OOME > > >> > > >> >> > > > happens, because you simply cannot know the precise > > point > > >> > > during > > >> > > >> >> > program > > >> > > >> >> > > > execution where it will happen, or what isn't going to > > >> work > > >> > > >> because > > >> > > >> >> > Java > > >> > > >> >> > > > did not have memory space to create an object. Going > > >> > > unresponsive > > >> > > >> >> is > > >> > > >> >> > > > not surprising. > > >> > > >> >> > > > > > >> > > >> >> > > > If you can solve your heap problem, note that you may > > run > > >> > into > > >> > > >> other > > >> > > >> >> > > > performance issues discussed on the wiki page that I > > >> linked. > > >> > > >> >> > > > > > >> > > >> >> > > > Thanks, > > >> > > >> >> > > > Shawn > > >> > > >> >> > > > > > >> > > >> >> > > > > >> > > >> >> > > > >> > > >> >> > > >> > > >> > > > >> > > >> > > > >> > > >> > > >> > > > > >> > > > >> > > > > > > > > >