Search with very large boolean filter
Hi, I am using Solr 4.7.0 to search text with an id filter, like this: id:(100 OR 2 OR 5 OR 81 OR 10 ...) The number of IDs in the boolean filter are usually less than 100, but could sometimes be very large (around 30k IDs). We currently set maxBooleanClauses to 1024, partitioned the IDs by every 1000, and batched the solr queries, which worked but became slow when the total number of IDs is larger than 10k. I am wondering what would be the best strategy to handle this kind of problem? Can we increase the maxBooleanClauses to reduce the number of batches? And if possible, we prefer not to create additionally large indexes. Thanks!
Re: Search with very large boolean filter
Thanks for the quick replies, Alex and Jack! > definitely can improve on the ORing the ids with Going to try that! But I guess it would still hit the maxBooleanClauses=1024 threshold. > 1. Are you trying to retrieve a large number of documents, or simply perform queries against a subset of the index? We would like to perform queries against a subset of the index. > 2. How many unique queries are you expecting to perform against each specific filter set of IDs? There are usually only a couple (around 10) of unique queries for the same set of IDs for a short period of time (around 1min). > 3. How often does the set of IDs change? The IDs are almost different for each query. btw., the total number would be 99% be less than 1k. But in 1% rare cases it could be more than 10k. > 4. Is there more than one filter set of IDs in use during a particular interval of time? No. The ID set will be the only filter applied to "id". Thanks! 2015-11-20 14:26 GMT-08:00 Jack Krupansky : > 1. Are you trying to retrieve a large number of documents, or simply > perform queries against a subset of the index? > > 2. How many unique queries are you expecting to perform against each > specific filter set of IDs? > > 3. How often does the set of IDs change? > > 4. Is there more than one filter set of IDs in use during a particular > interval of time? > > > > -- Jack Krupansky > > On Fri, Nov 20, 2015 at 4:50 PM, jichi wrote: > >> Hi, >> >> I am using Solr 4.7.0 to search text with an id filter, like this: >> >> id:(100 OR 2 OR 5 OR 81 OR 10 ...) >> >> The number of IDs in the boolean filter are usually less than 100, but >> could sometimes be very large (around 30k IDs). >> >> We currently set maxBooleanClauses to 1024, partitioned the IDs by every >> 1000, and batched the solr queries, which worked but became slow when the >> total number of IDs is larger than 10k. >> >> I am wondering what would be the best strategy to handle this kind of >> problem? >> Can we increase the maxBooleanClauses to reduce the number of batches? >> And if possible, we prefer not to create additionally large indexes. >> >> Thanks! >> > > -- jichi
Re: Search with very large boolean filter
Hi Shawn, We have already switched the request method to POST. I am going to try the term query parser soon. I will post the performance difference against the IN syntax here later. Thanks! 2015-11-20 15:23 GMT-08:00 Shawn Heisey : > On 11/20/2015 4:09 PM, jichi wrote: > > Thanks for the quick replies, Alex and Jack! > > > >> definitely can improve on the ORing the ids with > > Going to try that! But I guess it would still hit the > maxBooleanClauses=1024 > > threshold. > > The terms query parser does not have a limit like boolean queries do. > This query parser was added in version 4.10, so be aware of that. > Querying for a large number of terms with the terms query parser will > scale a lot better than a boolean query -- better performance. > > The number of terms you query will affect the size of the query text. > The query size is constrained by either the max HTTP header size if the > request is a GET, or the max form size if it's a POST. The max HTTP > header size is configurable in the servlet container (jetty, tomcat, > etc) and I would not recommend going over about 32K with it. The max > form size is configurable in solrconfig.xml with the > formdataUploadLimitInKB attribute on the requestParsers element. That > attribute defaults to 2048, which yields a default size of 2MB. > Switching your queries to POST requests is advisable. > > Thanks, > Shawn > > -- jichi
How to speed up field collapsing on large number of groups
Hi everyone, I am using Solr 4.10 to index 20 million documents without sharding. Each document has a groupId field, and there are about 2 million groups. I found the search with collapsing on groupId significantly slower comparing to without collapsing, especially when combined with facet queries. I am wondering what would be the general approach to speedup field collapsing by 2~4 times? Would sharding the index help? Is it possible to optimize collapsing without sharding? The filter parameter for collapsing is like this: q=*:*&fq={!collapse field=groupId max=sum(...a long formula...)} I also put this fq into warmup queries xml to warmup caches. But still, when q changes and more fq are added, the collapsing search would take about 3~5 seconds. Without collapsing, the search can finish within 2 seconds. I am thinking to manually optimize CollapsingQParserPlugin through parallelization or extra caching. For example, is it possible to parallelize collapsing collector by different lucene index segments? Thanks! -- jichi
Re: How to speed up field collapsing on large number of groups
Thanks for the quick response, Joel! I am hoping to delay sharding if possible, which might involve more things to consider :) 1) What is the size of the result set before the collapse? When search with q=*:* for example, before collapse numFound is around 5 million, and that after collapse is 2 million. I only return about the top 30 documents in the result. 2) Have you tested without the long formula, just using a field for the min/max. It would be good to understand the impact of the formula on performance. The performance seems to be affected by the number of fields appearing in the max formula. For example, that 5 million expensive query would take 4.4 sec. For both {!collapse field=productGroupId} and {!collapse field=productGroupId max=only_one_field}, the query time would reduce to around 2.4 sec. If I remove the entire collapse fq, then the query only took 1.3 sec. 3) How much memory do you have on the server and for the heap. Memory use rises with the cardinality of the collapse field. So you'll want to be sure there is enough memory to comfortably perform the collapse. I am setting Xmx to 24G. The total index size on disk is 50G. In solrconfig.xml, I use solr.FastLRUCache for filterCache with cache size 2048, solr.LRUCache for documentCache with cache size 32768, and solr.LRUCache for queryResultCache with cache size 4096. I am using default fieldValueCache. I found Collapsing QParser plugin explicitly uses lucene's field cache. Maybe, increasing fieldCache would help? But I am not sure how to increase it in Solr. Sent from [Nylas N1](https://link.nylas.com/link/5tkvmhpozan5j5h3lhni487b /local- 481233c4-d727/0?redirect=https%3A%2F%2Fnylas.com%2Fn1%3Fref%3Dn1&r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn), the extensible, open source mail client.  On Jun 28 2016, at 4:48 pm, Joel Bernstein <joels...@gmail.com> wrote: > Sharding will help, but you'll need to co-locate documents by group ID. A few questions / suggestions: > > > > > 1) What is the size of the result set before the collapse? > > 2) Have you tested without the long formula, just using a field for the min/max. It would be good to understand the impact of the formula on performance. > > 3) How much memory do you have on the server and for the heap. Memory use rises with the cardinality of the collapse field. So you'll want to be sure there is enough memory to comfortably perform the collapse. > > > > > > > > > > > Joel Bernstein > > [http://joelsolr.blogspot.com/](http://joelsolr.blogspot.com/&r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn) > > > > > > On Tue, Jun 28, 2016 at 4:08 PM, jichi <[jichi...@gmail.com](mailto:jichi...@gmail.com)> wrote: > > >> Hi everyone, > > I am using Solr 4.10 to index 20 million documents without sharding. > Each document has a groupId field, and there are about 2 million groups. > I found the search with collapsing on groupId significantly slower > comparing to without collapsing, especially when combined with facet > queries. > > I am wondering what would be the general approach to speedup field > collapsing by 2~4 times? > Would sharding the index help? > Is it possible to optimize collapsing without sharding? > > The filter parameter for collapsing is like this: > > q=*:*&fq={!collapse field=groupId max=sum(...a long formula...)} > > I also put this fq into warmup queries xml to warmup caches. But still, > when q changes and more fq are added, the collapsing search would take > about 3~5 seconds. Without collapsing, the search can finish within 2 > seconds. > > I am thinking to manually optimize CollapsingQParserPlugin through > parallelization or extra caching. > For example, is it possible to parallelize collapsing collector by > different lucene index segments? > > Thanks! > > \-- > jichi > > > >
Re: How to speed up field collapsing on large number of groups
Hi everyone, Is it possible to optimize collapsing on large index through parallelization without sharding? Or can we conclude that sharding is currently the only approach to geometrically speedup slow collapsing queries? I tried manually parallelizing CollapsingQParserPlugin by different Lucene segments. In particular, I added threadpool to IndexSearcher and then parallelized CollapsingQParserPlugin.CollapsingFieldValueCollector, which I rewrote to utilize the LeafCollector introduced in Lucene5. But I am surprised that parallelization made the overall performance worse. Without parallelization, the first a couple of lucene segments took majority of the collapsing time, and the rest took almost zero time. After parallelization, all parallelized collapsing on lucene segments would take some time, and the overall time become longer by about 20%. Thanks! Sent from [Nylas N1](https://link.nylas.com/link/5tkvmhpozan5j5h3lhni487b /local- ff801f29-31d8/0?redirect=https%3A%2F%2Fnylas.com%2Fn1%3Fref%3Dn1&r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn), the extensible, open source mail client.  On Jun 28 2016, at 1:08 pm, jichi <jichi...@gmail.com> wrote: > Hi everyone, > > > > > I am using Solr 4.10 to index 20 million documents without sharding. > > Each document has a groupId field, and there are about 2 million groups. > > I found the search with collapsing on groupId significantly slower comparing to without collapsing, especially when combined with facet queries. > > > > > I am wondering what would be the general approach to speedup field collapsing by 2~4 times? > > Would sharding the index help? > > Is it possible to optimize collapsing without sharding? > > > > > The filter parameter for collapsing is like this: > > > > > q=*:*&fq={!collapse field=groupId max=sum(...a long formula...)} > > > > I also put this fq into warmup queries xml to warmup caches. But still, when q changes and more fq are added, the collapsing search would take about 3~5 seconds. Without collapsing, the search can finish within 2 seconds. > > > > > I am thinking to manually optimize CollapsingQParserPlugin through parallelization or extra caching. > > > For example, is it possible to parallelize collapsing collector by different lucene index segments? > > > > > Thanks! > > > > > \-- > > > jichi > > > >
Restarting SolrCloud that is taking realtime updates
Hi, I am seeking for the best practice to restart a sharded SolrCloud that taking search traffic as well as realtime updates without downtime. When I deploy new customized Solr plugins,for example, it will require restarting the whole SolrCloud cluster. I am testing Solr 6.2.1 with 4 shards. And I find that when SolrCloud is taking updates, when I restart any Solr node (no matter whether it is a leader node or overseer or other normal replica), the restarted node would Reindex it's whole data from its leader. i.e., it will redownload the whole index data and then drop its old data. The only way I find to avoid such reindexing is to temporarily disable updates, such as invoke disableReplication in the leader node before restarting. Additionally, I didn't find a way to temporarily pause Solr replication to a single replica. Before sharding, we can do disablePoll to disable replication in a slave. But after sharding, disable replication from the leader node is the only way I found, which will pause not only the replication to the one node to restart, but also disable replication in all nodes in the same shard. The procedure becomes more complex if I want to restart a leader node: I need first manually trigger a leader node failover through rebalancing, then disable replication in the new leader node, then restart the old leader node, and at last reenable replication in the new leader node. As you can see, it seems to take many steps to restart SolrCloud node by node this way. I am not sure if this is the best procedure to restart the whole SolrCloud that is taking realtime update? Thanks! Sent from [Nylas N1](https://link.nylas.com/link/5tkvmhpozan5j5h3lhni487b /local-7bf8174b-7288/0?redirect=https%3A%2F%2Fnylas.com%2Fn1%3Fref%3Dn1&r=c29s ci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn), the extensible, open source mail client. 
Re: Restarting SolrCloud that is taking realtime updates
Thanks so much for the very quick and detailed explanation, Erick! According to the following page, it seems numRecordsToKeep cannot be too high that must fit in a singe POST. It seems your 1> or 3> approaches would be the best in pratical when the number of updated documents is high. https://support.lucidworks.com/hc/en-us/articles/203842143-Recovery-times- while-restarting-a-SolrCloud-node Thanks again for happy thanksgiving! Sent from [Nylas N1](https://link.nylas.com/link/5tkvmhpozan5j5h3lhni487b /local-76daa0e7-1a84/0?redirect=https%3A%2F%2Fnylas.com%2Fn1%3Fref%3Dn1&r=c29s ci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn), the extensible, open source mail client.  On Nov 25 2016, at 2:33 pm, Erick Erickson wrote: > First, get out of thinking about the replication API, things like DISABLEPOLL and the like when in SolrCloud mode. The "old style" replication is used under the control of the synching strategy. Unless you've configured master/slave sections of your solrconfig.xml files and somehow dealt with the leader changing (who should be polled?), I'm pretty sure this is a total red herring. > > As for the rest, that's just the way it works. In SolrCloud, the raw documents are forwarded from the leader to the followers. Outside of a node going into recovery, replication isn't used at all. > > However, when a node goes into recovery (which by definition it will when the core is reloaded or the Solr instance is restarted) then the replica checks with the leader to see if it's "too far" out of date. The default "too far" is 100 docs, although this can be changed by setting the updatelog numRecordsToKeep to a higher number in solrconfig.xml. If the replica is too far out of date, a full index replication is done which is what you're observing. > > If the number of updates the leader has received is < 100 (or numRecordsToKeep) the leader sends the raw documents to the follower from it's update log and there is no "old style" replication there at all. > > So, the net-net here is that your choices are limited: > > 1> stop indexing while doing the restart. > > 2> bump numRecordsToKeep to some larger number that you expect not to be exceeded for the time it takes to restart each node. > > 3> live with the full index replication in this situation. > > I'll add parenthetically that having to redeploy plugins and the like _should_ be a relatively rare operation, and it seems (at least from the outside) to be a perfectly reasonable thing to do in a maintenance window when index updates are disabled. > > You can also consider using collection aliasing to switch back and forth between two collections so you can manipulate the current cold one and, when you're satisfied, switch the alias. > > Best, Erick > > On Fri, Nov 25, 2016 at 1:40 PM, Jichi Guo wrote: > Hi, > > > > I am seeking for the best practice to restart a sharded SolrCloud that taking > search traffic as well as realtime updates without downtime. > > When I deploy new customized Solr plugins,for example, it will require > restarting the whole SolrCloud cluster. > > I am testing Solr 6.2.1 with 4 shards. > > And I find that when SolrCloud is taking updates, when I restart any Solr node > (no matter whether it is a leader node or overseer or other normal replica), > the restarted node would Reindex it's whole data from its leader. i.e., it > will redownload the whole index data and then drop its old data. > > The only way I find to avoid such reindexing is to temporarily disable > updates, such as invoke disableReplication in the leader node before > restarting. > > > > Additionally, I didn't find a way to temporarily pause Solr replication to a > single replica. Before sharding, we can do disablePoll to disable replication > in a slave. But after sharding, disable replication from the leader node is > the only way I found, which will pause not only the replication to the one > node to restart, but also disable replication in all nodes in the same shard. > > > > The procedure becomes more complex if I want to restart a leader node: I need > first manually trigger a leader node failover through rebalancing, then > disable replication in the new leader node, then restart the old leader node, > and at last reenable replication in the new leader node. > > > > As you can see, it seems to take many steps to restart SolrCloud node by node > this way. >
Best practices to debug Solr search in production without all fields stored
Hi everyone, I found it convenient to debug Solr search results if I mark all fields to be "stored=true" in schema. For example, given a document, I could check why it is not returned in a query with debug=true. But in production, most of the fields have "stored=false" for performance reason. I am wondering what would be the recommended approaches to easily reason Solr search results if some used fields are not stored? Thanks! Sent from [Nylas N1](https://link.nylas.com/link/5tkvmhpozan5j5h3lhni487b/loca l-32350514-1e41/0?redirect=https%3A%2F%2Fnylas.com%2Fn1%3Fref%3Dn1&r=c29sci11c 2VyQGx1Y2VuZS5hcGFjaGUub3Jn), the extensible, open source mail client. 