Search with very large boolean filter

2015-11-20 Thread jichi
Hi,

I am using Solr 4.7.0 to search text with an id filter, like this:

  id:(100 OR 2 OR 5 OR 81 OR 10 ...)

The number of IDs in the boolean filter are usually less than 100, but
could sometimes be very large (around 30k IDs).

We currently set maxBooleanClauses to 1024, partitioned the IDs by every
1000, and batched the solr queries, which worked but became slow when the
total number of IDs is larger than 10k.

I am wondering what would be the best strategy to handle this kind of
problem?
Can we increase the maxBooleanClauses to reduce the number of batches?
And if possible, we prefer not to create additionally large indexes.

Thanks!


Re: Search with very large boolean filter

2015-11-20 Thread jichi
Thanks for the quick replies, Alex and Jack!

> definitely can improve on the ORing the ids with
Going to try that! But I guess it would still hit the maxBooleanClauses=1024
threshold.

> 1. Are you trying to retrieve a large number of documents, or simply
perform queries against a subset of the index?
We would like to perform queries against a subset of the index.

> 2. How many unique queries are you expecting to perform against each
specific filter set of IDs?
There are usually only a couple (around 10) of unique queries for the same
set of IDs for a short period of time (around 1min).

> 3. How often does the set of IDs change?
The IDs are almost different for each query.
btw., the total number would be 99% be less than 1k.
But in 1% rare cases it could be more than 10k.

> 4. Is there more than one filter set of IDs in use during a particular
interval of time?
No. The ID set will be the only filter applied to "id".


Thanks!


2015-11-20 14:26 GMT-08:00 Jack Krupansky :

> 1. Are you trying to retrieve a large number of documents, or simply
> perform queries against a subset of the index?
>
> 2. How many unique queries are you expecting to perform against each
> specific filter set of IDs?
>
> 3. How often does the set of IDs change?
>
> 4. Is there more than one filter set of IDs in use during a particular
> interval of time?
>
>
>
> -- Jack Krupansky
>
> On Fri, Nov 20, 2015 at 4:50 PM, jichi  wrote:
>
>> Hi,
>>
>> I am using Solr 4.7.0 to search text with an id filter, like this:
>>
>>   id:(100 OR 2 OR 5 OR 81 OR 10 ...)
>>
>> The number of IDs in the boolean filter are usually less than 100, but
>> could sometimes be very large (around 30k IDs).
>>
>> We currently set maxBooleanClauses to 1024, partitioned the IDs by every
>> 1000, and batched the solr queries, which worked but became slow when the
>> total number of IDs is larger than 10k.
>>
>> I am wondering what would be the best strategy to handle this kind of
>> problem?
>> Can we increase the maxBooleanClauses to reduce the number of batches?
>> And if possible, we prefer not to create additionally large indexes.
>>
>> Thanks!
>>
>
>


-- 
jichi


Re: Search with very large boolean filter

2015-11-20 Thread jichi
Hi Shawn,

We have already switched the request method to POST.
I am going to try the term query parser soon. I will post the performance
difference against the IN syntax here later.

Thanks!

2015-11-20 15:23 GMT-08:00 Shawn Heisey :

> On 11/20/2015 4:09 PM, jichi wrote:
> > Thanks for the quick replies, Alex and Jack!
> >
> >> definitely can improve on the ORing the ids with
> > Going to try that! But I guess it would still hit the
> maxBooleanClauses=1024
> > threshold.
>
> The terms query parser does not have a limit like boolean queries do.
> This query parser was added in version 4.10, so be aware of that.
> Querying for a large number of terms with the terms query parser will
> scale a lot better than a boolean query -- better performance.
>
> The number of terms you query will affect the size of the query text.
> The query size is constrained by either the max HTTP header size if the
> request is a GET, or the max form size if it's a POST.  The max HTTP
> header size is configurable in the servlet container (jetty, tomcat,
> etc) and I would not recommend going over about 32K with it.  The max
> form size is configurable in solrconfig.xml with the
> formdataUploadLimitInKB attribute on the requestParsers element.  That
> attribute defaults to 2048, which yields a default size of 2MB.
> Switching your queries to POST requests is advisable.
>
> Thanks,
> Shawn
>
>


-- 
jichi


How to speed up field collapsing on large number of groups

2016-06-28 Thread jichi
Hi everyone,

I am using Solr 4.10 to index 20 million documents without sharding.
Each document has a groupId field, and there are about 2 million groups.
I found the search with collapsing on groupId significantly slower
comparing to without collapsing, especially when combined with facet
queries.

I am wondering what would be the general approach to speedup field
collapsing by 2~4 times?
Would sharding the index help?
Is it possible to optimize collapsing without sharding?

The filter parameter for collapsing is like this:

q=*:*&fq={!collapse field=groupId max=sum(...a long formula...)}

I also put this fq into warmup queries xml to warmup caches. But still,
when q changes and more fq are added, the collapsing search would take
about 3~5 seconds. Without collapsing, the search can finish within 2
seconds.

I am thinking to manually optimize CollapsingQParserPlugin through
parallelization or extra caching.
For example, is it possible to parallelize collapsing collector by
different lucene index segments?

Thanks!

-- 
jichi


Re: How to speed up field collapsing on large number of groups

2016-06-28 Thread Jichi Guo
Thanks for the quick response, Joel!

I am hoping to delay sharding if possible, which might involve more things to
consider :)  

  

1) What is the size of the result set before the collapse?

  

When search with q=*:* for example, before collapse numFound is around 5
million, and that after collapse is 2 million.  

I only return about the top 30 documents in the result.  

  

2) Have you tested without the long formula, just using a field for the
min/max. It would be good to understand the impact of the formula on
performance.

  

The performance seems to be affected by the number of fields appearing in the
max formula.

  

For example, that 5 million expensive query would take 4.4 sec.

For both {!collapse field=productGroupId} and {!collapse field=productGroupId
max=only_one_field}, the query time would reduce to around 2.4 sec.

If I remove the entire collapse fq, then the query only took 1.3 sec.

  

3) How much memory do you have on the server and for the heap. Memory use
rises with the cardinality of the collapse field. So you'll want to be sure
there is enough memory to comfortably perform the collapse.

  

I am setting Xmx to 24G. The total index size on disk is 50G.

In solrconfig.xml, I use solr.FastLRUCache for filterCache with cache size
2048, solr.LRUCache for documentCache with cache size 32768, and solr.LRUCache
for queryResultCache with cache size 4096. I am using default fieldValueCache.

  

I found Collapsing QParser plugin explicitly uses lucene's field cache.  

Maybe, increasing fieldCache would help?  But I am not sure how to increase it
in Solr.

  
Sent from [Nylas N1](https://link.nylas.com/link/5tkvmhpozan5j5h3lhni487b
/local-
481233c4-d727/0?redirect=https%3A%2F%2Fnylas.com%2Fn1%3Fref%3Dn1&r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn),
the extensible, open source mail client.  

![](https://link.nylas.com/open/5tkvmhpozan5j5h3lhni487b/local-
481233c4-d727?r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)

On Jun 28 2016, at 4:48 pm, Joel Bernstein <joels...@gmail.com> wrote:  

> Sharding will help, but you'll need to co-locate documents by group ID. A
few questions / suggestions:

>

>  
>

>

> 1) What is the size of the result set before the collapse?

>

> 2) Have you tested without the long formula, just using a field for the
min/max. It would be good to understand the impact of the formula on
performance.

>

> 3) How much memory do you have on the server and for the heap. Memory use
rises with the cardinality of the collapse field. So you'll want to be sure
there is enough memory to comfortably perform the collapse.

>

>  
>

>

>  
>

>

>  
>

>

> Joel Bernstein

>

>
[http://joelsolr.blogspot.com/](http://joelsolr.blogspot.com/&r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)
  
>

>

>  
>

>

> On Tue, Jun 28, 2016 at 4:08 PM, jichi
<[jichi...@gmail.com](mailto:jichi...@gmail.com)> wrote:  
>

>

>> Hi everyone,  
>  
>  I am using Solr 4.10 to index 20 million documents without sharding.  
>  Each document has a groupId field, and there are about 2 million groups.  
>  I found the search with collapsing on groupId significantly slower  
>  comparing to without collapsing, especially when combined with facet  
>  queries.  
>  
>  I am wondering what would be the general approach to speedup field  
>  collapsing by 2~4 times?  
>  Would sharding the index help?  
>  Is it possible to optimize collapsing without sharding?  
>  
>  The filter parameter for collapsing is like this:  
>  
>  q=*:*&fq={!collapse field=groupId max=sum(...a long formula...)}  
>  
>  I also put this fq into warmup queries xml to warmup caches. But still,  
>  when q changes and more fq are added, the collapsing search would take  
>  about 3~5 seconds. Without collapsing, the search can finish within 2  
>  seconds.  
>  
>  I am thinking to manually optimize CollapsingQParserPlugin through  
>  parallelization or extra caching.  
>  For example, is it possible to parallelize collapsing collector by  
>  different lucene index segments?  
>  
>  Thanks!  
>  
>  \--  
>  jichi  
>

>

>  
>



Re: How to speed up field collapsing on large number of groups

2016-07-13 Thread Jichi Guo
Hi everyone,

  

Is it possible to optimize collapsing on large index through parallelization
without sharding?

  

Or can we conclude that sharding is currently the only approach to
geometrically speedup slow collapsing queries?

  

I tried manually parallelizing CollapsingQParserPlugin by different Lucene
segments. In particular, I added threadpool to IndexSearcher and then
parallelized CollapsingQParserPlugin.CollapsingFieldValueCollector, which I
rewrote to utilize the LeafCollector introduced in Lucene5.

But I am surprised that parallelization made the overall performance worse.

  

Without parallelization, the first a couple of lucene segments took majority
of the collapsing time, and the rest took almost zero time.

After parallelization, all parallelized collapsing on lucene segments would
take some time, and the overall time become longer by about 20%.

  

Thanks!

  
Sent from [Nylas N1](https://link.nylas.com/link/5tkvmhpozan5j5h3lhni487b
/local-
ff801f29-31d8/0?redirect=https%3A%2F%2Fnylas.com%2Fn1%3Fref%3Dn1&r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn),
the extensible, open source mail client.  

![](https://link.nylas.com/open/5tkvmhpozan5j5h3lhni487b/local-
ff801f29-31d8?r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)

On Jun 28 2016, at 1:08 pm, jichi <jichi...@gmail.com> wrote:  

> Hi everyone,

>

>  
>

>

> I am using Solr 4.10 to index 20 million documents without sharding.

>

> Each document has a groupId field, and there are about 2 million groups.

>

> I found the search with collapsing on groupId significantly slower comparing
to without collapsing, especially when combined with facet queries.

>

>  
>

>

> I am wondering what would be the general approach to speedup field
collapsing by 2~4 times?

>

> Would sharding the index help?

>

> Is it possible to optimize collapsing without sharding?

>

>  
>

>

> The filter parameter for collapsing is like this:

>

>  
>

>

> q=*:*&fq={!collapse field=groupId max=sum(...a long formula...)}

>

>  
>

> I also put this fq into warmup queries xml to warmup caches. But still, when
q changes and more fq are added, the collapsing search would take about 3~5
seconds. Without collapsing, the search can finish within 2 seconds.

>

>  
>

>

> I am thinking to manually optimize CollapsingQParserPlugin through
parallelization or extra caching.  
>

>

> For example, is it possible to parallelize collapsing collector by different
lucene index segments?

>

>  
>

>

> Thanks!

>

>  
>

>

> \--  
>

>

> jichi  
>

>

>  
>



Restarting SolrCloud that is taking realtime updates

2016-11-25 Thread Jichi Guo
Hi,

  

I am seeking for the best practice to restart a sharded SolrCloud that taking
search traffic as well as realtime updates without downtime.

When I deploy new customized Solr plugins,for example, it will require
restarting the whole SolrCloud cluster.

I am testing Solr 6.2.1 with 4 shards.

And I find that when SolrCloud is taking updates, when I restart any Solr node
(no matter whether it is a leader node or overseer or other normal replica),
the restarted node would Reindex it's whole data from its leader. i.e., it
will redownload the whole index data and then drop its old data.

The only way I find to avoid such reindexing is to temporarily disable
updates, such as invoke disableReplication in the leader node before
restarting.

  

Additionally, I didn't find a way to temporarily pause Solr replication to a
single replica. Before sharding, we can do disablePoll to disable replication
in a slave. But after sharding,  disable replication from the leader node is
the only way I found, which will pause not only the replication to the one
node to restart, but also disable replication in all nodes in the same shard.

  

The procedure becomes more complex if I want to restart a leader node: I need
first manually trigger a leader node failover through rebalancing, then
disable replication in the new leader node, then restart the old leader node,
and at last reenable replication in the new leader node.

  

As you can see, it seems to take many steps to restart SolrCloud node by node
this way.

I am not sure if this is the best procedure to restart the whole SolrCloud
that is taking realtime update?

  

Thanks!

  
Sent from [Nylas N1](https://link.nylas.com/link/5tkvmhpozan5j5h3lhni487b
/local-7bf8174b-7288/0?redirect=https%3A%2F%2Fnylas.com%2Fn1%3Fref%3Dn1&r=c29s
ci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn), the extensible, open source mail client.

![](https://link.nylas.com/open/5tkvmhpozan5j5h3lhni487b/local-
7bf8174b-7288?r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)



Re: Restarting SolrCloud that is taking realtime updates

2016-11-25 Thread Jichi Guo
Thanks so much for the very quick and detailed explanation, Erick!

  

According to the following page, it seems numRecordsToKeep cannot be too high
that must fit in a singe POST.

It seems your 1> or 3> approaches would be the best in pratical when the
number of updated documents is high.

  

https://support.lucidworks.com/hc/en-us/articles/203842143-Recovery-times-
while-restarting-a-SolrCloud-node  

  

Thanks again for happy thanksgiving!

  
Sent from [Nylas N1](https://link.nylas.com/link/5tkvmhpozan5j5h3lhni487b
/local-76daa0e7-1a84/0?redirect=https%3A%2F%2Fnylas.com%2Fn1%3Fref%3Dn1&r=c29s
ci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn), the extensible, open source mail client.

![](https://link.nylas.com/open/5tkvmhpozan5j5h3lhni487b/local-
76daa0e7-1a84?r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)

  
On Nov 25 2016, at 2:33 pm, Erick Erickson  wrote:  

> First, get out of thinking about the replication API, things like  
DISABLEPOLL and the like when in SolrCloud mode. The  
"old style" replication is used under the control of the synching  
strategy. Unless you've configured master/slave sections of  
your solrconfig.xml files and somehow dealt with the leader  
changing (who should be polled?), I'm pretty sure this is a total red herring.

>

> As for the rest, that's just the way it works. In SolrCloud, the  
raw documents are forwarded from the leader to the followers.  
Outside of a node going into recovery, replication isn't used  
at all.

>

> However, when a node goes into recovery (which by definition it will  
when the core is reloaded or the Solr instance is restarted) then  
the replica checks with the leader to see if it's "too far" out of date. The  
default "too far" is 100 docs, although this can be changed by setting  
the updatelog numRecordsToKeep to a higher number in solrconfig.xml.  
If the replica is too far out of date, a full index replication is done which  
is what you're observing.

>

> If the number of updates the leader has received is < 100  
(or numRecordsToKeep) the leader sends the raw documents to the  
follower from it's update log and there is no "old style" replication there  
at all.

>

> So, the net-net here is that your choices are limited:

>

> 1> stop indexing while doing the restart.

>

> 2> bump numRecordsToKeep to some larger number that  
 you expect not to be exceeded for the time it takes to  
 restart each node.

>

> 3> live with the full index replication in this situation.

>

> I'll add parenthetically that having to redeploy plugins and the like  
_should_ be a relatively rare operation, and it seems (at least from  
the outside) to be a perfectly reasonable thing to do in a maintenance  
window when index updates are disabled.

>

> You can also consider using collection aliasing to switch back and  
forth between two collections so you can manipulate the current  
cold one and, when you're satisfied, switch the alias.

>

> Best,  
Erick

>

> On Fri, Nov 25, 2016 at 1:40 PM, Jichi Guo  wrote:  
> Hi,  
>  
>  
>  
> I am seeking for the best practice to restart a sharded SolrCloud that
taking  
> search traffic as well as realtime updates without downtime.  
>  
> When I deploy new customized Solr plugins,for example, it will require  
> restarting the whole SolrCloud cluster.  
>  
> I am testing Solr 6.2.1 with 4 shards.  
>  
> And I find that when SolrCloud is taking updates, when I restart any Solr
node  
> (no matter whether it is a leader node or overseer or other normal replica),  
> the restarted node would Reindex it's whole data from its leader. i.e., it  
> will redownload the whole index data and then drop its old data.  
>  
> The only way I find to avoid such reindexing is to temporarily disable  
> updates, such as invoke disableReplication in the leader node before  
> restarting.  
>  
>  
>  
> Additionally, I didn't find a way to temporarily pause Solr replication to a  
> single replica. Before sharding, we can do disablePoll to disable
replication  
> in a slave. But after sharding, disable replication from the leader node is  
> the only way I found, which will pause not only the replication to the one  
> node to restart, but also disable replication in all nodes in the same
shard.  
>  
>  
>  
> The procedure becomes more complex if I want to restart a leader node: I
need  
> first manually trigger a leader node failover through rebalancing, then  
> disable replication in the new leader node, then restart the old leader
node,  
> and at last reenable replication in the new leader node.  
>  
>  
>  
> As you can see, it seems to take many steps to restart SolrCloud node by
node  
> this way.  
>

Best practices to debug Solr search in production without all fields stored

2016-12-30 Thread Jichi Guo
Hi everyone,

  

I found it convenient to debug Solr search results if I mark all fields to be
"stored=true" in schema.

For example, given a document, I could check why it is not returned in a query
with debug=true.

  

But in production, most of the fields have "stored=false" for performance
reason.

I am wondering what would be the recommended approaches to easily reason Solr
search results if some used fields are not stored?

  

Thanks!

  
Sent from [Nylas N1](https://link.nylas.com/link/5tkvmhpozan5j5h3lhni487b/loca
l-32350514-1e41/0?redirect=https%3A%2F%2Fnylas.com%2Fn1%3Fref%3Dn1&r=c29sci11c
2VyQGx1Y2VuZS5hcGFjaGUub3Jn), the extensible, open source mail client.

![](https://link.nylas.com/open/5tkvmhpozan5j5h3lhni487b/local-32350514-1e41?r
=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)