MatchAllDocsQuery is much slower in solr5.3.1 compare to solr4.7

2015-11-05 Thread wei
We are running our search on solr4.7 and I am evaluating whether to upgrade
to solr5.3.1. I found MatchAllDocsQuery is much slower in solr5.3.1. Anyone
know why?

We have a lot of queries without any query keyword, but we apply filters on
the query. Load testing shows those queries are much slower in solr5.3.1
compare to 4.7. If we load test with queries with search keywords, we can
see the queries are much faster in solr5.3.1 compare solr4.7.
here is sample debug info:
(in solr 4.7)


   
  0
  86
  
 id
 0
 *:*
 true
 +categoryIdsPath:1001
 2
  
   
   
  
 36652255
  
  
 36651884
  
   
   
  *:*
  *:*
  MatchAllDocsQuery(*:*)
  *:*
  
 1.0 = (MATCH) MatchAllDocsQuery, product of:
  1.0 = queryNorm
 1.0 = (MATCH) MatchAllDocsQuery, product of:
  1.0 = queryNorm
  
  LuceneQParser
  
 +categoryIdsPath:1001
  
  
 +categoryIdsPath:1001
  
  
 86.0
 
0.0

   0.0


   0.0


   0.0


   0.0


   0.0


   0.0

 
 
86.0

   85.0


   0.0


   0.0


   0.0


   0.0


   1.0

 
  
   


(in solr 5.3.1)


   
  0
  313
  
 id
 0
 *:*
 true
 +categoryIdsPath:1001
 2
  
   
   
  
 36652255
  
  
 36651884
  
   
   
  *:*
  *:*
  MatchAllDocsQuery(*:*)
  *:*
  
 1.0 = *:*, product of:
  1.0 = boost
  1.0 = queryNorm
 1.0 = *:*, product of:
  1.0 = boost
  1.0 = queryNorm
  
  LuceneQParser
  
 +categoryIdsPath:1001
  
  
 +categoryIdsPath:1001
  
  
 313.0
 
0.0

   0.0


   0.0


   0.0


   0.0


   0.0


   0.0


   0.0


   0.0

 
 
311.0

   311.0


   0.0


   0.0


   0.0


   0.0


   0.0


   0.0


   0.0

 
  
   

Thanks,
Wei


Re: MatchAllDocsQuery is much slower in solr5.3.1 compare to solr4.7

2015-11-06 Thread wei
Thanks Jack and Shawn. I checked these Jira tickets, but I am not sure if
the slowness of MatchAllDocsQuery is also caused by the removal of
fieldcache. Can someone please explain a little bit?

Thanks,
Wei

On Fri, Nov 6, 2015 at 7:15 AM, Shawn Heisey  wrote:

> On 11/5/2015 10:25 PM, Jack Krupansky wrote:
> > I vaguely recall some discussion concerning removal of the field cache in
> > Lucene.
>
> The FieldCache wasn't exactly *removed* ... it's more like it was
> renamed, improved, and sort of hidden in a miscellaneous package.  Some
> things still require this functionality, so they use the hidden class
> instead, which was changed to use the DocValues API.
>
> https://issues.apache.org/jira/browse/LUCENE-5666
>
> I am not qualified to discuss LUCENE-5666 beyond what I wrote in the
> paragraph above, and it's possible that some of what I said is wrong
> because I do not really understand the APIs involved.
>
> The change has caused problems for Solr.  End result from Solr's
> perspective: Certain things which used to work perfectly fine (mostly
> facets and grouping) in Solr 4.x have one of two problems in 5.x:
> Either they don't work at all, or performance has gone way down.  Some
> of these problems are documented in Jira.  These are the issues I know
> about:
>
> https://issues.apache.org/jira/browse/SOLR-8088
> https://issues.apache.org/jira/browse/SOLR-7495
> https://issues.apache.org/jira/browse/SOLR-8096
>
> For fields where adding docValues is a viable option (most field types
> other than solr.TextField), adding docValues and reindexing is very
> likely to solve those problems.
>
> Sometimes adding docValues won't work, either because the field type
> doesn't allow it, or because it's the indexed terms that are needed, not
> the original field value.  For those situations, there is currently no
> solution.
>
> Thanks,
> Shawn
>
>


Re: MatchAllDocsQuery is much slower in solr5.3.1 compare to solr4.7

2015-11-06 Thread wei
and see if that is a lot faster, both with old and new Solr.
>
> -- Jack Krupansky
>
> On Fri, Nov 6, 2015 at 3:01 PM, wei  wrote:
>
> > Thanks Jack and Shawn. I checked these Jira tickets, but I am not sure if
> > the slowness of MatchAllDocsQuery is also caused by the removal of
> > fieldcache. Can someone please explain a little bit?
> >
> > Thanks,
> > Wei
> >
> > On Fri, Nov 6, 2015 at 7:15 AM, Shawn Heisey 
> wrote:
> >
> > > On 11/5/2015 10:25 PM, Jack Krupansky wrote:
> > > > I vaguely recall some discussion concerning removal of the field
> cache
> > in
> > > > Lucene.
> > >
> > > The FieldCache wasn't exactly *removed* ... it's more like it was
> > > renamed, improved, and sort of hidden in a miscellaneous package.  Some
> > > things still require this functionality, so they use the hidden class
> > > instead, which was changed to use the DocValues API.
> > >
> > > https://issues.apache.org/jira/browse/LUCENE-5666
> > >
> > > I am not qualified to discuss LUCENE-5666 beyond what I wrote in the
> > > paragraph above, and it's possible that some of what I said is wrong
> > > because I do not really understand the APIs involved.
> > >
> > > The change has caused problems for Solr.  End result from Solr's
> > > perspective: Certain things which used to work perfectly fine (mostly
> > > facets and grouping) in Solr 4.x have one of two problems in 5.x:
> > > Either they don't work at all, or performance has gone way down.  Some
> > > of these problems are documented in Jira.  These are the issues I know
> > > about:
> > >
> > > https://issues.apache.org/jira/browse/SOLR-8088
> > > https://issues.apache.org/jira/browse/SOLR-7495
> > > https://issues.apache.org/jira/browse/SOLR-8096
> > >
> > > For fields where adding docValues is a viable option (most field types
> > > other than solr.TextField), adding docValues and reindexing is very
> > > likely to solve those problems.
> > >
> > > Sometimes adding docValues won't work, either because the field type
> > > doesn't allow it, or because it's the indexed terms that are needed,
> not
> > > the original field value.  For those situations, there is currently no
> > > solution.
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
>


Re: MatchAllDocsQuery is much slower in solr5.3.1 compare to solr4.7

2015-11-06 Thread wei
the explain part are different in solr4.7 and solr 5.3.1. In solr 4.7,
there is only one line

 
 1.0 = (MATCH) MatchAllDocsQuery, product of:
  1.0 = queryNorm
 1.0 = (MATCH) MatchAllDocsQuery, product of:
  1.0 = queryNorm
  

in solr 5.3.1, there is actually a boost, and the score is product of boost
& queryNorm.

Can that cause the problem? if solr5 need to calculate the product of all
the hits. I am not sure where the boost come from, and why it is different
in solr4.7

  
 1.0 = *:*, product of:
  1.0 = boost
  1.0 = queryNorm
 1.0 = *:*, product of:
  1.0 = boost
  1.0 = queryNorm
  


Re: MatchAllDocsQuery is much slower in solr5.3.1 compare to solr4.7

2015-11-06 Thread wei
Hi Shawn,

I took care of the warm up problem during the test. I setup jmeter project,
get query log from our production(>10 queries), and run the same query
log through jmeter to hit the solr instances with the same qps(about 40). I
removed warmup queries in both the solr setup, and also set the autowarmup
of cache to 0 in the solrconfig. I run the test for 1 hour. these two
instances are not serving other query traffic but they both get update
traffic. I disabled softcommit in solr5 and set the hardcommit to 2
minutes. The solr4 instance is a slave node replicating from solr4 master
instance, and the master also has 2 minutes commit cycle, and the testing
solr4 instance replicate the index every 2 minutes.

The solr5 is slower than solr4. After some investigation I realized that it
seems the queries containing q=*:* are causing the problem. I splitted the
query log into two log files, one with q=*:* and another without(almost all
our queries have filter queries). when I run the test, solr5 is faster when
running query with query keyword, but is much slower when run "q=*:*" query
log.

There is no other query traffic to both the two instance.(there is index
traffic). When I get the query debug log in my first email, I make sure
there is no filter cache (verified through the solr console. after hard
commit, the filtercache is cleaned)

Hope my email address your concern about how I do the test. What obvious to
me is that solr5 is faster in one test(with query keyword) and is slower in
the other test(without query keyword).

Thanks,
Wei

On Fri, Nov 6, 2015 at 1:41 PM, Shawn Heisey  wrote:

> On 11/6/2015 1:01 PM, wei wrote:
> > Thanks Jack and Shawn. I checked these Jira tickets, but I am not sure if
> > the slowness of MatchAllDocsQuery is also caused by the removal of
> > fieldcache. Can someone please explain a little bit?
>
> I only glanced at your full output in the message at the start of this
> thread.  I thought I saw facet output in it, but it turns out that the
> only mention of facets was the timing information from the debug, so
> that very likely rules out the FieldCache change as a culprit.
>
> I am suspecting that the 4.7 index is warmed better, and may have the
> specific filter query (categoryIdsPath:1001)already sitting in the
> filterCache.
>
> Try running that query a few of times on both versions, then restart
> Solr on both versions so they both start clean, and run the query *once*
> on each system, and see whether there's still a large discrepancy.
>
> If one of the systems is receiving queries from active clients and the
> other is not, then the comparison will be unfair, and biased towards the
> one that is getting additional queries.  Query activity, even if it
> seems unrelated to the query you are testing, has a tendency to reduce
> overall qtime values.
>
> Thanks,
> Shawn
>
>


Re: MatchAllDocsQuery is much slower in solr5.3.1 compare to solr4.7

2015-11-06 Thread wei
Good point! I tried that, on solr5 the query time is around 100-110ms, and
on solr4 it is around 60-63ms(very consistent). Solr5 is slower.

Thanks,
Wei

On Fri, Nov 6, 2015 at 6:46 PM, Yonik Seeley  wrote:

> On Fri, Nov 6, 2015 at 9:30 PM, wei  wrote:
> > in solr 5.3.1, there is actually a boost, and the score is product of
> boost
> > & queryNorm.
>
> Hmmm, well, it's worth putting on the list of stuff to investigate.
> Boosting was also changed in lucene.
>
> What happens if you try this multiple times in a row?
>
> &rows=2&fl=id&q={!cache=false}*:*&fq=categoryIdsPath:1001
>
> (basically just add {!cache=false} as a prefix to the main query.)
>
> This would allow hotspot time to compile methods, and ensure that the
> filter query was cached, and do a better job of isolating the
> "filtered match-all-docs" part of the execution.
>
> -Yonik
>


Re: MatchAllDocsQuery is much slower in solr5.3.1 compare to solr4.7

2015-11-06 Thread wei
Thanks Yonik.

A JIRA bug is opened:
https://issues.apache.org/jira/browse/SOLR-8251

Wei

On Fri, Nov 6, 2015 at 7:10 PM, Yonik Seeley  wrote:

> On Fri, Nov 6, 2015 at 9:56 PM, wei  wrote:
> > Good point! I tried that, on solr5 the query time is around 100-110ms,
> and
> > on solr4 it is around 60-63ms(very consistent). Solr5 is slower.
>
> When it's something easy, there comes a point when it makes sense to
> stop asking more questions and just try it yourself...
> I just did this, and can confirm what you're seeing.   For me, 5.3.1
> is about 5x slower than 4.10 for this particular query.
> Thanks for your persistence / patience in reporting this.  Could you
> open a JIRA issue for it?
>
> -Yonik
>


solr query latency spike when replicating index

2015-04-02 Thread wei
I noticed the solr query latency spike on slave node when replicating index
from master. Especially when master just finished optimization, the slave
node will copy the whole index, and the latency is really bad.

Is there some way to fix it?

Thanks,
Wei


Re: solr query latency spike when replicating index

2015-04-04 Thread wei
seems "sar" is not installed. This is product machine, so I can't install
it. We use ssd, and the gc throughput is about 95.8.
We already throttle the replication to below 20M.

We also have enough memory to hold both the jvm and index in memory. I am
not sure when replicating the index, if both indexes(old and new) need to
be in the memory. The memory is not big enough to hold both(old index+new
index+jvm).

Thanks,
Wei

On Fri, Apr 3, 2015 at 3:35 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> In Solr 5.0 you can throttle the replication and limit the bandwidth it
> uses. The Sematext guys wrote a nice blog post about it. See
> http://blog.sematext.com/2015/01/26/solr-5-replication-throttling/
>
> On Thu, Apr 2, 2015 at 1:53 PM, wei  wrote:
>
> > I noticed the solr query latency spike on slave node when replicating
> index
> > from master. Especially when master just finished optimization, the slave
> > node will copy the whole index, and the latency is really bad.
> >
> > Is there some way to fix it?
> >
> > Thanks,
> > Wei
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Correct approach to copy index between solr clouds?

2017-08-25 Thread Wei
Hi,

In our set up there are two solr clouds:

Cloud A:  production cloud serves both writes and reads

Cloud B:  back up cloud serves only writes

Cloud A and B have the same shard configuration.

Write requests are sent to both cloud A and B. In certain circumstances
when Cloud A's update lags behind,  we want to bulk copy the binary index
from B to A.

We have tried two approaches:

Approach 1.
  For cloud A:
  a. delete collection to wipe out everything
  b. create new collection (data is empty now)
  c. shut down solr server
  d. copy binary index from cloud B to corresponding shard replicas in
cloud A
  e. start solr server

Approach 2.
  For cloud A:
  a.  shut down solr server
  b.  remove the whole 'data' folder under index/  in each replica
  c.  copy binary index from cloud B to corresponding shard replicas in
cloud A
  d.  start solr server

Is approach 2 sufficient?  I am wondering if delete/recreate collection
each time is necessary to get cloud into a "clean" state for copy binary
index between solr clouds.

Thanks for your advice!


Re: Correct approach to copy index between solr clouds?

2017-08-26 Thread Wei
Thanks Erick. Can you explain a bit more on the write.lock file? So far I
have been copying it over from B to A and haven't seen issue starting the
replica.

On Sat, Aug 26, 2017 at 9:25 AM, Erick Erickson 
wrote:

> Approach 2 is sufficient. You do have to insure that you don't copy
> over the write.lock file however as you may not be able to start
> replicas if that's there.
>
> There's a relatively little-known third option. You an (ab)use the
> replication API "fetchindex" command, see:
> https://cwiki.apache.org/confluence/display/solr/Index+Replication to
> pull the index from Cloud B to replicas on Cloud A. That has the
> advantage of working even if you are actively indexing to Cloud B.
> NOTE: currently you cannot _query_ CloudA (the target) while the
> fetchindex is going on, but I doubt you really care since you were
> talking about having Cloud A offline anyway. So for each replica you
> fetch to you'll send the fetchindex command directly to the replica on
> Cloud A and the "masterURL" will be the corresponding replica on Cloud
> B.
>
> Finally, what I'd really do is _only_ have one replica for each shard
> on Cloud A active and fetch to _that_ replica. I'd also delete the
> data dir on all the other replicas for the shard on Cloud A. Then as
> you bring the additional replicas up they'll do a full synch from the
> leader.
>
> FWIW,
> Erick
>
> On Fri, Aug 25, 2017 at 6:53 PM, Wei  wrote:
> > Hi,
> >
> > In our set up there are two solr clouds:
> >
> > Cloud A:  production cloud serves both writes and reads
> >
> > Cloud B:  back up cloud serves only writes
> >
> > Cloud A and B have the same shard configuration.
> >
> > Write requests are sent to both cloud A and B. In certain circumstances
> > when Cloud A's update lags behind,  we want to bulk copy the binary index
> > from B to A.
> >
> > We have tried two approaches:
> >
> > Approach 1.
> >   For cloud A:
> >   a. delete collection to wipe out everything
> >   b. create new collection (data is empty now)
> >   c. shut down solr server
> >   d. copy binary index from cloud B to corresponding shard replicas
> in
> > cloud A
> >   e. start solr server
> >
> > Approach 2.
> >   For cloud A:
> >   a.  shut down solr server
> >   b.  remove the whole 'data' folder under index/  in each replica
> >   c.  copy binary index from cloud B to corresponding shard replicas
> in
> > cloud A
> >   d.  start solr server
> >
> > Is approach 2 sufficient?  I am wondering if delete/recreate collection
> > each time is necessary to get cloud into a "clean" state for copy binary
> > index between solr clouds.
> >
> > Thanks for your advice!
>


commit time in solr cloud

2017-09-08 Thread Wei
Hi,

In solr cloud we want to track the last commit time on each node. The
information source is from the luke handler:
 admin/luke?numTerms=0&wt=json, e.g.


   - userData:
   {
  - commitTimeMSec: "1504895505447"
  },
   - lastModified: "2017-09-08T18:31:45.447Z"



I'm assuming the lastModified time is when latest hard commit happens. Is
that correct?

On all nodes we have autoCommit set to 15 mins interval. One observation I
don't  understand is quite often the last commit time on shard leaders lags
behind the last commit time on replicas, some times the lag is over 10
minutes.  My understanding is that as update requests goes to leader first,
the timer on the leaders would start earlier than the replicas. Am I
missing something here?

Thanks,
Wei


solr cloud without hard commit?

2017-09-28 Thread Wei
Hello All,

What are the impacts if solr cloud is configured to have only soft commits
but no hard commits? In this way if a non-leader node crashes, will it
still be able to recover from the leader? Basically we are wondering  in a
read heavy & write heavy scenario, whether taking hard commit out could
help to improve query performance and what are the consequences.

Thanks,
Wei


Re: solr cloud without hard commit?

2017-09-29 Thread Wei
Thanks Emir and Erick!  Helps me a lot to understand the commit process. A
few more questions:

1.  https://lucidworks.com/2013/08/23/understanding-
transaction-logs-softcommit-and-commit-in-sorlcloud/  mentions that for
soft commit, "new segments are created that will be merged".  Does that
mean without hard commit, soft commits will create many small segments in
memory and that could also slow down query?  As I understand merge policy
only kicks in with hard commit.

2.  Without hard commit configure, will the segments still be fsync to disk
when accumulated updates exceeds rambuffersizeMB?  Is there any concern to
increase rambuffersizeMB to a large value?

3. Can transaction logs be disabled in solr cloud? Will
functionalities(replication, peer sync) break without transaction logs?

Thanks,
Wei


On Fri, Sep 29, 2017 at 8:33 AM, Erick Erickson 
wrote:

> More than you want to know about hard and soft commits here:
> https://lucidworks.com/2013/08/23/understanding-
> transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> You don't need to read it though, Emir did an admirable job of telling
> you why turning off hard commits is a terrible idea.
>
> Best,
> Erick
>
> On Fri, Sep 29, 2017 at 1:07 AM, Emir Arnautović
>  wrote:
> > Hi Wei,
> > Hard commits are about data durability. It will roll over transaction
> logs and create index new index segment. If configured with
> openSearcher=false, they do not affect query performance much (other then
> taking some resources) since they do not invalidate caches. If you have
> transaction logs enabled, without hard commits it would grow infinitely and
> can result in full disk. In case of heavy indexing, even rare hard commits
> can result in large transaction logs causing Solr restart after crash to
> take a while because transaction logs are replayed.
> > Soft commits are the one that are affecting query performance and should
> be as rare as your requirements allow. They invalidate caches causing cold
> searches or if you have warming set up, take resources to do the warming.
> >
> > I would recommend to keep hard commits, set to every 20-60 seconds
> (depending on indexing volume) and make sure openSearcher is set to false.
> >
> > HTH,
> > Emir
> >
> >> On 29 Sep 2017, at 06:55, Wei  wrote:
> >>
> >> Hello All,
> >>
> >> What are the impacts if solr cloud is configured to have only soft
> commits
> >> but no hard commits? In this way if a non-leader node crashes, will it
> >> still be able to recover from the leader? Basically we are wondering
> in a
> >> read heavy & write heavy scenario, whether taking hard commit out could
> >> help to improve query performance and what are the consequences.
> >>
> >> Thanks,
> >> Wei
> >
>


Leader initiated recovery authentication failure

2017-11-04 Thread Wei
Hi All,

After enabling basic authentication for solr cloud, I noticed that the
internal leader initiated recovery failed with 401 response.

The recovery request from leader:

GET 
//replica1.mycloud.com:9090/solr/admin/cores?action=*REQUESTRECOVERY*&core=replica1&wt=javabin&version=2
HTTP/1.1" 401 310 "-"
"Solr[org.apache.solr.client.solrj.impl.HttpSolrClient] 1.0" 5

My authorization config is:

authorization:
{

   - class: "solr.RuleBasedAuthorizationPlugin",
   - permissions:
   [
  -
  {
 - name: "security-edit",
 - role: "admin",
 - index: 1
 },
  -
  {
 - name: "schema-edit",
 - role: "admin",
 - index: 2
 },
  -
  {
 - name: "config-edit",
 - role: "admin",
 - index: 3
 },
  -
  {
 - name: "core-admin-edit",
 - role: "admin",
 - index: 4
 },
  -
  {
 - name: "collection-admin-edit",
 - role: "admin",
 - index: 5
 }
  ],


Looks the unauthorized error is because core-admin-edit requires admin
access. How can I config authentication credentials for solr cloud's
internal request? Appreciate your help!

Thanks,
Wei


solr cloud updatehandler stats mismatch

2017-11-05 Thread Wei
Hi,

I use the following api to track the number of update requests:

/solr/collection1/admin/mbeans?cat=UPDATE&stats=true&wt=json


Result:


   - class: "org.apache.solr.handler.UpdateRequestHandler",
   - version: "6.4.2.1",
   - description: "Add documents using XML (with XSLT), CSV, JSON, or
   javabin",
   - src: null,
   - stats:
   {
  - handlerStart: 1509824945436,
  - requests: 106062,
  - ...


I am quite confused that the number of requests reported above is quite
different from the count from solr access logs. A few times the handler
stats is much higher: handler reports ~100k requests but in the access log
there are only 5k update requests. What could be the possible cause?

Thanks,
Wei


Re: solr cloud updatehandler stats mismatch

2017-11-13 Thread Wei
Thanks Amrit. Can you explain a bit more what kind of requests won't be
logged?  Is that something configurable for solr?

Best,
Wei

On Thu, Nov 9, 2017 at 3:12 AM, Amrit Sarkar  wrote:

> Wei,
>
> Are the requests coming through to collection has multiple shards and
> replicas. Please mind a update request is received by a node, redirected to
> particular shard the doc belong, and then distributed to replicas of the
> collection. On each replica, each core, update request is played.
>
> Can be a probable reason b/w mismatch between Mbeans stats and manual
> counting in logs, as not everything gets logged. Need to check that once.
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> Medium: https://medium.com/@sarkaramrit2
>
> On Thu, Nov 9, 2017 at 4:34 PM, Furkan KAMACI 
> wrote:
>
> > Hi Wei,
> >
> > Do you compare it with files which are under /var/solr/logs by default?
> >
> > Kind Regards,
> > Furkan KAMACI
> >
> > On Sun, Nov 5, 2017 at 6:59 PM, Wei  wrote:
> >
> > > Hi,
> > >
> > > I use the following api to track the number of update requests:
> > >
> > > /solr/collection1/admin/mbeans?cat=UPDATE&stats=true&wt=json
> > >
> > >
> > > Result:
> > >
> > >
> > >- class: "org.apache.solr.handler.UpdateRequestHandler",
> > >- version: "6.4.2.1",
> > >- description: "Add documents using XML (with XSLT), CSV, JSON, or
> > >javabin",
> > >- src: null,
> > >- stats:
> > >{
> > >   - handlerStart: 1509824945436,
> > >   - requests: 106062,
> > >   - ...
> > >
> > >
> > > I am quite confused that the number of requests reported above is quite
> > > different from the count from solr access logs. A few times the handler
> > > stats is much higher: handler reports ~100k requests but in the access
> > log
> > > there are only 5k update requests. What could be the possible cause?
> > >
> > > Thanks,
> > > Wei
> > >
> >
>


Lucene two-phase iteration question

2017-12-22 Thread Wei
Hi,

I noticed that lucene has introduced a new two-phase iteration API since 5,
but could not get a good understanding of how it works. Are there any
detail documentation or examples?  Does the two-phase iteration result in
better query performance?  Appreciate your help.

Thanks,
Wei


Re: Lucene two-phase iteration question

2018-01-01 Thread Wei
Hello Mikhail,

Thank you so much for the info. Trying to digest it first.Can you
elaborate more on what has changed? Any pointer is greatly appreciated.

Regards,
Wei

On Mon, Jan 1, 2018 at 10:04 AM, Mikhail Khludnev  wrote:

> Hello, Wei.
> Some first details have been discussed here
> https://www.youtube.com/watch?v=BM4-Mv0kWr8
> Unfortunately, things have changed from those times.
>
> On Sat, Dec 23, 2017 at 1:43 AM, Wei  wrote:
>
> > Hi,
> >
> > I noticed that lucene has introduced a new two-phase iteration API since
> 5,
> > but could not get a good understanding of how it works. Are there any
> > detail documentation or examples?  Does the two-phase iteration result in
> > better query performance?  Appreciate your help.
> >
> > Thanks,
> > Wei
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Multiple solr instances per host vs Multiple cores in same solr instance

2018-08-25 Thread Wei
Hi,

I have a question about the deployment configuration in solr cloud.  When
we need to increase the number of shards in solr cloud, there are two
options:

1.  Run multiple solr instances per host, each with a different port and
hosting a single core for one shard.

2.  Run one solr instance per host, and have multiple cores(shards) in the
same solr instance.

Which would be better performance wise? For the first option I think JVM
size for each solr instance can be smaller, but deployment is more
complicated? Are there any differences for cpu utilization?

Thanks,
Wei


Re: Multiple solr instances per host vs Multiple cores in same solr instance

2018-08-26 Thread Wei
Thanks Shawn. When using multiple Solr instances per host, is there any way
to prevent solrcloud from putting multiple replicas of the same shard on
same host?
I see it makes sense if we can splitting into multiple instances with
smaller heap size. Besides that, do you think multiple instances will be
able to get better CPU utilization on multi-core server?

Thanks,
Wei

On Sun, Aug 26, 2018 at 4:37 AM Shawn Heisey  wrote:

> On 8/26/2018 12:00 AM, Wei wrote:
> > I have a question about the deployment configuration in solr cloud.  When
> > we need to increase the number of shards in solr cloud, there are two
> > options:
> >
> > 1.  Run multiple solr instances per host, each with a different port and
> > hosting a single core for one shard.
> >
> > 2.  Run one solr instance per host, and have multiple cores(shards) in
> the
> > same solr instance.
> >
> > Which would be better performance wise? For the first option I think JVM
> > size for each solr instance can be smaller, but deployment is more
> > complicated? Are there any differences for cpu utilization?
>
> My general advice is to only have one Solr instance per machine.  One
> Solr instance can handle many indexes, and usually will do so with less
> overhead than two or more instances.
>
> I can think of *ONE* exception to this -- when a single Solr instance
> would require a heap that's extremely large. Splitting that into two or
> more instances MIGHT greatly reduce garbage collection pauses.  But
> there's a caveat to the caveat -- in my strong opinion, if your Solr
> instance is so big that it requires a huge heap and you're considering
> splitting into multiple Solr instances on one machine, you very likely
> need to run each of those instances on *separate* machines, so that each
> one can have access to all the resources of the machine it's running on.
>
> For SolrCloud, when you're running multiple instances per machine, Solr
> will consider those to be completely separate instances, and you may end
> up with all of the replicas for a shard on a single machine, which is a
> problem for high availability.
>
> Thanks,
> Shawn
>
>


Re: Multiple solr instances per host vs Multiple cores in same solr instance

2018-08-27 Thread Wei
Thanks Bernd.  Do you have preferLocalShards=true in both cases? Do you
notice CPU/memory utilization difference between the two deployments? How
many servers did you use in total?  I am curious what's the bottleneck for
the one instance and 3 cores configuration.

Thanks,
Wei

On Mon, Aug 27, 2018 at 1:45 AM Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:

> My tests with many combinations (instance, node, core) on a 3 server
> cluster
> with SolrCloud pointed out that highest performance is with multiple solr
> instances and shards and replicas placed by rules so that you get advantage
> from preferLocalShards=true.
>
> The disadvantage ist the handling of the system, which means setup,
> starting
> and stopping, setting up the shards and replicas with rules and so on.
>
> I tested with 3x3 SolrCloud (3 shards, 3 replicas).
> A 3x3 system with one instance and 3 cores per host could handle up to
> 30QPS.
> A 3x3 system with multi instance (different ports, single core and shard
> per
> instance) could handle 60QPS on same hardware with same data.
>
> Also, the single instance per server setup has spikes in the response time
> graph
> which are not seen with a multi instance setup.
>
> Tested about 2 month ago with SolCloud 6.4.2.
>
> Regards,
> Bernd
>
>
> Am 26.08.2018 um 08:00 schrieb Wei:
> > Hi,
> >
> > I have a question about the deployment configuration in solr cloud.  When
> > we need to increase the number of shards in solr cloud, there are two
> > options:
> >
> > 1.  Run multiple solr instances per host, each with a different port and
> > hosting a single core for one shard.
> >
> > 2.  Run one solr instance per host, and have multiple cores(shards) in
> the
> > same solr instance.
> >
> > Which would be better performance wise? For the first option I think JVM
> > size for each solr instance can be smaller, but deployment is more
> > complicated? Are there any differences for cpu utilization?
> >
> > Thanks,
> > Wei
> >
>


Re: Multiple solr instances per host vs Multiple cores in same solr instance

2018-08-31 Thread Wei
Hi Erick,

I am looking into the rule based replica placement documentation and
confused. How to ensure there are no more than one replica for any shard on
the same host?   There is an example rule  shard:*,replica:<2,node:* seem
to serve the purpose, but  I am not sure if  'node' refer to solr instance
or actual physical host. Is there an example for defining node?

Thanks



On Sun, Aug 26, 2018 at 8:37 PM Erick Erickson 
wrote:

> Yes, you can use the "node placement rules", see:
> https://lucene.apache.org/solr/guide/6_6/rule-based-replica-placement.html
>
> This is a variant of "rack awareness".
>
> Of course the simplest way if you're not doing very many collections is to
> create the collection with the special "EMPTY" createNodeSet then just
> build out your collection with ADDREPLICA, placing each replica on a
> particular node. The idea of that capability was exactly to explicitly
> control
> where each and every replica landed.
>
> As a third alternative, just create the collection and let Solr put
> the replicas where
> it will, then use MOVEREPLICA to position replicas as you want.
>
> The node placement rules are primarily intended for automated or very large
> setups. Manually placing replicas is simpler for limited numbers.
>
> Best,
> Erick
> On Sun, Aug 26, 2018 at 8:10 PM Wei  wrote:
> >
> > Thanks Shawn. When using multiple Solr instances per host, is there any
> way
> > to prevent solrcloud from putting multiple replicas of the same shard on
> > same host?
> > I see it makes sense if we can splitting into multiple instances with
> > smaller heap size. Besides that, do you think multiple instances will be
> > able to get better CPU utilization on multi-core server?
> >
> > Thanks,
> > Wei
> >
> > On Sun, Aug 26, 2018 at 4:37 AM Shawn Heisey 
> wrote:
> >
> > > On 8/26/2018 12:00 AM, Wei wrote:
> > > > I have a question about the deployment configuration in solr cloud.
> When
> > > > we need to increase the number of shards in solr cloud, there are two
> > > > options:
> > > >
> > > > 1.  Run multiple solr instances per host, each with a different port
> and
> > > > hosting a single core for one shard.
> > > >
> > > > 2.  Run one solr instance per host, and have multiple cores(shards)
> in
> > > the
> > > > same solr instance.
> > > >
> > > > Which would be better performance wise? For the first option I think
> JVM
> > > > size for each solr instance can be smaller, but deployment is more
> > > > complicated? Are there any differences for cpu utilization?
> > >
> > > My general advice is to only have one Solr instance per machine.  One
> > > Solr instance can handle many indexes, and usually will do so with less
> > > overhead than two or more instances.
> > >
> > > I can think of *ONE* exception to this -- when a single Solr instance
> > > would require a heap that's extremely large. Splitting that into two or
> > > more instances MIGHT greatly reduce garbage collection pauses.  But
> > > there's a caveat to the caveat -- in my strong opinion, if your Solr
> > > instance is so big that it requires a huge heap and you're considering
> > > splitting into multiple Solr instances on one machine, you very likely
> > > need to run each of those instances on *separate* machines, so that
> each
> > > one can have access to all the resources of the machine it's running
> on.
> > >
> > > For SolrCloud, when you're running multiple instances per machine, Solr
> > > will consider those to be completely separate instances, and you may
> end
> > > up with all of the replicas for a shard on a single machine, which is a
> > > problem for high availability.
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
>


question for rule based replica placement

2018-09-02 Thread Wei
Hi,

In rule based replica placement,  how to ensure there are no more than one
replica for any shard on the same host?   In the documentation there is an
example rule

shard:*,replica:<2,node:*

Does 'node' refer to solr instance or actual physical host?  Is there an
example for defining the physical host?

Thanks,
Wei


Re: question for rule based replica placement

2018-09-02 Thread Wei
Thanks Erick. Suppose I have 5 hosts h1,h2,h3,h4,h5  and want to create a
5X2 solr cloud of 5 shards, 2 replicas per shard. On each host I will run
two solr JVMs, each hosts a single solr core. Solr's default 'snitch'
provide a 'host' tag, so I wonder if I can use it to prevent any host from
have two replicas from the same shard, when creating collection:

/solr/admin/collections?action=CREATE&name=mycollection&numShards=5&replicationFactor=2&maxShardsPerNode=1&rule=shard:*,
replica<2, host:*

Is this the correct way to use 'snitch'? I cannot find more relevant
documentation on how to configure and customize 'snitch'.

Thanks,
Wei

On Sun, Sep 2, 2018 at 9:30 PM Erick Erickson 
wrote:

> You need to provide a "snitch" and define a rule appropriately. This
> is a variant of "rack awareness".
>
> Solr considers two JVMs running on the same physical host as
> completely separate Solr instances, so to get replicas on different
> hosts you need a snitch etc.
>
> Best,
> Erick
> On Sun, Sep 2, 2018 at 4:39 PM Wei  wrote:
> >
> > Hi,
> >
> > In rule based replica placement,  how to ensure there are no more than
> one
> > replica for any shard on the same host?   In the documentation there is
> an
> > example rule
> >
> > shard:*,replica:<2,node:*
> >
> > Does 'node' refer to solr instance or actual physical host?  Is there an
> > example for defining the physical host?
> >
> > Thanks,
> > Wei
>


preferLocalShards setting

2018-09-06 Thread Wei
Hi,

I am setting up a solr cloud with external load balancer.  Noticed the
'preferLocalShards' configuration and I am wondering how it would impact
performance. If one host can have replicas from all shards it sure will be
beneficial;  but in my 5 shard / 2 replica cloud on 5 servers, each server
will only host 2 of the 5 shards( 2 JVMs per server, each JVM have one
replica from different shards). Is it useful to set preferLocalShards=true
in this case?

Thanks,
Wei


Index optimization takes too long

2018-11-02 Thread Wei
Hello,

After a recent schema change,  it takes almost 40 minutes to optimize the
index.  The schema change is to enable docValues for all sort/facet fields,
which increase the index size from 12G to 14G. Before the change it only
takes 5 minutes to do the optimization.

I have tried to increase maxMergeAtOnceExplicit because the default 30
could be too low:

100

But it doesn't seem to help. Any suggestions?

Thanks,
Wei


Re: Index optimization takes too long

2018-11-03 Thread Wei
Thanks everyone! I checked the system metrics during the optimization
process. CPU usage is quite low, there is no I/O wait,  and memory usage is
not much different from before the docValues change.  So I wonder what
could be the bottleneck.

Thanks,
Wei

On Sat, Nov 3, 2018 at 1:38 PM Erick Erickson 
wrote:

> Going from my phone so it'll be terse.  See uninvertingmergeuodateprocessor
> (or something like that). Also, there's an idea in SOLR-12259 IIRC, but
> that'll be in 7.6 at the earliest.
>
> On Sat, Nov 3, 2018, 07:13 Shawn Heisey 
> > On 11/3/2018 5:32 AM, Dave wrote:
> > > On a side note, does adding docvalues to an already indexed field, and
> > then optimizing, prevent the need to reindex to take advantage of
> > docvalues? I was under the impression you had to reindex the content.
> >
> > You must reindex when changing the schema to add docValues.  An optimize
> > will not build the new data structures. It will only rebuild the data
> > structures that are already there.
> >
> > Thanks,
> > Shawn
> >
> >
>


Retrieve field from docValues

2018-11-05 Thread Wei
Hi,

I have a few questions about using the useDocValuesAsStored option to
retrieve field from docValues:

1. For schema version 1.6, useDocValuesAsStored=true is default, so there
is no need to explicitly set it in schema.xml?

2.  With useDocValuesAsStored=true and the following definition, will Solr
retrieve id from docValues instead of stored field? if fl= id, title,
score,   both id and title are single value field:

  

 

  Do I need to have all fields stored="false" docValues="true" to make solr
retrieve from docValues only? I am using Solr 6.6.

Thanks,
Wei


Re: Retrieve field from docValues

2018-11-06 Thread Wei
Thanks Yasufumi and Erick.

---. 2. "it depends". Solr  will try to do the most efficient thing
possible. If _all_ the fields are docValues, it will return the stored
values from the docValues  structure.

I find this jira:   https://issues.apache.org/jira/browse/SOLR-8344Does
this mean "Solr  will try to do the most efficient thing possible" only
working for 7.x?  Is the behavior available for 6.6?

-- This prevents a disk seek and  decompress cycle.

Does this still hold if whole index is loaded into memory?  Also for the
benefit of performance improvement,  does the uniqueKey field need to be
always docValues? Since it is used in the first phase of distributed
search.

Thanks,
Wei



On Tue, Nov 6, 2018 at 8:30 AM Erick Erickson 
wrote:

> 2. "it depends". Solr  will try to do the most efficient thing
> possible. If _all_ the fields are docValues, it will return the stored
> values from the docValues  structure. This prevents a disk seek and
> decompress cycle.
>
> However, if even one field is docValues=false Solr will by default
> return the stored values. For the multiValued case, you can explicitly
> tell Solr to return the docValues field.
>
> Best,
> Erick
> On Tue, Nov 6, 2018 at 1:46 AM Yasufumi Mizoguchi
>  wrote:
> >
> > Hi,
> >
> > > 1. For schema version 1.6, useDocValuesAsStored=true is default, so
> there
> > > is no need to explicitly set it in schema.xml?
> >
> > Yes.
> >
> > > 2.  With useDocValuesAsStored=true and the following definition, will
> Solr
> > > retrieve id from docValues instead of stored field?
> >
> > No.
> > AFAIK, if you define both docValues="true" and stored="true" in your
> > schema,
> > Solr tries to retrieve stored value.
> > (Except using streaming expressions or /export handler etc...
> > See:
> >
> https://lucene.apache.org/solr/guide/6_6/docvalues.html#DocValues-EnablingDocValues
> > )
> >
> > Thanks,
> > Yasufumi
> >
> >
> > 2018年11月6日(火) 9:54 Wei :
> >
> > > Hi,
> > >
> > > I have a few questions about using the useDocValuesAsStored option to
> > > retrieve field from docValues:
> > >
> > > 1. For schema version 1.6, useDocValuesAsStored=true is default, so
> there
> > > is no need to explicitly set it in schema.xml?
> > >
> > > 2.  With useDocValuesAsStored=true and the following definition, will
> Solr
> > > retrieve id from docValues instead of stored field? if fl= id, title,
> > > score,   both id and title are single value field:
> > >
> > >> > docValues="true" required="true"/>
> > >
> > >   > > docValues="true" required="true"/>
> > >
> > >   Do I need to have all fields stored="false" docValues="true" to make
> solr
> > > retrieve from docValues only? I am using Solr 6.6.
> > >
> > > Thanks,
> > > Wei
> > >
>


Re: Retrieve field from docValues

2018-11-06 Thread Wei
I see there is also a docValuesFormat option, what's the default for this
setting? Performance wise is it good to set docValuesFormat="Memory" ?

Best,
Wei


On Tue, Nov 6, 2018 at 11:55 AM Erick Erickson 
wrote:

> Yes, "the most efficient possible" is associated with that JIRA, so only
> in 7x.
>
> "Does this still hold if whole index is loaded into memory?"
> The decompression part yes, the disk seek part no. And it's also
> sensitive to whether the documentCache already has the document.
>
> I'd also make uniqueKey ant the _version_ fields docValues.
>
> Best,
> Erick
> On Tue, Nov 6, 2018 at 10:44 AM Wei  wrote:
> >
> > Thanks Yasufumi and Erick.
> >
> > ---. 2. "it depends". Solr  will try to do the most efficient thing
> > possible. If _all_ the fields are docValues, it will return the stored
> > values from the docValues  structure.
> >
> > I find this jira:   https://issues.apache.org/jira/browse/SOLR-8344
> Does
> > this mean "Solr  will try to do the most efficient thing possible" only
> > working for 7.x?  Is the behavior available for 6.6?
> >
> > -- This prevents a disk seek and  decompress cycle.
> >
> > Does this still hold if whole index is loaded into memory?  Also for the
> > benefit of performance improvement,  does the uniqueKey field need to be
> > always docValues? Since it is used in the first phase of distributed
> > search.
> >
> > Thanks,
> > Wei
> >
> >
> >
> > On Tue, Nov 6, 2018 at 8:30 AM Erick Erickson 
> > wrote:
> >
> > > 2. "it depends". Solr  will try to do the most efficient thing
> > > possible. If _all_ the fields are docValues, it will return the stored
> > > values from the docValues  structure. This prevents a disk seek and
> > > decompress cycle.
> > >
> > > However, if even one field is docValues=false Solr will by default
> > > return the stored values. For the multiValued case, you can explicitly
> > > tell Solr to return the docValues field.
> > >
> > > Best,
> > > Erick
> > > On Tue, Nov 6, 2018 at 1:46 AM Yasufumi Mizoguchi
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > > 1. For schema version 1.6, useDocValuesAsStored=true is default, so
> > > there
> > > > > is no need to explicitly set it in schema.xml?
> > > >
> > > > Yes.
> > > >
> > > > > 2.  With useDocValuesAsStored=true and the following definition,
> will
> > > Solr
> > > > > retrieve id from docValues instead of stored field?
> > > >
> > > > No.
> > > > AFAIK, if you define both docValues="true" and stored="true" in your
> > > > schema,
> > > > Solr tries to retrieve stored value.
> > > > (Except using streaming expressions or /export handler etc...
> > > > See:
> > > >
> > >
> https://lucene.apache.org/solr/guide/6_6/docvalues.html#DocValues-EnablingDocValues
> > > > )
> > > >
> > > > Thanks,
> > > > Yasufumi
> > > >
> > > >
> > > > 2018年11月6日(火) 9:54 Wei :
> > > >
> > > > > Hi,
> > > > >
> > > > > I have a few questions about using the useDocValuesAsStored option
> to
> > > > > retrieve field from docValues:
> > > > >
> > > > > 1. For schema version 1.6, useDocValuesAsStored=true is default, so
> > > there
> > > > > is no need to explicitly set it in schema.xml?
> > > > >
> > > > > 2.  With useDocValuesAsStored=true and the following definition,
> will
> > > Solr
> > > > > retrieve id from docValues instead of stored field? if fl= id,
> title,
> > > > > score,   both id and title are single value field:
> > > > >
> > > > >> > > > docValues="true" required="true"/>
> > > > >
> > > > >   > > > > docValues="true" required="true"/>
> > > > >
> > > > >   Do I need to have all fields stored="false" docValues="true" to
> make
> > > solr
> > > > > retrieve from docValues only? I am using Solr 6.6.
> > > > >
> > > > > Thanks,
> > > > > Wei
> > > > >
> > >
>


Re: Retrieve field from docValues

2018-11-06 Thread Wei
Also I notice this issue is still open:
https://issues.apache.org/jira/browse/SOLR-10816
Does that mean we still need to have stored=true for uniqueKey?

On Tue, Nov 6, 2018 at 2:14 PM Wei  wrote:

> I see there is also a docValuesFormat option, what's the default for this
> setting? Performance wise is it good to set docValuesFormat="Memory" ?
>
> Best,
> Wei
>
>
> On Tue, Nov 6, 2018 at 11:55 AM Erick Erickson 
> wrote:
>
>> Yes, "the most efficient possible" is associated with that JIRA, so only
>> in 7x.
>>
>> "Does this still hold if whole index is loaded into memory?"
>> The decompression part yes, the disk seek part no. And it's also
>> sensitive to whether the documentCache already has the document.
>>
>> I'd also make uniqueKey ant the _version_ fields docValues.
>>
>> Best,
>> Erick
>> On Tue, Nov 6, 2018 at 10:44 AM Wei  wrote:
>> >
>> > Thanks Yasufumi and Erick.
>> >
>> > ---. 2. "it depends". Solr  will try to do the most efficient thing
>> > possible. If _all_ the fields are docValues, it will return the stored
>> > values from the docValues  structure.
>> >
>> > I find this jira:   https://issues.apache.org/jira/browse/SOLR-8344
>> Does
>> > this mean "Solr  will try to do the most efficient thing possible" only
>> > working for 7.x?  Is the behavior available for 6.6?
>> >
>> > -- This prevents a disk seek and  decompress cycle.
>> >
>> > Does this still hold if whole index is loaded into memory?  Also for the
>> > benefit of performance improvement,  does the uniqueKey field need to be
>> > always docValues? Since it is used in the first phase of distributed
>> > search.
>> >
>> > Thanks,
>> > Wei
>> >
>> >
>> >
>> > On Tue, Nov 6, 2018 at 8:30 AM Erick Erickson 
>> > wrote:
>> >
>> > > 2. "it depends". Solr  will try to do the most efficient thing
>> > > possible. If _all_ the fields are docValues, it will return the stored
>> > > values from the docValues  structure. This prevents a disk seek and
>> > > decompress cycle.
>> > >
>> > > However, if even one field is docValues=false Solr will by default
>> > > return the stored values. For the multiValued case, you can explicitly
>> > > tell Solr to return the docValues field.
>> > >
>> > > Best,
>> > > Erick
>> > > On Tue, Nov 6, 2018 at 1:46 AM Yasufumi Mizoguchi
>> > >  wrote:
>> > > >
>> > > > Hi,
>> > > >
>> > > > > 1. For schema version 1.6, useDocValuesAsStored=true is default,
>> so
>> > > there
>> > > > > is no need to explicitly set it in schema.xml?
>> > > >
>> > > > Yes.
>> > > >
>> > > > > 2.  With useDocValuesAsStored=true and the following definition,
>> will
>> > > Solr
>> > > > > retrieve id from docValues instead of stored field?
>> > > >
>> > > > No.
>> > > > AFAIK, if you define both docValues="true" and stored="true" in your
>> > > > schema,
>> > > > Solr tries to retrieve stored value.
>> > > > (Except using streaming expressions or /export handler etc...
>> > > > See:
>> > > >
>> > >
>> https://lucene.apache.org/solr/guide/6_6/docvalues.html#DocValues-EnablingDocValues
>> > > > )
>> > > >
>> > > > Thanks,
>> > > > Yasufumi
>> > > >
>> > > >
>> > > > 2018年11月6日(火) 9:54 Wei :
>> > > >
>> > > > > Hi,
>> > > > >
>> > > > > I have a few questions about using the useDocValuesAsStored
>> option to
>> > > > > retrieve field from docValues:
>> > > > >
>> > > > > 1. For schema version 1.6, useDocValuesAsStored=true is default,
>> so
>> > > there
>> > > > > is no need to explicitly set it in schema.xml?
>> > > > >
>> > > > > 2.  With useDocValuesAsStored=true and the following definition,
>> will
>> > > Solr
>> > > > > retrieve id from docValues instead of stored field? if fl= id,
>> title,
>> > > > > score,   both id and title are single value field:
>> > > > >
>> > > > >   > > > > > docValues="true" required="true"/>
>> > > > >
>> > > > >  > > > > > docValues="true" required="true"/>
>> > > > >
>> > > > >   Do I need to have all fields stored="false" docValues="true" to
>> make
>> > > solr
>> > > > > retrieve from docValues only? I am using Solr 6.6.
>> > > > >
>> > > > > Thanks,
>> > > > > Wei
>> > > > >
>> > >
>>
>


solr optimize command

2018-11-28 Thread Wei
Hi,

 I use the following http request to start solr index optimization:

http://localhost:8983/solr//update?skipError=true -F stream.body='
'


 The request returns status code 200 shortly, but when looking at the solr
instance I noticed that actual optimization has not completed yet as there
are more than 1 segments. Is the optimize command async? What is the best
approach to validate that optimize is truly completed?


Thanks,

Wei


Questions for SynonymGraphFilter and WordDelimiterGraphFilter

2019-01-04 Thread Wei
Hello,

We are upgrading to Solr 7.6.0 and noticed that SynonymFilter and
WordDelimiterFilter have been deprecated. Solr doc recommends to use
SynonymGraphFilter and WordDelimiterGraphFilter instead.  In current
schema, we have text field type defined as



  














  

  














  



In the index phase we have both SynonymFilter and WordDelimiterFilter
configured:







Solr documentation states that "graph filters produces correct token
graphs, but cannot consume an input token graph correctly. When use
these two graph filter during indexing, you must follow it with a
FlattenGraphFilter". I am confused as how to replace them with the new
SynonymGraphFilter and WordDelimiterGraphFilter. A few questions:

1. Regarding the FlattenGraphFilter, is it to be used only once or
multiple times after each graph filter? Can we have the configure like
this?

   

   

   



   

2. Is it possible to we have two graph filters, i.e. both
SynonymGraphFilter and WordDelimiterGraphFilter in the same analysis
chain? If not what's the best option to replace our current config?

3. With the StopFilterFactory in between SynonymGraphFilter and
WordDelimiterGraphFilter, I get a few index errors:

Exception writing document id XX to the index; possible analysis error

Caused by: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1

But if I move StopFilter before the SynonymGraphFilter the errors are gone.

I guess the StopFilter mess up the SynonymGraphFilter output? Not sure
if  it's a solr defect or there is a guideline that StopFilter should
not be put after graph filters.

Thanks in advance for you input.


Thanks,

Wei


Re: Questions for SynonymGraphFilter and WordDelimiterGraphFilter

2019-01-07 Thread Wei
Thanks Thomas. You mentioned "Also there is no need for the
FlattenGraphFilter", that's quite interesting because the Solr
documentation says it's mandatory for indexing:
https://lucene.apache.org/solr/guide/7_6/filter-descriptions.html. Is there
any more explanation for this?

Best regards,
Wei


On Mon, Jan 7, 2019 at 7:56 AM Thomas Aglassinger <
t.aglassin...@netconomy.net> wrote:

> Hi Wei,
>
> here's a fairly simple field type we currently use in a project that seems
> to do the job with graph synonyms. Maybe this helps as a starting point for
> you:
>
>  positionIncrementGap="100">
> 
> 
>  managed="de" />
>  />
>  preserveOriginal="1"
> generateWordParts="1" generateNumberParts="1"
> catenateWords="1"
> catenateNumbers="1" catenateAll="0"
> splitOnCaseChange="1" />
> 
> 
> 
> 
> 
> 
>
> As you can see we use the same filters for both indexing and query, so
> this might have some impact on positional queries but so far it seems
> negligible for the short synonyms we use in practice. Also there is no need
> for the FlattenGraphFilter.
>
> The WhitespaceTokenizerFactory ensures that you can define synonyms with
> hyphens like mac-book -> macbook.
>
> Best regards, Thomas.
>
>
> On 05.01.19, 02:11, "Wei"  wrote:
>
> Hello,
>
> We are upgrading to Solr 7.6.0 and noticed that SynonymFilter and
> WordDelimiterFilter have been deprecated. Solr doc recommends to use
> SynonymGraphFilter and WordDelimiterGraphFilter instead
> I guess the StopFilter mess up the SynonymGraphFilter output? Not sure
> if  it's a solr defect or there is a guideline that StopFilter should
> not be put after graph filters.
>
> Thanks in advance for you input.
>
>
> Thanks,
>
> Wei
>
>
>


Re: Questions for SynonymGraphFilter and WordDelimiterGraphFilter

2019-01-08 Thread Wei
bump..

On Mon, Jan 7, 2019 at 11:53 AM Wei  wrote:

> Thanks Thomas. You mentioned "Also there is no need for the
> FlattenGraphFilter", that's quite interesting because the Solr
> documentation says it's mandatory for indexing:
> https://lucene.apache.org/solr/guide/7_6/filter-descriptions.html. Is
> there any more explanation for this?
>
> Best regards,
> Wei
>
>
> On Mon, Jan 7, 2019 at 7:56 AM Thomas Aglassinger <
> t.aglassin...@netconomy.net> wrote:
>
>> Hi Wei,
>>
>> here's a fairly simple field type we currently use in a project that
>> seems to do the job with graph synonyms. Maybe this helps as a starting
>> point for you:
>>
>> > positionIncrementGap="100">
>> 
>> 
>> > managed="de" />
>> > managed="de" />
>> > preserveOriginal="1"
>> generateWordParts="1" generateNumberParts="1"
>> catenateWords="1"
>> catenateNumbers="1" catenateAll="0"
>> splitOnCaseChange="1" />
>> 
>> 
>> 
>> 
>> 
>> 
>>
>> As you can see we use the same filters for both indexing and query, so
>> this might have some impact on positional queries but so far it seems
>> negligible for the short synonyms we use in practice. Also there is no need
>> for the FlattenGraphFilter.
>>
>> The WhitespaceTokenizerFactory ensures that you can define synonyms with
>> hyphens like mac-book -> macbook.
>>
>> Best regards, Thomas.
>>
>>
>> On 05.01.19, 02:11, "Wei"  wrote:
>>
>> Hello,
>>
>> We are upgrading to Solr 7.6.0 and noticed that SynonymFilter and
>> WordDelimiterFilter have been deprecated. Solr doc recommends to use
>> SynonymGraphFilter and WordDelimiterGraphFilter instead
>> I guess the StopFilter mess up the SynonymGraphFilter output? Not sure
>> if  it's a solr defect or there is a guideline that StopFilter should
>> not be put after graph filters.
>>
>> Thanks in advance for you input.
>>
>>
>> Thanks,
>>
>> Wei
>>
>>
>>


solr 7 optimize with Tlog/Pull replicas

2019-03-08 Thread Wei
Hi,

RecentIy I encountered a strange issue with optimize in Solr 7.6. The cloud
is created with 4 shards with 2 Tlog replicas per shard. After batch index
update I issue an optimize command to a randomly picked replica in the
cloud.  After a while when I check,  all the non-leader Tlog replicas
finished optimization to a single segment, however all the leader replicas
still have multiple segments.  Previously inn the all NRT replica cloud, I
see optimization is triggered on all nodes.  Is the optimization process
different with Tlog/Pull replicas?

Best,
Wei


Re: solr 7 optimize with Tlog/Pull replicas

2019-03-10 Thread Wei
Thanks Erick.

1> TLOG replicas shouldn’t optimize on the follower. They should optimize
on the leader then replicate the entire index to the follower.

Does that mean the follower will ignore the optimize request? Or shall I
send the optimize request only to one of the leaders?

2> As of Solr 7.5, optimize should not optimize to a single segment
_unless_ that segment is < 5G. See LUCENE-7976. Or you explicitly set
numSegments on the optimize command.

-- Is the 5G limit controlled by maxMegedSegmentMB setting? In
solrconfig.xml I used these settings:


   100
   10
   10
   20480


But in the end I see multiple segments much smaller than the 20GB limit.
In 7.6 is it required to explicitly set the number of segments to 1? e.g
shall I use

/update?optimize=true&waitSearcher=false&maxSegments=1

Best,
Wei


On Fri, Mar 8, 2019 at 12:29 PM Erick Erickson 
wrote:

> This is very odd for at least two reasons:
>
> 1> TLOG replicas shouldn’t optimize on the follower. They should optimize
> on the leader then replicate the entire index to the follower.
>
> 2> As of Solr 7.5, optimize should not optimize to a single segment
> _unless_ that segment is < 5G. See LUCENE-7976. Or you explicitly set
> numSegments on the optimize command.
>
> So if you can reliably reproduce this, it’s probably worth a JIRA…...
>
> > On Mar 8, 2019, at 11:21 AM, Wei  wrote:
> >
> > Hi,
> >
> > RecentIy I encountered a strange issue with optimize in Solr 7.6. The
> cloud
> > is created with 4 shards with 2 Tlog replicas per shard. After batch
> index
> > update I issue an optimize command to a randomly picked replica in the
> > cloud.  After a while when I check,  all the non-leader Tlog replicas
> > finished optimization to a single segment, however all the leader
> replicas
> > still have multiple segments.  Previously inn the all NRT replica cloud,
> I
> > see optimization is triggered on all nodes.  Is the optimization process
> > different with Tlog/Pull replicas?
> >
> > Best,
> > Wei
>
>


Re: solr 7 optimize with Tlog/Pull replicas

2019-03-10 Thread Wei
A side question, for heavy bulk indexing, what's the recommended setting
for auto commit? As there is no query needed during the bulking indexing
process, I have auto soft commit disabled. Is there any side effect if I
also disable auto commit?

On Sun, Mar 10, 2019 at 10:22 PM Wei  wrote:

> Thanks Erick.
>
> 1> TLOG replicas shouldn’t optimize on the follower. They should optimize
> on the leader then replicate the entire index to the follower.
>
> Does that mean the follower will ignore the optimize request? Or shall I
> send the optimize request only to one of the leaders?
>
> 2> As of Solr 7.5, optimize should not optimize to a single segment
> _unless_ that segment is < 5G. See LUCENE-7976. Or you explicitly set
> numSegments on the optimize command.
>
> -- Is the 5G limit controlled by maxMegedSegmentMB setting? In
> solrconfig.xml I used these settings:
>
> 
>100
>10
>10
>20480
> 
>
> But in the end I see multiple segments much smaller than the 20GB limit.
> In 7.6 is it required to explicitly set the number of segments to 1? e.g
> shall I use
>
> /update?optimize=true&waitSearcher=false&maxSegments=1
>
> Best,
> Wei
>
>
> On Fri, Mar 8, 2019 at 12:29 PM Erick Erickson 
> wrote:
>
>> This is very odd for at least two reasons:
>>
>> 1> TLOG replicas shouldn’t optimize on the follower. They should optimize
>> on the leader then replicate the entire index to the follower.
>>
>> 2> As of Solr 7.5, optimize should not optimize to a single segment
>> _unless_ that segment is < 5G. See LUCENE-7976. Or you explicitly set
>> numSegments on the optimize command.
>>
>> So if you can reliably reproduce this, it’s probably worth a JIRA…...
>>
>> > On Mar 8, 2019, at 11:21 AM, Wei  wrote:
>> >
>> > Hi,
>> >
>> > RecentIy I encountered a strange issue with optimize in Solr 7.6. The
>> cloud
>> > is created with 4 shards with 2 Tlog replicas per shard. After batch
>> index
>> > update I issue an optimize command to a randomly picked replica in the
>> > cloud.  After a while when I check,  all the non-leader Tlog replicas
>> > finished optimization to a single segment, however all the leader
>> replicas
>> > still have multiple segments.  Previously inn the all NRT replica
>> cloud, I
>> > see optimization is triggered on all nodes.  Is the optimization process
>> > different with Tlog/Pull replicas?
>> >
>> > Best,
>> > Wei
>>
>>


Re: solr 7 optimize with Tlog/Pull replicas

2019-03-12 Thread Wei
Thanks Erick, it's very helpful.  So for bulking indexing in a Tlog or
Tlog/Pull cloud,  when we optimize at the end of updates, segments on the
leader replica will change rapidly and the follower replicas will be
continuously pulling from the leader, effectively downloading the whole
index.  Is there a more efficient way?

On Mon, Mar 11, 2019 at 9:59 AM Erick Erickson 
wrote:

> do _not_ turn of hard commits, even when bulk indexing. Set the
> OpenSeacher to false in your config. This is for two reasons:
> 1> the only time the transaction log is rolled over is when a hard commit
> happens. If you turn off commits it’ll grow to a very large size.
> 2> If, for any reason, the node restarts, it’ll replay the transaction log
> from the last hard commit point, potentially taking hours if you haven’t
> committed.
>
> And you should probably open  a new searcher occasionally, even while bulk
> indexing. For Real Time Get there are some internal structures that grow in
> proportion to the docs indexed since the last searcher was opened.
>
> And for your other quesitons:
> <1> I believe so, try it and look at your solr log.
>
> <2> Yes. Have you looked at Mike’s video (the third one down) here:
> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html?
> TieredMergePolicy is the third video. The merge policy combines like-sized
> segments. It’s wasteful to rewrite, say, a 19G segment just to add a 1G so
> having multiple segments < 20G is perfectly normal.
>
> Best,
> Erick
>
> > On Mar 10, 2019, at 10:36 PM, Wei  wrote:
> >
> > A side question, for heavy bulk indexing, what's the recommended setting
> > for auto commit? As there is no query needed during the bulking indexing
> > process, I have auto soft commit disabled. Is there any side effect if I
> > also disable auto commit?
> >
> > On Sun, Mar 10, 2019 at 10:22 PM Wei  wrote:
> >
> >> Thanks Erick.
> >>
> >> 1> TLOG replicas shouldn’t optimize on the follower. They should
> optimize
> >> on the leader then replicate the entire index to the follower.
> >>
> >> Does that mean the follower will ignore the optimize request? Or shall I
> >> send the optimize request only to one of the leaders?
> >>
> >> 2> As of Solr 7.5, optimize should not optimize to a single segment
> >> _unless_ that segment is < 5G. See LUCENE-7976. Or you explicitly set
> >> numSegments on the optimize command.
> >>
> >> -- Is the 5G limit controlled by maxMegedSegmentMB setting? In
> >> solrconfig.xml I used these settings:
> >>
> >>  class="org.apache.solr.index.TieredMergePolicyFactory">
> >>   100
> >>   10
> >>   10
> >>   20480
> >> 
> >>
> >> But in the end I see multiple segments much smaller than the 20GB limit.
> >> In 7.6 is it required to explicitly set the number of segments to 1? e.g
> >> shall I use
> >>
> >> /update?optimize=true&waitSearcher=false&maxSegments=1
> >>
> >> Best,
> >> Wei
> >>
> >>
> >> On Fri, Mar 8, 2019 at 12:29 PM Erick Erickson  >
> >> wrote:
> >>
> >>> This is very odd for at least two reasons:
> >>>
> >>> 1> TLOG replicas shouldn’t optimize on the follower. They should
> optimize
> >>> on the leader then replicate the entire index to the follower.
> >>>
> >>> 2> As of Solr 7.5, optimize should not optimize to a single segment
> >>> _unless_ that segment is < 5G. See LUCENE-7976. Or you explicitly set
> >>> numSegments on the optimize command.
> >>>
> >>> So if you can reliably reproduce this, it’s probably worth a JIRA…...
> >>>
> >>>> On Mar 8, 2019, at 11:21 AM, Wei  wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> RecentIy I encountered a strange issue with optimize in Solr 7.6. The
> >>> cloud
> >>>> is created with 4 shards with 2 Tlog replicas per shard. After batch
> >>> index
> >>>> update I issue an optimize command to a randomly picked replica in the
> >>>> cloud.  After a while when I check,  all the non-leader Tlog replicas
> >>>> finished optimization to a single segment, however all the leader
> >>> replicas
> >>>> still have multiple segments.  Previously inn the all NRT replica
> >>> cloud, I
> >>>> see optimization is triggered on all nodes.  Is the optimization
> process
> >>>> different with Tlog/Pull replicas?
> >>>>
> >>>> Best,
> >>>> Wei
> >>>
> >>>
>
>


Question for separate query and updates with TLOG and PULL replicas

2019-04-10 Thread Wei
Hi,

I have a question about how to complete separate queries and updates in a
cluster of mixed TLOG and PULL replicas.

solr cloud setup:  Solr-7.6.0,  10 shards,  each shard has 2 TLOG + 4 PULL
replicas.
In solrconfig.xml we set preferred replica type for queries to PULL:

  replica.type:PULL 


A load-balancer is set up in front of the solr cloud, including both TLOG
and PULL replicas.  Also we use a http client for queries.  Some
observations:

1.  In the TLOG replicas, I see about the same number of external queries
in jetty access log. It is expected as our load balancer does not
differentiate TLOG and PULL replicas.  My question is,  when the TLOG
replica receives an external query, will it forward to one of the PULL
replicas? Or will it send the shard request to PULL replicas but still
serve as the aggregate node for the query?

2.  In the TLOG replicas,  I am still seeing some internal shard request,
but in much lower volume compare to PULL replicas.  I checked one leader
TLOG replica, the number of shard requests is 1% of that on PULL replicas
in the same shard.  With shards.preference=replica.type:PULL,  why would
the TLOG receive any internal shard request?

To completely separate query and updates, I think that I might need to have
the load-balancer set up to include only the PULL replicas.  Is there any
other option?

Thanks,
Wei


BinaryResponseWriter fetches unnecessary fields?

2018-01-19 Thread Wei
Hi all,


We observe that solr query time increases significantly with the number of
rows requested,  even all we retrieve for each document is just
fl=id,score.  Debugged a bit and see that most of the increased time was
spent in BinaryResponseWriter,  converting lucene document into
SolrDocument.


Inside convertLuceneDocToSolrDoc():


https://github.com/apache/lucene-solr/blob/df874432b9a17b547acb24a01d3491
839e6a6b69/solr/core/src/java/org/apache/solr/response/
DocsStreamer.java#L182


   for (IndexableField f : doc.getFields())


I am a bit puzzled why we need to iterate through all the fields in the
document. Why can’t we just iterate through the requested fields in fl?
Specifically:



https://github.com/apache/lucene-solr/blob/df874432b9a17b547acb24a01d3491
839e6a6b69/solr/core/src/java/org/apache/solr/response/
DocsStreamer.java#L156


if we change  sdoc = convertLuceneDocToSolrDoc(doc,
rctx.getSearcher().getSchema())  to


sdoc = convertLuceneDocToSolrDoc(doc, rctx.getSearcher().getSchema(),
fnames)


and just iterate through fnames in convertLuceneDocToSolrDoc(),  there is a
significant performance boost in our case, the query time increase from
rows=128 vs rows=500 is much smaller.  Am I missing something here?


Thanks,

Wei


Re: BinaryResponseWriter fetches unnecessary fields?

2018-01-23 Thread Wei
Thanks Chris! Is RetrieveFieldsOptimizer a new functionality introduced in
7.x?  Our observation is with botht 5.4 & 6.4.  I have created a jira for
the issue:

https://issues.apache.org/jira/browse/SOLR-11891

I am also wondering how enableLazyFieldLoading affect the case, but haven't
tested yet. Please let us know if you catch anything.


Thanks,
Wei


On Mon, Jan 22, 2018 at 3:15 PM, Chris Hostetter 
wrote:

>
> : Inside convertLuceneDocToSolrDoc():
> :
> :
> : https://github.com/apache/lucene-solr/blob/
> df874432b9a17b547acb24a01d3491
> : 839e6a6b69/solr/core/src/java/org/apache/solr/response/
> : DocsStreamer.java#L182
> :
> :
> :for (IndexableField f : doc.getFields())
> :
> :
> : I am a bit puzzled why we need to iterate through all the fields in the
> : document. Why can’t we just iterate through the requested fields in fl?
> : Specifically:
>
> I have a hunch here -- but i haven't verified it.
>
> First of all: the specific code in question that you mention assumes it
> doesn't *need* to filter out the result of "doc.getFields()" basd on the
> 'fl' because at the point in the processing where the DocsStreamer is
> looping over the result of "doc.getFields()" the "Document" object it's
> dealing with *should* only contain the specific (subset of stored) fields
> requested by the fl param -- this is handled by RetrieveFieldsOptimizer &
> SolrDocumentFetcher that the DocsStreamer builds up acording to the
> results of ResultContext.getReturnFields() when asking the
> SolrIndexSearcher to fetch the doc()
>
> But i think what's happening here is that because of the documentCache,
> there are cases where the SolrIndexSearcher is not actaully using
> a SolrDocumentStoredFieldVisitor to limit what's requested from the
> IndexReader, and the resulting Document contains all fields -- which is
> then compounded by code that loops over every field.
>
> At a quick glance, I'm a little fuzzy on how exactly
> enableLazyFieldLoading may/may-not be affecting things here, but either
> way I think you are correct -- we can/should make this overall stack of
> code smarter about looping over fields we know we want, vs looping over
> all fields in the doc.
>
> Can you please file a jira for this?
>
>
> -Hoss
> http://www.lucidworks.com/


facet.method=uif not working in solr cloud?

2018-01-30 Thread Wei
Hi,

I am using the following parameters for faceting, request solr to use the
UIF method;

&facet=on&facet.field=color&q=*:*&facet.method=uif&facet.mincount=1&debugQuery=true

It works as expected in my local standalone solr:


   - facet-debug:
   {
  - elapse: 2,
  - sub-facet:
  [
 -
 {
- processor: "SimpleFacets",
- elapse: 2,
- action: "field facet",
- maxThreads: 0,
- sub-facet:
[
   -
   {
  - elapse: 2,
  - requestedMethod: "UIF",
  - appliedMethod: "UIF",
  - inputDocSetSize: 8191,
  - field: "color"
  }
   ]
}
 ]
  },


However when I apply the same query to solr cloud with multiple shards, the
appliedMethod is alway FC instead of UIF:

{

   - processor: "SimpleFacets",
   - elapse: 18,
   - action: "field facet",
   - maxThreads: 0,
   - sub-facet:
   [
  -
  {
 - elapse: 58,
 - requestedMethod: "UIF",
 - appliedMethod: "FC",
 - inputDocSetSize: 33487,
 - field: "color",
 - numBuckets: 238
 }
  ]

}

I also see that in standalone mode fieldValueCache is used with UIF
applied, but in cloud mode fieldValueCache is always empty.  Are there any
other parameters I need to apply UIF faceting in solr cloud?

Thanks,
Wei


Re: facet.method=uif not working in solr cloud?

2018-01-31 Thread Wei
Thanks Alessandro. Totally agree that from the logic I can't see why the
requested facet.method=uif is not accepted. I don't see anything in
solr.log also.  However I find that the uif method somehow works with json
facet api in cloud mode,  e.g:

curl http://mysolrcloud:8983/solr/mycollection/select -d
'q=*:*&wt=json&rows=0&json.facet={color: {type: terms, field : color,
method : uif, limit:1000, mincount:1}}&debugQuery=true'

Then in the debug response I see:

"facet-trace":{

   - "processor":"FacetQueryProcessor",
   - "elapse":453,
   - "query":null,
   - "domainSize":70215,
   - "sub-facet":[
  1. {
 - "processor":"FacetFieldProcessorByArrayUIF",
 - "elapse":1,
 - "field":"color",
 - "limit":1000,
 - "numBuckets":20,
 - "domainSize":7166
  },
  2. {
 - "processor":"FacetFieldProcessorByArrayUIF",
 - "elapse":1,
 - "field":"color",
 - "limit":1000
 - "numBuckets":19,
 - "domainSize":7004
  },
  3. {
 - "processor":"FacetFieldProcessorByArrayUIF",
 - "elapse":2,
 - "field":"color",
 - "limit":1000,
 - "numBuckets":20,
 - "domainSize":7030
  },
  4. {
 - "processor":"FacetFieldProcessorByArrayUIF",
 - "elapse":80,
 - "field":"color",
 - "limit":1000,
 - "numBuckets":20,
 - "domainSize":6969
  },
  5. {
 - "processor":"FacetFieldProcessorByArrayUIF",
 - "elapse":85,
 - "field":"color",
 - "limit":1000,
 - "numBuckets":20,
 - "domainSize":6953
  },
  6. {
 - "processor":"FacetFieldProcessorByArrayUIF",
 - "elapse":85,
 - "field":"color",
 - "limit":1000,
 - "numBuckets":20,
 - "domainSize":6901
  },
  7. {
 - "processor":"FacetFieldProcessorByArrayUIF",
 - "elapse":93,
 - "field":"color",
 - "limit":1000,
 - "numBuckets":20,
 - "domainSize":6951
  },
  8. {
 - "processor":"FacetFieldProcessorByArrayUIF",
 - "elapse":104,
 - "field":"color",
 - "limit":1000,
 - "numBuckets":19,
 - "domainSize":7127
  }
   ]

A few things puzzled me here.  Looks like when using the json facet api,
SimpleFacets is not used, replaced by FacetFieldPorcessorByArrayUIF
processor. Is that the expected behavior? Also with uif method applied,
facet latency is greatly increased.  Some shards have much bigger elapse
time reported ( 104 vs 1),  I wonder what could cause the discrepancy as my
index in different shards are evenly distributed.

Thanks,
Wei


On Wed, Jan 31, 2018 at 2:24 AM, Alessandro Benedetti 
wrote:

> I worked personally on the SimpleFacets class which does the facet method
> selection :
>
> FacetMethod appliedFacetMethod = selectFacetMethod(field,
> sf, requestedMethod, mincount,
> exists);
>
> RTimer timer = null;
> if (fdebug != null) {
>fdebug.putInfoItem("requestedMethod", requestedMethod==null?"not
> specified":requestedMethod.name());
>fdebug.putInfoItem("appliedMethod", appliedFacetMethod.name());
>fdebug.putInfoItem("inputDocSetSize", docs.size());
>fdebug.putInfoItem("field", field);
>timer = new RTimer();
> }
>
> Within the select facet method , the only code block related UIF is (
> another block can apply when facet method arrives null to the Solr Node,
> but
> that should not apply as we see the facet method in the debug):
>
> /* UIF without DocValues can't deal with mincount=0, the reason is because
>  we create the buckets based on the values present in the result
> set.
>  So we are not going to see facet values which are not in the
> result
> set */
>  if (method == FacetMethod.UIF
>  && !field.hasDocValues() && mincount == 0) {
>method = field.multiValued() ? FacetMethod.FC : FacetMethod.FCS;
>  }
>
> So is there anything in the logs?
> Because that seems to me the only point where you can change from UIF to FC
> and you clearly have mincount=1.
>
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: facet.method=uif not working in solr cloud?

2018-02-02 Thread Wei
I tried to debug a bit and see that when executing on a cloud solr server,
although I put facet.field=color&q=*:*&facet.method=uif&facet.mincount=1 in
the request url, at the point it reaches SimpleFacet inside req.params it
somehow has been rewritten to  f.color.facet.mincount=0, no wonder the
method chosen become FC. So one myth solved; but the new myth is why the
facet.mincount is override to 0 in solr req?

Cheers,
Wei

On Thu, Feb 1, 2018 at 2:01 AM, Alessandro Benedetti 
wrote:

> " Looks like when using the json facet api,
> SimpleFacets is not used, replaced by FacetFieldPorcessorByArrayUIF "
>
> That is expected, I remember Yonik to stress the fact that it is a
> completely different approach to faceting ( and different components and
> classes are involved).
>
> But your first case, it may be worth an investigation.
> If you have the tools and you are used to it I would encourage you to
> reproduce the issue and remote debug it from a Solr server.
> Putting a breakpoint in the Simple Facets method you should be able to
> solve
> the mystery ( a bug maybe ? I am very curious about it. )
>
> Cheers
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: facet.method=uif not working in solr cloud?

2018-02-12 Thread Wei
Adding facet.distrib.mco=true did the trick.  Thanks Toke and Alessandro!

Cheers,
Wei

On Thu, Feb 8, 2018 at 1:23 AM, Toke Eskildsen  wrote:

> On Fri, 2018-02-02 at 17:40 -0800, Wei wrote:
> > I tried to debug a bit and see that when executing on a cloud solr
> > server, although I put
> > facet.field=color&q=*:*&facet.method=uif&facet.mincount=1 in
> > the request url, at the point it reaches SimpleFacet inside
> > req.params it somehow has been rewritten
> > to  f.color.facet.mincount=0, no wonder the
> > method chosen become FC. So one myth solved; but the new myth is why
> > the facet.mincount is override to 0 in solr req?
>
> AFAIK, it is due to an attempt of optimisation for distributed
> faceting. The relevant JIRA seems to be https://issues.apache.org/jira/
> browse/SOLR-8988
>
> Try setting facet.distrib.mco=true
>
> - Toke Eskildsen, Royal Danish Library
>
>


Re: facet.method=uif not working in solr cloud?

2018-02-14 Thread Wei
Thanks all!   It's really great learning.  A bit off the topic, after I
enabled facet.method = uif in solr cloud,  the faceting performance is
actually much worse than the original fc( ~1000 ms with uif  vs ~200 ms
with fc). My cloud has 8 shards with 6 replicas in each shard.  I do see
that fieldValueCache is getting utilized.  Any reason uif could be so
slow?

On Tue, Feb 13, 2018 at 7:41 AM, Yonik Seeley  wrote:

> Great, thanks for tracking that down!
> It's interesting that a mincount of 0 disables uif processing in the
> first place.  IIRC, it's only the hash-based method (as opposed to
> array-based) that can't return zero counts.
>
> -Yonik
>
>
> On Tue, Feb 13, 2018 at 6:17 AM, Alessandro Benedetti
>  wrote:
> > *Update* : This has been actually already solved by Hoss.
> >
> > https://issues.apache.org/jira/browse/SOLR-11711 and this is the Pull
> > Request : https://github.com/apache/lucene-solr/pull/279/files
> >
> > This should go live with 7.3
> >
> > Cheers
> >
> >
> >
> > -
> > ---
> > Alessandro Benedetti
> > Search Consultant, R&D Software Engineer, Director
> > Sease Ltd. - www.sease.io
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: facet.method=uif not working in solr cloud?

2018-02-14 Thread Wei
Thanks Yonik. If uif has big upfront cost when hits solr the first time,
in solr cloud the same faceting request could hit different replicas in the
same shard, so that cost will happen at least for the number of replicas?
If we are doing frequent auto commits, fieldvaluecache will be invalidated
and uif will have to pay the upfront cost again after each commit?



On Wed, Feb 14, 2018 at 11:51 AM, Yonik Seeley  wrote:

> On Wed, Feb 14, 2018 at 2:28 PM, Wei  wrote:
> > Thanks all!   It's really great learning.  A bit off the topic, after I
> > enabled facet.method = uif in solr cloud,  the faceting performance is
> > actually much worse than the original fc( ~1000 ms with uif  vs ~200 ms
> > with fc). My cloud has 8 shards with 6 replicas in each shard.  I do see
> > that fieldValueCache is getting utilized.  Any reason uif could be so
> > slow?
>
> I haven't seen that before.  Are you sure it's not the first time
> faceting on a field?  uif has big upfront cost, but is usually faster
> once that cost has been paid.
>
>
> -Yonik
>
> > On Tue, Feb 13, 2018 at 7:41 AM, Yonik Seeley  wrote:
> >
> >> Great, thanks for tracking that down!
> >> It's interesting that a mincount of 0 disables uif processing in the
> >> first place.  IIRC, it's only the hash-based method (as opposed to
> >> array-based) that can't return zero counts.
> >>
> >> -Yonik
> >>
> >>
> >> On Tue, Feb 13, 2018 at 6:17 AM, Alessandro Benedetti
> >>  wrote:
> >> > *Update* : This has been actually already solved by Hoss.
> >> >
> >> > https://issues.apache.org/jira/browse/SOLR-11711 and this is the Pull
> >> > Request : https://github.com/apache/lucene-solr/pull/279/files
> >> >
> >> > This should go live with 7.3
> >> >
> >> > Cheers
> >> >
> >> >
> >> >
> >> > -
> >> > ---
> >> > Alessandro Benedetti
> >> > Search Consultant, R&D Software Engineer, Director
> >> > Sease Ltd. - www.sease.io
> >> > --
> >> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> >>
>


Different solr score between stand alone vs cloud mode solr

2018-06-07 Thread Wei
Hi,

Recently we have an observation that really puzzled us.  We have two
instances of Solr,  one in stand alone mode and one is a single-shard solr
cloud with a couple of replicas.  Both are indexed with the same documents
and have same solr version 6.6.2.  When issue the same query, the solr
score from stand alone and cloud are different.  How could this happen?
With the same data, software version and query,  should solr score be
exactly same regardless of cloud mode or not?

Thanks,
Wei


Re: Different solr score between stand alone vs cloud mode solr

2018-06-07 Thread Wei
Thanks Erick. However our indexes on stand alone and cloud are both static
-- we indexed them from the same source xmls, optimize and have no updates
after it is done. Also in cloud there is only one single shard( with
multiple replicas ). I assume distributed stats doesn't have effect in this
case?

Thanks,
Wei

On Thu, Jun 7, 2018 at 12:18 PM, Erick Erickson 
wrote:

> Short form:
>
> As docs are updated, they're marked as deleted until the segment is
> merged. This affects things like term frequency and doc frequency
> which in turn influences the score.
>
> Due to how commits happen, i.e. autocommit will hit at slightly skewed
> wall-clock time, different segments are merged on different replicas
> of the same shard. Thus the scores can be slightly different
>
> You can turn on distributed stats which will help with this:
> https://issues.apache.org/jira/browse/SOLR-1632
>
> Best,
> Erick
>
> On Thu, Jun 7, 2018 at 12:07 PM, Wei  wrote:
> > Hi,
> >
> > Recently we have an observation that really puzzled us.  We have two
> > instances of Solr,  one in stand alone mode and one is a single-shard
> solr
> > cloud with a couple of replicas.  Both are indexed with the same
> documents
> > and have same solr version 6.6.2.  When issue the same query, the solr
> > score from stand alone and cloud are different.  How could this happen?
> > With the same data, software version and query,  should solr score be
> > exactly same regardless of cloud mode or not?
> >
> > Thanks,
> > Wei
>


How to exclude certain values in multi-value field filter query

2018-06-18 Thread Wei
Hi,

I have a multi-value field,  and there is a limited set of values for the
field: A, B, C, D.
Is there a way to filter out documents that has only A or B values in the
multi-value field?

Basically I want to  exclude document that has:

A

B

A B

and get documents that has:


C

D

C D

A C

B C

A D

B D

A B C

A B D

A C D

B C D

A B C D


Thanks,

Wei


Re: How to exclude certain values in multi-value field filter query

2018-06-19 Thread Wei
Thanks Mikhail and Alessandro.

On Tue, Jun 19, 2018 at 2:37 AM, Mikhail Khludnev  wrote:

> you need to index num vals
> <https://lucene.apache.org/solr/7_1_0//solr-core/org/
> apache/solr/update/processor/CountFieldValuesUpdateProcessorFactory.html>
> in the separate field, and then *:* -(V:(A AND B) AND numVals:2) -(V:(A OR
> B) AND numVals:1)
>
>
> On Tue, Jun 19, 2018 at 9:20 AM Wei  wrote:
>
> > Hi,
> >
> > I have a multi-value field,  and there is a limited set of values for the
> > field: A, B, C, D.
> > Is there a way to filter out documents that has only A or B values in the
> > multi-value field?
> >
> > Basically I want to  exclude document that has:
> >
> > A
> >
> > B
> >
> > A B
> >
> > and get documents that has:
> >
> >
> > C
> >
> > D
> >
> > C D
> >
> > A C
> >
> > B C
> >
> > A D
> >
> > B D
> >
> > A B C
> >
> > A B D
> >
> > A C D
> >
> > B C D
> >
> > A B C D
> >
> >
> > Thanks,
> >
> > Wei
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


solr filter query on text field

2018-07-11 Thread Wei
Hi,

I am running filter query on a field of text_general type and see
completely different results for the following queries:

   fq= my_text_field:"Jurassic park the movie"   returns 0
result

   fq= my_text_field:(Jurassic park the movie)   returns 20
result

   fq= my_text_field:Jurassic park the movie  returns
thousands of results


Which one is the correct syntax? I am confused why the first query doesn't
have any match at all.  I also thought 2 and 3 are the same, but turns out
quite different.


Thanks,
Wei


Re: solr filter query on text field

2018-07-11 Thread Wei
Thanks Erick and Andrea!  If my default operator is OR,  fq=
my_text_field:(Jurassic park the movie)  is equivalent to
my_text_field:(Jurassic
OR park OR the OR movie)? That make sense.

On Wed, Jul 11, 2018 at 9:06 AM, Andrea Gazzarini 
wrote:

> The syntax is valid in all those three examples, the right one depends on
> what you need.
>
> The first query executes a proximity search (you can think to a phrase
> search, for simplicity) so it returns no result because probably you don't
> have any matching docs with that whole literal.
>
> The second is querying the my_text_field for all terms which compose the
> value between parenthesis. You can think to a query where each term is an
> optional clause, something like mytextfield:jurassic OR mytextfiekd:park...
> (it's not exactly an OR but this could give you the idea=
>
> The third example is not doing what you think. My_text_field is used only
> with the first term (Jurassic) while the others are using the default
> field. Something like mytextfield:jurassic OR defaultfield:park OR
> defaultfield:the That's the reason  you have so many results (I guess
> the default field is a catch-all field)
>
> Sorry for typos I'm using my mobile
>
> Andrea
>
> Il mer 11 lug 2018, 17:54 Wei  ha scritto:
>
> > Hi,
> >
> > I am running filter query on a field of text_general type and see
> > completely different results for the following queries:
> >
> >fq= my_text_field:"Jurassic park the movie"   returns 0
> > result
> >
> >fq= my_text_field:(Jurassic park the movie)   returns 20
> > result
> >
> >fq= my_text_field:Jurassic park the movie  returns
> > thousands of results
> >
> >
> > Which one is the correct syntax? I am confused why the first query
> doesn't
> > have any match at all.  I also thought 2 and 3 are the same, but turns
> out
> > quite different.
> >
> >
> > Thanks,
> > Wei
> >
>


Re: solr filter query on text field

2018-07-11 Thread Wei
btw, is there any difference if the fq field is a string field vs test
field?

On Wed, Jul 11, 2018 at 11:59 AM, Wei  wrote:

> Thanks Erick and Andrea!  If my default operator is OR,  fq=
> my_text_field:(Jurassic park the movie)  is equivalent to 
> my_text_field:(Jurassic
> OR park OR the OR movie)? That make sense.
>
> On Wed, Jul 11, 2018 at 9:06 AM, Andrea Gazzarini 
> wrote:
>
>> The syntax is valid in all those three examples, the right one depends on
>> what you need.
>>
>> The first query executes a proximity search (you can think to a phrase
>> search, for simplicity) so it returns no result because probably you don't
>> have any matching docs with that whole literal.
>>
>> The second is querying the my_text_field for all terms which compose the
>> value between parenthesis. You can think to a query where each term is an
>> optional clause, something like mytextfield:jurassic OR
>> mytextfiekd:park...
>> (it's not exactly an OR but this could give you the idea=
>>
>> The third example is not doing what you think. My_text_field is used only
>> with the first term (Jurassic) while the others are using the default
>> field. Something like mytextfield:jurassic OR defaultfield:park OR
>> defaultfield:the That's the reason  you have so many results (I guess
>> the default field is a catch-all field)
>>
>> Sorry for typos I'm using my mobile
>>
>> Andrea
>>
>> Il mer 11 lug 2018, 17:54 Wei  ha scritto:
>>
>> > Hi,
>> >
>> > I am running filter query on a field of text_general type and see
>> > completely different results for the following queries:
>> >
>> >fq= my_text_field:"Jurassic park the movie"   returns 0
>> > result
>> >
>> >fq= my_text_field:(Jurassic park the movie)   returns 20
>> > result
>> >
>> >fq= my_text_field:Jurassic park the movie  returns
>> > thousands of results
>> >
>> >
>> > Which one is the correct syntax? I am confused why the first query
>> doesn't
>> > have any match at all.  I also thought 2 and 3 are the same, but turns
>> out
>> > quite different.
>> >
>> >
>> > Thanks,
>> > Wei
>> >
>>
>
>


Solr timeAllowed metric

2018-08-03 Thread Wei
Hi,

We tried to use solr's timeAllowed parameter to restrict the time spend on
expensive queries.  But as described at

https://lucene.apache.org/solr/guide/6_6/common-query-parameters.html#CommonQueryParameters-ThetimeAllowedParameter

" This value is only checked at the time of Query Expansion and Document
collection" .  Does that mean Solr will not abort the request if
timeAllowed is exceeded during the scoring process? What are the components
(query, facet,  stats, debug etc) this metric is effectively used?

Thanks,
Wei


Re: Solr timeAllowed metric

2018-08-06 Thread Wei
Thanks Mikhail! Is traditional facet subject to timeAllowed?

On Mon, Aug 6, 2018 at 3:46 AM, Mikhail Khludnev  wrote:

> One note: enum facets might be stopped by timeAllowed.
>
> On Mon, Aug 6, 2018 at 1:45 PM Mikhail Khludnev  wrote:
>
> > Hello, Wei.
> >
> > "Document collection" is done along side with "scoring process". So,
> Solr
> > will abort the request if
> > timeAllowed is exceeded during the scoring process.
> > Query, MLT, grouping are subject of timeAllowed constrains, but facet,
> > json.facet https://issues.apache.org/jira/browse/SOLR-12478, stats,
> debug
> > are not.
> >
> > On Fri, Aug 3, 2018 at 11:34 PM Wei  wrote:
> >
> >> Hi,
> >>
> >> We tried to use solr's timeAllowed parameter to restrict the time spend
> on
> >> expensive queries.  But as described at
> >>
> >>
> >> https://lucene.apache.org/solr/guide/6_6/common-query-parameters.html#
> CommonQueryParameters-ThetimeAllowedParameter
> >>
> >> " This value is only checked at the time of Query Expansion and Document
> >> collection" .  Does that mean Solr will not abort the request if
> >> timeAllowed is exceeded during the scoring process? What are the
> >> components
> >> (query, facet,  stats, debug etc) this metric is effectively used?
> >>
> >> Thanks,
> >> Wei
> >>
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Unbalanced shard requests

2020-04-27 Thread Wei
Hi everyone,

I have a strange issue after upgrade from 7.6.0 to 8.4.1. My cloud has 6
shards with 10 TLOG replicas each shard.  After upgrade I noticed that one
of the replicas in each shard is handling most of the distributed shard
requests, so 6 nodes are heavily loaded while other nodes are idle. There
is no change in shard handler configuration:



3

3

500

 


What could cause the unbalanced internal distributed request?


Thanks in advance.



Wei


Re: Unbalanced shard requests

2020-04-27 Thread Wei
Hi Eric,

I am measuring the number of shard requests, and it's for query only, no
indexing requests.  I have an external load balancer and see each node
received about the equal number of external queries. However for the
internal shard queries,  the distribution is uneven:6 nodes (one in
each shard,  some of them are leaders and some are non-leaders ) gets about
80% of the shard requests, the other 54 nodes gets about 20% of the shard
requests.   I checked a few other parameters set:

-Dsolr.disable.shardsWhitelist=true
shards.preference=replica.location:local,replica.type:TLOG

Nothing seems to cause the strange behavior.  Any suggestions how to
debug this?

-Wei


On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson 
wrote:

> Wei:
>
> How are you measuring utilization here? The number of incoming requests or
> CPU?
>
> The leader for each shard are certainly handling all of the indexing
> requests since they’re TLOG replicas, so that’s one thing that might
> skewing your measurements.
>
> Best,
> Erick
>
> > On Apr 27, 2020, at 7:13 PM, Wei  wrote:
> >
> > Hi everyone,
> >
> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My cloud has 6
> > shards with 10 TLOG replicas each shard.  After upgrade I noticed that
> one
> > of the replicas in each shard is handling most of the distributed shard
> > requests, so 6 nodes are heavily loaded while other nodes are idle. There
> > is no change in shard handler configuration:
> >
> >  > "HttpShardHandlerFactory">
> >
> >3
> >
> >30000
> >
> >500
> >
> > 
> >
> >
> > What could cause the unbalanced internal distributed request?
> >
> >
> > Thanks in advance.
> >
> >
> >
> > Wei
>
>


solr payloads performance

2020-05-08 Thread Wei
Hi everyone,

Have a question regarding typical  e-commerce scenario: each item may have
different price in different store. suppose there are 10 million items and
1000 stores.

Option 1:  use solr payloads, each document have
 store_prices_payload:store1|price1 store2|price2  .
store1000|price1000

Option 2: use dynamic fields and have 1000 fields in each document, i.e.
   field1:  store1_price:  price1
   field2:  store2_price:  price2
   ...
   field1000:  store1000_price: price1000

Option 2 doesn't look elegant,  but is there any performance benchmark on
solr payloads? In terms of filtering, sorting or faceting, how would query
performance compare between the two?

Thanks,
Wei


Re: Unbalanced shard requests

2020-05-08 Thread Wei
Update:  after I remove the shards.preference parameter from
solrconfig.xml,  issue is gone and internal shard requests are now
balanced. The same parameter works fine with solr 7.6.  Still not sure of
the root cause, but I observed a strange coincidence: the nodes that are
most frequently picked for shard requests are the first node in each shard
returned from the CLUSTERSTATUS api.  Seems something wrong with shuffling
equally compared nodes when shards.preference is set.  Will report back if
I find more.

On Mon, Apr 27, 2020 at 5:59 PM Wei  wrote:

> Hi Eric,
>
> I am measuring the number of shard requests, and it's for query only, no
> indexing requests.  I have an external load balancer and see each node
> received about the equal number of external queries. However for the
> internal shard queries,  the distribution is uneven:6 nodes (one in
> each shard,  some of them are leaders and some are non-leaders ) gets about
> 80% of the shard requests, the other 54 nodes gets about 20% of the shard
> requests.   I checked a few other parameters set:
>
> -Dsolr.disable.shardsWhitelist=true
> shards.preference=replica.location:local,replica.type:TLOG
>
> Nothing seems to cause the strange behavior.  Any suggestions how to
> debug this?
>
> -Wei
>
>
> On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson 
> wrote:
>
>> Wei:
>>
>> How are you measuring utilization here? The number of incoming requests
>> or CPU?
>>
>> The leader for each shard are certainly handling all of the indexing
>> requests since they’re TLOG replicas, so that’s one thing that might
>> skewing your measurements.
>>
>> Best,
>> Erick
>>
>> > On Apr 27, 2020, at 7:13 PM, Wei  wrote:
>> >
>> > Hi everyone,
>> >
>> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My cloud has 6
>> > shards with 10 TLOG replicas each shard.  After upgrade I noticed that
>> one
>> > of the replicas in each shard is handling most of the distributed shard
>> > requests, so 6 nodes are heavily loaded while other nodes are idle.
>> There
>> > is no change in shard handler configuration:
>> >
>> > > > "HttpShardHandlerFactory">
>> >
>> >3
>> >
>> >3
>> >
>> >500
>> >
>> > 
>> >
>> >
>> > What could cause the unbalanced internal distributed request?
>> >
>> >
>> > Thanks in advance.
>> >
>> >
>> >
>> > Wei
>>
>>


Re: Unbalanced shard requests

2020-05-11 Thread Wei
Thanks Michael!  Yes in each shard I have 10 Tlog replicas,  no other type
of replicas, and each Tlog replica is an individual solr instance on its
own physical machine.  In the jira you mentioned 'when "last place matches"
== "first place matches" – e.g. when shards.preference specified matches
*all* available replicas'.   My setting is
shards.preference=replica.location:local,replica.type:TLOG,
I also tried just shards.preference=replica.location:local and it still has
the issue. Can you explain a bit more?

On Mon, May 11, 2020 at 12:26 PM Michael Gibney 
wrote:

> FYI: https://issues.apache.org/jira/browse/SOLR-14471
> Wei, assuming you have only TLOG replicas, your "last place" matches
> (to which the random fallback ordering would not be applied -- see
> above issue) would be the same as the "first place" matches selected
> for executing distributed requests.
>
>
> On Mon, May 11, 2020 at 1:49 PM Michael Gibney
>  wrote:
> >
> > Wei, probably no need to answer my earlier questions; I think I see
> > the problem here, and believe it is indeed a bug, introduced in 8.3.
> > Will file an issue and submit a patch shortly.
> > Michael
> >
> > On Mon, May 11, 2020 at 12:49 PM Michael Gibney
> >  wrote:
> > >
> > > Hi Wei,
> > >
> > > In considering this problem, I'm stumbling a bit on terminology
> > > (particularly, where you mention "nodes", I think you're referring to
> > > "replicas"?). Could you confirm that you have 10 TLOG replicas per
> > > shard, for each of 6 shards? How many *nodes* (i.e., running solr
> > > server instances) do you have, and what is the replica placement like
> > > across those nodes? What, if any, non-TLOG replicas do you have per
> > > shard (not that it's necessarily relevant, but just to get a complete
> > > picture of the situation)?
> > >
> > > If you're able without too much trouble, can you determine what the
> > > behavior is like on Solr 8.3? (there were different changes introduced
> > > to potentially relevant code in 8.3 and 8.4, and knowing whether the
> > > behavior you're observing manifests on 8.3 would help narrow down
> > > where to look for an explanation).
> > >
> > > Michael
> > >
> > > On Fri, May 8, 2020 at 7:34 PM Wei  wrote:
> > > >
> > > > Update:  after I remove the shards.preference parameter from
> > > > solrconfig.xml,  issue is gone and internal shard requests are now
> > > > balanced. The same parameter works fine with solr 7.6.  Still not
> sure of
> > > > the root cause, but I observed a strange coincidence: the nodes that
> are
> > > > most frequently picked for shard requests are the first node in each
> shard
> > > > returned from the CLUSTERSTATUS api.  Seems something wrong with
> shuffling
> > > > equally compared nodes when shards.preference is set.  Will report
> back if
> > > > I find more.
> > > >
> > > > On Mon, Apr 27, 2020 at 5:59 PM Wei  wrote:
> > > >
> > > > > Hi Eric,
> > > > >
> > > > > I am measuring the number of shard requests, and it's for query
> only, no
> > > > > indexing requests.  I have an external load balancer and see each
> node
> > > > > received about the equal number of external queries. However for
> the
> > > > > internal shard queries,  the distribution is uneven:6 nodes
> (one in
> > > > > each shard,  some of them are leaders and some are non-leaders )
> gets about
> > > > > 80% of the shard requests, the other 54 nodes gets about 20% of
> the shard
> > > > > requests.   I checked a few other parameters set:
> > > > >
> > > > > -Dsolr.disable.shardsWhitelist=true
> > > > > shards.preference=replica.location:local,replica.type:TLOG
> > > > >
> > > > > Nothing seems to cause the strange behavior.  Any suggestions how
> to
> > > > > debug this?
> > > > >
> > > > > -Wei
> > > > >
> > > > >
> > > > > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson <
> erickerick...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Wei:
> > > > >>
> > > > >> How are you measuring utilization here? The number of incoming
> requests
> > > > >> or CPU?
> > > >

Re: Unbalanced shard requests

2020-05-19 Thread Wei
Hi Phill,

What is the RAM config you are referring to, JVM size? How is that related
to the load balancing, if each node has the same configuration?

Thanks,
Wei

On Mon, May 18, 2020 at 3:07 PM Phill Campbell
 wrote:

> In my previous report I was configured to use as much RAM as possible.
> With that configuration it seemed it was not load balancing.
> So, I reconfigured and redeployed to use 1/4 the RAM. What a difference
> for the better!
>
> 10.156.112.50   load average: 13.52, 10.56, 6.46
> 10.156.116.34   load average: 11.23, 12.35, 9.63
> 10.156.122.13   load average: 10.29, 12.40, 9.69
>
> Very nice.
> My tool that tests records RPS. In the “bad” configuration it was less
> than 1 RPS.
> NOW it is showing 21 RPS.
>
>
> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> <
> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >
> {
>   "responseHeader":{
> "status":0,
> "QTime":161},
>   "metrics":{
> "solr.core.BTS.shard1.replica_n2":{
>   "QUERY./select.requestTimes":{
> "count":5723,
> "meanRate":6.8163888639859085,
> "1minRate":11.557013215119536,
> "5minRate":8.760356217628159,
> "15minRate":4.707624230995833,
> "min_ms":0.131545,
> "max_ms":388.710848,
> "mean_ms":30.300492048215947,
> "median_ms":6.336654,
> "stddev_ms":51.527164088667035,
> "p75_ms":35.427943,
> "p95_ms":140.025957,
> "p99_ms":230.533099,
> "p999_ms":388.710848
>
>
>
> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> <
> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >
> {
>   "responseHeader":{
> "status":0,
> "QTime":11},
>   "metrics":{
> "solr.core.BTS.shard2.replica_n8":{
>   "QUERY./select.requestTimes":{
> "count":6469,
> "meanRate":7.502581801189549,
> "1minRate":12.211423085368564,
> "5minRate":9.445681397767322,
> "15minRate":5.216209798637846,
> "min_ms":0.154691,
> "max_ms":701.657394,
> "mean_ms":34.2734699171445,
> "median_ms":5.640378,
> "stddev_ms":62.27649205954566,
> "p75_ms":39.016371,
> "p95_ms":156.997982,
> "p99_ms":288.883028,
> "p999_ms":538.368031
>
>
> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> <
> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >
> {
>   "responseHeader":{
> "status":0,
> "QTime":67},
>   "metrics":{
> "solr.core.BTS.shard3.replica_n16":{
>   "QUERY./select.requestTimes":{
> "count":7109,
> "meanRate":7.787524673806184,
> "1minRate":11.88519763582083,
> "5minRate":9.893315557386755,
> "15minRate":5.620178363676527,
> "min_ms":0.150887,
> "max_ms":472.826462,
> "mean_ms":32.184282366621204,
> "median_ms":6.977733,
> "stddev_ms":55.729908615189196,
> "p75_ms":36.655011,
> "p95_ms":151.12627,
> "p99_ms":251.440162,
> "p999_ms":472.826462
>
>
> Compare that to the previous report and you can see the improvement.
> So, note to myself. Figure out the sweet spot for RAM usage. Use too much
> and strange behavior is noticed. While using too much all the load focused
> on one box and query times slowed.
> I did not see any OOM errors during any of this.
>
> Regards
>
>
>
> > On May 18, 2020, at 3:23 PM, Phill Campbell
>  wrote:
> >
> > I have been testing 8.5.2 and it looks like the load has moved but is
> still on one machine.
> >
> > Setup:
> > 3 physical machines.
> > Each machine hosts 8 instances of Solr.
> > Each instance of Solr hosts one replica.
> >
> > Another way to say it:
> > Number of shards

Re: Unbalanced shard requests

2020-05-22 Thread Wei
Hi Michael,

I also verified the patch in SOLR-14471 with 8.4.1 and it fixed the issue
with shards.preference=replica.location:local,replica.type:TLOG in my
setting.  Thanks!

Wei

On Thu, May 21, 2020 at 12:09 PM Phill Campbell
 wrote:

> Yes, JVM heap settings.
>
> > On May 19, 2020, at 10:59 AM, Wei  wrote:
> >
> > Hi Phill,
> >
> > What is the RAM config you are referring to, JVM size? How is that
> related
> > to the load balancing, if each node has the same configuration?
> >
> > Thanks,
> > Wei
> >
> > On Mon, May 18, 2020 at 3:07 PM Phill Campbell
> >  wrote:
> >
> >> In my previous report I was configured to use as much RAM as possible.
> >> With that configuration it seemed it was not load balancing.
> >> So, I reconfigured and redeployed to use 1/4 the RAM. What a difference
> >> for the better!
> >>
> >> 10.156.112.50   load average: 13.52, 10.56, 6.46
> >> 10.156.116.34   load average: 11.23, 12.35, 9.63
> >> 10.156.122.13   load average: 10.29, 12.40, 9.69
> >>
> >> Very nice.
> >> My tool that tests records RPS. In the “bad” configuration it was less
> >> than 1 RPS.
> >> NOW it is showing 21 RPS.
> >>
> >>
> >>
> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >> <
> >>
> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >>>
> >> {
> >>  "responseHeader":{
> >>"status":0,
> >>"QTime":161},
> >>  "metrics":{
> >>"solr.core.BTS.shard1.replica_n2":{
> >>  "QUERY./select.requestTimes":{
> >>"count":5723,
> >>"meanRate":6.8163888639859085,
> >>"1minRate":11.557013215119536,
> >>"5minRate":8.760356217628159,
> >>"15minRate":4.707624230995833,
> >>"min_ms":0.131545,
> >>"max_ms":388.710848,
> >>"mean_ms":30.300492048215947,
> >>"median_ms":6.336654,
> >>"stddev_ms":51.527164088667035,
> >>"p75_ms":35.427943,
> >>"p95_ms":140.025957,
> >>"p99_ms":230.533099,
> >>"p999_ms":388.710848
> >>
> >>
> >>
> >>
> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >> <
> >>
> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >>>
> >> {
> >>  "responseHeader":{
> >>"status":0,
> >>"QTime":11},
> >>  "metrics":{
> >>"solr.core.BTS.shard2.replica_n8":{
> >>  "QUERY./select.requestTimes":{
> >>"count":6469,
> >>"meanRate":7.502581801189549,
> >>"1minRate":12.211423085368564,
> >>"5minRate":9.445681397767322,
> >>"15minRate":5.216209798637846,
> >>"min_ms":0.154691,
> >>"max_ms":701.657394,
> >>"mean_ms":34.2734699171445,
> >>"median_ms":5.640378,
> >>"stddev_ms":62.27649205954566,
> >>"p75_ms":39.016371,
> >>"p95_ms":156.997982,
> >>"p99_ms":288.883028,
> >>"p999_ms":538.368031
> >>
> >>
> >>
> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >> <
> >>
> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
> >>>
> >> {
> >>  "responseHeader":{
> >>"status":0,
> >>"QTime":67},
> >>  "metrics":{
> >>"solr.core.BTS.shard3.replica_n16":{
> >>  "QUERY./select.requestTimes":{
> >>"count":7109,
> >>"meanRate":7.787524673806184,
> >>"1minRate":11.88519763582083,
> >>"5minRate":9.893315557386755,
> >>"15minRate":5.620178363676527,
> >

How to disable cache for facet.query?

2020-08-08 Thread Wei
Hi,

I am trying to disable filter cache for some filter queries as they contain
unique ids and cause cache evictions. By adding {!cache=false} the fq is no
longer stored in filter cache, however I have similar conditions in
facet.query and using facet.query={!cache=false}(color:red AND id:XXX) does
not work.  Is it possible to stop solr from putting facet.query into filter
cache?

Thanks,
Wei


solr performance with >1 NUMAs

2020-09-23 Thread Wei
Hi,

Recently we deployed solr 8.4.1 on a batch of new servers with 2 NUMAs. I
noticed that query latency almost doubled compared to deployment on single
NUMA machines. Not sure what's causing the huge difference. Is there any
tuning to boost the performance on multiple NUMA machines? Any pointer is
appreciated.

Best,
Wei


Re: solr performance with >1 NUMAs

2020-09-25 Thread Wei
Thanks Dominique. I'll start with the -XX:+UseNUMA option.

Best,
Wei

On Fri, Sep 25, 2020 at 7:04 AM Dominique Bejean 
wrote:

> Hi,
>
> This would be a Java VM option, not something Solr itself can know about.
> Take a look at this article in comments. May be it will help.
>
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html?showComment=1347033706559#c229885263664926125
>
> Regards
>
> Dominique
>
>
>
> Le jeu. 24 sept. 2020 à 03:42, Wei  a écrit :
>
> > Hi,
> >
> > Recently we deployed solr 8.4.1 on a batch of new servers with 2 NUMAs. I
> > noticed that query latency almost doubled compared to deployment on
> single
> > NUMA machines. Not sure what's causing the huge difference. Is there any
> > tuning to boost the performance on multiple NUMA machines? Any pointer is
> > appreciated.
> >
> > Best,
> > Wei
> >
>


Re: solr performance with >1 NUMAs

2020-09-26 Thread Wei
Thanks Shawn! Currently we are still using the CMS collector for solr with
Java 8. When last evaluated with Solr 7, CMS performs better than G1 for
our case. When using G1, is it better to upgrade from Java 8 to Java 11?
>From https://lucene.apache.org/solr/guide/8_4/solr-system-requirements.html,
seems Java 14 is not officially supported for Solr 8.

Best,
Wei


On Fri, Sep 25, 2020 at 5:50 PM Shawn Heisey  wrote:

> On 9/23/2020 7:42 PM, Wei wrote:
> > Recently we deployed solr 8.4.1 on a batch of new servers with 2 NUMAs. I
> > noticed that query latency almost doubled compared to deployment on
> single
> > NUMA machines. Not sure what's causing the huge difference. Is there any
> > tuning to boost the performance on multiple NUMA machines? Any pointer is
> > appreciated.
>
> If you're running with standard options, Solr 8.4.1 will start using the
> G1 garbage collector.
>
> As of Java 14, G1 has gained the ability to use the -XX:+UseNUMA option,
> which makes better decisions about memory allocations and multiple
> NUMAs.  If you're running a new enough Java, it would probably be
> beneficial to add this to the garbage collector options.  Solr itself is
> unaware of things like NUMA -- Java must handle that.
>
> https://openjdk.java.net/jeps/345
>
> Thanks,
> Shawn
>


Re: What does current mean?

2020-09-26 Thread Wei
My understanding is that current means whether there is data pending to be
committed.

Best,
Wei

On Sat, Sep 26, 2020 at 5:09 PM Kayak28  wrote:

> Hello, Solr community:
>
>
>
> I would like to ask a question about the current icon on the core-overview
>
> under statistics.
>
> I thought previously that the current tag tells users whether it is
>
> searchable or not (commit or not commit) because if I send a
>
> commit request, it becomes an OK-ish icon from NG-ish icon.
>
>
>
> If anyone knows the meaning of the icon, I would like to hear about.
>
>
>
>
>
>
>
>
>
>
>
> --
>
>
>
> Sincerely,
>
> Kaya
>
> github: https://github.com/28kayak
>
>


Re: solr performance with >1 NUMAs

2020-09-28 Thread Wei
Thanks Shawn. Looks like Java 11 is the way to go with -XX:+UseNUMA. Do you
see any backward compatibility issue for Solr 8 with Java 11? Can we run
Solr 8 built with JDK 8 in Java 11 JRE, or need to rebuild solr with Java
11 JDK?

Best,
Wei

On Sat, Sep 26, 2020 at 6:44 PM Shawn Heisey  wrote:

> On 9/26/2020 1:39 PM, Wei wrote:
> > Thanks Shawn! Currently we are still using the CMS collector for solr
> with
> > Java 8. When last evaluated with Solr 7, CMS performs better than G1 for
> > our case. When using G1, is it better to upgrade from Java 8 to Java 11?
> >  From
> https://lucene.apache.org/solr/guide/8_4/solr-system-requirements.html,
> > seems Java 14 is not officially supported for Solr 8.
>
> It has been a while since I was working with Solr every day, and when I
> was, Java 11 did not yet exist.  I have no idea whether Java 11 improves
> things beyond Java 8.  That said ... all software evolves and usually
> improves as time goes by.  It is likely that the newer version has SOME
> benefit.
>
> Regarding whether or not Java 14 is supported:  There are automated
> tests where all the important code branches are run with all major
> versions of Java, including pre-release versions, and those tests do
> include various garbage collectors.  Somebody notices when a combination
> doesn't work, and big problems with newer Java versions are something
> that gets discussed on our mailing lists.
>
> Java 14 has been out for a while, with no big problems being discussed
> so far.  So it is likely that it works with Solr.  Can I say for sure?
> No.  I haven't tried it myself.
>
> I don't have any hardware available where there is more than one NUMA,
> or I would look deeper into this myself.  It would be interesting to
> find out whether the -XX:+UseNUMA option makes a big difference in
> performance.
>
> Thanks,
> Shawn
>


Re: solr performance with >1 NUMAs

2020-10-22 Thread Wei
Hi Shawn,

I.m circling back with some new findings with our 2 NUMA issue.  After a
few iterations, we do see improvement with the useNUMA flag and other JVM
setting changes. Here are the current settings, with Java 11:

-XX:+UseNUMA

-XX:+UseG1GC

-XX:+AlwaysPreTouch

-XX:+UseTLAB

-XX:G1MaxNewSizePercent=20

-XX:MaxGCPauseMillis=150

-XX:+DisableExplicitGC

-XX:+DoEscapeAnalysis

-XX:+ParallelRefProcEnabled

-XX:+UnlockDiagnosticVMOptions

-XX:+UnlockExperimentalVMOptions


Compared to previous Java 8 + CMS on 2 NUMA servers,  P99 latency has
improved over 20%.


Thanks,

Wei




On Mon, Sep 28, 2020 at 4:02 PM Shawn Heisey  wrote:

> On 9/28/2020 12:17 PM, Wei wrote:
> > Thanks Shawn. Looks like Java 11 is the way to go with -XX:+UseNUMA. Do
> you
> > see any backward compatibility issue for Solr 8 with Java 11? Can we run
> > Solr 8 built with JDK 8 in Java 11 JRE, or need to rebuild solr with Java
> > 11 JDK?
>
> I do not know of any problems running the binary release of Solr 8
> (which is most likely built with the Java 8 JDK) with a newer release
> like Java 11 or higher.
>
> I think Sun was really burned by such problems cropping up in the days
> of Java 5 and 6, and their developers have worked really hard to make
> sure that never happens again.
>
> If you're running Java 11, you will need to pick a different garbage
> collector if you expect the NUMA flag to function.  The most recent
> releases of Solr are defaulting to G1GC, which as previously mentioned,
> did not gain NUMA optimizations until Java 14.
>
> It is not clear to me whether the NUMA optimizations will work with any
> collector other than Parallel until Java 14.  You would need to check
> Java documentation carefully or ask someone involved with development of
> Java.
>
> If you do see an improvement using the NUMA flag with Java 11, please
> let us know exactly what options Solr was started with.
>
> Thanks,
> Shawn
>


docValues usage

2020-11-03 Thread Wei
Hi,

I have a couple of primitive single value numeric type fields,  their
values are used in boosting functions, but not used in sort/facet. or in
returned response.   Should I use docValues for them in the schema?  I can
think of the following options:

 1)   indexed=true,  stored=true, docValues=false
 2)   indexed=true, stored=false, docValues=true
 3)   indexed=false,  stored=false,  docValues=true

What would be the performance implications for these options?

Best,
Wei


Re: docValues usage

2020-11-04 Thread Wei
Thanks Erick. As indexed is not necessary,  and docValues is more efficient
than stored fields for function queries, so  we shall go with the
following:

  3) indexed=false,  stored=false,  docValues=true.

Is my understanding correct?

Best,
Wei

On Wed, Nov 4, 2020 at 5:24 AM Erick Erickson 
wrote:

> You don’t need to index the field for function queries, see:
> https://lucene.apache.org/solr/guide/8_6/docvalues.html.
>
> Function queries, as opposed to sorting, faceting and grouping are
> evaluated at search time where the
> search process is already parked on the document anyway, so answering the
> question “for doc X, what
> is the value of field Y” to compute the score. DocValues are still more
> efficient I think, although I
> haven’t measured explicitly...
>
> For sorting, faceting and grouping, it’s a much different story. Take
> sorting. You have to ask
> “for field Y, what’s the value in docX and docZ?”. Say you’re parked on
> docX. Doc Z is long gone
> and getting the value for field Y much more expensive.
>
> Also, docValues will not increase memory requirements _unless used_.
> Otherwise they’ll
> just sit there on disk. They will certainly increase disk space whether
> used or not.
>
> And _not_ using docValues when you facet, group or sort will also
> _certainly_ increase
> your heap requirements since the docValues structure must be built on the
> heap rather
> than be in MMapDirectory space.
>
> Best,
> Erick
>
>
> > On Nov 4, 2020, at 5:32 AM, uyilmaz  wrote:
> >
> > Hi,
> >
> > I'm by no means expert on this so if anyone sees a mistake please
> correct me.
> >
> > I think you need to index this field, since boost functions are added to
> the query as optional clauses (
> https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Thebf_BoostFunctions_Parameter).
> It's like boosting a regular field by putting ^2 next to it in a query.
> Storing or enabling docValues will unnecesarily consume space/memory.
> >
> > On Tue, 3 Nov 2020 16:10:50 -0800
> > Wei  wrote:
> >
> >> Hi,
> >>
> >> I have a couple of primitive single value numeric type fields,  their
> >> values are used in boosting functions, but not used in sort/facet. or in
> >> returned response.   Should I use docValues for them in the schema?  I
> can
> >> think of the following options:
> >>
> >> 1)   indexed=true,  stored=true, docValues=false
> >> 2)   indexed=true, stored=false, docValues=true
> >> 3)   indexed=false,  stored=false,  docValues=true
> >>
> >> What would be the performance implications for these options?
> >>
> >> Best,
> >> Wei
> >
> >
> > --
> > uyilmaz 
>
>


Re: docValues usage

2020-11-04 Thread Wei
And in the case of both stored=true and docValues=true,  Solr 8.x shall be
choosing the optimal approach by itself?

On Wed, Nov 4, 2020 at 9:15 AM Wei  wrote:

> Thanks Erick. As indexed is not necessary,  and docValues is more
> efficient than stored fields for function queries, so  we shall go with the
> following:
>
>   3) indexed=false,  stored=false,  docValues=true.
>
> Is my understanding correct?
>
> Best,
> Wei
>
> On Wed, Nov 4, 2020 at 5:24 AM Erick Erickson 
> wrote:
>
>> You don’t need to index the field for function queries, see:
>> https://lucene.apache.org/solr/guide/8_6/docvalues.html.
>>
>> Function queries, as opposed to sorting, faceting and grouping are
>> evaluated at search time where the
>> search process is already parked on the document anyway, so answering the
>> question “for doc X, what
>> is the value of field Y” to compute the score. DocValues are still more
>> efficient I think, although I
>> haven’t measured explicitly...
>>
>> For sorting, faceting and grouping, it’s a much different story. Take
>> sorting. You have to ask
>> “for field Y, what’s the value in docX and docZ?”. Say you’re parked on
>> docX. Doc Z is long gone
>> and getting the value for field Y much more expensive.
>>
>> Also, docValues will not increase memory requirements _unless used_.
>> Otherwise they’ll
>> just sit there on disk. They will certainly increase disk space whether
>> used or not.
>>
>> And _not_ using docValues when you facet, group or sort will also
>> _certainly_ increase
>> your heap requirements since the docValues structure must be built on the
>> heap rather
>> than be in MMapDirectory space.
>>
>> Best,
>> Erick
>>
>>
>> > On Nov 4, 2020, at 5:32 AM, uyilmaz 
>> wrote:
>> >
>> > Hi,
>> >
>> > I'm by no means expert on this so if anyone sees a mistake please
>> correct me.
>> >
>> > I think you need to index this field, since boost functions are added
>> to the query as optional clauses (
>> https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Thebf_BoostFunctions_Parameter).
>> It's like boosting a regular field by putting ^2 next to it in a query.
>> Storing or enabling docValues will unnecesarily consume space/memory.
>> >
>> > On Tue, 3 Nov 2020 16:10:50 -0800
>> > Wei  wrote:
>> >
>> >> Hi,
>> >>
>> >> I have a couple of primitive single value numeric type fields,  their
>> >> values are used in boosting functions, but not used in sort/facet. or
>> in
>> >> returned response.   Should I use docValues for them in the schema?  I
>> can
>> >> think of the following options:
>> >>
>> >> 1)   indexed=true,  stored=true, docValues=false
>> >> 2)   indexed=true, stored=false, docValues=true
>> >> 3)   indexed=false,  stored=false,  docValues=true
>> >>
>> >> What would be the performance implications for these options?
>> >>
>> >> Best,
>> >> Wei
>> >
>> >
>> > --
>> > uyilmaz 
>>
>>


Solr filter query on text fields

2019-06-24 Thread Wei
Hi,

I have always been using solr fq on string fields. Recently I need to apply
fq on one text field defined as follows:



   

   

   

   

   

   

   


For query q=*:*&fq=description:”ice cream”,  the filter query returns
matches for “ice cream bar”  and “vanilla ice cream” , but does not match
for “ice cold cream”.

The results seem neither exact match nor phrase match. What's the expected
behavior for fq on text fields?  I have tried to look into the solr docs
but there is no clear explanation.

Thanks,
Wei


Re: Solr filter query on text fields

2019-06-24 Thread Wei
Thanks Shawn! I didn't notice the asterisks are created during copy/paste,
one lesson learned :)
Does that mean when fq is applied to text fields,  it is doing text match
in the field just like q in a query field?  While for string fields, it is
exact match.
If it is a phrase query,  what are the values for relate parameters such as
ps?

Thanks,
Wei

On Mon, Jun 24, 2019 at 4:51 PM Shawn Heisey  wrote:

> On 6/24/2019 5:37 PM, Wei wrote:
> >  stored="true"/>
>
> I'm assuming that the asterisks here are for emphasis, that they are not
> actually present.  This can be very confusing.  It is far better to
> relay the precise information and not try to emphasize anything.
>
> > For query q=*:*&fq=description:”ice cream”,  the filter query returns
> > matches for “ice cream bar”  and “vanilla ice cream” , but does not match
> > for “ice cold cream”.
> >
> > The results seem neither exact match nor phrase match. What's the
> expected
> > behavior for fq on text fields?  I have tried to look into the solr docs
> > but there is no clear explanation.
>
> If the quotes are present in what you actually sent to Solr, then that
> IS a phrase query.  And that is why it did not match your third example.
>
> Try one of these instead:
>
> q=*:*&fq=description:(ice cream)
>
> q=*:*&fq=description:ice description:cream)
>
> Thanks,
> Shawn
>


Re: Solr filter query on text fields

2019-06-25 Thread Wei
Thanks Erick for the clarification.  How does the ps work for fq?  I
configured ps=4 for q, it doesn't apply to fq though. For phrase queries in
fq seems ps=0 is used. Is there a way to config it for fq also?

Best,
Wei

On Tue, Jun 25, 2019 at 9:51 AM Erick Erickson 
wrote:

> q and fq do _exactly_ the same thing in terms of query parsing, subject to
> all the same conditions.
>
> There are two things that apply to fq clauses that have nothing to do with
> the query _parsing_.
> 1> there is no scoring, so it’s cheaper from that perspective
> 2> the results are cached in a bitmap and can be re-used later
>
> Best,
> Erick
>
> > On Jun 24, 2019, at 7:06 PM, Wei  wrote:
> >
> > Thanks Shawn! I didn't notice the asterisks are created during
> copy/paste,
> > one lesson learned :)
> > Does that mean when fq is applied to text fields,  it is doing text match
> > in the field just like q in a query field?  While for string fields, it
> is
> > exact match.
> > If it is a phrase query,  what are the values for relate parameters such
> as
> > ps?
> >
> > Thanks,
> > Wei
> >
> > On Mon, Jun 24, 2019 at 4:51 PM Shawn Heisey 
> wrote:
> >
> >> On 6/24/2019 5:37 PM, Wei wrote:
> >>>  >> stored="true"/>
> >>
> >> I'm assuming that the asterisks here are for emphasis, that they are not
> >> actually present.  This can be very confusing.  It is far better to
> >> relay the precise information and not try to emphasize anything.
> >>
> >>> For query q=*:*&fq=description:”ice cream”,  the filter query returns
> >>> matches for “ice cream bar”  and “vanilla ice cream” , but does not
> match
> >>> for “ice cold cream”.
> >>>
> >>> The results seem neither exact match nor phrase match. What's the
> >> expected
> >>> behavior for fq on text fields?  I have tried to look into the solr
> docs
> >>> but there is no clear explanation.
> >>
> >> If the quotes are present in what you actually sent to Solr, then that
> >> IS a phrase query.  And that is why it did not match your third example.
> >>
> >> Try one of these instead:
> >>
> >> q=*:*&fq=description:(ice cream)
> >>
> >> q=*:*&fq=description:ice description:cream)
> >>
> >> Thanks,
> >> Shawn
> >>
>
>


Function Query with multi-value field

2019-07-11 Thread Wei
Hi,

I have a question regarding function query that operates on multi-value
fields.  For the following field:



 Each value is a hex string representation of RGB value.  for example there
are 3 values indexed

#FF00FF- C1
#EE82EE   - C2
#DA70D6   - C3

How would I write a function query that operates on all values of the
field?  Given color S in query, how to calculate the similarities between S
and C1/C2/C3 and find which one is the closest?
I checked https://lucene.apache.org/solr/guide/6_6/function-queries.html but
didn't see an example.

Thanks,
Wei


Re: Function Query with multi-value field

2019-07-13 Thread Wei
Any suggestion?

On Thu, Jul 11, 2019 at 3:03 PM Wei  wrote:

> Hi,
>
> I have a question regarding function query that operates on multi-value
> fields.  For the following field:
>
>  multivalued="true"/>
>
>  Each value is a hex string representation of RGB value.  for example
> there are 3 values indexed
>
> #FF00FF- C1
> #EE82EE   - C2
> #DA70D6   - C3
>
> How would I write a function query that operates on all values of the
> field?  Given color S in query, how to calculate the similarities between
> S and C1/C2/C3 and find which one is the closest?
> I checked https://lucene.apache.org/solr/guide/6_6/function-queries.html but
> didn't see an example.
>
> Thanks,
> Wei
>


How to block expensive solr queries

2019-10-07 Thread Wei
Hi,

Recently we encountered a problem when solr cloud query latency suddenly
increase, many simple queries that has small recall gets time out. After
digging a bit I found that the root cause is some stats queries happen at
the same time, such as

/solr/mycollection/select?stats=true&stats.field=unique_ids&stats.calcdistinct=true



I see unique_ids is a high cardinality field so this query is quite
expensive. But why a small volume of such query blocks other queries and
make simple queries time out?  I checked the solr thread pool and see there
are plenty of idle threads available.  We are using solr 7.6.2 with a 10
shard cloud set up.

Is there a way to block certain solr queries based on url pattern? i.e.
ignore the stats.calcdistinct request in this case.


Thanks,

Wei


Re: How to block expensive solr queries

2019-10-07 Thread Wei
Hi Mikhail,

Yes I have the timeAllowed parameter configured, still is this case it
doesn't seem to prevent the stats request from blocking other normal
queries.  Is it possible to drop the request before solr executes it? maybe
at the jetty request filter?

Thanks,
Wei

On Mon, Oct 7, 2019 at 1:39 PM Mikhail Khludnev  wrote:

> Hello, Wei.
>
> Have you tried to abandon heavy queries with
>
> https://lucene.apache.org/solr/guide/8_1/common-query-parameters.html#CommonQueryParameters-ThetimeAllowedParameter
>  ?
> It may or may not be able to stop stats.
>
> https://github.com/apache/lucene-solr/blob/25eda17c66f0091dbf6570121e38012749c07d72/solr/core/src/test/org/apache/solr/cloud/CloudExitableDirectoryReaderTest.java#L223
> can clarify it.
>
> On Mon, Oct 7, 2019 at 8:19 PM Wei  wrote:
>
> > Hi,
> >
> > Recently we encountered a problem when solr cloud query latency suddenly
> > increase, many simple queries that has small recall gets time out. After
> > digging a bit I found that the root cause is some stats queries happen at
> > the same time, such as
> >
> >
> >
> /solr/mycollection/select?stats=true&stats.field=unique_ids&stats.calcdistinct=true
> >
> >
> >
> > I see unique_ids is a high cardinality field so this query is quite
> > expensive. But why a small volume of such query blocks other queries and
> > make simple queries time out?  I checked the solr thread pool and see
> there
> > are plenty of idle threads available.  We are using solr 7.6.2 with a 10
> > shard cloud set up.
> >
> > Is there a way to block certain solr queries based on url pattern? i.e.
> > ignore the stats.calcdistinct request in this case.
> >
> >
> > Thanks,
> >
> > Wei
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Re: How to block expensive solr queries

2019-10-10 Thread Wei
On Wed, Oct 9, 2019 at 9:59 AM Wei  wrote:

> Thanks all. I debugged a bit and see timeAllowed does not limit stats
> call. Also I think it would be useful for solr to support a white list or
> black list of operations as Toke suggested. Will create jira for it.
> Currently seems the only option to explore is adding filter to solr's
> embedded jetty.  Does anyone have experience doing that? Do I also need to
> change SolrDispatchFilter?
>
> On Tue, Oct 8, 2019 at 3:50 AM Toke Eskildsen  wrote:
>
>> On Mon, 2019-10-07 at 10:18 -0700, Wei wrote:
>> > /solr/mycollection/select?stats=true&stats.field=unique_ids&stats.cal
>> > cdistinct=true
>> ...
>> > Is there a way to block certain solr queries based on url pattern?
>> > i.e. ignore the stats.calcdistinct request in this case.
>>
>> It sounds like it is possible for users to issue arbitrary queries
>> against your Solr installation. As you have noticed, it makes it easy
>> to perform a Denial Of Service (intentional or not). Filtering out
>> stats.calcdistinct won't help with the next request for
>> group.ngroups=true, facet.field=unique_id&facet.limit=1,
>> rows=1 or something fifth.
>>
>> I recommend you flip your logic and only allow specific types of
>> requests and put limits on those. To my knowledge that is not a build-
>> in feature of Solr.
>>
>> - Toke Eskildsem, Royal Danish Library
>>
>>
>>


Updates blocked in Tlog solr cloud?

2019-11-18 Thread Wei
Hi,

I am puzzled by a problem in solr cloud with Tlog replicas and would
appreciate your insights.  Our solr cloud has two shards and each shard
have 5 tlog replicas. When one of the non-leader replica has hardware issue
and become unreachable,  updates to the whole cloud stopped.  We are on
solr 7.6 and use solrj client to send updates only to leaders.  To my
understanding,  with Tlog replica type, the leader only forward update
requests to replicas for transaction log update and each replica
periodically pulls the segment from leader.  When one replica fails to
respond,  why update requests to the cloud are blocked?  Does leader need
to wait for response from each replica to inform client that update is
successful?

Best,
Wei


Re: Updates blocked in Tlog solr cloud?

2019-11-19 Thread Wei
Hi Erick,

I observed that the update request rate dropped from 20 per sec to 3 per
sec for about 8 minutes. After that there is a huge burst of updates. This
looks quite match the queue up behavior you mentioned. But I don't think
the time out took that long. Is there a configurable setting for the time
out?
Also the bad tlog replica is not reachable at the time, so we did a
DELETEREPLICA command with collections API to remove it from the cloud.

Thanks,
Wei


On Tue, Nov 19, 2019 at 5:52 AM Erick Erickson 
wrote:

> How long are updates blocked and how did the tlog replica on the bad
> hardware go down?
>
> Solr has to wait for an ack back from the tlog follower to be certain that
> the follower has all the documents in case it has to switch to that replica
> to become the leader. If the update to the follower times out, the leader
> will put it into a recovering state.
>
> So I’d expect the collection to queue up indexing until the request to the
> follower on the bad hardware timed out, did you wait at least that long?
>
> Best,
> Erick
>
> > On Nov 18, 2019, at 7:11 PM, Wei  wrote:
> >
> > Hi,
> >
> > I am puzzled by a problem in solr cloud with Tlog replicas and would
> > appreciate your insights.  Our solr cloud has two shards and each shard
> > have 5 tlog replicas. When one of the non-leader replica has hardware
> issue
> > and become unreachable,  updates to the whole cloud stopped.  We are on
> > solr 7.6 and use solrj client to send updates only to leaders.  To my
> > understanding,  with Tlog replica type, the leader only forward update
> > requests to replicas for transaction log update and each replica
> > periodically pulls the segment from leader.  When one replica fails to
> > respond,  why update requests to the cloud are blocked?  Does leader need
> > to wait for response from each replica to inform client that update is
> > successful?
> >
> > Best,
> > Wei
>
>


Lucene optimization to disable hit count

2019-11-20 Thread Wei
Hi,

I see this lucene optimization to disable hit counts for better query
performance:

https://issues.apache.org/jira/browse/LUCENE-8060

Is the feature available in Solr 8.3?

Thanks,
Wei


Re: Lucene optimization to disable hit count

2019-11-20 Thread Wei
Thanks! Looking forward to have this feature in Solr.

On Wed, Nov 20, 2019 at 5:30 PM Tomás Fernández Löbbe 
wrote:

> Not yet:
> https://issues.apache.org/jira/browse/SOLR-13289
>
> On Wed, Nov 20, 2019 at 4:57 PM Wei  wrote:
>
> > Hi,
> >
> > I see this lucene optimization to disable hit counts for better query
> > performance:
> >
> > https://issues.apache.org/jira/browse/LUCENE-8060
> >
> > Is the feature available in Solr 8.3?
> >
> > Thanks,
> > Wei
> >
>


Re: Updates blocked in Tlog solr cloud?

2019-11-25 Thread Wei
Update for another observation: after the follower replica become
unresponsive, I notice there are multiple commits happen on the leader
within two minutes, and then seeing the following OOM error on leader:

o.a.s.s.HttpSolrCall null:java.lang.RuntimeException:
java.lang.OutOfMemoryError: Direct buffer memoryat
org.apache.solr.servlet.HttpSolrCall.sendError(HttpSolrCall.java:662)at
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:530)at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)
  at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)
  at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)
  at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
  at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
  at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
  at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
  at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
  at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
  at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
  at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
  at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
  at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
  at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
  at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
  at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
  at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
  at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
  at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
  at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
  at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
  at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
  at org.eclipse.jetty.server.Server.handle(Server.java:531)at
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
  at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
  at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)at
org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
  at




The commits are not inline with our autocommit interval. I am wondering if
the commits could be caused by the leader initialed recovery process.  Will
the Tlog leader do extra commits  for the replica to sync up in recovery
process?


Best,

Wei



On Tue, Nov 19, 2019 at 1:22 PM Wei  wrote:

> Hi Erick,
>
> I observed that the update request rate dropped from 20 per sec to 3 per
> sec for about 8 minutes. After that there is a huge burst of updates. This
> looks quite match the queue up behavior you mentioned. But I don't think
> the time out took that long. Is there a configurable setting for the time
> out?
> Also the bad tlog replica is not reachable at the time, so we did a
> DELETEREPLICA command with collections API to remove it from the cloud.
>
> Thanks,
> Wei
>
>
> On Tue, Nov 19, 2019 at 5:52 AM Erick Erickson 
> wrote:
>
>> How long are updates blocked and how did the tlog replica on the bad
>> hardware go down?
>>
>> Solr has to wait for an ack back from the tlog follower to be certain
>> that the follower has all the documents in case it has to switch to that
>> replica to become the leader. If the update to the follower times out, the
>> leader will put it into a recovering state.
>>
>> So I’d expect the collection to queue up indexing until the request to
>> the follower on the bad hardware timed out, did you wait at least that long?
>>
>> Best,
>> Erick
>>
>> > On Nov 18, 2019, at 7:11 PM, Wei  wrote:
>> >
>> > Hi,
>> >
>> > I am puzzled by a problem in solr cloud with Tlog replicas and would
>> > appreciate your insights.  Our solr cloud has two shards and each shard
>> > have 5 tlog replicas. When one of the non-leader replica has hardware
>> issue
>> > and become unreachable,  updates to the whole cloud stopped.  We are on
>> > solr 7.6 and use solrj client to send updates only to leaders.  

Convert javabin to json

2019-11-27 Thread Wei
Hi,

Is there a reliable way to convert solr's javabin response to json format?
We use solrj client with wt=javabin, but want to convert the received
javabin response to json for passing to client.  We don't want to use
wt=json as javabin is more efficient.  We tried the noggit jsonutil

https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/noggit/JSONUtil.java

but seems it is not able to convert parts of the query response such as
facet.  Are there any other options available?

Thanks,
Wei


Early termination in Lucene 8

2020-01-22 Thread Wei
Hi,

I am excited to see Lucene 8 introduced BlockMax WAND as a major speed
improvement https://issues.apache.org/jira/browse/LUCENE-8135.  My question
is, how does it integrate with facet request,  when the numFound won't be
exact? I did some search but haven't found any documentation on this. Any
pointer is greatly appreciated.

Best,
Wei


Re: Early termination in Lucene 8

2020-01-23 Thread Wei
Thanks Mikhail.  Do you know of any example on query parser with WAND?

On Thu, Jan 23, 2020 at 1:02 AM Mikhail Khludnev  wrote:

> If one creates query parser wrapping queries with WAND it just produce
> incomplete docset (I guess), which will be passed to facet component and
> produce fewer counts.
>
> On Thu, Jan 23, 2020 at 2:11 AM Wei  wrote:
>
> > Hi,
> >
> > I am excited to see Lucene 8 introduced BlockMax WAND as a major speed
> > improvement https://issues.apache.org/jira/browse/LUCENE-8135.  My
> > question
> > is, how does it integrate with facet request,  when the numFound won't be
> > exact? I did some search but haven't found any documentation on this. Any
> > pointer is greatly appreciated.
> >
> > Best,
> > Wei
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Re: solr 5 leaving tomcat, will I be the only one fearing about this?

2016-10-07 Thread Wei Zhang
I think it just means they won't officially support deploying war to tomcat
or other container. make sense to me if I was in charge of solr, I would
just support jetty,  predictable with a single configuration.  I wouldn't
want to spent countless hrs supporting various configurations.  Instead use
those hrs to further solr development.  I'm sure someone that has enough
familiarity with tomcat and Java and solr shouldn't have any issue, after
all solr is free but you need to pay for support.

On Fri, Oct 7, 2016, 7:13 PM Renee Sun  wrote:

> I just read through the following link Shawn shared in his reply:
> https://wiki.apache.org/solr/WhyNoWar
>
> While the following statement is true:
>
> "Supporting a single set of binary bits is FAR easier than worrying
> about what kind of customized environment the user has chosen for their
> deployment. "
>
> But it also probably will reduce the flexibility... for example, we tune
> for
> Scalability at tomcat level, such as its thread pool etc.  I assume the
> standalone Solr (which is still using Jetty underlying) would expose
> sufficient configurable 'knobs' that allow me to turn 'Solr' to meet our
> data work load.
>
> If we want to minimize the migration work, our existing business logic
> component will remain in tomcat, then the fact that we will have co-exist
> jetty and tomcat deployed in production system is a bit strange... or is
> it?
>
> Even if I could port our webapps to use Jetty, I assume the way solr is
> embedding Jetty I would be able to integrate at that level, I probably end
> up with 2 Jetty container instances running on same server, correct? It is
> still too early for me to be sure how this will impact our system but I am
> a
> little worried.
>
> Renee
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/solr-5-leaving-tomcat-will-I-be-the-only-one-fearing-about-this-tp4300065p4300259.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


dev-unsubscribe

2017-01-28 Thread Robin Wei


2 question about solr and lucene

2013-09-16 Thread Robin Wei
Hi, guys:
I met two questions about solr and lucene, wish people to help out.

use payload query but can NOT with numerical field type.  for example:  
  
I implemented my own requesthandler, refer to 
http://hnagtech.wordpress.com/2013/04/19/using-payloads-with-solr-4-x/ 
 I query in solr:sinaTag:operate
  solr response:

  "numFound": 2,
   "start": 0,"maxScore": 99,"docs": [  {"id": "1628209010",
"followersCount": 752,
"sinaTag": "operate|99 C2C|98 B2C|97 OnlineShopping|96 E-commercial|94",
"score": 99

   },

  {"id": "1900546410",
"followersCount": 1002,
"sinaTag": "Startup|99 Benz|98 PublicRelation|97 operate|96 Activity|95 
Media|94 AD|93 Vehicle|92 ",
  "score":   96

   }

This work well. 
But query with combined with other numberical condition, such as:
sinaTag:operate and followersCount:[752 TO 752]
{"responseHeader": {"status": 0,"QTime": 40 
 },  "response": {"numFound": 0,"start": 0,
"maxScore": 0,"docs": []  }
   }
   According these dataset, the first record should be responsed rather than 
NOT FOUND.
   I not know why.


 2. About string field fuzzy match filtering, how to get the score? what the 
formula is?
When I used two or several string fuzzy match, probable AND or OR,  how to 
get the score? what the formula is?
Might I implement myself score formula class which interface or abstract 
class to extend ?
   




Thanks in advance.




 

2 question about solr and lucene

2013-09-16 Thread Robin Wei
Hi, guys:
I met two questions about solr and lucene, wish people to help out.
use payload query but can NOT with numerical field type.  for example:  
  
I implemented my own requesthandler, refer to 
http://hnagtech.wordpress.com/2013/04/19/using-payloads-with-solr-4-x/ 
 I query in solr:sinaTag:operate
  solr response:
  "numFound": 2,
"start": 0,
"maxScore": 99,
"docs": [
  {
"id": "1628209010",
"followersCount": 752,
"sinaTag": "operate|99 C2C|98 B2C|97 OnlineShopping|96 E-commercial|94",
"score": 99
   },
  {
"id": "1900546410",
"followersCount": 1002,
"sinaTag": "Startup|99 Benz|98 PublicRelation|97 operate|96 Activity|95 
Media|94 AD|93 Vehicle|92 ",
  "score":   96
   }
This work well. 
But query with combined with other numberical condition, such as:
sinaTag:operate and followersCount:[752 TO 752]
{
   "responseHeader": {
"status": 0,
"QTime": 40
  },
  "response": {
"numFound": 0,
"start": 0,
"maxScore": 0,
"docs": []
  }
}
According these dataset, the first record should be responsed rather than 
NOT FOUND. 
I not know why.


  2. About string field fuzzy match filtering, how to get the score? what the 
formula is?
 When I used two or several string fuzzy match, probable AND or OR,  how to 
get the score? what the formula is?
 Might I implement myself score formula class which interface or abstract 
class to extend ?
 



Thanks in advance.







Solr 4 memory usage increase

2013-05-16 Thread Wei Zhao
We are migrating from Solr 3.5 to Solr 4.2.

After some performance testing, we found 4.2's memory usage is a lot higher
than 3.5. Our 12GM max heap process used to handle the test pretty well with
3.5. while, with 4.2, the same test runs into serious GC half way (20
minutes) into the test.

Anyone knows that something is significantly different from Solr 3.5 in
terms of memory usage? 

We also notice on a slave, IndexeWriter class is actually taking significant
portion (around 3GB) of the heap. Why Solr opens a IndexWriter on a slave?
Is there a conf I can use to turn it off? I don't remember I saw such heap
usage by a similar class in Solr 3.5.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-memory-usage-increase-tp4064066.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4 memory usage increase

2013-05-16 Thread Wei Zhao
No, exactly the same JVM of Java6



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-memory-usage-increase-tp4064066p4064108.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4 memory usage increase

2013-05-17 Thread Wei Zhao
Here is the JVM info:

$ java -version
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-memory-usage-increase-tp4064066p4064271.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 4 memory usage increase

2013-05-17 Thread Wei Zhao
We have master/slave setup. We disabled autocommits/autosoftcommits. So the
slave only replicates from master and serve query. Master does all the
indexing and commit every 5 minutes. Slave polls master every 2.5 minutes
and does replication.

Both tests with Solr 3.5 and 4.2 was run with the same setup and both with
master/slave replication running. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-memory-usage-increase-tp4064066p4064275.html
Sent from the Solr - User mailing list archive at Nabble.com.


solr data config questions

2010-06-28 Thread Peng, Wei
Hi All,

 

I am a new user of Solr.

We are now trying to enable searching on Digg dataset.

It has story_id as the primary key and comment_id are the comment id
which commented story_id, so story_id and comment_id is one-to-many
relationship.

These comment_ids can be replied by some repliers, so comment_id and
repliers are one-to-many relationship.

 

The problem is that within a single returned document the search results
shows an array of comment_ids and an array of repliers without knowing
which repliers replied which comment.

For example: now we got comment_id:[c1,c,2...,cn],
repliers:[r1,r2,r3rm]. Can we get something like
comment_id:[c1,c,2...,cn], repliers:[{r1,r2},{},r3{rm-1,rm}] so that
{r1,r2} is corresponding to c1?

 

Our current data-config is attached:

































  



 



  

  

  

  

  

  

 

 



  

  

  

  

  



 









 

Please help me on this.

Many thanks

 

Vivian

 

 

 



  1   2   >