Re: Solr 7.6.0: PingRequestHandler - Changing the default query (*:*)

2019-08-05 Thread Nicolas Franck
If the ping request handler is taking too long,
and the server is not recovering automatically,
there is not much you can do automatically on that server.
You have to intervene manually, and restart Solr on that node.

First of all: the ping is just an internal check. If it takes too long
to respond, the requester (i.e. the script calling it), should stop
the request, and mark that node as problematic. If there are
for example memory problems every subsequent request will only enhance
the problem, and Solr cannot recover from that.

> On 5 Aug 2019, at 06:15, dinesh naik  wrote:
> 
> Thanks john,Erick and Furknan.
> 
> I have already defined the ping request handler in solrconfig.xml as below:
>   name="invariants"> /select _root_:abc  
> 
> My question is regarding the custom query being used. Here i am querying
> for field _root_ which is available in all of my cluster and defined as a
> string field. The result for _root_:abc might not get me any match as
> well(i am ok with not finding any matches, the query should not be taking
> 10-15 seconds for getting the response).
> 
> If the response comes within 1 second , then the core recovery issue is
> solved, hence need your suggestion if using _root_ field in custom query is
> fine?
> 
> 
> On Mon, Aug 5, 2019 at 2:49 AM Furkan KAMACI  wrote:
> 
>> Hi,
>> 
>> You can change invariants i.e. *qt* and *q* of a *PingRequestHandler*:
>> 
>> 
>>   
>> /search
>> some test query
>>   
>> 
>> 
>> Check documentation fore more info:
>> 
>> https://lucene.apache.org/solr/7_6_0//solr-core/org/apache/solr/handler/PingRequestHandler.html
>> 
>> Kind Regards,
>> Furkan KAMACI
>> 
>> On Sat, Aug 3, 2019 at 4:17 PM Erick Erickson 
>> wrote:
>> 
>>> You can also (I think) explicitly define the ping request handler in
>>> solrconfig.xml to do something else.
>>> 
 On Aug 2, 2019, at 9:50 AM, Jörn Franke  wrote:
 
 Not sure if this is possible, but why not create a query handler in
>> Solr
>>> with any custom query and you use that as ping replacement ?
 
> Am 02.08.2019 um 15:48 schrieb dinesh naik >> :
> 
> Hi all,
> I have few clusters with huge data set and whenever a node goes down
>> its
> not able to recover due to below reasons:
> 
> 1. ping request handler is taking more than 10-15 seconds to respond.
>>> The
> ping requesthandler however, expects it will return in less than 1
>>> second
> and fails a requestrecovery if it is not responded to in this time.
> Therefore recoveries never would start.
> 
> 2. soft commit is very low ie. 5 sec. This is a business requirement
>> so
> not much can be done here.
> 
> As the standard/default admin/ping request handler is using *:*
>> queries
>>> ,
> the response time is much higher, and i am looking for an option to
>>> change
> the same so that the ping handler returns the results within few
> miliseconds.
> 
> here is an example for standard query time:
> 
> snip---
> curl "
> 
>>> 
>> http://hostname:8983/solr/parts/select?indent=on&q=*:*&rows=0&wt=json&distrib=false&debug=timing
> "
> {
> "responseHeader":{
>  "zkConnected":true,
>  "status":0,
>  "QTime":16620,
>  "params":{
>"q":"*:*",
>"distrib":"false",
>"debug":"timing",
>"indent":"on",
>"rows":"0",
>"wt":"json"}},
> "response":{"numFound":1329638799,"start":0,"docs":[]
> },
> "debug":{
>  "timing":{
>"time":16620.0,
>"prepare":{
>  "time":0.0,
>  "query":{
>"time":0.0},
>  "facet":{
>"time":0.0},
>  "facet_module":{
>"time":0.0},
>  "mlt":{
>"time":0.0},
>  "highlight":{
>"time":0.0},
>  "stats":{
>"time":0.0},
>  "expand":{
>"time":0.0},
>  "terms":{
>"time":0.0},
>  "block-expensive-queries":{
>"time":0.0},
>  "slow-query-logger":{
>"time":0.0},
>  "debug":{
>"time":0.0}},
>"process":{
>  "time":16619.0,
>  "query":{
>"time":16619.0},
>  "facet":{
>"time":0.0},
>  "facet_module":{
>"time":0.0},
>  "mlt":{
>"time":0.0},
>  "highlight":{
>"time":0.0},
>  "stats":{
>"time":0.0},
>  "expand":{
>"time":0.0},
>  "terms":{
>"time":0.0},
>  "block-expensive-queries":{
>"time":0.0},
>  "slow-query-logger":{
>"time":0.0},
>  "debug":{
>"time":0.0}
> 
> 
> snap
> 
> can we use query: _root_:abc in the ping request handler ? Tried this
>>> query
> and its returning the results within few miliseconds and also the
>> nodes
>>> are
> able to recover with

Re: Solr 7.6.0: PingRequestHandler - Changing the default query (*:*)

2019-08-05 Thread Shawn Heisey

On 8/4/2019 10:15 PM, dinesh naik wrote:

My question is regarding the custom query being used. Here i am querying
for field _root_ which is available in all of my cluster and defined as a
string field. The result for _root_:abc might not get me any match as
well(i am ok with not finding any matches, the query should not be taking
10-15 seconds for getting the response).


Typically the *:* query is the fastest option.  It is special syntax 
that means "all documents" and it usually executes very quickly.  It 
will be faster than querying for a value in a specific field, which is 
what you have defined currently.


I will typically add a "rows" parameter to the ping handler with a value 
of 1, so Solr will not be retrieving a large amount of data.  If you are 
running Solr in cloud mode, you should experiment with setting the 
distrib parameter to false, which will hopefully limit the query to the 
receiving node only.


Erick has already mentioned GC pauses as a potential problem.  With a 
10-15 second response time, I think that has high potential to be the 
underlying cause.


The response you included at the beginning of the thread indicates there 
are 1.3 billion documents, which is going to require a fair amount of 
heap memory.  If seeing such long ping times with a *:* query is 
something that happens frequently, your heap may be too small, which 
will cause frequent full garbage collections.


The very low autoSoftCommit time can contribute to system load.  I think 
it's very likely, especially with such a large index, that in many cases 
those automatic commits are taking far longer than 5 seconds to 
complete.  If that's the case, you're not achieving a 5 second 
visibility interval and you are putting a lot of load on Solr, so I 
would consider increasing it.


Thanks,
Shawn


Difference between search results from Solr 5 and 8

2019-08-05 Thread Alexander Sherbakov
Hi all,

We upgraded our Solr cluster from 5 to 8 and I've found a difference in search 
results.

Previously we had this in the schema.xml:


Which stopped working in Solr 8, so we mowed this to solrconfig.xml as:
AND

Now, this search gives 0 results while previously it worked fine and returned 2 
records:
[ path=select parameters={fq: ["type:Member"], sort: "score desc", q: 
"u...@gmail.com ad...@yahoo.com", fl: "* score", qf: "email_words_ngram", 
defType: "edismax", mm: 1, start: 0, rows: 20} ]

At the same time the docs say that terms without explicit "+" or "-" are 
considered as optional and results of both terms should be returned.

This search works:
[ path=select parameters={fq: ["type:Member"], sort: "score desc", q: 
"u...@gmail.com OR ad...@yahoo.com", fl: "* score", qf: "email_words_ngram", 
defType: "edismax", mm: 1, start: 0, rows: 20} ]

I need help figuring out what's wrong with our configuration and how to handle 
this properly.

Thank you,
Alexander

Re: Difference between search results from Solr 5 and 8

2019-08-05 Thread Shawn Heisey

On 8/5/2019 7:34 AM, Alexander Sherbakov wrote:

Which stopped working in Solr 8, so we mowed this to solrconfig.xml as:
AND

Now, this search gives 0 results while previously it worked fine and returned 2 
records:
[ path=select parameters={fq: ["type:Member"], sort: "score desc", q: "u...@gmail.com ad...@yahoo.com", fl: 
"* score", qf: "email_words_ngram", defType: "edismax", mm: 1, start: 0, rows: 20} ]

At the same time the docs say that terms without explicit "+" or "-" are 
considered as optional and results of both terms should be returned.


Untagged clauses are indeed optional -- if you leave the default 
operator at "OR".  You've set it to "AND", which means that effectively 
any query clause without a +/- or a boolean operator has an implicit + 
-- it will be required.


The behavior in Solr 5 should be the same with a default operator of 
AND, unless you were perhaps running into a bug there.  Or maybe 
everything was not entirely the same before.


Thanks,
Shawn


Re: Solr 7.6.0: PingRequestHandler - Changing the default query (*:*)

2019-08-05 Thread dinesh naik
Hi Nikolas,
The restart of node is not helping , the node keeps trying to recover and
always fails:

here is the log :
2019-07-31 06:10:08.049 INFO
 (coreZkRegister-1-thread-1-processing-n:replica_host:8983_solr
x:parts_shard30_replica_n2697 c:parts s:shard30 r:core_node2698)
x:parts_shard30_replica_n2697 o.a.s.c.ZkController Core needs to
recover:parts_shard30_replica_n2697

2019-07-31 06:10:08.050 INFO
 (updateExecutor-3-thread-1-processing-n:replica_host:8983_solr
x:parts_shard30_replica_n2697 c:parts s:shard30 r:core_node2698)
x:parts_shard30_replica_n2697 o.a.s.u.DefaultSolrCoreState Running recovery

2019-07-31 06:10:08.056 INFO
 (recoveryExecutor-4-thread-1-processing-n:replica_host:8983_solr
x:parts_shard30_replica_n2697 c:parts s:shard30 r:core_node2698)
x:parts_shard30_replica_n2697 o.a.s.c.RecoveryStrategy Starting recovery
process. recoveringAfterStartup=true

2019-07-31 06:10:08.261 INFO
 (recoveryExecutor-4-thread-1-processing-n:replica_host:8983_solr
x:parts_shard30_replica_n2697 c:parts s:shard30 r:core_node2698)
x:parts_shard30_replica_n2697 o.a.s.c.RecoveryStrategy startupVersions
size=49956 range=[1640550593276674048 to 1640542396328443904]

2019-07-31 06:10:08.328 INFO  (qtp689401025-58)  o.a.s.s.HttpSolrCall
[admin] webapp=null path=/admin/info/key params={omitHeader=true&wt=json}
status=0 QTime=0

2019-07-31 06:10:09.276 INFO
 (recoveryExecutor-4-thread-1-processing-n:replica_host:8983_solr
x:parts_shard30_replica_n2697 c:parts s:shard30 r:core_node2698)
x:parts_shard30_replica_n2697 o.a.s.c.RecoveryStrategy Failed to connect
leader http://hostname:8983/solr on recovery, try again

The ping request query is being called from solr itself and not via some
script,so there is no way to stop it .

code where the time is hardcoded to 1 sec:

try (HttpSolrClient httpSolrClient = new
HttpSolrClient.Builder(leaderReplica.getCoreUrl())
  .withSocketTimeout(1000)
  .withConnectionTimeout(1000)

.withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient())
  .build()) {
SolrPingResponse resp = httpSolrClient.ping();
return leaderReplica;
  } catch (IOException e) {
log.info("Failed to connect leader {} on recovery, try again",
leaderReplica.getBaseUrl());
Thread.sleep(500);
  } catch (Exception e) {
if (e.getCause() instanceof IOException) {
  log.info("Failed to connect leader {} on recovery, try again",
leaderReplica.getBaseUrl());
  Thread.sleep(500);
} else {
  return leaderReplica;
}
  }



On Mon, Aug 5, 2019 at 1:19 PM Nicolas Franck 
wrote:

> If the ping request handler is taking too long,
> and the server is not recovering automatically,
> there is not much you can do automatically on that server.
> You have to intervene manually, and restart Solr on that node.
>
> First of all: the ping is just an internal check. If it takes too long
> to respond, the requester (i.e. the script calling it), should stop
> the request, and mark that node as problematic. If there are
> for example memory problems every subsequent request will only enhance
> the problem, and Solr cannot recover from that.
>
> > On 5 Aug 2019, at 06:15, dinesh naik  wrote:
> >
> > Thanks john,Erick and Furknan.
> >
> > I have already defined the ping request handler in solrconfig.xml as
> below:
> >   > name="invariants"> /select _root_:abc  
> >
> > My question is regarding the custom query being used. Here i am querying
> > for field _root_ which is available in all of my cluster and defined as a
> > string field. The result for _root_:abc might not get me any match as
> > well(i am ok with not finding any matches, the query should not be taking
> > 10-15 seconds for getting the response).
> >
> > If the response comes within 1 second , then the core recovery issue is
> > solved, hence need your suggestion if using _root_ field in custom query
> is
> > fine?
> >
> >
> > On Mon, Aug 5, 2019 at 2:49 AM Furkan KAMACI 
> wrote:
> >
> >> Hi,
> >>
> >> You can change invariants i.e. *qt* and *q* of a *PingRequestHandler*:
> >>
> >> 
> >>   
> >> /search
> >> some test query
> >>   
> >> 
> >>
> >> Check documentation fore more info:
> >>
> >>
> https://lucene.apache.org/solr/7_6_0//solr-core/org/apache/solr/handler/PingRequestHandler.html
> >>
> >> Kind Regards,
> >> Furkan KAMACI
> >>
> >> On Sat, Aug 3, 2019 at 4:17 PM Erick Erickson 
> >> wrote:
> >>
> >>> You can also (I think) explicitly define the ping request handler in
> >>> solrconfig.xml to do something else.
> >>>
>  On Aug 2, 2019, at 9:50 AM, Jörn Franke  wrote:
> 
>  Not sure if this is possible, but why not create a query handler in
> >> Solr
> >>> with any custom query and you use that as ping replacement ?
> 
> > Am 02.08.2019 um 15:48 schrieb dinesh naik <
> dineshkumarn...@gmail.com
> >>> :
> >
> > Hi all,
> > I have few clusters with huge data set and whenever a node goes down
> >> its
> > n

Re: Solr 7.6.0: PingRequestHandler - Changing the default query (*:*)

2019-08-05 Thread dinesh naik
Hi Shawn,
yes i am running solr in cloud mode and Even after adding the params row=0
and distrib=false, the query response is more than 15 sec due to more than
a billion doc set.
Also the soft commit setting can not be changed to a higher no. due to
requirement from business team.

http://hostname:8983/solr/parts/select?indent=on&q=*:*&rows=0&wt=json&distrib=false
takes more than 10 sec always.

Here are the java heap and G1GC setting i have ,

/usr/java/default/bin/java -server -Xmx31g -Xms31g -XX:+UseG1GC
-XX:MaxGCPauseMillis=250 -XX:ConcGCThreads=5
-XX:ParallelGCThreads=10 -XX:+UseLargePages -XX:+AggressiveOpts
-XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
-XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=18
-XX:MaxNewSize=6G -XX:PrintFLSStatistics=1
-XX:+PrintPromotionFailure -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/solr7/logs/heapdump
-verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime

JVM  heap has never crossed 20GB in my setup , also Young G1GC timing is
well within milli seconds (in range of 25-200 ms).

On Mon, Aug 5, 2019 at 6:37 PM Shawn Heisey  wrote:

> On 8/4/2019 10:15 PM, dinesh naik wrote:
> > My question is regarding the custom query being used. Here i am querying
> > for field _root_ which is available in all of my cluster and defined as a
> > string field. The result for _root_:abc might not get me any match as
> > well(i am ok with not finding any matches, the query should not be taking
> > 10-15 seconds for getting the response).
>
> Typically the *:* query is the fastest option.  It is special syntax
> that means "all documents" and it usually executes very quickly.  It
> will be faster than querying for a value in a specific field, which is
> what you have defined currently.
>
> I will typically add a "rows" parameter to the ping handler with a value
> of 1, so Solr will not be retrieving a large amount of data.  If you are
> running Solr in cloud mode, you should experiment with setting the
> distrib parameter to false, which will hopefully limit the query to the
> receiving node only.
>
> Erick has already mentioned GC pauses as a potential problem.  With a
> 10-15 second response time, I think that has high potential to be the
> underlying cause.
>
> The response you included at the beginning of the thread indicates there
> are 1.3 billion documents, which is going to require a fair amount of
> heap memory.  If seeing such long ping times with a *:* query is
> something that happens frequently, your heap may be too small, which
> will cause frequent full garbage collections.
>
> The very low autoSoftCommit time can contribute to system load.  I think
> it's very likely, especially with such a large index, that in many cases
> those automatic commits are taking far longer than 5 seconds to
> complete.  If that's the case, you're not achieving a 5 second
> visibility interval and you are putting a lot of load on Solr, so I
> would consider increasing it.
>
> Thanks,
> Shawn
>


-- 
Best Regards,
Dinesh Naik


Re: Solr 7.6.0: PingRequestHandler - Changing the default query (*:*)

2019-08-05 Thread Erick Erickson
How much total physical memory on your machine? Lucene holds a lot of the
index in MMapDirectory space. My starting point is to allocate no more than
50% of my physical memory to the Java heap. You’re allocating 31G, if you don’t
have at _least_ 64G on these machines you’re probably swapping.

See: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Best,
Erick


> On Aug 5, 2019, at 10:58 AM, dinesh naik  wrote:
> 
> Hi Shawn,
> yes i am running solr in cloud mode and Even after adding the params row=0
> and distrib=false, the query response is more than 15 sec due to more than
> a billion doc set.
> Also the soft commit setting can not be changed to a higher no. due to
> requirement from business team.
> 
> http://hostname:8983/solr/parts/select?indent=on&q=*:*&rows=0&wt=json&distrib=false
> takes more than 10 sec always.
> 
> Here are the java heap and G1GC setting i have ,
> 
> /usr/java/default/bin/java -server -Xmx31g -Xms31g -XX:+UseG1GC
> -XX:MaxGCPauseMillis=250 -XX:ConcGCThreads=5
> -XX:ParallelGCThreads=10 -XX:+UseLargePages -XX:+AggressiveOpts
> -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
> -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=18
> -XX:MaxNewSize=6G -XX:PrintFLSStatistics=1
> -XX:+PrintPromotionFailure -XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/solr7/logs/heapdump
> -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps
> -XX:+PrintGCTimeStamps
> -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime
> 
> JVM  heap has never crossed 20GB in my setup , also Young G1GC timing is
> well within milli seconds (in range of 25-200 ms).
> 
> On Mon, Aug 5, 2019 at 6:37 PM Shawn Heisey  wrote:
> 
>> On 8/4/2019 10:15 PM, dinesh naik wrote:
>>> My question is regarding the custom query being used. Here i am querying
>>> for field _root_ which is available in all of my cluster and defined as a
>>> string field. The result for _root_:abc might not get me any match as
>>> well(i am ok with not finding any matches, the query should not be taking
>>> 10-15 seconds for getting the response).
>> 
>> Typically the *:* query is the fastest option.  It is special syntax
>> that means "all documents" and it usually executes very quickly.  It
>> will be faster than querying for a value in a specific field, which is
>> what you have defined currently.
>> 
>> I will typically add a "rows" parameter to the ping handler with a value
>> of 1, so Solr will not be retrieving a large amount of data.  If you are
>> running Solr in cloud mode, you should experiment with setting the
>> distrib parameter to false, which will hopefully limit the query to the
>> receiving node only.
>> 
>> Erick has already mentioned GC pauses as a potential problem.  With a
>> 10-15 second response time, I think that has high potential to be the
>> underlying cause.
>> 
>> The response you included at the beginning of the thread indicates there
>> are 1.3 billion documents, which is going to require a fair amount of
>> heap memory.  If seeing such long ping times with a *:* query is
>> something that happens frequently, your heap may be too small, which
>> will cause frequent full garbage collections.
>> 
>> The very low autoSoftCommit time can contribute to system load.  I think
>> it's very likely, especially with such a large index, that in many cases
>> those automatic commits are taking far longer than 5 seconds to
>> complete.  If that's the case, you're not achieving a 5 second
>> visibility interval and you are putting a lot of load on Solr, so I
>> would consider increasing it.
>> 
>> Thanks,
>> Shawn
>> 
> 
> 
> -- 
> Best Regards,
> Dinesh Naik



Re: Solr 7.6.0: PingRequestHandler - Changing the default query (*:*)

2019-08-05 Thread dinesh naik
Hi Erick,
Each vm has 128GB of physical memory.


On Mon, Aug 5, 2019, 8:38 PM Erick Erickson  wrote:

> How much total physical memory on your machine? Lucene holds a lot of the
> index in MMapDirectory space. My starting point is to allocate no more than
> 50% of my physical memory to the Java heap. You’re allocating 31G, if you
> don’t
> have at _least_ 64G on these machines you’re probably swapping.
>
> See:
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> Best,
> Erick
>
>
> > On Aug 5, 2019, at 10:58 AM, dinesh naik 
> wrote:
> >
> > Hi Shawn,
> > yes i am running solr in cloud mode and Even after adding the params
> row=0
> > and distrib=false, the query response is more than 15 sec due to more
> than
> > a billion doc set.
> > Also the soft commit setting can not be changed to a higher no. due to
> > requirement from business team.
> >
> >
> http://hostname:8983/solr/parts/select?indent=on&q=*:*&rows=0&wt=json&distrib=false
> > takes more than 10 sec always.
> >
> > Here are the java heap and G1GC setting i have ,
> >
> > /usr/java/default/bin/java -server -Xmx31g -Xms31g -XX:+UseG1GC
> > -XX:MaxGCPauseMillis=250 -XX:ConcGCThreads=5
> > -XX:ParallelGCThreads=10 -XX:+UseLargePages -XX:+AggressiveOpts
> > -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled
> > -XX:InitiatingHeapOccupancyPercent=50 -XX:G1ReservePercent=18
> > -XX:MaxNewSize=6G -XX:PrintFLSStatistics=1
> > -XX:+PrintPromotionFailure -XX:+HeapDumpOnOutOfMemoryError
> > -XX:HeapDumpPath=/solr7/logs/heapdump
> > -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps
> > -XX:+PrintGCTimeStamps
> > -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime
> >
> > JVM  heap has never crossed 20GB in my setup , also Young G1GC timing is
> > well within milli seconds (in range of 25-200 ms).
> >
> > On Mon, Aug 5, 2019 at 6:37 PM Shawn Heisey  wrote:
> >
> >> On 8/4/2019 10:15 PM, dinesh naik wrote:
> >>> My question is regarding the custom query being used. Here i am
> querying
> >>> for field _root_ which is available in all of my cluster and defined
> as a
> >>> string field. The result for _root_:abc might not get me any match as
> >>> well(i am ok with not finding any matches, the query should not be
> taking
> >>> 10-15 seconds for getting the response).
> >>
> >> Typically the *:* query is the fastest option.  It is special syntax
> >> that means "all documents" and it usually executes very quickly.  It
> >> will be faster than querying for a value in a specific field, which is
> >> what you have defined currently.
> >>
> >> I will typically add a "rows" parameter to the ping handler with a value
> >> of 1, so Solr will not be retrieving a large amount of data.  If you are
> >> running Solr in cloud mode, you should experiment with setting the
> >> distrib parameter to false, which will hopefully limit the query to the
> >> receiving node only.
> >>
> >> Erick has already mentioned GC pauses as a potential problem.  With a
> >> 10-15 second response time, I think that has high potential to be the
> >> underlying cause.
> >>
> >> The response you included at the beginning of the thread indicates there
> >> are 1.3 billion documents, which is going to require a fair amount of
> >> heap memory.  If seeing such long ping times with a *:* query is
> >> something that happens frequently, your heap may be too small, which
> >> will cause frequent full garbage collections.
> >>
> >> The very low autoSoftCommit time can contribute to system load.  I think
> >> it's very likely, especially with such a large index, that in many cases
> >> those automatic commits are taking far longer than 5 seconds to
> >> complete.  If that's the case, you're not achieving a 5 second
> >> visibility interval and you are putting a lot of load on Solr, so I
> >> would consider increasing it.
> >>
> >> Thanks,
> >> Shawn
> >>
> >
> >
> > --
> > Best Regards,
> > Dinesh Naik
>
>


SOLR 8.1.1 index on pdate field included in search results

2019-08-05 Thread Hodder, Rick
I am migrating from SOLR 4.10.2 to 8.1.1. For some reason, in the 8.1.1 core, a 
pdate index named IDX_ExpirationDate is appearing as a field in the search 
results documents.
I have several other indexes that are defined and (correctly) do not appear in 
the results. But the index I am having trouble with is the only one based on a 
pdate.
Here is a sample 8.1.1 response that demonstrates the issue:
"response":{"numFound":58871,"start":0,"docs":[
  {
"id":"1",
"ExpirationDate":"2018-01-26T00:00:00Z",
"_version_":1641033044033798170,
"IDX_ExpirationDate":["2018-01-26T00:00:00Z"]},
  {
"id":"2",
"ExpirationDate":"2018-02-20T00:00:00Z",
"_version_":1641032965380112384,
"IDX_ExpirationDate":["2018-02-20T00:00:00Z"]},

ExpirationDate is supposed to be there, but IDX_ExpirationDate should not. I 
know that I can probably keep using date, but it is deprecated, and part of the 
reason for upgrading to 8.1.1 is to use the latest non-deprecated stuff ;-)
I have an index named IDX_ExpirationDate based on a field called ExpirationDate 
that was a date field in 4.10.2:





In the 8.1.1 core, I have this configured as a pdate:







Re: SOLR 8.1.1 index on pdate field included in search results

2019-08-05 Thread Shawn Heisey

On 8/5/2019 10:37 AM, Hodder, Rick wrote:

ExpirationDate is supposed to be there, but IDX_ExpirationDate should not. I 
know that I can probably keep using date, but it is deprecated, and part of the 
reason for upgrading to 8.1.1 is to use the latest non-deprecated stuff ;-)


The DatePointField class defaults to docValues="true" and 
useDocValuesAsStored="true".  Unless those parameters are changed, if 
the field is defined for a document, it will typically be in search results.


https://lucene.apache.org/solr/guide/6_6/docvalues.html#DocValues-RetrievingDocValuesDuringSearch

Thanks,
Shawn


RE: SOLR 8.1.1 index on pdate field included in search results

2019-08-05 Thread Hodder, Rick
Hi Shawn,

>The DatePointField class defaults to docValues="true" and 
>useDocValuesAsStored="true".  Unless those parameters are changed, 
>if the field is defined for a document, it will typically be in search results.

Just checking, I'm fine with ExpirationDate appearing in the results, it's the 
index IDX_ExpirationDate that I don't want in the results. 

So you are saying that I should add  docValues="false" or 
docValuesAsStored="false" to the indexed but not stored field?:



I have other IDX_ fields defined that are not pdate and they don't appear in 
results, that's what's confusing me, for example:

   

Thanks,
Rick



RE: SOLR 8.1.1 index on pdate field included in search results

2019-08-05 Thread Hodder, Rick
You are right of course, Shawn.

I added useDocValuesAsStored="false" to the IDX_ExpirationDate field 
definition, and it no longer shows up

Thanks,
Rick

-Original Message-
From: Hodder, Rick 
Sent: Monday, August 05, 2019 2:02 PM
To: solr-user@lucene.apache.org
Subject: RE: SOLR 8.1.1 index on pdate field included in search results

Hi Shawn,

>The DatePointField class defaults to docValues="true" and 
>useDocValuesAsStored="true".  Unless those parameters are changed, if the 
>field is defined for a document, it will typically be in search results.

Just checking, I'm fine with ExpirationDate appearing in the results, it's the 
index IDX_ExpirationDate that I don't want in the results. 

So you are saying that I should add  docValues="false" or 
docValuesAsStored="false" to the indexed but not stored field?:



I have other IDX_ fields defined that are not pdate and they don't appear in 
results, that's what's confusing me, for example:

   

Thanks,
Rick



Re: NRT for new items in index

2019-08-05 Thread Updates Profimedia



On 2019/08/03 18:00:28, Furkan KAMACI  wrote: 
> Hi,
> 
> First of all, could you check here:
> https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> to
> better understand hard commits, soft commits and transaction logs to
> achieve NRT search.
> 
> Kind Regards,
> Furkan KAMACI
> 
> On Wed, Jul 31, 2019 at 3:47 PM profiuser  wrote:
> 
> > Hi,
> >
> > we have something about 400 000 000 items in a solr collection.
> > We have set up auto commit property for this collection to 15 minutes.
> > Is a big collection and we using some caches etc. Therefore we have big
> > autocommit value.
> >
> > This have disadvantage that we haven't NRT searches.
> >
> > We would like to have NRT at least for searching for the newly added items.
> >
> > We read about new functionality "Category routed alilases" in a solr
> > version
> > 8.1.
> >
> > And we got an idea, that we could add to our collection schema field for
> > routing.
> > And at the time of indexing we check if item is new and to routing field we
> > set up value "new", or the item is older than some time period we set up
> > value to "old".
> > And we will have one category routed alias routedCollection, and there will
> > be 2 collections old and new.
> >
> > If we index new item, router choose new collection and this item is
> > inserted
> > to it. After some period we reindex item and we decide that this item is
> > old
> > and to routing field we set up value "old". Router decide to update
> > (insert)
> > item to collection old. But we expect that solr automatically check
> > uniqueness in all routed collections. And if solr found item in other
> > collection, than will be automatically deleted. But not !!!
> >
> > Is this expected behaviour?
> >
> > Could be used this functionality for issue we have? Or could someone
> > suggest
> > another solution, which ensure that we have all new items ready for NRT
> > searches?
> >
> > Thanks for your help
> >
> >
> >
> >
> >
> >
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> >
> 

Hi,

we know this page, and we understand how commits and transaction logs works, 
but as I said we have a very big index size ;-) Therefore we cannot create 
commits to often.
We must cache data for fast search, and if we will commit to often, then we can 
any cache throw out.

Now we have only one server, and we prepare new solution with Solr Cloud. Where 
we would have several servers. We have limited resources and we cannot afford 
to have for example 20 Solr servers, which I believe is a standard solution for 
big indexes.

Therefore we search for some compromise between price/performance. Therefore we 
think about have more collections. And one collection would be a daily feed 
(small index) and then we can commit every several seconds. And these 
collections would be merge to main collection alias.

Do you have another idea?

Best





Re: NRT for new items in index

2019-08-05 Thread Jörn Franke
Do you have some more information on index and size? 

Do you have to store everything in the Index? Can you store some data (blobs 
etc) outside ?

I think you are generally right with your solution, but also be aware that it 
is sometimes cheaper to have several servers instead keeping engineer busy for 
some months to find a solution. I don’t say this is the case in your solution 
and I am also not a fan at throwing hardware at a problem, but an engineer 
(even if it affects him/herself) should always make that decision. That does 
not necessarily mean that engineer looses a job - one can implement other 
valuable features for a customer.

> Am 06.08.2019 um 08:21 schrieb Updates Profimedia :
> 
> 
> 
>> On 2019/08/03 18:00:28, Furkan KAMACI  wrote: 
>> Hi,
>> 
>> First of all, could you check here:
>> https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>> to
>> better understand hard commits, soft commits and transaction logs to
>> achieve NRT search.
>> 
>> Kind Regards,
>> Furkan KAMACI
>> 
>>> On Wed, Jul 31, 2019 at 3:47 PM profiuser  wrote:
>>> 
>>> Hi,
>>> 
>>> we have something about 400 000 000 items in a solr collection.
>>> We have set up auto commit property for this collection to 15 minutes.
>>> Is a big collection and we using some caches etc. Therefore we have big
>>> autocommit value.
>>> 
>>> This have disadvantage that we haven't NRT searches.
>>> 
>>> We would like to have NRT at least for searching for the newly added items.
>>> 
>>> We read about new functionality "Category routed alilases" in a solr
>>> version
>>> 8.1.
>>> 
>>> And we got an idea, that we could add to our collection schema field for
>>> routing.
>>> And at the time of indexing we check if item is new and to routing field we
>>> set up value "new", or the item is older than some time period we set up
>>> value to "old".
>>> And we will have one category routed alias routedCollection, and there will
>>> be 2 collections old and new.
>>> 
>>> If we index new item, router choose new collection and this item is
>>> inserted
>>> to it. After some period we reindex item and we decide that this item is
>>> old
>>> and to routing field we set up value "old". Router decide to update
>>> (insert)
>>> item to collection old. But we expect that solr automatically check
>>> uniqueness in all routed collections. And if solr found item in other
>>> collection, than will be automatically deleted. But not !!!
>>> 
>>> Is this expected behaviour?
>>> 
>>> Could be used this functionality for issue we have? Or could someone
>>> suggest
>>> another solution, which ensure that we have all new items ready for NRT
>>> searches?
>>> 
>>> Thanks for your help
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>> 
>> 
> 
> Hi,
> 
> we know this page, and we understand how commits and transaction logs works, 
> but as I said we have a very big index size ;-) Therefore we cannot create 
> commits to often.
> We must cache data for fast search, and if we will commit to often, then we 
> can any cache throw out.
> 
> Now we have only one server, and we prepare new solution with Solr Cloud. 
> Where we would have several servers. We have limited resources and we cannot 
> afford to have for example 20 Solr servers, which I believe is a standard 
> solution for big indexes.
> 
> Therefore we search for some compromise between price/performance. Therefore 
> we think about have more collections. And one collection would be a daily 
> feed (small index) and then we can commit every several seconds. And these 
> collections would be merge to main collection alias.
> 
> Do you have another idea?
> 
> Best
> 
> 
>