Re: Authentication between solr-exporter and solrcloud

2018-08-15 Thread Dwane Hall
Hi Sushant,

I had the same issue and unfortunately the exporter does not appear to support 
a secure cluster.  I raised a JIRA feature request so please upvote it as this 
will increase the chances of it being included in a future release.

https://issues.apache.org/jira/browse/SOLR-12584

Thanks

From: Sushant Vengurlekar 
Sent: Wednesday, 15 August 2018 10:39 PM
To: solr-user@lucene.apache.org
Subject: Authentication between solr-exporter and solrcloud

I have followed this guide for monitoring the solrcloud
https://lucene.apache.org/solr/guide/7_3/monitoring-solr-with-prometheus-and-grafana.html

I have basic authentication enabled for the solrcloud. How do I configure
the solr-exporter to authenticate with the set username and password.

Thank you


Solr range faceting

2018-09-06 Thread Dwane Hall
Good morning Solr community.  I'm having a few facet range issues for which I'd 
appreciate some advice when somebody gets a spare couple of minutes.

Environment
Solr Cloud (7.3.1)
Single Shard Index, No replicas

Facet Configuration (I'm using the request params API and useParams at runtime)
"facet":"true",
"facet.mincount":1,
"facet.missing":"false",
"facet.range":"Value"
"f.Value.facet.range.start":0.0,
"f.Value.facet.range.end":2000.0,
"f.Value.facet.range.gap":100,
"f.Value.facet.range.include":"edge",
"f.Value.facet.range.other":"all",

My problem
With my range facet configuration I'm expecting to see a facet range entry for 
every 'step' (100 in my case) between my facet.range.start and facet.range.end 
settings. Something like the following 0.0,100.0,200.0, ...2000.0 with a sum of 
the number of values that occur between each range step.  This does not appear 
to be the case and in some instances I don't appear to get counts for some 
range steps (800.0 and 1000.0 for example are present in my result set range 
below but I don't get a range value facets for these values?)

Am I completely misunderstanding how range facets are supposed to work or is my 
configuration a little askew?

Any advice would be greatly appreciated.

The Solr Response
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":121},

  "response":{"numFound":869,"start":0,"docs":[
  {
"Value":9475.08},
  {
"Value":780.0},
  {
"Value":1000.0},
  {
"Value":50.0},
  {
"Value":50.0},
  {
"Value":0.0},
  {
"Value":800.0},
  {
"Value":0.0},
  {
"Value":1000.0},
  {
"Value":1000.0},
  {
"Value":5000.0},
  {
"Value":2000.0},
  {
"Value":4000.0},
  {
"Value":1500.0},
  {
"Value":0.0},
  {
"Value":1.0},
  {
"Value":1000.0}]
  },
  "facet_counts":{
"facet_ranges":{
  "Value":{
"counts":[
  "0.0",9,
  "400.0",80,
  "700.0",69,
  "1900.0",9],
"gap":100.0,
"before":0,
"after":103,
"between":766,
"start":0.0,
"end":2000.0}}

Cheers,

Dwane


Re: Solr range faceting

2018-09-06 Thread Dwane Hall
Thanks Jan that has fixed the bucket issue but I'm a little confused at why 
zero counts exist for some buckets when they appear to be values in them?

"response":{"numFound":869,"start":0,"docs":[
  {
"Value":9475.08},
  {
"Value":780.0},
  {
"Value":9475.08},
  {
"Value":1000.0},
  {
"Value":50.0},
  {
"Value":50.0},
  {
"Value":0.0},
  {
"Value":800.0},
  {
"Value":0.0},
  {
"Value":1000.0},
  {
"Value":1000.0},
  {
"Value":5000.0},
  {
"Value":2000.0},
  {
   "Value":4000.0},
  {
"Value":1500.0},
  {
"Value":0.0},
  {
"Value":1.0},
  {
"Value":5000.0},
  {
"Value":1000.0},
  {
"Value":0.0},
  {
"Value":1200.0},
  {
"Value":9000.0},
  {
"Value":1500.0},
  {
"Value":1.0},
  {
"Value":5000.0},
  {
"Value":4000.0},
  {
"Value":5000.0},
  {
"Value":5000.0},
  {
"Value":1.0},
  {
"Value":1000.0}]
  },

  "facet_counts":{
"facet_queries":{},
"facet_ranges":{
  "Value":{
"counts":[
  "0.0",9,
  "100.0",0,
  "200.0",0,
  "300.0",0,
  "400.0",80,
  "500.0",0,
  "600.0",0,
  "700.0",69,
  "800.0",0,
  "900.0",0,
  "1000.0",0,
  "1100.0",0,
  "1200.0",0,
  "1300.0",0,
  "1400.0",0,
  "1500.0",0,
  "1600.0",0,
  "1700.0",0,
  "1800.0",0,
  "1900.0",9],
"gap":100.0,
"before":0,
"after":103,
"between":766,
"start":0.0,
"end":2000.0}

Cheers,

Dwane

From: Jan H?ydahl 
Sent: Friday, 7 September 2018 9:23:44 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr range faceting

Try facet.minCount=0

Jan

> 7. sep. 2018 kl. 01:07 skrev Dwane Hall :
>
> Good morning Solr community.  I'm having a few facet range issues for which 
> I'd appreciate some advice when somebody gets a spare couple of minutes.
>
> Environment
> Solr Cloud (7.3.1)
> Single Shard Index, No replicas
>
> Facet Configuration (I'm using the request params API and useParams at 
> runtime)
> "facet":"true",
> "facet.mincount":1,
> "facet.missing":"false",
> "facet.range":"Value"
> "f.Value.facet.range.start":0.0,
> "f.Value.facet.range.end":2000.0,
> "f.Value.facet.range.gap":100,
> "f.Value.facet.range.include":"edge",
> "f.Value.facet.range.other":"all",
>
> My problem
> With my range facet configuration I'm expecting to see a facet range entry 
> for every 'step' (100 in my case) between my facet.range.start and 
> facet.range.end settings. Something like the following 0.0,100.0,200.0, 
> ...2000.0 with a sum of the number of values that occur between each range 
> step.  This does not appear to be the case and in some instances I don't 
> appear to get counts for some range steps (800.0 and 1000.0 for example are 
> present in my result set range below but I don't get a range value facets for 
> these values?)
>
> Am I completely misunderstanding how range facets are supposed to work or is 
> my configuration a little askew?
>
> Any advice would be greatly appreciated.
>
> The Solr Response
> "responseHeader":{
>"zkConnected":true,
>"status":0,
>"QTime":121},
>
>  "response":{"numFound":869,"start":0,"docs":[
>  {
>"Value":9475.08},
>  {
>"Value":780.0},
>  {
>"Value":1000.0},
>  {
>"Value":50.0},
>  {
>"Value":50.0},
>  {
>"Value":0.0},
>  {
>"Value":800.0},
>  {
>"Value":0.0},
>  {
>"Value":1000.0},
>  {
>"Value":1000.0},
>  {
>"Value":5000.0},
>  {
>"Value":2000.0},
>  {
>"Value":4000.0},
>  {
>"Value":1500.0},
>  {
>"Value":0.0},
>  {
>"Value":1.0},
>  {
>"Value":1000.0}]
>  },
>  "facet_counts":{
>"facet_ranges":{
>  "Value":{
>"counts":[
>  "0.0",9,
>  "400.0",80,
>  "700.0",69,
>  "1900.0",9],
>"gap":100.0,
>"before":0,
>"after":103,
>"between":766,
>"start":0.0,
>"end":2000.0}}
>
> Cheers,
>
> Dwane


Re: Solr range faceting

2018-09-07 Thread Dwane Hall
Thanks Erick,

The field is defined as a pfloat.



I took your advice and tried smaller result range and the counts look good.  I 
might try an index rebuild I’m wondering if the data has somehow been corrupted 
by a combination of old and new index mappings.

Thanks again for your assistance.


"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":7},
  "response":{"numFound":3,"start":0,"docs":[
  {
"Value":34.0},
  {
"Value":34.0},
  {
"Value":34.0}]
  },
  "facet_counts":
"facet_queries":{},
"facet_ranges":{
  "Value:{
"counts":[
  "0.0",3],
"gap":100.0,
"before":0,
"after":0,
"between":3,
"start":0.0,
"end":2000.0}}

From: Erick Erickson 
Sent: Friday, 7 September 2018 12:54:48 PM
To: solr-user
Subject: Re: Solr range faceting

Indeed this doesn't look right. By my count, you're missing 599 counts
you'd expect in that range, although the after and between numbers
total the numFound.

What kind of a field is Value? Given the number of docs missing, I'd
guess you could get the number of docs down really small and post
them. Something like
values 1, 2, 3, 4, 5, 
and your range query so we could try it.

What is the fieldType definition and field for Value?

And finally, do you get different results if you use json faceting?

Best,
Erick
On Thu, Sep 6, 2018 at 5:51 PM Dwane Hall  wrote:
>
> Thanks Jan that has fixed the bucket issue but I'm a little confused at why 
> zero counts exist for some buckets when they appear to be values in them?
>
> "response":{"numFound":869,"start":0,"docs":[
>   {
> "Value":9475.08},
>   {
> "Value":780.0},
>   {
> "Value":9475.08},
>   {
> "Value":1000.0},
>   {
> "Value":50.0},
>   {
> "Value":50.0},
>   {
> "Value":0.0},
>   {
> "Value":800.0},
>   {
> "Value":0.0},
>   {
> "Value":1000.0},
>   {
> "Value":1000.0},
>   {
> "Value":5000.0},
>   {
> "Value":2000.0},
>   {
>"Value":4000.0},
>   {
> "Value":1500.0},
>   {
> "Value":0.0},
>   {
> "Value":1.0},
>   {
> "Value":5000.0},
>   {
> "Value":1000.0},
>   {
> "Value":0.0},
>   {
> "Value":1200.0},
>   {
> "Value":9000.0},
>   {
> "Value":1500.0},
>   {
> "Value":1.0},
>   {
> "Value":5000.0},
>   {
> "Value":4000.0},
>   {
> "Value":5000.0},
>   {
> "Value":5000.0},
>   {
> "Value":1.0},
>   {
> "Value":1000.0}]
>   },
>
>   "facet_counts":{
> "facet_queries":{},
> "facet_ranges":{
>   "Value":{
> "counts":[
>   "0.0",9,
>   "100.0",0,
>   "200.0",0,
>   "300.0",0,
>   "400.0",80,
>   "500.0",0,
>   "600.0",0,
>   "700.0",69,
>   "800.0",0,
>   "900.0",0,
>   "1000.0",0,
>   "1100.0",0,
>   "1200.0",0,
>   "1300.0",0,
>   "1400.0",0,
>   "1500.0",0,
>   "1600.0",0,
>   "1700.0",0,
>   "1800.0",0,
>   "1900.0",9],
> "gap":100.0,
> "before":0,
> "after":103,
> "between":766,
> "start":0.0,
> "end":2000.0}
>
> Cheers,
>
> Dwane
> 
> From: Jan H?ydahl 
> Sent: Friday, 7 September 2018 9:23:44 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr range faceting
>
> Try facet.minCount=0
>
> Jan
>
> > 7. sep. 20

switch query parser and solr cloud

2018-09-12 Thread Dwane Hall
Good afternoon Solr brains trust I'm seeking some community advice if somebody 
can spare a minute from their busy schedules.

I'm attempting to use the switch query parser to influence client search 
behaviour based on a client specified request parameter.

Essentially I want the following to occur:

-A user has the option to pass through an optional request parameter 
"allResults" to solr
-If "allResults" is true then return all matching query records by appending a 
filter query for all records (fq=*:*)
-If "allResults" is empty then apply a filter using the collapse query parser 
({!collapse field=SUMMARY_FIELD})

Environment
Solr 7.3.1 (1 solr node DEV, 4 solr nodes PTST)
4 shard collection

My Implementation
I'm using the switch query parser to choose client behaviour by appending a 
filter query to the user request very similar to what is documented in the solr 
reference guide here 
(https://lucene.apache.org/solr/guide/7_4/other-parsers.html#switch-query-parser)

The request uses the params api (pertinent line below is the _appends_ filter 
queries)
(useParams=firstParams,secondParams)

  "set":{
"firstParams":{
"op":"AND",
"wt":"json",
"start":0,
"allResults":"false",
"fl":"FIELD_1,FIELD_2,SUMMARY_FIELD",
  "_appends_":{
"fq":"{!switch default=\"{!collapse field=SUMMARY_FIELD}\" 
case.true=*:* v=${allResults}}",
  },
  "_invariants_":{
"deftype":"edismax",
"timeAllowed":2,
"rows":"30",
"echoParams":"none",
}
  }
   }

   "set":{
"secondParams":{
"df":"FIELD_1",
"q":"{!edismax v=${searchString} df=FIELD_1 q.op=${op}}",
  "_invariants_":{
"qf":"FIELD_1,FIELD_2,SUMMARY_FIELD",
}
  }
   }}

Everything works nicely until I move from a single node solr instance (DEV) to 
a clustered solr instance (PTST) in which I receive a null pointer exception 
from Solr which I'm having trouble picking apart.  I've co-located the solr 
documents using document routing which appear to be the only requirement for 
the collapse query parser's use.

Does anyone know if the switch query parser has any limitations in a sharded 
solr cloud environment or can provide any possible troubleshooting advice?

Any community recommendations would be greatly appreciated

Solr stack trace
2018-09-12 12:16:12,918 4064160860 ERROR : [c:my_collection s:shard1 
r:core_node3 x:my_collection_ptst_shard1_replica_n1] 
org.apache.solr.common.SolrException : 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at https://myserver:1234/solr/my_collection_ptst_shard2_replica_n2: 
java.lang.NullPointerException
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:643)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
at 
org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:172)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748

Thanks for taking the time to assist,

Dwane


Re: switch query parser and solr cloud

2018-09-12 Thread Dwane Hall
Thanks for the suggestions and responses Erick and Shawn.  Erick I only return 
30 records irrespective of the query (not the entire payload) I removed some of 
my configuration settings for readability. The parameter "allResults" was a 
little misleading I apologise for that but I appreciate your input.

Shawn thanks for your comments. Regarding the switch query parser the Hossman 
has a great description of its use and application here 
(https://lucidworks.com/2013/02/20/custom-solr-request-params/).  PTST is just 
our performance testing environment and is not important in the context of the 
question other than it being a multi node solr environment.  The server side 
error was the null pointer which is why I was having a few difficulties 
debugging it as there was not a lot of info to troubleshoot.  I'll keep playing 
and explore the client filter option for addressing this issue.

Thanks again for both of your input

Cheers,

Dwane

From: Erick Erickson 
Sent: Thursday, 13 September 2018 12:20 AM
To: solr-user
Subject: Re: switch query parser and solr cloud

You will run into significant problems if, when returning "all
results", you return large result sets. For regular queries I like to
limit the return to 100, although 1,000 is sometimes OK.

Millions will blow you out of the water, use CursorMark or Streaming
for very large result sets. CursorMark gets you a page at a time, but
efficiently and Streaming doesn't consume huge amounts of memory.

And assuming you could possible return 1M rows, say, what would the
user do with it? Displaying in a browser is problematic for instance.

Best,
Erick
On Wed, Sep 12, 2018 at 5:54 AM Shawn Heisey  wrote:
>
> On 9/12/2018 5:47 AM, Dwane Hall wrote:
> > Good afternoon Solr brains trust I'm seeking some community advice if 
> > somebody can spare a minute from their busy schedules.
> >
> > I'm attempting to use the switch query parser to influence client search 
> > behaviour based on a client specified request parameter.
> >
> > Essentially I want the following to occur:
> >
> > -A user has the option to pass through an optional request parameter 
> > "allResults" to solr
> > -If "allResults" is true then return all matching query records by 
> > appending a filter query for all records (fq=*:*)
> > -If "allResults" is empty then apply a filter using the collapse query 
> > parser ({!collapse field=SUMMARY_FIELD})
>
> I'm looking at the documentation for the switch parser and I'm having
> difficulty figuring out what it actually does.
>
> This is the kind of thing that is better to handle in your client
> instead of asking Solr to do it for you.  You'd have to have your code
> construct the complex localparam for the switch parser ... it would be
> much easier to write code to insert your special collapse filter when it
> is required.
>
> > Everything works nicely until I move from a single node solr instance (DEV) 
> > to a clustered solr instance (PTST) in which I receive a null pointer 
> > exception from Solr which I'm having trouble picking apart.  I've 
> > co-located the solr documents using document routing which appear to be the 
> > only requirement for the collapse query parser's use.
>
> Some features break down when working with sharded indexes.  This is one
> of the reasons that sharding should only be done when it is absolutely
> required.  A single-shard index tends to perform better anyway, unless
> it's really really huge.
>
> The error is a remote exception, from
> https://myserver:1234/solr/my_collection_ptst_shard2_replica_n2. Which
> suggests that maybe not all your documents are co-located on the same
> shard the way you think they are.  Is this a remote server/shard?  I am
> completely guessing here.  It's always possible that you've encountered
> a bug.  Does this one (not fixed) look like it might apply?
>
> https://issues.apache.org/jira/browse/SOLR-9104
>
> There should be a server-side error logged by the Solr instance running
> on myserver:1234 as well.  Have you looked at that?
>
> I do not know what PTST means.  Is that important for me to understand?
>
> Thanks,
> Shawn
>


Re: switch query parser and solr cloud

2018-09-13 Thread Dwane Hall
Afternoon all,

Just to add some closure to this topic in case anybody else stumbles across a 
similar problem I've managed to resolve my issue by removing the switch query 
parser from the _appends_ component of the parameter set.

so the parameter set changes from this

 "set":{
"firstParams":{
"op":"AND",
"wt":"json",
"start":0,
"allResults":"false",
"fl":"FIELD_1,FIELD_2,SUMMARY_FIELD",
  "_appends_":{
"fq":"{!switch default=\"{!collapse field=SUMMARY_FIELD}\" 
case.true=*:* v=${allResults}}",
  },

to just a regular old filter query

 "set":{
"firstParams":{
"op":"AND",
"wt":"json",
"start":0,
"allResults":"false",
"fl":"FIELD_1,FIELD_2,SUMMARY_FIELD",
"fq":"{!switch default=\"{!collapse field=SUMMARY_FIELD}\" 
case.true=*:* v=${allResults}}",

Somewhat odd.

Thanks again to Erick and Shawn for taking the time to assist and talk this 
through.

Dwane

From: Dwane Hall 
Sent: Thursday, 13 September 2018 6:42 AM
To: Erick Erickson; solr-user@lucene.apache.org
Subject: Re: switch query parser and solr cloud

Thanks for the suggestions and responses Erick and Shawn.  Erick I only return 
30 records irrespective of the query (not the entire payload) I removed some of 
my configuration settings for readability. The parameter "allResults" was a 
little misleading I apologise for that but I appreciate your input.

Shawn thanks for your comments. Regarding the switch query parser the Hossman 
has a great description of its use and application here 
(https://lucidworks.com/2013/02/20/custom-solr-request-params/).  PTST is just 
our performance testing environment and is not important in the context of the 
question other than it being a multi node solr environment.  The server side 
error was the null pointer which is why I was having a few difficulties 
debugging it as there was not a lot of info to troubleshoot.  I'll keep playing 
and explore the client filter option for addressing this issue.

Thanks again for both of your input

Cheers,

Dwane

From: Erick Erickson 
Sent: Thursday, 13 September 2018 12:20 AM
To: solr-user
Subject: Re: switch query parser and solr cloud

You will run into significant problems if, when returning "all
results", you return large result sets. For regular queries I like to
limit the return to 100, although 1,000 is sometimes OK.

Millions will blow you out of the water, use CursorMark or Streaming
for very large result sets. CursorMark gets you a page at a time, but
efficiently and Streaming doesn't consume huge amounts of memory.

And assuming you could possible return 1M rows, say, what would the
user do with it? Displaying in a browser is problematic for instance.

Best,
Erick
On Wed, Sep 12, 2018 at 5:54 AM Shawn Heisey  wrote:
>
> On 9/12/2018 5:47 AM, Dwane Hall wrote:
> > Good afternoon Solr brains trust I'm seeking some community advice if 
> > somebody can spare a minute from their busy schedules.
> >
> > I'm attempting to use the switch query parser to influence client search 
> > behaviour based on a client specified request parameter.
> >
> > Essentially I want the following to occur:
> >
> > -A user has the option to pass through an optional request parameter 
> > "allResults" to solr
> > -If "allResults" is true then return all matching query records by 
> > appending a filter query for all records (fq=*:*)
> > -If "allResults" is empty then apply a filter using the collapse query 
> > parser ({!collapse field=SUMMARY_FIELD})
>
> I'm looking at the documentation for the switch parser and I'm having
> difficulty figuring out what it actually does.
>
> This is the kind of thing that is better to handle in your client
> instead of asking Solr to do it for you.  You'd have to have your code
> construct the complex localparam for the switch parser ... it would be
> much easier to write code to insert your special collapse filter when it
> is required.
>
> > Everything works nicely until I move from a single node solr instance (DEV) 
> > to a clustered solr instance (PTST) in which I receive a null pointer 
> > exception from Solr which I'm having trouble picking apart.  I've 
> > co-located the solr documents using document routing which appear to be the 
> > only requirement for the collapse query parser's use.
>
> Some features break down when working

Exporting results and schema design

2018-11-14 Thread Dwane Hall
Good afternoon Solr community,

I have a situation where I require the following solr features.

1.   Highlighting must be available for the matched search results

2.   After a user performs a regular solr search (/select, rows=10) I 
require a drill down which has the potential to export a large number of 
results (1000s +).


Here is how I’ve approached each of the two problems above


1.Highlighting must be available for the matched search results

My implementation is in pattern with the recommend approach.  A stored=false 
indexed=true copy field with the individual fields to highlight analysed, 
stored=true, indexed=false.












2.After a user performs a regular solr search (/select, rows=10) I require a 
drill down which has the potential to export a large number of results (1000s+ 
with no searching required over the fields)


>From all the documentation the recommended approach for returning large result 
>sets is using the /export request handler.  As none of my fields qualify for 
>using then /export handler (i.e. docValues=true) is my only option to have 
>additional duplicated fields mapped as strings so they can be used in the 
>export process?

i.e. using my example above now managed-schema now becomes





















If I did not require highlighting I could change the initial mapped fields 
(First_Names, Last_Names) from type=text_general to type=string and save the 
additional storage in the index but in my current situation I can’t see a way 
around having to duplicate all the fields required for the /export handler as 
strings. Is this how people typically handle this problem or am I completely 
off the mark with my design?


Any advice would be greatly appreciated,


Thanks


Dwane


Re: Exporting results and schema design

2018-11-15 Thread Dwane Hall

Thanks Erick that's great advice as always it's very much appreciated. I've 
never seen an example of that pattern used before 
(stored=false,indexed=false,useDocValuesAsStored=true) on any of the fantastic 
solr blogs I've read and I've read lot of them many times (all of your 
excellent Lucidworks posts, Yonik's personal blog, Rafal's sematext posts, Tote 
from the RDL, and most of the Lucene/Solr revolution youtube clips  etc.).  I 
found the relevant section in the solr doco it's really a little gem of a 
pattern I did not know docValues could provide that feature and I've always 
believed I needed stored=true if I required the fields returned.

Many thanks once again for taking the time to respond,

Dwane


From: Erick Erickson 
Sent: Thursday, 15 November 2018 1:55 PM
To: solr-user
Subject: Re: Exporting results and schema design

Well, docValues doesn't necessarily waste much index space if you
don't store the field and useDocValuesAsStored. It also won't beat up
your machine as badly if you fetch all your fields from DV fields. To
fetch a stored field, you need to

> seek to the stored data on disk
> decompress a 16K block minimum
> fetch the stored fields.

So using docvalues rather than stored for "1000s" of rows will avoid that cycle.

You can use the cursorMark to page efficiently, your middleware would
have to be in charge of that.

Best,
Erick
On Wed, Nov 14, 2018 at 6:35 PM Dwane Hall  wrote:
>
> Good afternoon Solr community,
>
> I have a situation where I require the following solr features.
>
> 1.   Highlighting must be available for the matched search results
>
> 2.   After a user performs a regular solr search (/select, rows=10) I 
> require a drill down which has the potential to export a large number of 
> results (1000s +).
>
>
> Here is how I’ve approached each of the two problems above
>
>
> 1.Highlighting must be available for the matched search results
>
> My implementation is in pattern with the recommend approach.  A stored=false 
> indexed=true copy field with the individual fields to highlight analysed, 
> stored=true, indexed=false.
>
>  multiValued="true"/>
>
>  multiValued="true"/>
>
> 
>
> 
>
>  multiValued="true"/>
>
>
> 2.After a user performs a regular solr search (/select, rows=10) I require a 
> drill down which has the potential to export a large number of results 
> (1000s+ with no searching required over the fields)
>
>
> From all the documentation the recommended approach for returning large 
> result sets is using the /export request handler.  As none of my fields 
> qualify for using then /export handler (i.e. docValues=true) is my only 
> option to have additional duplicated fields mapped as strings so they can be 
> used in the export process?
>
> i.e. using my example above now managed-schema now becomes
>
>  multiValued="true"/>
>
>  multiValued="true"/>
>
> 
>
> 
>
>  multiValued="true"/>
>
>
> 
>
> 
>
>  multiValued="true"/>
>
>  multiValued="true"/>
>
>
> If I did not require highlighting I could change the initial mapped fields 
> (First_Names, Last_Names) from type=text_general to type=string and save the 
> additional storage in the index but in my current situation I can’t see a way 
> around having to duplicate all the fields required for the /export handler as 
> strings. Is this how people typically handle this problem or am I completely 
> off the mark with my design?
>
>
> Any advice would be greatly appreciated,
>
>
> Thanks
>
>
> Dwane


Solr admin UI new features

2019-01-23 Thread Dwane Hall
Hi user community,


I recently upgraded a single node solr cloud environment from 7.3.1 to 7.6. 
While traversing through the release notes for solr 7.5 to identify any 
important changes to consider for the upgrade I noticed two excellent additions 
to the Admin UI that took effect in solr 7.5 (SOLR-8207 – Add Nodes view to 
Admin UI “Cloud” tab and SOLR-7767 ZK Status under “Cloud” tab).After 
completing my upgrade all collections are online and functioning as expected 
and solr is working without issue however these two new menu items do not 
appear to work (the urls are hit https://server:port/solr/#/~cloud?view=nodes, 
https://server:port/solr/#/~cloud?view=zkstatus) but the pages are blank.  The 
original menu items all function without issue (Tree, Graph, Graph (Radial)).

I’ve cleared my browser cache and checked the logs which are all clean (with 
the log level set to DEBUG on org.apache.jetty.server.*).  Are there any 
additional configuration changes I’m overlooking that I need to take advantage 
of these new features?


Environment

Chrome 70, and Firefox 45
Solr 7.6 (Cloud, single node)
Https, basic auth plugin enabled
Zookeeper 3.4.6

As always any advice is appreciated,

Thanks

Dwane


Re: Solr admin UI new features

2019-01-23 Thread Dwane Hall
Thanks Erick, very helpful as always ...we're up and going now. Before the 
install I spun up a stand alone instance to check comparability and the process 
did not shut down cleanly from the looks of things. I'm guessing solr was a 
little confused on which instance of zookeeper to use. (The bundled version or 
our production instance). Thanks again for your assistance it's very much 
appreciated.

Dwane

From: Erick Erickson 
Sent: Thursday, 24 January 2019 1:23:15 PM
To: solr-user
Subject: Re: Solr admin UI new features

H, is there any chance you somehow have your Solr instances
pulling some things, particularly browser-related from your old
install? Or from some intermediate cache between your browser and the
Solr instances? Or perhaps "something got copied somewhere" and is
being picked up from the old install? I'm really grasping at straws
here admittedly

Here's what I'd do. Install a fresh Solr 7.6 somewhere, your laptop
would be fine, some node in your system that doesn't have anything on
it you can use temporarily, I don't really care as long as you can
guarantee your new install is the only Solr install being referenced
and you can point it at your production ZooKeeper ensemble. Do you
still have the same problem? If not, I'd guess that your production
system has somehow mixed-and-matched...

Best,
Erick

On Wed, Jan 23, 2019 at 4:36 PM Dwane Hall  wrote:
>
> Hi user community,
>
>
> I recently upgraded a single node solr cloud environment from 7.3.1 to 7.6. 
> While traversing through the release notes for solr 7.5 to identify any 
> important changes to consider for the upgrade I noticed two excellent 
> additions to the Admin UI that took effect in solr 7.5 (SOLR-8207 – Add Nodes 
> view to Admin UI “Cloud” tab and SOLR-7767 ZK Status under “Cloud” tab).
> After completing my upgrade all collections are online and functioning as 
> expected and solr is working without issue however these two new menu items 
> do not appear to work (the urls are hit 
> https://server:port/solr/#/~cloud?view=nodes, 
> https://server:port/solr/#/~cloud?view=zkstatus) but the pages are blank.  
> The original menu items all function without issue (Tree, Graph, Graph 
> (Radial)).
>
> I’ve cleared my browser cache and checked the logs which are all clean (with 
> the log level set to DEBUG on org.apache.jetty.server.*).  Are there any 
> additional configuration changes I’m overlooking that I need to take 
> advantage of these new features?
>
>
> Environment
>
> Chrome 70, and Firefox 45
> Solr 7.6 (Cloud, single node)
> Https, basic auth plugin enabled
> Zookeeper 3.4.6
>
> As always any advice is appreciated,
>
> Thanks
>
> Dwane


Delete by id

2019-02-11 Thread Dwane Hall
Hey Solr community,

I’m having an issue deleting documents from my Solr index and am seeking some 
community advice when somebody gets a spare minute. It seems really like a 
really simple problem …a requirement to delete a document by its id.

Here’s how my documents are mapped in solr

DOC_ID


My json format to delete the document (all looks correct according to 
https://lucene.apache.org/solr/guide/7_6/uploading-data-with-index-handlers.html
 “The JSON update format allows for a simple delete-by-id. The value of a 
delete can be an array which contains a list of zero or more specific document 
id’s (not a range) to be deleted. For example, a single document”)

Attempt 1 – “shorthand”
{“delete”:”123!12345”}

Attempt 2 – “longhand”
{“delete”:“DOC_ID”:”123!12345”}
{“delete”:{“DOC_ID”:”123!12345”}}

..the error is the same in all instances “org.apache.solr.common.SolrException: 
Document is missing mandatory uniqueKey field: DOC_ID”

Can anyone see any obvious details I’m overlooking?

I’ve tried all the update handlers below (both curl and through admin ui)

/update/
/update/json
/update/json/docs

My environment
Solr cloud 7.6
Single node

As always any advice would be greatly appreciated,

Thanks,

Dwane


Re: Delete by id

2019-02-15 Thread Dwane Hall
Thanks Matt,

I was thinking the same regarding Solr thinking it's an update, not a delete. 
Sorry about the second "longhand" example, yes that was a copy paste issue the 
format is incorrect I was playing around with a few options with the JSON 
format.  I'll keep testing the only difference I could see between our examples 
was you keeping your unique id field called "id" and not a custom value 
("DOC_ID" in my instance).  It seems minor but I've run out of any other ideas 
and am fishing at the moment.

Thanks again,

Dwane



From: Matt Pearce 
Sent: Wednesday, 13 February 2019 10:40 PM
To: solr-user@lucene.apache.org
Subject: Re: Delete by id

Hi Dwane,

The error suggests that Solr is trying to add a document, rather than
delete one, and is complaining that the DOC_ID is missing.

I tried each of your examples (without the smart quotes), and they all
worked as expected, both from curl and the admin UI. There's an error in
your longhand example, which should read
{ "delete": { "id": "123!12345" }}
However, even using your example, I didn't get a complaint about the
field being missing.

Using curl, my command was:
curl -XPOST -H 'Content-type: application/json'
http://localhost:8983/solr/testCollection/update -d '{ "delete":
"123!12345" }'

Are you doing anything differently from that?

Thanks,
Matt


On 11/02/2019 23:24, Dwane Hall wrote:
> Hey Solr community,
>
> I’m having an issue deleting documents from my Solr index and am seeking some 
> community advice when somebody gets a spare minute. It seems really like a 
> really simple problem …a requirement to delete a document by its id.
>
> Here’s how my documents are mapped in solr
>
> DOC_ID
>  required="true" multiValued="false" />
>
> My json format to delete the document (all looks correct according to 
> https://lucene.apache.org/solr/guide/7_6/uploading-data-with-index-handlers.html
>  “The JSON update format allows for a simple delete-by-id. The value of a 
> delete can be an array which contains a list of zero or more specific 
> document id’s (not a range) to be deleted. For example, a single document”)
>
> Attempt 1 – “shorthand”
> {“delete”:”123!12345”}
>
> Attempt 2 – “longhand”
> {“delete”:“DOC_ID”:”123!12345”}
> {“delete”:{“DOC_ID”:”123!12345”}}
>
> ..the error is the same in all instances 
> “org.apache.solr.common.SolrException: Document is missing mandatory 
> uniqueKey field: DOC_ID”
>
> Can anyone see any obvious details I’m overlooking?
>
> I’ve tried all the update handlers below (both curl and through admin ui)
>
> /update/
> /update/json
> /update/json/docs
>
> My environment
> Solr cloud 7.6
> Single node
>
> As always any advice would be greatly appreciated,
>
> Thanks,
>
> Dwane
>

--
Matt Pearce
Flax - Open Source Enterprise Search
www.flax.co.uk<http://www.flax.co.uk>


edismax behaviour?

2019-03-04 Thread Dwane Hall
Good afternoon solr community. I'm having an issue debugging an edismax query 
which appears to behave differently across two separate collections with what I 
believe to be the same default query parameters. Each collection contains 
different data but are configured using similar default query parameters.


Environment
Solr cloud 7.6
Params API used to set query defaults (params.json) (solrconfig.xml request 
handler defaults are empty)

Example query string
searchString (q) = John Joan Winthorp

Collection 1 default query params

"set":{
"params1":{
"op":"AND",
"df":"NAME",
"wt":"json",
"fl":"NAME ADDRESS EMAIL PHONE",
"q":"{!edismax v=${searchString} df=NAME q.op=${op}}",
"cursorMark":"*",
"sort":"id desc",
"echoParams":"all",
"debug":"query",
  "_invariants_":{
"deftype":"edismax",
"qf":"NAME ADDRESS EMAIL PHONE",
"uf":"NAME ADDRESS EMAIL PHONE",
"rows":"40",
"lowercaseOperators":"true",
"facet":"false",
"facet.mincount":1,
"facet.missing":"false",
"hl":"on",
"hl.method":"unified",
"hl.bs.type":"SEPARATOR",
"hl.tag.pre":"${",
"hl.tag.post":"}",
"hl.bs.separator":",",
"hl.fl":"NAME ADDRESS EMAIL PHONE",
"cache":"true",
"explainOther":"false",
"start":0}
  }
   }

Query results (debug=query) for above params1 - select request handler sets 
useParams=params1

"q":"{!edismax v=\"john joan winthorp\" df=NAME q.op=AND}"
"rawquerystring":"{!edismax v=\"john joan winthorp\" df=NAME q.op=AND}",
"parsedquery_toString":"+(+((+NAME:john +NAME:joan +NAME:winthorp) | 
(+ADDRESS:john +ADDRESS:joan +ADDRESS:winthorp) | () | (+EMAIL:john +EMAIL:joan 
+EMAIL:winthorp) | (+PHONE:john +PHONE:joan +PHONE:winthorp)))"


Collection 2 default query params

   "set":{
"params2":{
   "op":"AND",
   "df":"CUSTOMER",
   "wt":"json",
   "fl":"*",
   "q":"{!edismax v=${searchString} df=CUSTOMER q.op=${op}}",
   "cursorMark":"*",
   "sort":"id desc",
   "debug":"query",
  "_invariants_":{
"deftype":"edismax",
"qf":"CUSTOMER EMAIL PRODUCT",
"uf":"*",
"rows":"20",
"facet":"true",
"facet.mincount":1,
"facet.limit":20,
"facet.missing":"false",
"hl":"on",
"hl.method":"unified",
"hl.bs.type":"SEPARATOR",
"hl.tag.pre":"${",
"hl.tag.post":"}",
"hl.bs.separator":",",
"hl.fl":"*",
"cache":"true",
"explainOther":"false",
"echoParams":"none",
   "start":0}
  }
   }

Query results (debug=true) for params2 above - select request handler sets 
useParams=params2

"q":"{!edismax v=\"john joan winthorp\" df=NAME q.op=AND}",
"rawquerystring":"{!edismax v=\"john joan winthorp\" df=Goods_Descr q.op=AND}",
"parsedquery_toString":"+(+(CUSTOMER:john | EMAIL:john | PRODUCT:john) 
+(CUSTOMER:joan | EMAIL:joan | PRODUCT:joan) +(CUSTOMER:winthorp | 
EMAIL:winthorp | PRODUCT:winthorp))"

In summary I seem to get different search behaviours for the same edismax query 
executed against each collection.  The first search is translated into a query 
which requires all terms be present in each individual qf field (i.e. 
(+NAME:john +NAME:joan +NAME:winthorp)... ).  The second (my desired behaviour) 
is translated into a query where each term is OR'd over each qf field (i.e. 
+(CUSTOMER:john | EMAIL:john | PRODUCT:john) ...).

q=john joan winthorp

Query 1 - params1
"parsedquery_toString":"+(+((+NAME:john +NAME:joan +NAME:winthorp) | 
(+ADDRESS:john +ADDRESS:joan +ADDRESS:winthorp) | () | (+EMAIL:john +EMAIL:joan 
+EMAIL:winthorp) | (+PHONE:john +PHONE:joan +PHONE:winthorp)))"

Query 2 - params2
"parsedquery_toString":"+(+(CUSTOMER:john | EMAIL:john | PRODUCT:john) 
+(CUSTOMER:joan | EMAIL:joan | PRODUCT:joan) +(CUSTOMER:winthorp | 
EMAIL:winthorp | PRODUCT:winthorp))"


I copied the configset of the first collection when creating the second so most 
of the settings are identical across collections. Is anyone able to suggest or 
see any glaringly obvious configuration settings I may be overlooking to cause 
the queries to behave differently across collections?

As always any advice would be greatly appreciated

Thanks

Dwane


Basic auth and index replication

2019-04-03 Thread Dwane Hall
Hey Solr community.



I’ve been following a couple of open JIRA tickets relating to use of the basic 
auth plugin in a Solr cluster (https://issues.apache.org/jira/browse/SOLR-12584 
, https://issues.apache.org/jira/browse/SOLR-12860) and recently I’ve noticed 
similar behaviour when adding tlog replicas to an existing Solr collection.  
The problem appears to occur when Solr attempts to replicate the leaders index 
to a follower on another Solr node and it fails authentication in the process.



My environment

Solr cloud 7.6

Basic auth plugin enabled

SSL



Has anyone else noticed similar behaviour when using tlog replicas?



Thanks,



Dwane



2019-04-03T13:27:22,774 5000851636 WARN  : [   ] 
org.apache.solr.handler.IndexFetcher : Master at: 
https://myserver:myport/solr/mycollection_shard1_replica_n17/ is not available. 
Index fetch failed by exception: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at https://myserver:myport/solr/mycollection_shard1_replica_n17: 
Expected mime type application/octet-stream but got text/html. 





Error 401 require authentication



HTTP ERROR 401

Problem accessing /solr/mycollection_shard1_replica_n17/replication. Reason:

require authentication







edismax pt 2

2019-04-09 Thread Dwane Hall
Hi guys,



I’m just following up from an earlier question I raised on the forum regarding 
inconsistencies in edismax query behaviour and I think I may have discovered 
the cause of the problem.  From testing I've noticed that edismax query 
behaviour seems to change depending on the field types specified in the qf 
parameter.



Here’s an example first using only solr.Text fields.



(all fields are “text_general” – standard tokenizer, lower case filter only)

NAME solr.TextField

ADDRESS solr.TextField

EMAIL solr.TextField

PHONE_NUM solr.TextField



“qf":"NAME ADDRESS EMAIL PHONE_NUM”



"querystring":"peter john spain",

"parsedquery":"+(+DisjunctionMaxQuery((PHONE_NUM:peter | ADDRESS:peter | 
EMAIL:peter | NAME:peter)) +DisjunctionMaxQuery((PHONE_NUM:john | ADDRESS:john 
| EMAIL:john | NAME:john)) +DisjunctionMaxQuery((PHONE_NUM:spain | 
ADDRESS:spain | EMAIL:spain | NAME:spain)))",





Now with no other configuration changes when I introduce a range date field 
(solr.DateRangeField) called “DOB” into the qf parameter the behaviour of the 
edismax parser changes dramatically.



DOB solr.DateRangeField



“qf":"NAME ADDRESS EMAIL PHONE_NUM DOB”



"querystring":"peter john spain",

"parsedquery":"+(+DisjunctionMaxQuery(((+PHONE_NUM:peter +PHONE_NUM:john 
+PHONE_NUM:spain) | (+EMAIL:peter +EMAIL:john +EMAIL:spain) | () | (+NAME:peter 
+NAME:john +NAME:spain) | (+ADDRESS:peter +ADDRESS:john +ADDRESS:spain",




Notice the difference of the “|OR” and “+AND” between terms and also every term 
is now mandatory to exist in every field.  Is this the expected behaviour for 
the edismax query parser or am I overlooking something that may be causing this 
behaviour inconsistency?



As always any comments or feedback is greatly appreciated,


Thanks



Dwane





Hit highlighting

2018-07-06 Thread Dwane Hall
Good evening solr community.  I have not had a lot of luck on another community 
source seeking advice on using the unified highlighter so I thought I'd try my 
luck with the solr experts.  Any recommendations would be appreciated when you 
get time.


Apache Solr 6.4 saw the release of the unified highlighter which according to 
the Solr documentation is the "most flexible and performant of the options. We 
recommend that you try this highlighter even though it isn’t the default 
(yet)". 
(ref)

With this in mind I've attempted to follow this recommendation and utilize it 
in a project I'm designing. The functionality works as expected but I am unable 
to find a hl.requireFieldMatch equivalent for this highlighter. This means the 
entire hl.fl list is returned as empty arrays for all fields that do not have a 
hit highlight associated with them (along with the successful highlight 
fields). These fields can be ignored by a client but ideally they would not be 
passed to a calling client as the list can be quite long especially if a 
wildcard (*) is used in the hl.fl parameter. With this in mind would I be 
better off continuing with the unified highlighter and ignoring the additional 
non-highlighted field list or defaulting back to the fastVector highlighter? 
How significant performance improvement does the unified highlighter offer and 
am I better off wearing the additional network data overhead to leverage this 
performance gain? For reference the index is a large one (400,000,000+ 
documents on Solr 7.3.1) so my initial instinct is to keep using the unified 
highlighter and remove as much stress on my Solr cluster as possible. Any 
advice or recommendations would be greatly appreciated.


Thanks.


DH



Solr exporter

2018-07-08 Thread Dwane Hall
Has anyone had any luck using the Solr 7.3+ exporter for metrics collection on 
a Solr instance with the basic auth plugin enabled? The exporter starts without 
issue but I have had no luck specifying the credentials when the exporter tries 
to call the metrics API. The documentation does not appear to address this use 
case. Thanks.


Solr [subquery] document transformer

2018-07-22 Thread Dwane Hall
Good afternoon knowledgeable solr community.  I’m experiencing problems using a 
document transformer across a multiple shard collection and am wondering if 
anyone would please be able to assist or provide some guidance?



The document transformer query below works nicely until I split the collection 
into multiple shards and then I receive what appears to be an authentication 
issue on the subquery.





My query configuration (the original query returns a document with a field link 
to another ‘parent’ document)

"parent.q":"{!edismax qf=CHILD_ID v=$row.PARENT_ID _route_=PARENT_ID!}",

"parent.fl":"*",

"parent.rows":1,

"fl":"…other fields to display, parent:[subquery]",



Environment:

SolrCloud (7.3.1)

Https

Rules based authentication provider



Any advice would be appreciated.



DH



2018-07-23 11:43:06,445 5471250 ERROR : [c:my_collection s:shard1 r:core_node3 
x: my_collection_shard1_replica_n1] org.apache.solr.common.SolrException : 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at https://serverName:9021/solr/my_collection_shard3_replica_n4: 
Expected mime type application/octet-stream but got text/html. 





Error 401 require authentication



HTTP ERROR 401

Problem accessing /solr/my_collection_shard3_replica_n4/select. Reason:

require authentication







at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:607)

at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)

at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)

at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)

at 
org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:172)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at 
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)

at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)



https://serverName:9021/solr/my_collection_shard1_replica_n1: parsing error

at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:616)

at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)

at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)

at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)

at 
org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:172)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at 
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)

at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)

at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Caused by: org.apache.solr.common.SolrException: parsing error

at 
org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:52)

at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:614)

... 12 more

Caused by: java.io.EOFException

at 
org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:207)

at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:255)

at 
org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:747)

at 
org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:272)

at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)

at 
org.apache.solr.common.util.JavaBinCodec.readSolrDocumentList(JavaBinCodec.java:555)

at 
org.apache.solr.common.util.JavaBinCodec.readObject(JavaBinCodec.java:307)

at 
org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:256)

at 
org.apache.solr.common.util.JavaBinCodec.readOrderedMap(JavaBinCodec.java:200)

at 
org.apache.solr.common.util.JavaBinCodec.readObject(JavaBi

Re: Solr Cloud on Docker?

2020-02-05 Thread Dwane Hall
Hey Dominique,

>From a memory management perspective I don't do any container resource 
>limiting specifically in Docker (although as you mention you certainly can).  
>In our circumstances these hosts are used specifically for Solr so I planned 
>and tested my capacity beforehand. We have ~768G of RAM on each of these 5 
>hosts so with 20x16G heaps we had ~320G of heap being used by Solr, some 
>overhead for Docker and the other OS services leaving ~400G for the OS cache 
>and whatever wants to grab it on each host. Not everyone will have servers 
>this large which is why we really had to take advantage of multiple Solr 
>instances/host and Docker became important for our cluster operation 
>management.  Our disk's are not SSD's either and all instances write to the 
>same raid 5 spinner which is bind mounted to the containers.  With this 
>configuration we've been able to achieve consistent median response times of 
>under 500ms across the largest collection but obviously query type varies this 
>(no terms, leading wildcards etc.).  Our QPS is not huge ranging from 2-20/sec 
>but if we need to scale further or speed up response times there's certainly 
>wins that can be made at a disk level.  For our current circumstances we're 
>very content with the deployment.

In not sure if you've read Toke's blog on his experiences at the Royal Danish 
Library but I found it really useful when capacity planning and recommend 
reading it 
(https://sbdevel.wordpress.com/2016/11/30/70tb-16b-docs-4-machines-1-solrcloud/).

As always it's recommend to test for your own conditions and best of luck with 
your deployment!

Dwane


From: Scott Stults 
Sent: Thursday, 30 January 2020 1:45 AM
To: solr-user@lucene.apache.org 
Subject: Re: Solr Cloud on Docker?

One of our clients has been running a big Solr Cloud (100-ish nodes, TB
index, billions of docs) in kubernetes for over a year and it's been
wonderful. I think during that time the biggest scrapes we got were when we
ran out of disk space. Performance and reliability has been solid
otherwise. Like Dwane alluded to, a lot of operations pitfalls can be
avoided if you do your Docker orchestration through kubernetes.


k/r,
Scott

On Tue, Jan 28, 2020 at 3:34 AM Dominique Bejean 
wrote:

> Hi  Dwane,
>
> Thank you for sharing this great solr/docker user story.
>
> According to your Solr/JVM memory requirements (Heap size + MetaSpace +
> OffHeap size) are you specifying specific settings in docker-compose files
> (mem_limit, mem_reservation, mem_swappiness, ...) ?
> I suppose you are limiting total memory used by all dockerised Solr in
> order to keep free memory on host for MMAPDirectory ?
>
> In short can you explain the memory management ?
>
> Regards
>
> Dominique
>
>
>
>
> Le lun. 23 déc. 2019 à 00:17, Dwane Hall  a écrit :
>
> > Hey Walter,
> >
> > I recently migrated our Solr cluster to Docker and am very pleased I did
> > so. We run relativity large servers and run multiple Solr instances per
> > physical host and having managed Solr upgrades on bare metal installs
> since
> > Solr 5, containerisation has been a blessing (currently Solr 7.7.2). In
> our
> > case we run 20 Solr nodes per host over 5 hosts totalling 100 Solr
> > instances. Here I host 3 collections of varying size. The first contains
> > 60m docs (8 shards), the second 360m (12 shards) , and the third 1.3b (30
> > shards) all with 2 NRT replicas. The docs are primarily database sourced
> > but are not tiny by any means.
> >
> > Here are some of my comments from our migration journey:
> > - Running Solr on Docker should be no different to bare metal. You still
> > need to test for your environment and conditions and follow the guides
> and
> > best practices outlined in the excellent Lucidworks blog post
> >
> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> > .
> > - The recent Solr Docker images are built with Java 11 so if you store
> > your indexes in hdfs you'll have to build your own Docker image as Hadoop
> > is not yet certified with Java 11 (or use an older Solr version image
> built
> > with Java 8)
> > - As Docker will be responsible for quite a few Solr nodes it becomes
> > important to make sure the Docker daemon is configured in systemctl to
> > restart after failure or reboot of the host. Additionally the Docker
> > restart=always setting is useful for restarting failed containers
> > automatically if a single container dies (i.e. JVM explosions). I've
> > deliberately blown up the JVM in test conditions and found the
> > containers/Solr recover rea

Re: Rule of thumb for determining maxTime of AutoCommit

2020-02-27 Thread Dwane Hall
Hey Kaya,

How are you adding documents to your index?  Do you control this yourself or do 
you have multiple clients (curl, solrJ, calls directly to /update*) updating 
data in your index as I suspect (based on your hard and soft commit settings) 
that a client may be causing your soft commits making updates using the 
commitWithin parameter (who's default behaviour is a soft commit 
[openSearcher=ture]).  If this is the case and you don't want to allow client 
commits you can prevent this behaviour by adding a processor to your 
updateRequestProcessorChain of your solrconfig.xml and then manage your commits 
by configuring the settings you've described below 
(https://lucene.apache.org/solr/guide/8_4/shards-and-indexing-data-in-solrcloud.html#ignoring-commits-from-client-applications-in-solrcloud).


I always go back to Erick Erikson's great Lucidworks blog post for all things 
indexing related so if you haven't seen it I highly recommend checking it out.  
There's a ton of great info in this post so it may take a few reads to 
completely digest it all and towards the end he provides some index and query 
scenarios and some sensible hard and soft commit settings to address some 
common use cases.
To this day his words in this post still ring true in my ear "hard commits are 
about durability, soft commits are about visibility" 
(https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/)

Good luck,

Dwane


From: Emir Arnautović 
Sent: Thursday, 27 February 2020 9:23 PM
To: solr-user@lucene.apache.org 
Subject: Re: Rule of thumb for determining maxTime of AutoCommit

Hi Kaya,
Since you do not have soft commits, you must have explicit commits somewhere 
since your hard commits are configured not to open searcher.

Re warming up: yes - you are right. You need to check your queries and warmup 
numbers in cache configs. What you need to check is how log does warmup takes 
and if it takes too long reduce number of warmup queries/items. I think that 
there is cumulative warming time in admin console, or if you prefer some proper 
Solr monitoring tool, you can check out our Solr integration: 
https://apps.sematext.com/demo 

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 27 Feb 2020, at 03:00, Kayak28  wrote:
>
> Hello, Emir:
>
> Thank you for your reply.
> I do understand that the frequency of creating searcher depends on how much
> realitime-search is required.
>
> As you advise me, I have checked a soft-commit configuration.
> It is configured as:
> ${solr.autoSoftCommit.maxTime:-1}
>
> If I am correct, I have not set autoSoftCommit, which means autoSoftCommit
> does not create a new searcher.
> Next, I will take a look at my explicit commit.
>
> I am curious about what you say "warming strategy."
> Is there any good resource to learn about tuning warming up?
>
> As far as I know about warming up, there is two warming-up functions in
> Solr.
> One is static warming up, which you can configure queries in solrconfig.xml
> The other is dynamic warming up, which uses queries from old cache.
>
> How should I tune them?
> What is the first step to look at?
> (I am kinda guessing the answer can vary depends on the system, the
> service, etc... )
>
>
>
> Sincerely,
> Kaya Ota
>
>
>
> 2020年2月26日(水) 17:36 Emir Arnautović :
>
>> Hi Kaya,
>> The answer is simple: as much as your requirements allow delay between
>> data being indexed and changes being visible. It is sometimes seconds and
>> sometimes hours or even a day is tolerable. On each commit your caches are
>> invalidated and warmed (if it is configured like that) so in order to get
>> better use of caches, you should commit as rare as possible.
>>
>> The setting that you provided is about hard commits and those are
>> configured not to open new searcher so such commit does not cause “exceeded
>> limit” error. You either have soft auto commits configured or you do
>> explicit commits when updating documents. Check and tune those and if you
>> do explicit commits, remove those if possible. If you cannot afford less
>> frequent commits, you have to tune your warming strategy to make sure it
>> does not take as much time as period between two commits.
>>
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>>> On 26 Feb 2020, at 06:16, Kayak28  wrote:
>>>
>>> Hello, Solr Community:
>>>
>>> Another day, I had an error "exceeded limit of maxWarmingSearchers=2."
>>> I know this error causes when multiple commits(which opens a new
>> searcher)
>>> are requested too frequently.
>>>
>>> As far as I read Solr wiki, it recommends for me to have more interval
>>> between each commit, and make commit frequency less.
>>> Using autoCommit,  I would like to decreas

Re: SolrCloud location for solr.xml

2020-03-02 Thread Dwane Hall
Hey Jan,



Thanks for the info re swap there’s some interesting observations you’ve 
mentioned below particularly the container swap by default.   There was this 
note on the Docker forum describing a similar situation you mention did you 
attempt these settings with the same result? 
(https://success.docker.com/article/node-using-swap-memory-instead-of-host-memory)
 It mentions Docker EE specifically but it might be worth a try.



In our environment we also run a vm.swappiness setting of 1 (our OS default) 
which is inherited by our containers as we don’t enforce any memory or resource 
limits directly on them.   I did not attempt to turn off container swap during 
my testing so I don't have any benchmarks to relay back but if I get some clear 
air I’ll try to spin up some tests and see if I can replicate your observations.



There’s a great resource/tutorial from the Alibaba guys on testing and tweaking 
container resource limits and their observations are far more comprehensive 
than my experience (ref below).  If you haven't seen it may prove a useful 
reference for testing as I’ve noticed some of the Docker settings are not quite 
what you expect them to be.  For example the Docker documentation defines the 
–cpus value as follows:



“Specify how much of the available CPU resources a container can use. For 
instance, if the host machine has two CPUs and you set --cpus="1.5", the 
container is guaranteed at most one and a half of the CPUs. This is the 
equivalent of setting --cpu-period="10" and --cpu-quota="15". Available 
in Docker 1.13 and higher. “



From the reference below

CPU’s = Threads per core X cores per socket X sockets

CPUs are not physical CPU’s



We found this out the hard way when attempting to CPU resource limit a Fusion 
instance and wondered why it was struggling to start all its services on a 
single thread instead of a physical CPU 🙂


It's a three part series

https://www.alibabacloud.com/blog/594573?spm=a2c5t.11065265.1996646101.searchclickresult.612d62b1TGCY58

https://www.alibabacloud.com/blog/docker-container-resource-management-cpu-ram-and-io-part-2_594575

https://www.alibabacloud.com/blog/594579?spm=a2c5t.11065265.1996646101.searchclickresult.612d62b1TGCY58




I'll keep a keen eye on any developments and relay back if I get some tests 
together.


Cheers,



Dwane



PS: I’m assuming you're testing Solr 8.4.1 on Linux hosts?


From: Jan Høydahl 
Sent: Monday, 2 March 2020 12:01 AM
To: solr-user@lucene.apache.org 
Subject: Re: SolrCloud location for solr.xml

As long as solr.xml is a mix of setting that need to be separate per node and 
cluster wide settings, it makes no sense to enforce it in zk.

Perhaps we instead should stop requiring solr.xml and allow nodes to start 
without it. Solr can then use a hard coded version as fallback.

Most users just copies the example solr.xml into SOLR_HOME so if we can 
simplify the 80% case I don’t mind if more advanced users put it in zk or in 
local file system.

Jan Høydahl

> 1. mar. 2020 kl. 01:26 skrev Erick Erickson :
>
> Actually, I do this all the time. However, it’s because I’m always blowing 
> everything away and installing a different version of Solr or some such, 
> mostly laziness.
>
> We should move away from allowing solr.xml to be in SOLR_HOME when running in 
> cloud mode IMO, but that’ll need to be done in phases.
>
> Best,
> Erick
>
>> On Feb 28, 2020, at 5:17 PM, Mike Drob  wrote:
>>
>> Hi Searchers!
>>
>> I was recently looking at some of the start-up logic for Solr and was
>> interested in cleaning it up a little bit. However, I'm not sure how common
>> certain deployment scenarios are. Specifically is anybody doing the
>> following combination:
>>
>> * Using SolrCloud (i.e. state stored in zookeeper)
>> * Loading solr.xml from a local solr home rather than zookeeper
>>
>> Much appreciated! Thanks,
>> Mike


Re: SolrCloud location for solr.xml

2020-03-02 Thread Dwane Hall
Apologies all I just realised I replied to the wrong thread. This is in 
response to "Solr cloud on Docker?" not "SolrCloud" location for solr.xml". 
Apologies for the confusion.

Thanks

Dwane
____
From: Dwane Hall 
Sent: Monday, 2 March 2020 7:31 PM
To: Jan Høydahl ; solr-user@lucene.apache.org 

Subject: Re: SolrCloud location for solr.xml

Hey Jan,



Thanks for the info re swap there’s some interesting observations you’ve 
mentioned below particularly the container swap by default.   There was this 
note on the Docker forum describing a similar situation you mention did you 
attempt these settings with the same result? 
(https://success.docker.com/article/node-using-swap-memory-instead-of-host-memory)
 It mentions Docker EE specifically but it might be worth a try.



In our environment we also run a vm.swappiness setting of 1 (our OS default) 
which is inherited by our containers as we don’t enforce any memory or resource 
limits directly on them.   I did not attempt to turn off container swap during 
my testing so I don't have any benchmarks to relay back but if I get some clear 
air I’ll try to spin up some tests and see if I can replicate your observations.



There’s a great resource/tutorial from the Alibaba guys on testing and tweaking 
container resource limits and their observations are far more comprehensive 
than my experience (ref below).  If you haven't seen it may prove a useful 
reference for testing as I’ve noticed some of the Docker settings are not quite 
what you expect them to be.  For example the Docker documentation defines the 
–cpus value as follows:



“Specify how much of the available CPU resources a container can use. For 
instance, if the host machine has two CPUs and you set --cpus="1.5", the 
container is guaranteed at most one and a half of the CPUs. This is the 
equivalent of setting --cpu-period="10" and --cpu-quota="15". Available 
in Docker 1.13 and higher. “



From the reference below

CPU’s = Threads per core X cores per socket X sockets

CPUs are not physical CPU’s



We found this out the hard way when attempting to CPU resource limit a Fusion 
instance and wondered why it was struggling to start all its services on a 
single thread instead of a physical CPU 🙂


It's a three part series

https://www.alibabacloud.com/blog/594573?spm=a2c5t.11065265.1996646101.searchclickresult.612d62b1TGCY58

https://www.alibabacloud.com/blog/docker-container-resource-management-cpu-ram-and-io-part-2_594575

https://www.alibabacloud.com/blog/594579?spm=a2c5t.11065265.1996646101.searchclickresult.612d62b1TGCY58




I'll keep a keen eye on any developments and relay back if I get some tests 
together.


Cheers,



Dwane



PS: I’m assuming you're testing Solr 8.4.1 on Linux hosts?


From: Jan Høydahl 
Sent: Monday, 2 March 2020 12:01 AM
To: solr-user@lucene.apache.org 
Subject: Re: SolrCloud location for solr.xml

As long as solr.xml is a mix of setting that need to be separate per node and 
cluster wide settings, it makes no sense to enforce it in zk.

Perhaps we instead should stop requiring solr.xml and allow nodes to start 
without it. Solr can then use a hard coded version as fallback.

Most users just copies the example solr.xml into SOLR_HOME so if we can 
simplify the 80% case I don’t mind if more advanced users put it in zk or in 
local file system.

Jan Høydahl

> 1. mar. 2020 kl. 01:26 skrev Erick Erickson :
>
> Actually, I do this all the time. However, it’s because I’m always blowing 
> everything away and installing a different version of Solr or some such, 
> mostly laziness.
>
> We should move away from allowing solr.xml to be in SOLR_HOME when running in 
> cloud mode IMO, but that’ll need to be done in phases.
>
> Best,
> Erick
>
>> On Feb 28, 2020, at 5:17 PM, Mike Drob  wrote:
>>
>> Hi Searchers!
>>
>> I was recently looking at some of the start-up logic for Solr and was
>> interested in cleaning it up a little bit. However, I'm not sure how common
>> certain deployment scenarios are. Specifically is anybody doing the
>> following combination:
>>
>> * Using SolrCloud (i.e. state stored in zookeeper)
>> * Loading solr.xml from a local solr home rather than zookeeper
>>
>> Much appreciated! Thanks,
>> Mike


Zookeeper migration

2020-03-16 Thread Dwane Hall
Hey Solr community,


I’m wondering if anyone has ever managed a zookeeper migration while running 
SolrCloud or if they have any advice on the process (not a zookeeper upgrade 
but a new physical instance migration)? I could not seem to find any endpoints 
in the collections or coreadmin api’s that catered for this scenario.


Initially I was hoping I could do all of the required zookeeper preparation 
(znode creation, clusterprops …) and start my existing Solr instance pointing 
at the new zookeeper but of course the new instance is unaware of the cluster 
state (state.json).  In fact when trailing this in a Development environment it 
was quite destructive operation as data from my SOLR_HOME (/var/solr/data) was 
physically deleted after I connected Solr to the new zookeeper instance! I’m 
uncertain if this is the expected behaviour or not but it’s certainly something 
for people to be aware of!


After some investigation and testing my thoughts are I’ll need to complete the 
following:


- Stop the existing Solr instance so no updates are occurring

- Backup the data

- Create the znode on the new zookeeper instance

- Update/upload the appropriate zookeeper managed files to the new 
zookeeper instance (security.json, clusterprops.json, solr.xml etc.)

- Start Solr using ZK_HOST equal to the new zookeeper instance and 
znode (possibly use new Solr nodes here and not the existing ones)

- Replicate the collection creation process on the new zookeeper 
instance

- Physically copy the data from the old Solr nodes to the new Solr 
nodes and carefully map each replica and shard to the new location (which will 
have new replica names)

- Start the new Solr instance

- Clean up the old instance


So in summary has anybody completed a similar migration, can offer any advice, 
or are they aware of an easy way to transfer state between zookeeper instances 
to avoid the migration process I’ve outlined above?


Many thanks,


Dwane


Environment (SolrCloud)

Existing Zk: 3.4.2 (Bare metal)

New Zk: 3.4.2 (Docker)

Existing Solr: 7.7.2 (Docker)


Re: solr core metrics & prometheus exporter - indexreader is closed

2020-05-06 Thread Dwane Hall
Hey Richard,

I noticed this issue with the exporter in the 7.x branch. If you look through 
the release notes for Solr since then there have been quite a few improvements 
to the exporter particularly around thread safety and concurrency (and the 
number of nodes it can monitor).  The version of the exporter can run 
independently to your Solr version so my advice would be to download the most 
recent Solr version, check and modify the exporter start script for its library 
dependencies, extract these files to a separate location, and run this version 
against your 7.x instance. If you have the capacity to upgrade your Solr 
version this will save you having to maintain the exporter separately. Since 
making this change the exporter has not missed a beat and we monitor around 100 
Solr nodes.

Good luck,

Dwane

From: Richard Goodman 
Sent: Tuesday, 5 May 2020 10:22 PM
To: solr-user@lucene.apache.org 
Subject: solr core metrics & prometheus exporter - indexreader is closed

Hi there,

I've been playing with the prometheus exporter for solr, and have created
my config and have deployed it, so far, all groups were running fine (node,
jetty, jvm), however, I'm repeatedly getting an issue with the core group;

WARN  - 2020-05-05 12:01:24.812; org.apache.solr.prometheus.scraper.Async;
Error occurred during metrics collection
java.util.concurrent.ExecutionException:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://127.0.0.1:8083/solr: Server Error

request:
http://127.0.0.1:8083/solr/admin/metrics?group=core&wt=json&version=2.2
at
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
~[?:1.8.0_141]
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
~[?:1.8.0_141]
at
org.apache.solr.prometheus.scraper.Async.lambda$null$1(Async.java:45)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
at
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
~[?:1.8.0_141]
at
java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
~[?:1.8.0_141]
at
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
~[?:1.8.0_141]
at
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
~[?:1.8.0_141]
at
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
~[?:1.8.0_141]
at
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
~[?:1.8.0_141]
at
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
~[?:1.8.0_141]
at
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
~[?:1.8.0_141]
at
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
~[?:1.8.0_141]
at
org.apache.solr.prometheus.scraper.Async.lambda$waitForAllSuccessfulResponses$3(Async.java:43)
~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
at
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
~[?:1.8.0_141]
at
java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
~[?:1.8.0_141]
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
~[?:1.8.0_141]
at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595)
~[?:1.8.0_141]
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_141]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_141]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://127.0.0.1:8083/solr: Server Error

request:
http://127.0.0.1:8083/solr/admin/metrics?group=core&wt=json&version=2.2
at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:643)
~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:11]
at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
~[solr-solrj-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jen

Solr Deletes

2020-05-25 Thread Dwane Hall
Hey Solr users,



I'd really appreciate some community advice if somebody can spare some time to 
assist me.  My question relates to initially deleting a large amount of 
unwanted data from a Solr Cloud collection, and then advice on best patterns 
for managing delete operations on a regular basis.   We have a situation where 
data in our index can be 're-mastered' and as a result orphan records are left 
dormant and unneeded in the index (think of a scenario similar to client 
resolution where an entity can switch between golden records depending on the 
information available at the time).  I'm considering removing these dormant 
records with a large initial bulk delete, and then running a delete process on 
a regular maintenance basis.  The initial record backlog is ~50million records 
in a ~1.2billion document index (~4%) and the maintenance deletes are small in 
comparison ~20,000/week.



So with this scenario in mind I'm wondering what my best approach is for the 
initial bulk delete:

  1.  Do nothing with the initial backlog and remove the unwanted documents 
during the next large reindexing process?
  2.  Delete by query (DBQ) with a specific delete query using the document 
id's?
  3.  Delete by id (DBID)?

Are there any significant performance advantages between using DBID over a 
specific DBQ? Should I break the delete operations up into batches of say 1000, 
1, 10, N DOC_ID's at a time if I take this approach?



The Solr Reference guide mentions DBQ ignores the commitWithin parameter but 
you can specify multiple documents to remove with an OR (||) clause in a DBQ 
i.e.


Option 1 – Delete by id

{"delete":["",""]}



Option 2 – Delete by query (commitWithin ignored)

{"delete":{"query":"DOC_ID:( || )"}}



Shawn also provides a great explanation in this user group post from 2015 of 
the DBQ process 
(https://lucene.472066.n3.nabble.com/SolrCloud-delete-by-query-performance-td4206726.html)



I follow the Solr release notes fairly closely and also noticed this excellent 
addition and discussion from Hossman and committers in the Solr 8.5 release and 
it looks ideal for this scenario 
(https://issues.apache.org/jira/browse/SOLR-14241).  Unfortunately we're still 
on the 7.7.2 branch and are unable to take advantage of the streaming deletes 
feature.



If I do implement a weekly delete maintenance regime is there any advice the 
community can offer from experience?  I'll definitely want to avoid times of 
heavy indexing but how do deletes effect query performance?  Will users notice 
decreased performance during delete operations so they should be avoided during 
peak query windows as well?



As always any advice greatly is appreciated,



Thanks,



Dwane



Environment

SolrCloud 7.7.2, 30 shards, 2 replicas

~3 qps during peak times


Re: Solr Deletes

2020-05-26 Thread Dwane Hall
Thank you very much Erick, Emir, and Bram this is extremly useful advice I 
sincerely appreciate everyone’s input!


Before I received your responses I ran a controlled DBQ test in our DR 
environment and exactly what you said occurred.  It was like reading a step by 
step playbook of events with heavy blocking occurring on the Solr nodes and 
lots of threads going into a TIMED_WAITING state. Several shards were pushed 
into recovery mode and things were starting to get ugly, fast!


I'd read snippets in blog posts and JIRA tickets on DBQ being a blocking 
operation but I did not expect having such a specific DBQ (i.e. by ID's) would 
operate very differently from the DBID (which I expected block as well). Boy 
was I wrong! They're used interchangeably in the Solr ref guide examples so 
it’s very useful to understand the performance implications of each.  
Additionally all of the information I found on delete operations never 
mentioned query performance so I was unsure of its impact in this dimension.


Erik thanks again for your comprehensive response your blogs and user group 
responses are always a pleasure to read I'm constantly picking useful pieces of 
information that I use on a daily basis in managing our Solr/Fusion clusters. 
Additionally, I've been looking for an excuse to use streaming expressions and 
I did not think to use them the way you suggested.  I've watched quite a few of 
Joel's presentations on youtube and his blog is brilliant.  Streaming 
expressions are expanding with every Solr release they really are a very 
exciting part of Solr's evolution.  Your final point on searcher state while 
streaming expressions are running and its relationship with new searchers is a 
very interesting additional piece of information I’ll add to the toolbox. Thank 
you.



At the moment we're fortunate to have all the ID's of the documents to remove 
in a DB so I'll be able to construct batches of DBID requests relatively easily 
and store them in a backlog table for processing without needing to traverse 
Solr with cursors, streaming (or other means) to identify them.  We follow a 
similar approach for updates in batches of around ~1000 docs/batch.  
Inspiration for that sweet spot was once again determined after reading one of 
Erik's Lucidworks blog posts and testing 
(https://lucidworks.com/post/really-batch-updates-solr-2/).



Again thanks to the community and users for everyone’s contribution on the 
issue it is very much appreciated.


Successful Solr-ing to all,


Dwane


From: Bram Van Dam 
Sent: Wednesday, 27 May 2020 5:34 AM
To: solr-user@lucene.apache.org 
Subject: Re: Solr Deletes

On 26/05/2020 14:07, Erick Erickson wrote:
> So best practice is to go ahead and use delete-by-id.


I've noticed that this can cause issues when using implicit routing, at
least on 7.x. Though I can't quite remember whether the issue was a
performance issue, or whether documents would sometimes not get deleted.

In either case, I worked it around it by doing something like this:

UpdateRequest req = new UpdateRequest();
req.deleteById(id);
req.setCommitWithin(-1);
req.setParam(ShardParams._ROUTE_, shard);

Maybe that'll help if you run into either of those issues.

 - Bram


Re: BasicAuth help

2020-09-08 Thread Dwane Hall
Just adding some assistance to the Solr-LDAP integration options. A colleague 
of mine wrote a plugin that adopts a similar approach to the one Jan suggested 
of "plugging-in" an LDAP provider.

He provides the following notes on its design and use



1.   It authenticates with LDAP on every request which can be expensive. In the 
same repo he's written an optimisation for a gremlin-ldap-plugin that can 
probably be ported here (Once LDAP successfully authenticates, caches 
credentials locally by BCrypt hashing it and using the cached hash to validate 
subsequent requests until cache timeout which is when it goes back to LDAP 
again. So, any password changes in LDAP are reflected correctly. This caching 
can be turned on and off with a param based on how expensive the LDAP auth is).

2.  He had to copy large swaths of code from 
org.apache.solr.security.RuleBasedAuthorizationPlugin into the ldap 
authorisation plugin because the Solr class is not extensible. A refactor the 
class to make the extension easier would prevent this.

3.  Finally, the inter-node authentication. Need to look into it to see if 
there is a mechanism to extend the inter-node auth to include roles in the 
payload so that LDAP role look up isn’t happening on every node that request 
ends up hitting.



But if someone really wants LDAP integration they can use it as is. It's a good 
starting point anyway.  (https://github.com/vjgorla/solr-ldap-plugin)

Thanks,

Dwane

From: Jan Høydahl 
Sent: Monday, 7 September 2020 5:21 PM
To: solr-user@lucene.apache.org 
Subject: Re: BasicAuth help

That github patch is interesting.
My initial proposal for how to plug LDAP into Solr was to make the 
AuthenticationProvider pluggable in BasicAuthPlugin, so you could plug in an 
LDAPAuthProvider. See https://issues.apache.org/jira/browse/SOLR-8951 
. No need to replace the whole 
BasicAuth class I think. Anyone who wants to give it a shot, borrowing some 
code from the ldap_solr repo, feel free :)

Jan

> 4. sep. 2020 kl. 09:43 skrev Aroop Ganguly :
>
> Try looking at a simple ldap authentication suggested here: 
> https://github.com/itzmestar/ldap_solr 
> 
> You can combine this for authentication and couple it with rule based 
> authorization.
>
>
>
>> On Aug 28, 2020, at 12:26 PM, Vanalli, Ali A - DOT > > wrote:
>>
>> Hello,
>>
>> Solr is running on windows machine and wondering if it possible to setup 
>> BasicAuth with the LDAP?
>>
>> Also, tried the example of Basic-Authentication that is published 
>> here>  
>> >
>>  but this did not work too.
>>
>> Thanks...Ali
>>
>>
>



Re: Solr queries slow down over time

2020-09-25 Thread Dwane Hall
Goutham I suggest you read Hossman's excellent article on deep paging and why 
returning rows=(some large number) is a bad idea. It provides an thorough 
overview of the concept and will explain it better than I ever could 
(https://lucidworks.com/post/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/#update_2013_12_18).
 In short if you want to extract that many documents out of your corpus use 
cursor mark, streaming expressions, or Solr's parallel SQL interface (that uses 
streaming expressions under the hood)
https://lucene.apache.org/solr/guide/8_6/streaming-expressions.html.

Thanks,

Dwane

From: Goutham Tholpadi 
Sent: Friday, 25 September 2020 4:19 PM
To: solr-user@lucene.apache.org 
Subject: Solr queries slow down over time

Hi,

I have around 30M documents in Solr, and I am doing repeated *:* queries
with rows=1, and changing start to 0, 1, 2, and so on, in a
loop in my script (using pysolr).

At the start of the iteration, the calls to Solr were taking less than 1
sec each. After running for a few hours (with start at around 27M) I found
that each call was taking around 30-60 secs.

Any pointers on why the same fetch of 1 records takes much longer now?
Does Solr need to load all the 27M before getting the last 1 records?
Is there a better way to do this operation using Solr?

Thanks!
Goutham


Re: Avoiding duplicate entry for a multivalued field

2020-10-29 Thread Dwane Hall
Srinivas this is possible by adding an unique field update processor to the 
update processor chain you are using to perform your updates (/update, 
/update/json, /update/json/docs, .../a_custom_one)

The Java Documents explain its use nicely
(https://lucene.apache.org/solr/8_6_0//solr-core/org/apache/solr/update/processor/UniqFieldsUpdateProcessorFactory.html)
 or there are articles on stack overflow addressing this exact problem 
(https://stackoverflow.com/questions/37005747/how-to-remove-duplicates-from-multivalued-fields-in-solr#37006655)

Thanks,

Dwane

From: Srinivas Kashyap 
Sent: Thursday, 29 October 2020 3:49 PM
To: solr-user@lucene.apache.org 
Subject: Avoiding duplicate entry for a multivalued field

Hello,

Say, I have a schema field which is multivalued. Is there a way to maintain 
distinct values for that field though I continue to add duplicate values 
through atomic update via solrj?

Is there some property setting to have only unique values in a multi valued 
fields?

Thanks,
Srinivas

DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by 
replying to the e-mail, and then delete it without making copies or using it in 
any way.
No representation is made that this email or any attachments are free of 
viruses. Virus scanning is recommended and is the responsibility of the 
recipient.

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business. Providing a safer and more useful place for your human 
generated data. Specializing in; Security, archiving and compliance. To find 
out more visit the Mimecast website.


Re: Basic auth and index replication

2019-04-23 Thread Dwane Hall
Hi guys,

Did anyone get an opportunity to confirm this behaviour.  If not is the 
community happy for me to raise a JIRA ticket for this issue?

Thanks,

Dwane

From: Dwane Hall 
Sent: Wednesday, 3 April 2019 7:15 PM
To: solr-user@lucene.apache.org
Subject: Basic auth and index replication

Hey Solr community.



I’ve been following a couple of open JIRA tickets relating to use of the basic 
auth plugin in a Solr cluster (https://issues.apache.org/jira/browse/SOLR-12584 
, https://issues.apache.org/jira/browse/SOLR-12860) and recently I’ve noticed 
similar behaviour when adding tlog replicas to an existing Solr collection.  
The problem appears to occur when Solr attempts to replicate the leaders index 
to a follower on another Solr node and it fails authentication in the process.



My environment

Solr cloud 7.6

Basic auth plugin enabled

SSL



Has anyone else noticed similar behaviour when using tlog replicas?



Thanks,



Dwane



2019-04-03T13:27:22,774 5000851636 WARN  : [   ] 
org.apache.solr.handler.IndexFetcher : Master at: 
https://myserver:myport/solr/mycollection_shard1_replica_n17/ is not available. 
Index fetch failed by exception: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at https://myserver:myport/solr/mycollection_shard1_replica_n17: 
Expected mime type application/octet-stream but got text/html. 





Error 401 require authentication



HTTP ERROR 401

Problem accessing /solr/mycollection_shard1_replica_n17/replication. Reason:

require authentication







Spark-Solr connector

2019-07-11 Thread Dwane Hall
Hey guys,



I’ve just started looking at the excellent spark-solr project (thanks Tim 
Potter, Kiran Chitturi, Kevin Risden and Jason Gerlowski for their efforts with 
this project it looks really neat!!).



I’m only at the initial stages of my exploration but I’m running into a class 
not found exception when connecting to a secure solr cloud instance (basic 
auth, ssl).  Everything is working as expected on a non-secure solr cloud 
instance.



The process looks pretty straightforward according to the doco so I’m wondering 
if I’m missing anything obvious or if I need to bring any extra classes to the 
classpath when using this project?



Any advice would be greatly appreciated.



Thanks,



Dwane



Environments tried

7.6 and 8.1.1 solr cloud

SSL, Basic Auth Plugin, Rules Based Authorisation Plugin enabled

Spark v 2.4.3

Spark-Solr build spark-solr-3.7.0-20190619.153847-16-shaded.jar





./spark-2.4.3-bin-hado./spark-2.4.3-bin-hadoop2.7/bin/spark-shell --master 
local[*] --jars spark-solr-3.7.0-20190619.153847-16-shaded.jar --conf 
'spark.driver.extraJavaOptions=-Dbasicauth=solr:SolrRocks'





val options = Map(

"collection" -> "My_Collection",

"zkhost" -> "zkn1:2181,zkn2:2181,zkn3:2181/solr/SPARKTEST"

  )



val df = spark.read.format("solr").options(options).load



com.google.common.util.concurrent.ExecutionError: 
java.lang.NoClassDefFoundError: org/eclipse/jetty/client/api/Authentication

  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)

  at com.google.common.cache.LocalCache.get(LocalCache.java:4000)

  at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)

  at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)

  at 
com.lucidworks.spark.util.SolrSupport$.getCachedCloudClient(SolrSupport.scala:244)

  at 
com.lucidworks.spark.util.SolrSupport$.getSolrBaseUrl(SolrSupport.scala:248)

  at 
com.lucidworks.spark.SolrRelation.dynamicSuffixes$lzycompute(SolrRelation.scala:100)

  at com.lucidworks.spark.SolrRelation.dynamicSuffixes(SolrRelation.scala:98)

  at 
com.lucidworks.spark.SolrRelation.getBaseSchemaFromConfig(SolrRelation.scala:299)

  at 
com.lucidworks.spark.SolrRelation.querySchema$lzycompute(SolrRelation.scala:239)

  at com.lucidworks.spark.SolrRelation.querySchema(SolrRelation.scala:108)

  at com.lucidworks.spark.SolrRelation.schema(SolrRelation.scala:428)

  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:403)

  at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)

  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)

  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)

  ... 49 elided

Caused by: java.lang.NoClassDefFoundError: 
org/eclipse/jetty/client/api/Authentication

  at 
com.lucidworks.spark.util.SolrSupport$.getSolrCloudClient(SolrSupport.scala:214)

  at 
com.lucidworks.spark.util.SolrSupport$.getNewSolrCloudClient(SolrSupport.scala:240)

  at 
com.lucidworks.spark.util.CacheCloudSolrClient$$anon$1.load(SolrSupport.scala:38)

  at 
com.lucidworks.spark.util.CacheCloudSolrClient$$anon$1.load(SolrSupport.scala:36)

  at 
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)

  at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)

  at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)

  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)

  ... 64 more

Caused by: java.lang.ClassNotFoundException: 
org.eclipse.jetty.client.api.Authentication

  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)

  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

  ... 72 more


Re: Spark-Solr connector

2019-07-12 Thread Dwane Hall
Thanks Shawn I'll raise a question on the GitHub page. Cheers,
Dwane

From: Shawn Heisey 
Sent: Friday, 12 July 2019 10:05 PM
To: solr-user@lucene.apache.org
Subject: Re: Spark-Solr connector

On 7/11/2019 8:50 PM, Dwane Hall wrote:
> I’ve just started looking at the excellent spark-solr project (thanks Tim 
> Potter, Kiran Chitturi, Kevin Risden and Jason Gerlowski for their efforts 
> with this project it looks really neat!!).
>
> I’m only at the initial stages of my exploration but I’m running into a class 
> not found exception when connecting to a secure solr cloud instance (basic 
> auth, ssl).  Everything is working as expected on a non-secure solr cloud 
> instance.
>
> The process looks pretty straightforward according to the doco so I’m 
> wondering if I’m missing anything obvious or if I need to bring any extra 
> classes to the classpath when using this project?
>
> Any advice would be greatly appreciated.

The exception here (which I did not quote) is in code from Google,
Spark, and Lucidworks.  There are no Solr classes mentioned at all in
the stacktrace.

Which means that we won't be able to help you on this list.  Looking
closer at the stacktrace, it looks to me like you're going to need to
talk to Lucidworks about this problem.

Thanks,
Shawn


Re: Solr Ref Guide Changes - now HTML only

2019-10-29 Thread Dwane Hall
Although I don't use the pdf version I highly recommend watching Cassandra's 
talk from Activate last year ( https://m.youtube.com/watch?v=DixlnxAk08s). In 
this talk she addresses the challenges of the Solr ref guide including the 
'title search' mentioned below and presents a number of options for the guide's 
future.  It certainly gave me an appreciation for some of the complexities in 
this part of the Solr project that I'd never fully appreciated before.

The guide has come a long way since the confluence days and continues to evolve 
with every release. Although the title search is not ideal I'm yet to not find 
anything I'm looking for with a little creative googling.

That's my tire cents on the matter.

From: Alexandre Rafalovitch 
Sent: Tuesday, 29 October 2019 9:11 AM
To: solr-user 
Subject: Re: Solr Ref Guide Changes - now HTML only

I've done some experiments about indexing RefGuide (from source) into
Solr at: https://github.com/arafalov/solr-refguide-indexing . But the
problem was creating UI, hosting, etc.

There was also a thought (mine) of either shipping RefGuide in Solr
with pre-built index as an example or even just shipping an index with
links to the live version. Both of these were complicated because PDF
was throwing the publication schedule of. And also because we are
trying to make Solr distribution smaller, not bigger. A bit of a
catch-22 there. But maybe now it could be revisited.

Regards,
   Alex.
P.s. A personal offline copy of Solr RefGuide could certainly be built
from source. And it will become even easier to do that soon. But yes,
perhaps a compressed download of HTML version would be a nice
replacement of PDF.

On Tue, 29 Oct 2019 at 09:04, Shawn Heisey  wrote:
>
> On 10/28/2019 3:51 PM, Nicolas Paris wrote:
> > I am not very happy with the search engine embedded within the html
> > documentation I admit. Hope this is not solr under the hood :S
>
> It's not Solr under the hood.  It is done by a javascript library that
> runs in the browser.  It only searches page titles, not the whole document.
>
> The fact that a search engine has terrible search in its documentation
> is not lost on us.  We talked about what it would take to use Solr ...
> the infrastructure that would have to be set up and maintaned is
> prohibitive.
>
> We are looking into improving things in this area.  It's going a lot
> slower than we'd like.
>
> Thanks,
> Shawn


Cursor mark page duplicates

2019-11-07 Thread Dwane Hall
Hey Solr community,

I'm using Solr's cursor mark feature and noticing duplicates when paging 
through results.   The duplicate records happen intermittently and appear at 
the end of one page, and the beginning of the next (but not on all pages 
through the results). So if rows=20 the duplicate records would be document 20 
on page1, and document 21 on page 2.   The document's id come from a database 
and that field is a unique primary key so I'm confident that there are no 
duplicate document id's in my corpus.   Additionally no index updates are 
occurring in the index (it's completely static).  My result sort order is id (a 
string representation of a timestamp (-MM-DD HH:MM.SS)), score. In this 
Solr community post 
(https://lucene.472066.n3.nabble.com/Solr-document-duplicated-during-pagination-td4269176.html)
 Shawn Heisey suggests:


"There are two ways this can happen.  One is that the index has changed
between different queries, pushing or pulling results between the end of
one page and the beginning of the next page.  The other is having the
same uniqueKey value in more than one shard."

In the Solr query below for one of the example duplicates in question I can see 
a search by the id returns only a single document. The replication factor for 
the collection is 2 so the id will also appear in this shards replica.  Taking 
into consideration Shawn's advice above, my question is will having a shard 
replica still count as the document having a duplicate id in another shard and 
potentially introduce duplicates into my paged results?  If not could anyone 
suggest another possible scenario where duplicates could potentially be 
introduced?

As always any advice would be greatly appreciated,

Thanks,

Dwane

Environment
Solr cloud (7.7.2)
8 shard collection, replication factor 2

{

  "responseHeader":{

"zkConnected":true,

"status":0,

"QTime":2072,

"params":{

  "q":"id:myUUID(-MM-DD HH:MM.SS)",

  "fl":"id,[shard]"}},

  "response":{"numFound":1,"start":0,"maxScore":17.601822,"docs":[

  {

"id":"myUUID(-MM-DD HH:MM.SS)",


"[shard]":"https://solr1:9014/solr/MyCollection_shard4_replica_n12/|https://solr2:9011/solr/MyCollection_shard4_replica_n35/"}]

  }}




Re: Cursor mark page duplicates

2019-11-11 Thread Dwane Hall
Thanks Erick/Hossman,

I appreciate your input it's always an interesting read seeing Solr legends 
like yourselves work through a problem!  I certainly learn a lot from following 
your responses in this user group.

As you recommended I ran the distrib=false query on each shard and the results 
were the identical in both instances.  Below is a snapshot from the admin ui 
showing the details of each shard which all looks in order to me (other than 
our large number of deletes in the corpus ...we have quite a dynamic 
environment when the index is live)


Last Modified:23 days ago

Num Docs:47247895

Max Doc:68108804

Heap Memory Usage:-1

Deleted Docs:20860909

Version:528038

Segment Count:41



Master (Searching) Version:1571148411550 Gen:25528 Size:42.56 GB

Master (Replicable) Version:1571153302013 Gen:25529



Last Modified:23 days ago

Num Docs:47247895

Max Doc:68223647

Heap Memory Usage:-1

Deleted Docs:20975752

Version:526613

Segment Count:43



Master (Searching) Version:1571148411615 Gen:25527 Size:42.63 GB

Master (Replicable) Version:1571153302076 Gen:25528

I was however able to replicate the issue but under unusual circumstances with 
some crude in browser testing.  If I use a cursorMark other than "*" and 
constantly re-run the query (just resubmitting the url in a browser with the 
same cursor and query) the first result on the page toggles between the 
expected value, and the last item from the previous page.  So if rows=50, page 
2 toggles between result 51 (expected) and result 50 (the last item from the 
previous page).  It doesn't happen all the time but every one in five or so 
refreshes I'm able to replicate it consistently (and on every subsequent 
cursor).

I failed to mention in my original email that we use the HdfsDirectoryFactory 
to store our indexes in HDFS.  This configuration uses an off heap block cache 
to cache HDFS blocks in memory as it is unable to take advantage of the OS disk 
cache.  I mention this as we're currently in the process of switching to local 
disk and I've been unable to replicate the issue when using the local storage 
configuration of the same index.  This maybe completely unrelated, and 
additionally the local storage index is freshly loaded so it has not 
experienced the same number of deletes or updates that our HDFS indexes have.

I think my best bet is to monitor our new index configuration and if I notice 
any similar behaviour I'll make the community aware of my findings.

Once again,

Thanks for your input

Dwane


From: Chris Hostetter 
Sent: Friday, 8 November 2019 9:58 AM
To: solr-user@lucene.apache.org 
Subject: Re: Cursor mark page duplicates


: I'm using Solr's cursor mark feature and noticing duplicates when paging
: through results.  The duplicate records happen intermittently and appear
: at the end of one page, and the beginning of the next (but not on all
: pages through the results). So if rows=20 the duplicate records would be
: document 20 on page1, and document 21 on page 2.  The document's id come

Can you try to reproduce and show us the specifics of this including:

1) The sort param you're using
2) An 'fl' list that includes every field in the sort param
3) The returned values of every 'fl' field for the "duplicate" document
you are seeing as it appears in *BOTH* pages of results -- allong with the
cursorMark value in use on both of those pages.


: (-MM-DD HH:MM.SS)), score. In this Solr community post
: 
(https://lucene.472066.n3.nabble.com/Solr-document-duplicated-during-pagination-td4269176.html)
: Shawn Heisey suggests:

...that post was *NOT* about using cursorMark -- it was plain old regular
pagination, where even on a single core/replica you can see a document
X get "pushed" from page#1 to page#2 by updates/additions of some other
doxument Z that causes Z to sort "before" X.

With cursors this kind of "pushing other docs back" or "pushing other docs
forward" doesn't exist because of the cursorMark.  The only way a doc
*should* move is if it's OWN sort values are updated, causing it to
reposition itself.

But, if you have a static index, then it's *possible* that the last time
your document X was updated, there was a "glitch" somewhere in the
distributed update process, and the update didn't succeed in osme
replicas -- so the same document may have different sort values
on diff replicas.

: In the Solr query below for one of the example duplicates in question I
: can see a search by the id returns only a single document. The
: replication factor for the collection is 2 so the id will also appear in
: this shards replica.  Taking into consideration Shawn's advice above, my

If you've already identified a particular document where this has
happened, then you can also verify/disprove my hypothosis by hitting each
of the replicas that hosts this document with a request that looks like...

/solr/MyCollection_shard4_replica_n12/select?q=id:FOO&distrib=false
/solr/MyCollection_sha

Re: Cursor mark page duplicates

2019-11-28 Thread Dwane Hall
Hey guys,

I asked a question on the forum a couple of weeks ago regarding cursorMark 
duplicates.  I initially thought it may be due to HDFSCaching because I was 
unable replicate the issue on local indexes but unfortunately the dreaded 
duplicates have returned!! For a refresher I was seeing what I thought was 
duplicate documents appearing randomly on the last page of one cursor, and the 
first page of the next.  So if rows=50 the duplicates are document 50 on page 1 
and document 1 on page 2.

After further investigation I don't actually believe these documents are 
duplicates but the same document being returned from a different replica on 
each page.  After running a diff on the two documents the only difference is 
the field "Solr_Update_Date" which I insert on each document as it is inserted 
into the corpus.

This is how the managed-schema mapping for this field looks






The only sort parameter is the id field

"sort":"id desc"

rows=50


Here are the results




Document 50 on page 1 is



{

  "responseHeader":{

"zkConnected":true,

"status":0,

"QTime":8,

"params":{

  "q":"id:\"2019-10-29 15:15:36.748052\"",

  "fl":"id,_version_,[shard],Solr_Update_Date",

  "_":"1574900506126"}},

  "response":{"numFound":1,"start":0,"maxScore":7.312953,"docs":[

  {

"id":"2019-10-29 15:15:36.748052",

"Solr_Update_Date":"2019-11-01T00:15:07.811Z",

"_version_":1648956337338449920,


"[shard]":"https://solrHost:9021/solr/my_collection_shard4_replica_n14/|https://solrHost:9022/solr/my_collection_shard4_replica_n12/"}]

  }}



Document 1 on page 2 is


{

  "responseHeader":{

"zkConnected":true,

"status":0,

"QTime":7,

"params":{

  "q":"id:\"2019-10-29 15:15:36.748052\"",

  "fl":"id,_version_,[shard],Solr_Update_Date",

  "_":"1574900506126"}},

  "response":{"numFound":1,"start":0,"maxScore":7.822712,"docs":[

  {

"id":"2019-10-29 15:15:36.748052",

"Solr_Update_Date":"2019-11-01T00:15:07.794Z",

"_version_":1648956337338449920,


"[shard]":"https://solrHost:9022/solr/my_collection_shard4_replica_n12/|https://solrHost:9021/solr/my_collection_shard4_replica_n14/"}]

  }}


As you can see both documents have the same version number but different 
maxScores and Solr_Update_Date's.  My understanding is the cursorMark should 
only be generated off the id field so I can't see why I would get a different 
document from a different shard at the end of one page, and the beginning of 
the next? Would anyone have any insight into this behaviour as this happens 
randomly on page boundaries when paging through results.

Thanks for your time

Dwane



From: Dwane Hall 
Sent: Monday, 11 November 2019 10:10 PM
To: solr-user@lucene.apache.org 
Subject: Re: Cursor mark page duplicates

Thanks Erick/Hossman,

I appreciate your input it's always an interesting read seeing Solr legends 
like yourselves work through a problem!  I certainly learn a lot from following 
your responses in this user group.

As you recommended I ran the distrib=false query on each shard and the results 
were the identical in both instances.  Below is a snapshot from the admin ui 
showing the details of each shard which all looks in order to me (other than 
our large number of deletes in the corpus ...we have quite a dynamic 
environment when the index is live)


Last Modified:23 days ago

Num Docs:47247895

Max Doc:68108804

Heap Memory Usage:-1

Deleted Docs:20860909

Version:528038

Segment Count:41



Master (Searching) Version:1571148411550 Gen:25528 Size:42.56 GB

Master (Replicable) Version:1571153302013 Gen:25529



Last Modified:23 days ago

Num Docs:47247895

Max Doc:68223647

Heap Memory Usage:-1

Deleted Docs:20975752

Version:526613

Segment Count:43



Master (Searching) Version:1571148411615 Gen:25527 Size:42.63 GB

Master (Replicable) Version:1571153302076 Gen:25528

I was however able to replicate the issue but under unusual circumstances with 
some crude in browser testing.  If I use a cursorMark other than "*" and 
constantly re-run the query (just resubmitting the url in a browser with the 
same cursor and query) the first result on the page toggles between the 
expected value, and the last item from the previous page.  So if rows=50, page 
2 toggles between result 51 (expected) and result 50 (the last item from the 
previous p

Re: Cursor mark page duplicates

2019-11-28 Thread Dwane Hall
Thanks Shawn, you are indeed correct these are NRT replicas! Thanks very much 
for the advice and possible resolutions. I went down the NRT path as in the 
past I've read advice from some of the Solr gurus recommending to use these 
replica types unless you have a very good reason not to. I do have basic auth 
enabled on my Solr cloud configuration and believe I can't use PULL replicas 
until the following JIRA is resolved 
(https://issues.apache.org/jira/plugins/servlet/mobile#issue/SOLR-11904) as 
Solr users the index replicator for this process. With this being the case I'll 
attempt your second suggestion and see how I go. Thanks again for taking the 
time to look at this it really was a confusing one to debug. Have a great 
weekend fellow Solr users and happy Solr-ing.

Dwane

From: Shawn Heisey 
Sent: Friday, 29 November 2019 4:51 AM
To: solr-user@lucene.apache.org 
Subject: Re: Cursor mark page duplicates

On 11/28/2019 1:30 AM, Dwane Hall wrote:
> I asked a question on the forum a couple of weeks ago regarding cursorMark 
> duplicates.  I initially thought it may be due to HDFSCaching because I was 
> unable replicate the issue on local indexes but unfortunately the dreaded 
> duplicates have returned!! For a refresher I was seeing what I thought was 
> duplicate documents appearing randomly on the last page of one cursor, and 
> the first page of the next.  So if rows=50 the duplicates are document 50 on 
> page 1 and document 1 on page 2.
>
> After further investigation I don't actually believe these documents are 
> duplicates but the same document being returned from a different replica on 
> each page.  After running a diff on the two documents the only difference is 
> the field "Solr_Update_Date" which I insert on each document as it is 
> inserted into the corpus.
>
> This is how the managed-schema mapping for this field looks
>
>  default="NOW" />
This can happen with SolrCloud using NRT replicas.  The default replica
type is NRT.  Based on the core names returned by the [shard] field in
your responses, it looks like you do have NRT replicas.

There are two solutions.  The better solution is to use
TimestampUpdateProcessorFactory for setting your timestamp field instead
of a default of NOW in the schema.  An alternate solution is to use
TLOG/PULL replica types instead of NRT -- that way replicas are
populated by copying exact index contents instead of independently indexing.

Thanks,
Shawn


Re: Solr Cloud on Docker?

2019-12-22 Thread Dwane Hall
Hey Walter,

I recently migrated our Solr cluster to Docker and am very pleased I did so. We 
run relativity large servers and run multiple Solr instances per physical host 
and having managed Solr upgrades on bare metal installs since Solr 5, 
containerisation has been a blessing (currently Solr 7.7.2). In our case we run 
20 Solr nodes per host over 5 hosts totalling 100 Solr instances. Here I host 3 
collections of varying size. The first contains 60m docs (8 shards), the second 
360m (12 shards) , and the third 1.3b (30 shards) all with 2 NRT replicas. The 
docs are primarily database sourced but are not tiny by any means.

Here are some of my comments from our migration journey:
- Running Solr on Docker should be no different to bare metal. You still need 
to test for your environment and conditions and follow the guides and best 
practices outlined in the excellent Lucidworks blog post 
https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/.
- The recent Solr Docker images are built with Java 11 so if you store your 
indexes in hdfs you'll have to build your own Docker image as Hadoop is not yet 
certified with Java 11 (or use an older Solr version image built with Java 8)
- As Docker will be responsible for quite a few Solr nodes it becomes important 
to make sure the Docker daemon is configured in systemctl to restart after 
failure or reboot of the host. Additionally the Docker restart=always setting 
is useful for restarting failed containers automatically if a single container 
dies (i.e. JVM explosions). I've deliberately blown up the JVM in test 
conditions and found the containers/Solr recover really well under Docker.
- I use Docker Compose to spin up our environment and it has been excellent for 
maintaining consistent settings across Solr nodes and hosts. Additionally using 
a .env file makes most of the Solr environment variables per node configurable 
in an external file.
- I'd recommend Docker Swarm if you plan on running Solr over multiple physical 
hosts. Unfortunately we had an incompatible OS so I was unable to utilise this 
approach. The same incompatibility existed for K8s but Lucidworks has another 
great article on this approach if you're more fortunate with your environment 
than us https://lucidworks.com/post/running-solr-on-kubernetes-part-1/.
- Our Solr instances are TLS secured and use the basic auth plugin and rules 
based authentication provider. There's nothing I have not been able to 
configure with the default Docker images using environment variables passed 
into the container. This makes upgrades to Solr versions really easy as you 
just need to grab the image and pass in your environment details to the 
container for any new Solr version.
- If possible I'd start with the Solr 8 Docker image. The project underwent a 
large refactor to align it with the install script based on community feedback. 
If you start with an earlier version you'll need to refactor when you 
eventually move to Solr version 8. The Solr Docker page has more details on 
this.
- Matijn Koster (the project lead) is excellent and very responsive to 
questions on the project page. Read through the q&a page before reaching out I 
found a lot of my questions already answered there.  Additionally, he provides 
a number of example Docker configurations from command line parameters to 
docker-compose files running multiple instances and zookeeper quarums.
- The Docker extra hosts parameter is useful for adding extra hosts to your 
containers hosts file particularly if you have multiple nic cards with internal 
and external interfaces and you want to force communication over a specific one.
- We use the Solr Prometheus exporter to collect node metrics. I've found I've 
needed to reduce the metrics to collect as having this many nodes overwhelmed 
it occasionally. From memory it had something to do with concurrent 
modification of Future objects the collector users and it sometimes misses 
collection cycles. This is not Docker related but Solr size related and the 
exporter's ability to handle it.
- We use the zkCli script a lot for updating configsets. As I did not want to 
have to copy them into a container to update them I just download a copy of the 
Solr binaries and use it entirely for this zookeeper script. It's not elegant 
but a number of our Dev's are not familiar with Docker and this was a nice 
compromise. Another alternative is to just use the rest API to do any configset 
manipulation.
- We load balance all of these nodes to external clients using a haproxy Docker 
image. This combined with the Docker restart policy and Solr replication and 
autoscaling capabilities provides a very stable environment for us.

All in all migrating and running Solr on Docker has been brilliant. It was 
primarily driven by a need to scale our environment vertically on large 
hardware instances as running 100 nodes on bare metal was too big a maintenance 
and administrat