solr-injection

2020-02-11 Thread Martin Frank Hansen (MHQ)
Hi,

I was wondering how others are handling solr – injection in their solutions?

After reading this post: 
https://www.waratek.com/apache-solr-injection-vulnerability-customer-alert/ I 
can see how important it is to update to Solr-8.2 or higher.

Has anyone been successful in injecting unintended queries to Solr? I have 
tried to delete the database from the front-end, using basic search strings and 
Solr commands, but has yet not been successful (which is good). I think there 
are many who knows much more about this than me, so would be nice to hear from 
someone with more experience.

Which considerations do I need to look at in order to secure my Solr core? 
Currently we have a security layer on top on Solr, but at the same time we do 
not want to restrict the flexibility of the searches too much.

Best regards

Martin


Internal - KMD A/S

Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
KMD’s Privatlivspolitik, der fortæller, 
hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s 
Privacy Policy outlining how we process your 
personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis 
du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og 
ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andre 
fejl, som kan påvirke computeren eller it-systemet, hvori den modtages og 
læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar 
for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.

Please note that this message may contain confidential information. If you have 
received this message by mistake, please inform the sender of the mistake by 
sending a reply, then delete the message from your system without making, 
distributing or retaining any copies of it. Although we believe that the 
message and any attachments are free from viruses and other errors that might 
affect the computer or it-system where it is received and read, the recipient 
opens the message at his or her own risk. We assume no responsibility for any 
loss or damage arising from the receipt or use of this message.


Re: solr-injection

2020-02-11 Thread Jörn Franke
Do not have users accessing Solr directly.

 Have your own secure web frontend/ own APIs for it. In this way you can 
control secure access.

Secure Solr with https and Kerberos. Have for your web frontend only access 
rights needed and for your admins only the access rights they need. Automate 
deployment of configurations through the APIs. Secure Zookeeper (if in cloud 
mode) with ssl and authentication (eh Kerberos).

Make sure that connection to those two are only allowed for the web frontend 
and admins ( for the latter have a dedicated jumphost from which connections 
are allowed). 



> Am 11.02.2020 um 10:55 schrieb Martin Frank Hansen (MHQ) :
> 
> Hi,
> 
> I was wondering how others are handling solr – injection in their solutions?
> 
> After reading this post: 
> https://www.waratek.com/apache-solr-injection-vulnerability-customer-alert/ I 
> can see how important it is to update to Solr-8.2 or higher.
> 
> Has anyone been successful in injecting unintended queries to Solr? I have 
> tried to delete the database from the front-end, using basic search strings 
> and Solr commands, but has yet not been successful (which is good). I think 
> there are many who knows much more about this than me, so would be nice to 
> hear from someone with more experience.
> 
> Which considerations do I need to look at in order to secure my Solr core? 
> Currently we have a security layer on top on Solr, but at the same time we do 
> not want to restrict the flexibility of the searches too much.
> 
> Best regards
> 
> Martin
> 
> 
> Internal - KMD A/S
> 
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
> KMD’s Privatlivspolitik, der fortæller, 
> hvordan vi behandler oplysninger om dig.
> 
> Protection of your personal data is important to us. Here you can read KMD’s 
> Privacy Policy outlining how we process 
> your personal data.
> 
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. 
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
> afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
> e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen 
> og ethvert vedhæftet bilag efter vores overbevisning er fri for virus og 
> andre fejl, som kan påvirke computeren eller it-systemet, hvori den modtages 
> og læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget 
> ansvar for tab og skade, som er opstået i forbindelse med at modtage og bruge 
> e-mailen.
> 
> Please note that this message may contain confidential information. If you 
> have received this message by mistake, please inform the sender of the 
> mistake by sending a reply, then delete the message from your system without 
> making, distributing or retaining any copies of it. Although we believe that 
> the message and any attachments are free from viruses and other errors that 
> might affect the computer or it-system where it is received and read, the 
> recipient opens the message at his or her own risk. We assume no 
> responsibility for any loss or damage arising from the receipt or use of this 
> message.


Possible performance issue in my environment setup

2020-02-11 Thread Rudenko, Artur
I'm am currently investigating a performance issue in our environment (20M 
large PARENT documents and 800M nested small CHILD documents). The system 
inserts about 400K PARENT documents and 16M CHILD documents per day.
This is a solr cloud 8.3 environment with 7 servers (64 VCPU 128 GB RAM each, 
24GB allocated to Solr) with single collection (32 shards and replication 
factor 2).

Solr config related info :


  ${solr.autoCommit.maxTime:360}
  ${solr.autoCommit.maxDocs:5}
  true
   


   
  ${solr.autoSoftCommit.maxTime:30}
   

I found in the solr log the following log line:

[2020-02-10T00:01:00.522] INFO [qtp1686100174-100525] 
org.apache.solr.search.SolrIndexSearcher Opening 
[Searcher@37c9205b[0_shard29_replica_n112] realtime]

>From a log with 100K records, the above log record appears 65K times.

We are experiencing extremely slow query time while the indexing time is fast 
and sufficient.

Is this a possible direction to keep investigating? If so, any advices?


Thanks,
Artur Rudenko


This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


SolrJ 8.2: Too many Connection evictor threads

2020-02-11 Thread Andreas Kahl
Hello everyone, 

we just updated our Solr from former 5.4 to 8.2. The server runs fine,
but in our client applications we are seeing issues with thousands of
threads created with the name "Connection evictor". 
Can you give a hint how to limit those threads? 
Should we better use HttpSolrClient or Http2SolrClient?
Is another version of SolrJ advisable?

Thanks & Best Regards
Andreas



Issue to upgrade solr cloud

2020-02-11 Thread Yogesh Chaudhari
Hi All,

Currently we are using Solr 5.2.1 on production server and want upgrade to Solr 
7.7.2.  We are using solr 5.2.1 from last 5 years, we do have millions of 
documents on production server. We have Solr cloud with 2 shards and 3 replicas 
on production server.

I have upgraded Solr 5.2.1 to Solr 6.6.6 , it is upgraded successfully on my 
local machine.

Now I am trying to upgrading Solr 6.6.6 to Solr 7.7.2. I have upgraded all 6 
solr instances one at a time to Solr 7.7.2. I am getting below error. One shard 
(with 3 replicas) is upgraded successfully bu =t the another shard is giving an 
error (please refer below image).  Though one shard is upgraded, I can not do 
anything with that. I think issue is due to old indexes or documents.

[cid:image001.png@01D5E0FE.D4DE4870]

o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: no servers 
hosting shard: shard2

at 
org.apache.solr.handler.component.HttpShardHandler.prepDistributed(HttpShardHandler.java:463)
at 
org.apache.solr.handler.component.SearchHandler.getAndPrepShardHandler(SearchHandler.java:226)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:267)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2551)
at 
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:711)
at 
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:395)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:341)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1588)
at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1557)
at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handle(Server.java:502)
I am struggling from last few days to migrate solr 5.2.1 to Solr 7.7.1.

Can you please share your inputs or please assist me?

Thanks,

Yogesh Chaudhari


Re: Issue to upgrade solr cloud

2020-02-11 Thread Erick Erickson
You really have to re-index your content in this case. This is enforced in 
Lucene/Solr 8. Upgrading from one version to another isn’t sufficient.

The mail server pretty aggressively strips attachments, so your picture (?) 
didn’t come through.

The log you posted isn’t very helpful, we’d need the logs from a node that 
hosts a replica from shard2.

That said, I wouldn’t personally pursue the upgrade path, re-index to a fresh 
collection please.

Best,
Erick

> On Feb 11, 2020, at 6:45 AM, Yogesh Chaudhari 
>  wrote:
> 
> Hi All,
>  
> Currently we are using Solr 5.2.1 on production server and want upgrade to 
> Solr 7.7.2.  We are using solr 5.2.1 from last 5 years, we do have millions 
> of documents on production server. We have Solr cloud with 2 shards and 3 
> replicas on production server. 
>  
> I have upgraded Solr 5.2.1 to Solr 6.6.6 , it is upgraded successfully on my 
> local machine.
>  
> Now I am trying to upgrading Solr 6.6.6 to Solr 7.7.2. I have upgraded all 6 
> solr instances one at a time to Solr 7.7.2. I am getting below error. One 
> shard (with 3 replicas) is upgraded successfully bu =t the another shard is 
> giving an error (please refer below image).  Though one shard is upgraded, I 
> can not do anything with that. I think issue is due to old indexes or 
> documents.
>  
> 
>  
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: no servers 
> hosting shard: shard2
>  
> at 
> org.apache.solr.handler.component.HttpShardHandler.prepDistributed(HttpShardHandler.java:463)
> at 
> org.apache.solr.handler.component.SearchHandler.getAndPrepShardHandler(SearchHandler.java:226)
> at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:267)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2551)
> at 
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:711)
> at 
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:395)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:341)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1588)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1557)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at org.eclipse.jetty.server.Server.handle(Server.java:502)
> I am struggling from last few days to migrate solr 5.2.1 to Solr 7.7.1.
>  
> Can you please share your inputs or please assist me?
>  
> Thanks,
>  
> Yogesh Chaudhari



Re: SolrJ 8.2: Too many Connection evictor threads

2020-02-11 Thread Erick Erickson
Are you running a 5x SolrJ client against an 8x server? There’s no
guarantee at all that that would work (or vice-versa for that matter).

Most generally, SolrJ clients should be able to work with version X-1, but X-3
is unsupported.

Best,
Erick

> On Feb 11, 2020, at 6:36 AM, Andreas Kahl  wrote:
> 
> Hello everyone, 
> 
> we just updated our Solr from former 5.4 to 8.2. The server runs fine,
> but in our client applications we are seeing issues with thousands of
> threads created with the name "Connection evictor". 
> Can you give a hint how to limit those threads? 
> Should we better use HttpSolrClient or Http2SolrClient?
> Is another version of SolrJ advisable?
> 
> Thanks & Best Regards
> Andreas
> 



Antw: Re: SolrJ 8.2: Too many Connection evictor threads

2020-02-11 Thread Andreas Kahl
Erick, 


Thanks, that's why we want to upgrade our clients to the same Solr(J) version 
as the server has. But I am still fighting the uncontrolled creation of those 
Connection evictor threads in my tomcat. 


Best Regards

Andreas


>>> Erick Erickson  11.02.20 15.06 Uhr >>>
Are you running a 5x SolrJ client against an 8x server? There’s no
guarantee at all that that would work (or vice-versa for that matter).

Most generally, SolrJ clients should be able to work with version X-1, but X-3
is unsupported.

Best,
Erick

> On Feb 11, 2020, at 6:36 AM, Andreas Kahl  wrote:
> 
> Hello everyone, 
> 
> we just updated our Solr from former 5.4 to 8.2. The server runs fine,
> but in our client applications we are seeing issues with thousands of
> threads created with the name "Connection evictor". 
> Can you give a hint how to limit those threads? 
> Should we better use HttpSolrClient or Http2SolrClient?
> Is another version of SolrJ advisable?
> 
> Thanks & Best Regards
> Andreas
> 




Re: Possible performance issue in my environment setup

2020-02-11 Thread Erick Erickson
My first bit of advice would be to fix your autocommit intervals. There’s not 
much point
in having openSearcher set to true _and_ having your soft commit times also 
set, all
soft commit does is open a searcher and your autocommit does that.

I’d also reduce the time for autoCommit. You’re _probably_ being saved by the 
maxDoc entry,

Fix here is set openSearcher=false in autoCommit, and reduce the time. And let
soft commit handle opening searchers. Here’s
more than you want to know about how all this works:

https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Given your observation that you see a new searcher being opened
65K times, my bet is that you’re somehow committing far, far too
often. What’s the rate of opening new searchers? Do those 65K
entries span an hour? 10 days? Either you’re sending 50K docs very
frequently or your client is sending commits.

So here’s what I’d do as a quick-n-dirty triage of where to look first:

- first turn off indexing. Does your query performance improve? If so, consider 
autowarming and tuning your commit interval.

- next, add &debug=timing to some of your queries. That’ll tell you if a 
particular _component_ is taking a long time, something like faceting say.

- If nothing jumps out, throw a profiler at Solr to see where it’s spending 
it’s time.

Best,
Erick

> On Feb 11, 2020, at 6:17 AM, Rudenko, Artur  wrote:
> 
> I'm am currently investigating a performance issue in our environment (20M 
> large PARENT documents and 800M nested small CHILD documents). The system 
> inserts about 400K PARENT documents and 16M CHILD documents per day.
> This is a solr cloud 8.3 environment with 7 servers (64 VCPU 128 GB RAM each, 
> 24GB allocated to Solr) with single collection (32 shards and replication 
> factor 2).
> 
> Solr config related info :
> 
> 
>  ${solr.autoCommit.maxTime:360}
>  ${solr.autoCommit.maxDocs:5}
>  true
>   
> 
> 
>   
>  ${solr.autoSoftCommit.maxTime:30}
>   
> 
> I found in the solr log the following log line:
> 
> [2020-02-10T00:01:00.522] INFO [qtp1686100174-100525] 
> org.apache.solr.search.SolrIndexSearcher Opening 
> [Searcher@37c9205b[0_shard29_replica_n112] realtime]
> 
> From a log with 100K records, the above log record appears 65K times.
> 
> We are experiencing extremely slow query time while the indexing time is fast 
> and sufficient.
> 
> Is this a possible direction to keep investigating? If so, any advices?
> 
> 
> Thanks,
> Artur Rudenko
> 
> 
> This electronic message may contain proprietary and confidential information 
> of Verint Systems Inc., its affiliates and/or subsidiaries. The information 
> is intended to be for the use of the individual(s) or entity(ies) named 
> above. If you are not the intended recipient (or authorized to receive this 
> e-mail for the intended recipient), you may not use, copy, disclose or 
> distribute to anyone this message or any information contained in this 
> message. If you have received this electronic message in error, please notify 
> us by replying to this e-mail.



Re: cursorMark and shards? (6.6.2)

2020-02-11 Thread Erick Erickson
Wow, that’s pretty horrible performance. 

Yeah, I was conflating a couple of things here. Now it’s clear.

If you specify rows=1, what do you get in response time? I’m wondering if
your time is spent just assembling the response rather than searching. You’d
have to have massive docs for that to be the case, kind of a shot in the dark.
The assembly step requires the docs be read off disk, decompressed and then
transmitted, but 10 seconds is ridiculous for that. I’m starting to wonder about
being I/O bound either disk wise or network, but I’m pretty sure you’ve already
thought about that.

You are transmitting things around your servers given your statement that you
are seeing the searches distributed, which is also a waste, but again I wouldn’t
expect it to be that bad.

Hmmm, quick thing to check: What are the QTime’s reported? Those are
exclusive of assembling the return packet. If they were a few milliseconds and
your response back at the client was 10s, that’d be a clue.

Best,
Erick

> On Feb 11, 2020, at 2:13 AM, Walter Underwood  wrote:
> 
> sort=“id asc”
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Feb 10, 2020, at 9:50 PM, Tim Casey  wrote:
>> 
>> Walter,
>> 
>> When you do the query, what is the sort of the results?
>> 
>> tim
>> 
>> On Mon, Feb 10, 2020 at 8:44 PM Walter Underwood 
>> wrote:
>> 
>>> I’ll back up a bit, since it is sort of an X/Y problem.
>>> 
>>> I have an index with four shards and 17 million documents. I want to dump
>>> all the docs in JSON, label each one with a classifier, then load them back
>>> in with the labels. This is a one-time (or rare) bootstrap of the
>>> classified data. This will unblock testing and relevance work while we get
>>> the classifier hooked into the indexing pipeline.
>>> 
>>> Because I’m dumping all the fields, we can’t rely on docValues.
>>> 
>>> It is OK if it takes a few hours.
>>> 
>>> Right now, it is running about 1.7 Mdoc/hour, so roughly 10 hours. That is
>>> 16 threads searching id:0* through id:f*, fetching 1000 rows each time,
>>> using cursorMark and distributed search. Median response time is 10 s. CPU
>>> usage is about 1%.
>>> 
>>> It is all pretty grubby and it seems like there could be a better way.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
 On Feb 10, 2020, at 3:39 PM, Erick Erickson 
>>> wrote:
 
 Any field that’s unique per doc would do, but yeah, that’s usually an ID.
 
 Hmmm, I don’t see why separate queries for 0-f are necessary if you’re
>>> firing
 at individual replicas. Each replica should have multiple UUIDs that
>>> start with 0-f.
 
 Unless I misunderstand and you’re just firing off, say, 16 threads at
>>> the entire
 collection rather than individual shards which would work too. But for
>>> individual
 shards I think you need to look for all possible IDs...
 
 Erick
 
> On Feb 10, 2020, at 5:37 PM, Walter Underwood 
>>> wrote:
> 
> 
>> On Feb 10, 2020, at 2:24 PM, Walter Underwood 
>>> wrote:
>> 
>> Not sure if range queries work on a UUID field, ...
> 
> A search for id:0* took 260 ms, so it looks like they work just fine.
>>> I’ll try separate queries for 0-f.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
 
>>> 
>>> 
> 



RE: Possible performance issue in my environment setup

2020-02-11 Thread Rudenko, Artur
Thanks for helping, I will keep investigating.

Just note, we did stopped indexing and we did not saw any significant changes.

Artur Rudenko
Analytics Developer
Customer Engagement Solutions, VERINT
T +972.74.747.2536 | M +972.52.425.4686

-Original Message-
From: Erick Erickson 
Sent: Tuesday, February 11, 2020 4:16 PM
To: solr-user@lucene.apache.org
Subject: Re: Possible performance issue in my environment setup

My first bit of advice would be to fix your autocommit intervals. There’s not 
much point in having openSearcher set to true _and_ having your soft commit 
times also set, all soft commit does is open a searcher and your autocommit 
does that.

I’d also reduce the time for autoCommit. You’re _probably_ being saved by the 
maxDoc entry,

Fix here is set openSearcher=false in autoCommit, and reduce the time. And let 
soft commit handle opening searchers. Here’s more than you want to know about 
how all this works:

https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Given your observation that you see a new searcher being opened 65K times, my 
bet is that you’re somehow committing far, far too often. What’s the rate of 
opening new searchers? Do those 65K entries span an hour? 10 days? Either 
you’re sending 50K docs very frequently or your client is sending commits.

So here’s what I’d do as a quick-n-dirty triage of where to look first:

- first turn off indexing. Does your query performance improve? If so, consider 
autowarming and tuning your commit interval.

- next, add &debug=timing to some of your queries. That’ll tell you if a 
particular _component_ is taking a long time, something like faceting say.

- If nothing jumps out, throw a profiler at Solr to see where it’s spending 
it’s time.

Best,
Erick

> On Feb 11, 2020, at 6:17 AM, Rudenko, Artur  wrote:
>
> I'm am currently investigating a performance issue in our environment (20M 
> large PARENT documents and 800M nested small CHILD documents). The system 
> inserts about 400K PARENT documents and 16M CHILD documents per day.
> This is a solr cloud 8.3 environment with 7 servers (64 VCPU 128 GB RAM each, 
> 24GB allocated to Solr) with single collection (32 shards and replication 
> factor 2).
>
> Solr config related info :
>
> 
>  ${solr.autoCommit.maxTime:360}
>  ${solr.autoCommit.maxDocs:5}
>  true
>   
>
>
>   
>  ${solr.autoSoftCommit.maxTime:30}
>   
>
> I found in the solr log the following log line:
>
> [2020-02-10T00:01:00.522] INFO [qtp1686100174-100525]
> org.apache.solr.search.SolrIndexSearcher Opening
> [Searcher@37c9205b[0_shard29_replica_n112] realtime]
>
> From a log with 100K records, the above log record appears 65K times.
>
> We are experiencing extremely slow query time while the indexing time is fast 
> and sufficient.
>
> Is this a possible direction to keep investigating? If so, any advices?
>
>
> Thanks,
> Artur Rudenko
>
>
> This electronic message may contain proprietary and confidential information 
> of Verint Systems Inc., its affiliates and/or subsidiaries. The information 
> is intended to be for the use of the individual(s) or entity(ies) named 
> above. If you are not the intended recipient (or authorized to receive this 
> e-mail for the intended recipient), you may not use, copy, disclose or 
> distribute to anyone this message or any information contained in this 
> message. If you have received this electronic message in error, please notify 
> us by replying to this e-mail.



This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


Re: Dependency log4j-slf4j-impl for solr-core:7.5.0 causing a number of build problems

2020-02-11 Thread Wolf, Chris (ELS-CON)
(I found this stuck in my outbox, sorry for the delayed response)

Hi,

Thank you, I finally was able to configure maven to exclude that logging 
implementation.  But now I'm having an issue building a Spring-Boot executable 
WAR with embedded Tomcat, for some reason, when I "spring-boot:run" it, it 
seems to use embedded Jetty rather then embedded Tomcat.  I *think* it's 
because "solr-core" has transitive dependency on jetty jars.  I will file a 
Jira when I get to the bottom of this as well.

Thanks for getting back to me.

-Chris

On 1/16/20, 10:49 PM, "David Smiley"  wrote:

*** External email: use caution ***



Ultimately if you deduce the problem, file a JIRA issue and share it with
me; I will look into it.  I care about this matter too; I hate having to
exclude logging dependencies on the consuming end.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Jan 15, 2020 at 9:03 PM Wolf, Chris (ELS-CON) 
wrote:

> I am having several issues due to the slf4j implementation dependency
> “log4j-slf4j-impl” being declared as a dependency of solr-core:7.5.0.   
The
> first issue observed when starting the app is this:
>
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> 
[jar:file:/Users/ma-wolf2/.m2/repository/org/apache/logging/log4j/log4j-slf4j-impl/2.7/log4j-slf4j-impl-2.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> 
[jar:file:/Users/ma-wolf2/.m2/repository/ch/qos/logback/logback-classic/1.1.3/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> SLF4J: Actual binding is of type
> [org.apache.logging.slf4j.Log4jLoggerFactory]
>
> I first got wind that this might not be just myself from this thread:
>
> 
https://lucene.472066.n3.nabble.com/log4j-slf4j-impl-dependency-in-solr-core-td4449635.html#a4449891
>
>
>   *   If there are any users that integrate solr-core into their own code,
> it's currently a bit of a land-mine situation to change logging
> implementations.  If there's a way we can include log4j jars at build
> time, but remove the log4j dependency on the published solr-core
> artifact, that might work well.  We should do our best to make it so
> people can use EmbeddedSolrServer without log4j jars.
>
> There are two dimensions to this dependency problem:
>
>   *   Building a war file (this runs with a warning)
>   *   Building a spring-boot executable JAR with embedded servlet
> container (doesn’t run)
>
> When building a WAR and deploying, I get the “multiple SLF4J bindings”
> warning, but the app works. However, I want the convenience of a
> spring-boot executable JAR with embedded servlet container, but in that
> case, I get that warning followed by a fatal NoClassDefFoundError/
> ClassNotFoundException – which is a show-stopper.  If I hack the built
> spring-boot FAT jar and remove “log4j-slf4j-impl.jar” then the app works.
>
> For the WAR build, the proper version of log4j-slf4j-impl.jar was included
> – 2.11.0, but,for some reason when building the spring-boot fat (uber) 
jar,
> it was building with log4j-slf4j-impl:2.7 so of course it will croak.
>
> There are several issues:
>
>   1.  I don’t want log4j-slf4j-impl at all
>   2.  Somehow the version of “log4j-slf4j-impl” being used for the build
> is 2.7 rather then the expected 2.11.0
>   3.  Due to the version issue, the app croaks with
> ClassNotFoundException: org.apache.logging.log4j.util.ReflectionUtil
>
> For issue #1, I tried:
>   
>   org.apache.solr
>   solr-core
>   7.5.0
>   
> 
>   org.apache.logging.log4j
>   log4j-slf4j-impl
> 
>   
> 
>
> All to no avail, as that dependency ends up in the packaged build - for
> WAR, it’s version 2.11.0, so even though it’s a bad build, the app runs,
> but for building a spring-boot executable JAR with embedded webserver, for
> some reason, it switches log4j-slf4j-impl from version 2.11.0  to 2,7
> (2.11.0  works, but should not even be there)
>
> I also tried this:
>
> 
https://docs.spring.io/spring-boot/docs/current/maven-plugin/examples/exclude-dependency.html
>
> …that didn’t work either.
>
> I’m thinking that solr-core should have added a classifier of “provided”
> for “log4j-slf4j-impl”, but that’s conjecture of a possible solution going
> forward, but does anyone know how I can exclude  “log4j-slf4j-impl”  from 
a
> spring-boot build?
>
>
>
>
>
>




Re: Dependency log4j-slf4j-impl for solr-core:7.5.0 causing a number of build problems

2020-02-11 Thread Wolf, Chris (ELS-CON)
(sorry for bad formatting Outlook-for-Mac doesn't support Internet quoting)

Thanks Mark, I did that until I finally was able to exclude it altogether.

-Chris

On 1/17/20, 10:20 AM, "Mark H. Wood"  wrote:

For the version problem, I would try adding something like:

  

  org.apache.logging.log4j
  log4j-slf4j-impl
  2.11.0

  

to pin down the version no matter what is pulling it in.  Not ideal,
since you want to be rid of this dependency altogether, but at least
it may allow the spring-boot artifact to run, until the other problem
is sorted.



GC_TUNE setting from solr.in.sh is not applied

2020-02-11 Thread Steffen Moldenhauer
Hi all,

I installed Solr 8.4.1 (first time on a linux sub-system for testing purposes 
only) and for whatever reason the default GC settings prevented the server from 
running.
So I tried to change the setting with a GC_TUNE in solr.in.sh

But it got not applied to the start up. So I looked at the /solr-8.4.1/bin/solr 
Script and found:

  if [ -z ${GC_TUNE+x} ]; then
  GC_TUNE=('-XX:+UseG1GC' \

${GC_TUNE+x} looks strange to me - but I do not really know that much about 
shell-programming

Changed it to
if [ -z "$GC_TUNE" ]; then

and my setting from solr.in.sh got applied.

There's also: SOLR_JAVA_STACK_SIZE+x GC_LOG_OPTS+x Is that kind of syntax 
really working or is it wrong?

Regards
Steffen


Re: Storage/Volume type for Kubernetes Solr POD?

2020-02-11 Thread Susheel Kumar
Thanks, Karl for sharing.  With local SSD's you be able to auto scale. Is
that correct?

On Fri, Feb 7, 2020 at 5:22 AM Nicolas PARIS 
wrote:

> hi all
>
> what about cephfs or lustre distrubuted filesystem for such purpose ?
>
>
> Karl Stoney  writes:
>
> > we personally run solr on google cloud kubernetes engine and each node
> has a 512Gb persistent ssd (network attached) storage which gives roughly
> this performance (read/write):
> >
> > Sustained random IOPS limit 15,360.00 15,360.00
> > Sustained throughput limit (MB/s) 245.76  245.76
> >
> > and we get very good performance.
> >
> > ultimately though it's going to depend on your workload
> > 
> > From: Susheel Kumar 
> > Sent: 06 February 2020 13:43
> > To: solr-user@lucene.apache.org 
> > Subject: Storage/Volume type for Kubernetes Solr POD?
> >
> > Hello,
> >
> > Whats type of storage/volume is recommended to run Solr on Kubernetes
> POD?
> > I know in the past Solr has issues with NFS storing its indexes and was
> not
> > recommended.
> >
> >
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkubernetes.io%2Fdocs%2Fconcepts%2Fstorage%2Fvolumes%2F&data=02%7C01%7Ckarl.stoney%40autotrader.co.uk%7Cade649a9f6e84e1ee7d008d7ab0a8c7b%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C637165934101219754&sdata=wsc4v3dJwTzOqSirbo7DvdmrimTL2sOX66Ug%2FvzrRw8%3D&reserved=0
> >
> > Thanks,
> > Susheel
> > This e-mail is sent on behalf of Auto Trader Group Plc, Registered
> Office: 1 Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in
> England No. 9439967). This email and any files transmitted with it are
> confidential and may be legally privileged, and intended solely for the use
> of the individual or entity to whom they are addressed. If you have
> received this email in error please notify the sender. This email message
> has been swept for the presence of computer viruses.
>
>
> --
> nicolas paris
>


Re: cursorMark and shards? (6.6.2)

2020-02-11 Thread Walter Underwood
Good questions. Here is the QTime for rows=1000. Looks pretty reasonable. I’d 
blame the slowness on the VPN connection, but the median response time of 
10,000 msec is measured at the server.

The client is in Python, using wt=json. Average document size in JSON is 5132 
bytes. The system should not be IO bound, but I’ll check. The instances have 31 
GB of memory, shards are 40 GB on SSD. I don’t think I set up JVM monitoring on 
this cluster, so I can’t see if the GC is thrashing.

2020-02-11T08:51:33 INFO QTime=401
2020-02-11T08:51:34 INFO QTime=612
2020-02-11T08:51:34 INFO QTime=492
2020-02-11T08:51:35 INFO QTime=513
2020-02-11T08:51:36 INFO QTime=458
2020-02-11T08:51:36 INFO QTime=419
2020-02-11T08:51:46 INFO QTime=477
2020-02-11T08:51:47 INFO QTime=479
2020-02-11T08:51:47 INFO QTime=457
2020-02-11T08:51:50 INFO QTime=553
2020-02-11T08:51:50 INFO QTime=658
2020-02-11T08:51:52 INFO QTime=379

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 11, 2020, at 6:28 AM, Erick Erickson  wrote:
> 
> Wow, that’s pretty horrible performance. 
> 
> Yeah, I was conflating a couple of things here. Now it’s clear.
> 
> If you specify rows=1, what do you get in response time? I’m wondering if
> your time is spent just assembling the response rather than searching. You’d
> have to have massive docs for that to be the case, kind of a shot in the dark.
> The assembly step requires the docs be read off disk, decompressed and then
> transmitted, but 10 seconds is ridiculous for that. I’m starting to wonder 
> about
> being I/O bound either disk wise or network, but I’m pretty sure you’ve 
> already
> thought about that.
> 
> You are transmitting things around your servers given your statement that you
> are seeing the searches distributed, which is also a waste, but again I 
> wouldn’t
> expect it to be that bad.
> 
> Hmmm, quick thing to check: What are the QTime’s reported? Those are
> exclusive of assembling the return packet. If they were a few milliseconds and
> your response back at the client was 10s, that’d be a clue.
> 
> Best,
> Erick
> 
>> On Feb 11, 2020, at 2:13 AM, Walter Underwood  wrote:
>> 
>> sort=“id asc”
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Feb 10, 2020, at 9:50 PM, Tim Casey  wrote:
>>> 
>>> Walter,
>>> 
>>> When you do the query, what is the sort of the results?
>>> 
>>> tim
>>> 
>>> On Mon, Feb 10, 2020 at 8:44 PM Walter Underwood 
>>> wrote:
>>> 
 I’ll back up a bit, since it is sort of an X/Y problem.
 
 I have an index with four shards and 17 million documents. I want to dump
 all the docs in JSON, label each one with a classifier, then load them back
 in with the labels. This is a one-time (or rare) bootstrap of the
 classified data. This will unblock testing and relevance work while we get
 the classifier hooked into the indexing pipeline.
 
 Because I’m dumping all the fields, we can’t rely on docValues.
 
 It is OK if it takes a few hours.
 
 Right now, it is running about 1.7 Mdoc/hour, so roughly 10 hours. That is
 16 threads searching id:0* through id:f*, fetching 1000 rows each time,
 using cursorMark and distributed search. Median response time is 10 s. CPU
 usage is about 1%.
 
 It is all pretty grubby and it seems like there could be a better way.
 
 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)
 
> On Feb 10, 2020, at 3:39 PM, Erick Erickson 
 wrote:
> 
> Any field that’s unique per doc would do, but yeah, that’s usually an ID.
> 
> Hmmm, I don’t see why separate queries for 0-f are necessary if you’re
 firing
> at individual replicas. Each replica should have multiple UUIDs that
 start with 0-f.
> 
> Unless I misunderstand and you’re just firing off, say, 16 threads at
 the entire
> collection rather than individual shards which would work too. But for
 individual
> shards I think you need to look for all possible IDs...
> 
> Erick
> 
>> On Feb 10, 2020, at 5:37 PM, Walter Underwood 
 wrote:
>> 
>> 
>>> On Feb 10, 2020, at 2:24 PM, Walter Underwood 
 wrote:
>>> 
>>> Not sure if range queries work on a UUID field, ...
>> 
>> A search for id:0* took 260 ms, so it looks like they work just fine.
 I’ll try separate queries for 0-f.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
> 
 
 
>> 
> 



Support Tesseract in Apache Solr

2020-02-11 Thread Karan Jain
Hi All,

The Solr version 7.6.0 is running on my local machine. I have installed
Tesseract through following steps:-
yum install tesseract echo export PATH=$PATH:/usr/share/tesseract
>>~/.bash_profile
echo export TESSDATA_PREFIX=/usr/share/tesseract >>~/.bash_profile

Now the deployed Solr is supporting tesseract. I searched TESSDATA_PREFIX
in https://github.com/apache/lucene-solr and found no reference there. I
could not understand How Solr came to know about the deployed tesseract.
Please tell the specific java class in Solr if possible.

Thanks for your time,
Best,
Karan


Re: Support Tesseract in Apache Solr

2020-02-11 Thread Jörn Franke
Honestly i would not run tesseract on the same server as Solr. It takes a lot 
of resources and may negatively impact Solr. Just write a small program using 
Tika+Tesseract that runs on a different server / container and posts the 
results to Solr.

About your question: Probably Tika (a dependency of Solr) figured it out or 
depending on your format Pdfbox (used by Tika).

> Am 11.02.2020 um 19:15 schrieb Karan Jain :
> 
> Hi All,
> 
> The Solr version 7.6.0 is running on my local machine. I have installed
> Tesseract through following steps:-
> yum install tesseract echo export PATH=$PATH:/usr/share/tesseract
>>> ~/.bash_profile
> echo export TESSDATA_PREFIX=/usr/share/tesseract >>~/.bash_profile
> 
> Now the deployed Solr is supporting tesseract. I searched TESSDATA_PREFIX
> in https://github.com/apache/lucene-solr and found no reference there. I
> could not understand How Solr came to know about the deployed tesseract.
> Please tell the specific java class in Solr if possible.
> 
> Thanks for your time,
> Best,
> Karan


Re: Solr 8.2 replicas use only 1 CPU at 100% every solr.autoCommit.maxTime minutes

2020-02-11 Thread Vangelis Katsikaros
Hi

On Mon, Feb 10, 2020 at 5:05 PM Vangelis Katsikaros 
wrote:

> Hi all
>
> We run Solr 8.2.0
> * with Amazon Corretto 11.0.5.10.1 SDK (java arguments shown in [1]),
> * on Ubuntu 18.04
> * on AWS EC2 m5.2xlarge with 8 CPUs and 32GB of RAM
> * with -Xmx16g [1].
>
> We have migrated from Solr 3.5 and in big core (16GB) replicas we have
> started to suffer degraded service. The replica’s ReplicationHandler is in
> [8] and the master’s updateHandler in [9].
>
> We notice every 5 mins (the value for solr.autoCommit.maxTime) the
> following:
> * Solr uses all 8 CPUs. Suddenly for ~30 sec, it uses only 1 CPU at 100%
> and the rest of the CPUs are idle (mpstat [6]). In our previous setup with
> Solr 3 we used up to 80% of all CPUs.
> * During that time the solr queries suddenly take more than 1 second, up
> to 30 sec (or more). The same queries otherwise need less than 1 sec to
> complete.
> * The disk does not seem to be a bottleneck (iostat [4]).
> * Memory does not seem to be a bottleneck (vmstat [5]).
> * CPU (apart from the single CPU issue) does not seem to be a bottleneck
> (mpstat [6] & pidstat [3]).
> * We are no java/GC experts but It does not seem to be GC related [7].
>
> We have tried reducing the heap to 8 and 2GB with no positive effect. We
> have tested different autoCommit.maxTime values. Reducing it to 60 seconds
> makes things unbearable. 5 minutes is not significantly different than 10.
> Do you have any pointers to proceed debugging the issue?
>
> Detailed example problem that repeats every solr.autoCommit.maxTime
> minutes on the replicas:
> * From 12:36 to 12:39:04 queries are fast to serve [2]. Solr consumes CPU
> from all 8 CPUs (mpstat [6]). The metric solr.jvm.threads.blocked.count is
> 0 [2].
> * From 12:39:04 to 12:39:25 queries are slow to respond [2]. Solr consumes
> only 1 out of 8 CPUs, the other 7 CPUs are idle (mpstat [6]). The metric
> solr.jvm.threads.blocked.count grows from 0 to a big 2 digit number [2].
> * After 12:39:25 and until the next poll of a commit things are normal.
>
> Regards
> Vangelis
>
> [1]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-solr-info
> [2]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-slow-queries-and-solr-jvm-threads-blocked-count
> [3]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-pidstat
> [4]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-iostat
> [5]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-vmstat
> [6]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-mpstat
> [7]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-gc-logs
> [8]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-replica-replicationhandler
> [9]
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-master-updatehandler
>

Some additional information. We noticed (through the admin's "Thread Dump"
/solr/#/~threads) that whenever we see this behavior the all the threads
that block show the same stacktrace [10] and block at

org.apache.solr.search.function.FileFloatSource$Cache.get(FileFloatSource.java:198)
org.apache.solr.search.function.FileFloatSource.getCachedFloats(FileFloatSource.java:152)
org.apache.solr.search.function.FileFloatSource.getValues(FileFloatSource.java:95)
org.apache.lucene.queries.function.valuesource.MultiFloatFunction.getValues(MultiFloatFunction.java:76)
org.apache.lucene.queries.function.ValueSource$WrappedDoubleValuesSource.getValues(ValueSource.java:203)
org.apache.lucene.queries.function.FunctionScoreQuery$MultiplicativeBoostValuesSource.getValues(FunctionScoreQuery.java:255)
org.apache.lucene.queries.function.FunctionScoreQuery$FunctionScoreWeight.scorer(FunctionScoreQuery.java:218)
...

The boostfiles (external_boostvalue) are ~30M large and the schema fields
are configured in the schema [11] with:
  

Regards
Vangelis

[10]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-stacktrace
[11]
https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-schema-boostfile


Re: cursorMark and shards? (6.6.2)

2020-02-11 Thread Erick Erickson
Curiouser and curiouser. So two possibilities are just the time it takes to 
assemble the packet and the time it takes to send it back. Three more 
experiments then.

1> change the returned doc to return a single docValues=true field. My claim: 
The response will be very close to the 400-600 ms range back at the client. 
This is kind of a sanity check, it should collect the field to return without 
visiting the stored data/decompressing it etc.

2> change the returned doc to return a single docValues=false stored=true 
field. That’ll exercise the whole fetch-from-disk-and-decompress cycle because 
all the stored values for a doc needs to be decompressed if you access even 
one. If that comes back in, say, < 1 second then the speed issues are either GC 
thrashing or your network would be my guess. If it’s in the same 10s range, 
then I’d be looking at GC and the like.

3> change the returned rows to, say, 100 while returning all rows. If you see a 
pretty linear relationship between the number of docs and the response time 
then at least we know where to dig.

But the 1% CPU utilization makes me suspect transmission. If it were 
read-from-disk-and-decompress I’d expect more CPU due to the decompression 
phase unless you have a very slow disk.

So apparently CursorMark was a red herring?

it is _vaguely_ possible that bumping the documentCache higher might help, but 
frankly I doubt it. That would help if you had a situation where you were 
having to re-access the stored data from disk for several different components 
in a single query, but I don’t think that pertains. But then I’m surprised 
you’re seeing this at all so what I think, as an author I read had a character 
say, “ain’t worth a goldfish fart” in the face of evidence.

Best,
Erick

> On Feb 11, 2020, at 12:33 PM, Walter Underwood  wrote:
> 
> Good questions. Here is the QTime for rows=1000. Looks pretty reasonable. I’d 
> blame the slowness on the VPN connection, but the median response time of 
> 10,000 msec is measured at the server.
> 
> The client is in Python, using wt=json. Average document size in JSON is 5132 
> bytes. The system should not be IO bound, but I’ll check. The instances have 
> 31 GB of memory, shards are 40 GB on SSD. I don’t think I set up JVM 
> monitoring on this cluster, so I can’t see if the GC is thrashing.
> 
> 2020-02-11T08:51:33 INFO QTime=401
> 2020-02-11T08:51:34 INFO QTime=612
> 2020-02-11T08:51:34 INFO QTime=492
> 2020-02-11T08:51:35 INFO QTime=513
> 2020-02-11T08:51:36 INFO QTime=458
> 2020-02-11T08:51:36 INFO QTime=419
> 2020-02-11T08:51:46 INFO QTime=477
> 2020-02-11T08:51:47 INFO QTime=479
> 2020-02-11T08:51:47 INFO QTime=457
> 2020-02-11T08:51:50 INFO QTime=553
> 2020-02-11T08:51:50 INFO QTime=658
> 2020-02-11T08:51:52 INFO QTime=379
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Feb 11, 2020, at 6:28 AM, Erick Erickson  wrote:
>> 
>> Wow, that’s pretty horrible performance. 
>> 
>> Yeah, I was conflating a couple of things here. Now it’s clear.
>> 
>> If you specify rows=1, what do you get in response time? I’m wondering if
>> your time is spent just assembling the response rather than searching. You’d
>> have to have massive docs for that to be the case, kind of a shot in the 
>> dark.
>> The assembly step requires the docs be read off disk, decompressed and then
>> transmitted, but 10 seconds is ridiculous for that. I’m starting to wonder 
>> about
>> being I/O bound either disk wise or network, but I’m pretty sure you’ve 
>> already
>> thought about that.
>> 
>> You are transmitting things around your servers given your statement that you
>> are seeing the searches distributed, which is also a waste, but again I 
>> wouldn’t
>> expect it to be that bad.
>> 
>> Hmmm, quick thing to check: What are the QTime’s reported? Those are
>> exclusive of assembling the return packet. If they were a few milliseconds 
>> and
>> your response back at the client was 10s, that’d be a clue.
>> 
>> Best,
>> Erick
>> 
>>> On Feb 11, 2020, at 2:13 AM, Walter Underwood  wrote:
>>> 
>>> sort=“id asc”
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
 On Feb 10, 2020, at 9:50 PM, Tim Casey  wrote:
 
 Walter,
 
 When you do the query, what is the sort of the results?
 
 tim
 
 On Mon, Feb 10, 2020 at 8:44 PM Walter Underwood 
 wrote:
 
> I’ll back up a bit, since it is sort of an X/Y problem.
> 
> I have an index with four shards and 17 million documents. I want to dump
> all the docs in JSON, label each one with a classifier, then load them 
> back
> in with the labels. This is a one-time (or rare) bootstrap of the
> classified data. This will unblock testing and relevance work while we get
> the classifier hooked into the indexing pipeline.
> 
> Because I’m dumping all the fields, we can’t rely on docValues.
> 
>

Re: Storage/Volume type for Kubernetes Solr POD?

2020-02-11 Thread Karl Stoney
yes we scale with pd-ssd or local-ssd just fine

From: Susheel Kumar 
Sent: 11 February 2020 17:15
To: solr-user@lucene.apache.org 
Subject: Re: Storage/Volume type for Kubernetes Solr POD?

Thanks, Karl for sharing.  With local SSD's you be able to auto scale. Is
that correct?

On Fri, Feb 7, 2020 at 5:22 AM Nicolas PARIS 
wrote:

> hi all
>
> what about cephfs or lustre distrubuted filesystem for such purpose ?
>
>
> Karl Stoney  writes:
>
> > we personally run solr on google cloud kubernetes engine and each node
> has a 512Gb persistent ssd (network attached) storage which gives roughly
> this performance (read/write):
> >
> > Sustained random IOPS limit 15,360.00 15,360.00
> > Sustained throughput limit (MB/s) 245.76  245.76
> >
> > and we get very good performance.
> >
> > ultimately though it's going to depend on your workload
> > 
> > From: Susheel Kumar 
> > Sent: 06 February 2020 13:43
> > To: solr-user@lucene.apache.org 
> > Subject: Storage/Volume type for Kubernetes Solr POD?
> >
> > Hello,
> >
> > Whats type of storage/volume is recommended to run Solr on Kubernetes
> POD?
> > I know in the past Solr has issues with NFS storing its indexes and was
> not
> > recommended.
> >
> >
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkubernetes.io%2Fdocs%2Fconcepts%2Fstorage%2Fvolumes%2F&data=02%7C01%7Ckarl.stoney%40autotrader.co.uk%7C2a0b14d26833424dc60c08d7af161a99%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C637170381771386211&sdata=7nowVhOGd8ZDSPYWpUrNwWCl6dza6yDKrw94aORNfZ8%3D&reserved=0
> >
> > Thanks,
> > Susheel
> > This e-mail is sent on behalf of Auto Trader Group Plc, Registered
> Office: 1 Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in
> England No. 9439967). This email and any files transmitted with it are
> confidential and may be legally privileged, and intended solely for the use
> of the individual or entity to whom they are addressed. If you have
> received this email in error please notify the sender. This email message
> has been swept for the presence of computer viruses.
>
>
> --
> nicolas paris
>
This e-mail is sent on behalf of Auto Trader Group Plc, Registered Office: 1 
Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. 
9439967). This email and any files transmitted with it are confidential and may 
be legally privileged, and intended solely for the use of the individual or 
entity to whom they are addressed. If you have received this email in error 
please notify the sender. This email message has been swept for the presence of 
computer viruses.


Re: Support Tesseract in Apache Solr

2020-02-11 Thread Edward Ribeiro
I second Jorn: don't deploy Tesseract + Tika on the same server as Solr.
Tesseract, specially with OCR enabled, will drain your machine resources
that could be used to indexing/searching. In addition to that, any
malformed PDF could potentially shutdown the Solr server. Best bet would be
to use tika-server + tesseract on a dedicated server/container and then use
it to extract the text/ocr from the documents and then send it to Solr.

But answering your question: Solr embeds Tika that can be configured to use
Tesseract. It's Tika that knows about Tesseract. See here:
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR for more
information.

Best regards,
Edward

On Tue, Feb 11, 2020 at 3:26 PM Jörn Franke  wrote:

> Honestly i would not run tesseract on the same server as Solr. It takes a
> lot of resources and may negatively impact Solr. Just write a small program
> using Tika+Tesseract that runs on a different server / container and posts
> the results to Solr.
>
> About your question: Probably Tika (a dependency of Solr) figured it out
> or depending on your format Pdfbox (used by Tika).
>
> > Am 11.02.2020 um 19:15 schrieb Karan Jain :
> >
> > Hi All,
> >
> > The Solr version 7.6.0 is running on my local machine. I have installed
> > Tesseract through following steps:-
> > yum install tesseract echo export PATH=$PATH:/usr/share/tesseract
> >>> ~/.bash_profile
> > echo export TESSDATA_PREFIX=/usr/share/tesseract >>~/.bash_profile
> >
> > Now the deployed Solr is supporting tesseract. I searched TESSDATA_PREFIX
> > in https://github.com/apache/lucene-solr and found no reference there. I
> > could not understand How Solr came to know about the deployed tesseract.
> > Please tell the specific java class in Solr if possible.
> >
> > Thanks for your time,
> > Best,
> > Karan
>


per-field count of documents matched?

2020-02-11 Thread Fischer, Stephen
Hi wise Solr experts,

For our scientific use-case we want to show users a per-field count of 
documents that match that field.

We like to do this efficiently because we might return up to a million 
documents.

For example, if we had documents describing People, and a query of, say, 
"Stone" we might want to show

Fields matched:
  Last name:  145
  Street: 431
  Favorite rock band:  13
  Home exterior: 2340

Is there an efficient way to do this?

So far, we're trying to leverage highlighting.   But it seems very slow.

Thanks


Re: Solr 8.2 replicas use only 1 CPU at 100% every solr.autoCommit.maxTime minutes

2020-02-11 Thread Edward Ribeiro
Is your autoCommit configured to open new searchers? Did you try to set
openSearcher to false?

Edward

On Tue, Feb 11, 2020 at 3:40 PM Vangelis Katsikaros 
wrote:

> Hi
>
> On Mon, Feb 10, 2020 at 5:05 PM Vangelis Katsikaros  >
> wrote:
>
> > Hi all
> >
> > We run Solr 8.2.0
> > * with Amazon Corretto 11.0.5.10.1 SDK (java arguments shown in [1]),
> > * on Ubuntu 18.04
> > * on AWS EC2 m5.2xlarge with 8 CPUs and 32GB of RAM
> > * with -Xmx16g [1].
> >
> > We have migrated from Solr 3.5 and in big core (16GB) replicas we have
> > started to suffer degraded service. The replica’s ReplicationHandler is
> in
> > [8] and the master’s updateHandler in [9].
> >
> > We notice every 5 mins (the value for solr.autoCommit.maxTime) the
> > following:
> > * Solr uses all 8 CPUs. Suddenly for ~30 sec, it uses only 1 CPU at 100%
> > and the rest of the CPUs are idle (mpstat [6]). In our previous setup
> with
> > Solr 3 we used up to 80% of all CPUs.
> > * During that time the solr queries suddenly take more than 1 second, up
> > to 30 sec (or more). The same queries otherwise need less than 1 sec to
> > complete.
> > * The disk does not seem to be a bottleneck (iostat [4]).
> > * Memory does not seem to be a bottleneck (vmstat [5]).
> > * CPU (apart from the single CPU issue) does not seem to be a bottleneck
> > (mpstat [6] & pidstat [3]).
> > * We are no java/GC experts but It does not seem to be GC related [7].
> >
> > We have tried reducing the heap to 8 and 2GB with no positive effect. We
> > have tested different autoCommit.maxTime values. Reducing it to 60
> seconds
> > makes things unbearable. 5 minutes is not significantly different than
> 10.
> > Do you have any pointers to proceed debugging the issue?
> >
> > Detailed example problem that repeats every solr.autoCommit.maxTime
> > minutes on the replicas:
> > * From 12:36 to 12:39:04 queries are fast to serve [2]. Solr consumes CPU
> > from all 8 CPUs (mpstat [6]). The metric solr.jvm.threads.blocked.count
> is
> > 0 [2].
> > * From 12:39:04 to 12:39:25 queries are slow to respond [2]. Solr
> consumes
> > only 1 out of 8 CPUs, the other 7 CPUs are idle (mpstat [6]). The metric
> > solr.jvm.threads.blocked.count grows from 0 to a big 2 digit number [2].
> > * After 12:39:25 and until the next poll of a commit things are normal.
> >
> > Regards
> > Vangelis
> >
> > [1]
> >
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-solr-info
> > [2]
> >
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-slow-queries-and-solr-jvm-threads-blocked-count
> > [3]
> >
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-pidstat
> > [4]
> >
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-iostat
> > [5]
> >
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-vmstat
> > [6]
> >
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-mpstat
> > [7]
> >
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-gc-logs
> > [8]
> >
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-replica-replicationhandler
> > [9]
> >
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-master-updatehandler
> >
>
> Some additional information. We noticed (through the admin's "Thread Dump"
> /solr/#/~threads) that whenever we see this behavior the all the threads
> that block show the same stacktrace [10] and block at
>
>
> org.apache.solr.search.function.FileFloatSource$Cache.get(FileFloatSource.java:198)
>
> org.apache.solr.search.function.FileFloatSource.getCachedFloats(FileFloatSource.java:152)
>
> org.apache.solr.search.function.FileFloatSource.getValues(FileFloatSource.java:95)
>
> org.apache.lucene.queries.function.valuesource.MultiFloatFunction.getValues(MultiFloatFunction.java:76)
>
> org.apache.lucene.queries.function.ValueSource$WrappedDoubleValuesSource.getValues(ValueSource.java:203)
>
> org.apache.lucene.queries.function.FunctionScoreQuery$MultiplicativeBoostValuesSource.getValues(FunctionScoreQuery.java:255)
>
> org.apache.lucene.queries.function.FunctionScoreQuery$FunctionScoreWeight.scorer(FunctionScoreQuery.java:218)
> ...
>
> The boostfiles (external_boostvalue) are ~30M large and the schema fields
> are configured in the schema [11] with:
>   
>
> Regards
> Vangelis
>
> [10]
>
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-stacktrace
> [11]
>
> https://gist.github.com/vkatsikaros/5102e8088a98ad1ee49516aafa6bc5c4#file-schema-boostfile
>


Re: per-field count of documents matched?

2020-02-11 Thread Erick Erickson
Hmmm, you could do a facet query (or a series of them). 
facet.query=LastName:stone&facet.query=Street:stone etc….. That’d automatically 
only tally for the docs that match.

You could also consider a custom search component. For the exact case you 
describe, it’s actually fairly simple. The postings list has, for each term, 
the list of docs that contain it (internal Lucene doc ID). So I might have 
for field LastName:
stone -> 1,73,100…

for field Street:
stone-> 264,933…

So it’s simply a matter of, for each term, and each doc the overall query 
matches go down the list of docs and add them up.

However… I’m not sure you’d get what you want in either case. Consider a query 
(A AND B) OR (C AND D). And let’s say doc1 contains A in LastName, and C,D in 
Street. Should A be counted in the LastName tally for this doc?

I suppose you could put the full query in the facet.query above. I’m still not 
sure it’s what you need, since I’m not sure what "per-field count of documents 
that match” means in your application…

Best,
Erick

> On Feb 11, 2020, at 6:15 PM, Fischer, Stephen 
>  wrote:
> 
> Hi wise Solr experts,
> 
> For our scientific use-case we want to show users a per-field count of 
> documents that match that field.
> 
> We like to do this efficiently because we might return up to a million 
> documents.
> 
> For example, if we had documents describing People, and a query of, say, 
> "Stone" we might want to show
> 
> Fields matched:
>  Last name:  145
>  Street: 431
>  Favorite rock band:  13
>  Home exterior: 2340
> 
> Is there an efficient way to do this?
> 
> So far, we're trying to leverage highlighting.   But it seems very slow.
> 
> Thanks



RE: [External] Re: per-field count of documents matched?

2020-02-11 Thread Fischer, Stephen
Thanks very much!   By the way, we are using eDisMax, and the queries our UI 
supports don't include fancy Booleans, so your ideas just might work

Thanks again,
Steve

-Original Message-
From: Erick Erickson  
Sent: Tuesday, February 11, 2020 7:16 PM
To: solr-user@lucene.apache.org
Subject: [External] Re: per-field count of documents matched?

Hmmm, you could do a facet query (or a series of them). 
facet.query=LastName:stone&facet.query=Street:stone etc….. That’d automatically 
only tally for the docs that match.

You could also consider a custom search component. For the exact case you 
describe, it’s actually fairly simple. The postings list has, for each term, 
the list of docs that contain it (internal Lucene doc ID). So I might have for 
field LastName:
stone -> 1,73,100…

for field Street:
stone-> 264,933…

So it’s simply a matter of, for each term, and each doc the overall query 
matches go down the list of docs and add them up.

However… I’m not sure you’d get what you want in either case. Consider a query 
(A AND B) OR (C AND D). And let’s say doc1 contains A in LastName, and C,D in 
Street. Should A be counted in the LastName tally for this doc?

I suppose you could put the full query in the facet.query above. I’m still not 
sure it’s what you need, since I’m not sure what "per-field count of documents 
that match” means in your application…

Best,
Erick

> On Feb 11, 2020, at 6:15 PM, Fischer, Stephen 
>  wrote:
> 
> Hi wise Solr experts,
> 
> For our scientific use-case we want to show users a per-field count of 
> documents that match that field.
> 
> We like to do this efficiently because we might return up to a million 
> documents.
> 
> For example, if we had documents describing People, and a query of, 
> say, "Stone" we might want to show
> 
> Fields matched:
>  Last name:  145
>  Street: 431
>  Favorite rock band:  13
>  Home exterior: 2340
> 
> Is there an efficient way to do this?
> 
> So far, we're trying to leverage highlighting.   But it seems very slow.
> 
> Thanks



wildcards match end-of-word?

2020-02-11 Thread Fischer, Stephen
Hi,

I am a solr newbie.  I was surprised to discover that a search for kinase* 
returned fewer results than kinase.

Then I read the wildcard 
documentation,
 and saw why.  kinase* will not match the word "kinase".

Our end-users won't expect this behavior.  Presumably the solution would be for 
them (actually us, on their behalf), to use kinase* OR kinase.

But that is kind of a hack.

Is there a way we can configure solr to have wildcards match on end-of-word?

Thanks,
Steve


Re: wildcards match end-of-word?

2020-02-11 Thread Walter Underwood
“kinase*” does match “kinase”. On the page you linked to, it defines “*” as 
matching "Multiple characters (matches zero or more sequential characters)”.

If it is not matching, you may be using a stemmer on that field or doing some 
other processing that changes the tokens.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 11, 2020, at 6:24 PM, Fischer, Stephen 
>  wrote:
> 
> Hi,
> 
> I am a solr newbie.  I was surprised to discover that a search for kinase* 
> returned fewer results than kinase.
> 
> Then I read the wildcard 
> documentation,
>  and saw why.  kinase* will not match the word "kinase".
> 
> Our end-users won't expect this behavior.  Presumably the solution would be 
> for them (actually us, on their behalf), to use kinase* OR kinase.
> 
> But that is kind of a hack.
> 
> Is there a way we can configure solr to have wildcards match on end-of-word?
> 
> Thanks,
> Steve