filterCache ramBytesUsed monitoring statistics go negative

2020-11-02 Thread Dawn


Hi:

filterCache ramBytesUsed monitoring statistics go negative.

Is there a special meaning, or is there a statistical problem
 
When present the list, can sort it by key. Solr7 is like this, easy to 
view.


For example:

CACHE.searcher.filterCache.hits:
63265

CACHE.searcher.filterCache.cumulative_evictions:
1981

CACHE.searcher.filterCache.size:
6765

CACHE.searcher.filterCache.maxRamMB:
10240

CACHE.searcher.filterCache.hitratio:
0.8329712577846243

CACHE.searcher.filterCache.warmupTime:
49227

CACHE.searcher.filterCache.evictions:
1981

CACHE.searcher.filterCache.cumulative_hitratio:
0.737519464195261

CACHE.searcher.filterCache.lookups:
75951

CACHE.searcher.filterCache.cumulative_hits:
78624

CACHE.searcher.filterCache.cumulative_inserts:
15927

CACHE.searcher.filterCache.ramBytesUsed:
-1418740612

CACHE.searcher.filterCache.inserts:
10510

CACHE.searcher.filterCache.cumulative_lookups:
106606





RE: SOLR uses too much CPU and GC is also weird on Windows server

2020-11-02 Thread Jaan Arjasepp
Thanks for all for helping to think about it, but eventually found out that 
code was basically single record deleting/adding records. After it was batched 
up, then everything is back to normal. Funny thing is that 6.0.0. handled these 
requests somehow, but newer version did not.
Anyway, we will observe this and try to improve our code as well.

Best regards,
Jaan

-Original Message-
From: Erick Erickson  
Sent: 28 October 2020 17:18
To: solr-user@lucene.apache.org
Subject: Re: SOLR uses too much CPU and GC is also weird on Windows server

DocValues=true are usually only used for “primitive” types, string, numerics, 
booleans and the like, specifically _not_ text-based.

I say “usually” because there’s a special “SortableTextField” where it does 
make some sense to have a text-based field have docValues, but that’s intended 
for relatively short fields. For example you want to sort on a title field. And 
probably not something you’re working with.

There’s not much we can say from this distance I’m afraid. I think I’d focus on 
the memory requirements, maybe take a heap dump and see what’s using memory.

Did you restart Solr _after_ turning off indexing? I ask because that would 
help determine which side the problem is on, indexing or querying. It does 
sound like querying though.

As for docValues in general, if you want to be really brave, you can set 
uninvertible=false for all your fields where docValues=false. When you facet on 
such a field, you won’t get anything back. If you sort on such a field, you’ll 
get an error message back. That should test if somehow not having docValues is 
the root of your problem. Do this on a test system of course ;) I think this is 
a low-probability issue, but it’s a mystery anyway so...

Updating shouldn’t be that much of a problem either, and if you still see high 
CPU with indexing turned off, that eliminates indexing as a candidate.

Is there any chance you changed your schema at all and didn’t delete your 
entire index and add all your documents back? There are a lot of ways things 
can go wrong if that’s the case. You had to reindex from scratch when you went 
to 8x from 6x, I’m wondering if during that process the schema changed without 
starting over. I’m grasping at straws here…

I’d also seriously consider going to 8.6.3. We only make point releases when 
there’s something serious. Looking through lucene/CHANGES.txt, there is one 
memory leak fix in 8.6.2. I’d expect a gradual buildup of heap if that were 
what you’re seeing, but you never know.

As for having docValues=false, that would cut down on the size of the index on 
disk and speed up indexing some, but in terms of memory usage or CPU usage when 
querying, unless the docValues structures are _needed_, they’re never read into 
OS RAM by MMapDirectory… The question really is whether you ever, intentionally 
or not, do “something” that would be more efficient with docValues. That’s 
where setting uninvertible=false whenever you set docValues=false makes sense, 
things will show up if your assumption that you don’t need docValues is false.

Best,
Erick


> On Oct 28, 2020, at 9:29 AM, Jaan Arjasepp  wrote:
> 
> Hi all,
> 
> Its me again. Anyway, I did a little research and we tried different things 
> and well, some questions I want to ask and some things that I found.
> 
> Well after monitoring my system with VirtualVM, I found that GC jumping is 
> from 0.5GB to 2.5GB and it has 4GB of memory for now, so it should not be an 
> issue anymore or what? But will observe it a bit as it might rise I guess a 
> bit.
> 
> Next thing we found or are thinking about is that writing on a disk might be 
> an issue, we turned off the indexing and some other stuff, but I would say, 
> it did not save much still.
> I also did go through all the schema fields, not that much really. They are 
> all docValues=true. Also I must say they are all automatically generated, so 
> no manual working there except one field, but this also has docValue=true. 
> Just curious, if the field is not a string/text, can it be docValue=false or 
> still better to have true? And as for uninversion, then we are not using much 
> facets nor other specific things in query, just simple queries. 
> 
> Though I must say we are updating documents quite a bunch, but usage of CPU 
> for being so high, not sure about that. Older version seemed not using CPU so 
> much...
> 
> I am a bit running out of ideas and hoping that this will continue to work, 
> but I dont like the CPU usage even over night, when nobody uses it. We will 
> try to figure out the issue here and I hope I can ask more questions when in 
> doubt or out of ideas. Also I must admit, solr is really new for me 
> personally.
> 
> Jaan
> 
> -Original Message-
> From: Walter Underwood 
> Sent: 27 October 2020 18:44
> To: solr-user@lucene.apache.org
> Subject: Re: SOLR uses too much CPU and GC is also weird on Windows 
> server
> 
> That first graph shows a JVM that do

Re: SOLR uses too much CPU and GC is also weird on Windows server

2020-11-02 Thread Erick Erickson
What this sounds like is that somehow you were committing after every update in 
8x but not in your 6x code. How that would have been change is anybody’s guess 
;).

It’s vaguely possible that your client is committing and you had 
IgnoreCommitOptimizeUpdateProcessorFactory defined in your update chain in 6x 
but not 8x.

The other thing would be if your commit interval was much shorter in 8x than 6x 
or if your autowarm parameters were significantly different.

That said, this is still a mystery, glad you found an answer.

Thanks for getting back to us on this, this is useful information to have.

Best,
Erick

> On Nov 2, 2020, at 7:50 AM, Jaan Arjasepp  wrote:
> 
> Thanks for all for helping to think about it, but eventually found out that 
> code was basically single record deleting/adding records. After it was 
> batched up, then everything is back to normal. Funny thing is that 6.0.0. 
> handled these requests somehow, but newer version did not.
> Anyway, we will observe this and try to improve our code as well.
> 
> Best regards,
> Jaan
> 
> -Original Message-
> From: Erick Erickson  
> Sent: 28 October 2020 17:18
> To: solr-user@lucene.apache.org
> Subject: Re: SOLR uses too much CPU and GC is also weird on Windows server
> 
> DocValues=true are usually only used for “primitive” types, string, numerics, 
> booleans and the like, specifically _not_ text-based.
> 
> I say “usually” because there’s a special “SortableTextField” where it does 
> make some sense to have a text-based field have docValues, but that’s 
> intended for relatively short fields. For example you want to sort on a title 
> field. And probably not something you’re working with.
> 
> There’s not much we can say from this distance I’m afraid. I think I’d focus 
> on the memory requirements, maybe take a heap dump and see what’s using 
> memory.
> 
> Did you restart Solr _after_ turning off indexing? I ask because that would 
> help determine which side the problem is on, indexing or querying. It does 
> sound like querying though.
> 
> As for docValues in general, if you want to be really brave, you can set 
> uninvertible=false for all your fields where docValues=false. When you facet 
> on such a field, you won’t get anything back. If you sort on such a field, 
> you’ll get an error message back. That should test if somehow not having 
> docValues is the root of your problem. Do this on a test system of course ;) 
> I think this is a low-probability issue, but it’s a mystery anyway so...
> 
> Updating shouldn’t be that much of a problem either, and if you still see 
> high CPU with indexing turned off, that eliminates indexing as a candidate.
> 
> Is there any chance you changed your schema at all and didn’t delete your 
> entire index and add all your documents back? There are a lot of ways things 
> can go wrong if that’s the case. You had to reindex from scratch when you 
> went to 8x from 6x, I’m wondering if during that process the schema changed 
> without starting over. I’m grasping at straws here…
> 
> I’d also seriously consider going to 8.6.3. We only make point releases when 
> there’s something serious. Looking through lucene/CHANGES.txt, there is one 
> memory leak fix in 8.6.2. I’d expect a gradual buildup of heap if that were 
> what you’re seeing, but you never know.
> 
> As for having docValues=false, that would cut down on the size of the index 
> on disk and speed up indexing some, but in terms of memory usage or CPU usage 
> when querying, unless the docValues structures are _needed_, they’re never 
> read into OS RAM by MMapDirectory… The question really is whether you ever, 
> intentionally or not, do “something” that would be more efficient with 
> docValues. That’s where setting uninvertible=false whenever you set 
> docValues=false makes sense, things will show up if your assumption that you 
> don’t need docValues is false.
> 
> Best,
> Erick
> 
> 
>> On Oct 28, 2020, at 9:29 AM, Jaan Arjasepp  wrote:
>> 
>> Hi all,
>> 
>> Its me again. Anyway, I did a little research and we tried different things 
>> and well, some questions I want to ask and some things that I found.
>> 
>> Well after monitoring my system with VirtualVM, I found that GC jumping is 
>> from 0.5GB to 2.5GB and it has 4GB of memory for now, so it should not be an 
>> issue anymore or what? But will observe it a bit as it might rise I guess a 
>> bit.
>> 
>> Next thing we found or are thinking about is that writing on a disk might be 
>> an issue, we turned off the indexing and some other stuff, but I would say, 
>> it did not save much still.
>> I also did go through all the schema fields, not that much really. They are 
>> all docValues=true. Also I must say they are all automatically generated, so 
>> no manual working there except one field, but this also has docValue=true. 
>> Just curious, if the field is not a string/text, can it be docValue=false or 
>> still better to have true? And as for uninversion, then we are not using

[Free Online Meetups] London Information Retrieval Meetup

2020-11-02 Thread Alessandro Benedetti
Hi all,
The London Information Retrieval Meetup has moved online:

https://www.meetup.com/London-Information-Retrieval-Meetup-Group

It is a free evening meetup aimed at Information Retrieval passionates and
professionals who are curious to explore and discuss the latest trends in
the field.

It is technology agnostic, but you'll find many talks on Apache Solr and
related technologies.

Tomorrow (03.11 at 6:10 pm Uk time) we will host the sixth London
Information Retrieval meetup (fully remote).
We will have two talks:
*Talk 1*
"Feature Extraction for Large-Scale Text Collections"
from Luke Gallagher, PhD candidate, RMIT University
*Talk 2*
"A Learning to Rank Project on a Daily Song Ranking Problem"
from Ilaria Petreti (IR/ML Engineer, Sease) and Anna Ruggero (R&D Software
Engineer, Sease)

If you fancy some Search Stories, feel free to register here:
https://www.meetup.com/London-Information-Retrieval-Meetup-Group/events/273905485/

Cheers

have a nice evening!
--
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
www.sease.io


Re: Java Streaming API - nested Hashjoins with zk and accesstoken

2020-11-02 Thread sambasivarao giddaluri
Hi All,
Any advice on this.

Thanks
sam

On Sun, Nov 1, 2020 at 11:05 PM Anamika Solr 
wrote:

> Hi All,
>
> I need to combine 3 different documents using hashjoin. I am using below
> query(ignore placeholder queries):
>
>
> hashJoin(hashJoin(search(collectionName,q="*:*",fl="id",qt="/export",sort="id
> desc"), hashed =
> select(search(collectionName,q="*:*",fl="id",qt="/export",sort="id
> asc")),on="id"), hashed =
> select(search(collectionName,q="*:*",fl="id",qt="/export",sort="id
> asc")),on="id")
>
> This works with simple TupleStream in java. But I also need to pass auth
> token on zk. So I have to use below code:
>  ZkClientClusterStateProvider zkCluster = new
> ZkClientClusterStateProvider(zkHosts, null);
> SolrZkClient zkServer = zkCluster.getZkStateReader().getZkClient();
> StreamFactory streamFactory = new
> StreamFactory().withCollectionZkHost("collectionName"),
> zkServer.getZkServerAddress())
> .withFunctionName("search", CloudSolrStream.class)
> .withFunctionName("hashJoin", HashJoinStream.class)
> .withFunctionName("select", SelectStream.class);
>
> try (HashJoinStream hashJoinStream =
> (HashJoinStream)streamFactory.constructStream(expr);){}
>
> Issue is one hashjoin with nested select and search works fine with this
> api. But the multiple hashjoin is not completing the task. I can see
> expression is correctly parsed, but its waiting indefinitely to complete
> the thread.
>
> Any help is appreciated.
>
> Thanks,
> Anamika
>


Re: filterCache ramBytesUsed monitoring statistics go negative

2020-11-02 Thread Shawn Heisey

On 11/2/2020 4:27 AM, Dawn wrote:

filterCache ramBytesUsed monitoring statistics go negative.
Is there a special meaning, or is there a statistical problem
When present the list, can sort it by key. Solr7 is like this, easy to 
view.


When problems like this surface, it's usually because the code uses an 
"int" variable somewhere instead of a "long".  All numeric variables in 
Java are signed, and an "int" can only go up to a little over 2 billion 
before the numbers start going negative.


The master code branch looks like it's fine.  What is the exact version 
of Solr you're using?  With that information, I can check the relevant code.


Maybe simply upgrading to a much newer version would take care of this 
for you.


Thanks,
Shawn


Understand on intermittent solr replica going to GONE state

2020-11-02 Thread yaswanth kumar
Solr version: 8.2; Zoo - 3.4

I am progressively adding collection by collections with 3 replica's on
each, and all of a sudden we got to see the load averages on solr nodes
were bumped and also memory usage went to 65% usage on JAVA process , with
that some replica's had went to "GONE" state (as per solr cloud) , until I
restarted the solr service its been this issue.

Need some guidance on where to start with on finding the root cause for
this little outage?

Data points:

At the time we saw this outage, there are 3 instances of the copy tool
which is actually pulling the data from old solr (5) and getting it indexed
to new solr (8.2), which we stopped as and when we saw this outage as not
really sure if that's creating the issue.

we have around 12 solr nodes that are mapped to diff collections, each node
got 8 Cores cpu with 64GB RAM (40GB was allocated to JVM HEAP)

Based on the alerts , what we observed is that the load averages on few
solr nodes were very high like 36.5, 26.7, 20 (not really sure if this is
what concerning here as I used to see much lesser numbers like 3's and 4's
on the averages but now this went to double digits)

Also observed the below errors during that time from solr logs

o.a.s.s.HttpSolrCall Unable to write response, client closed connection or
we are shutting down
org.eclipse.jetty.io.EofException: Closed
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:620)
at
org.apache.commons.io.output.ProxyOutputStream.write(ProxyOutputStream.java:55)
at
org.apache.solr.response.QueryResponseWriterUtil$1.write(QueryResponseWriterUtil.java:54)
at java.io.OutputStream.write(OutputStream.java:116)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)
at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)
at org.apache.solr.util.FastWriter.flush(FastWriter.java:140)
at org.apache.solr.util.FastWriter.flushBuffer(FastWriter.java:154)
at
org.apache.solr.response.TextResponseWriter.close(TextResponseWriter.java:93)
at
org.apache.solr.response.JSONResponseWriter.write(JSONResponseWriter.java:73)
at
org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:65)
at org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:809)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:538)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)

-- 
Thanks & Regards,
Yaswanth Kumar Konathala.
yaswanth...@gmail.com


SolrIndexSearcher RankQuery Score calculation

2020-11-02 Thread Dawn
Hi:

SolrIndexSearcher.getDocListNC and getDocListAndSetNC code snippet:

if (cmd.getSort() != null && query instanceof RankQuery == false && 
(cmd.getFlags() & GET_SCORES) != 0) {
  TopFieldCollector.populateScores(topDocs.scoreDocs, this, query);
}


When this query includes a filterQuery, `QueryUtils.combineQueryAndFilter` will 
build A new BooleanQuery and copy it to the Query object。
so `query instanceof RankQuery` is false, This causes the score to be lost in 
the RankQuery phase.

Can you change this to determine if the original query is RankQuery: 
`cmd.getQuery() instanceof RankQuery`. 


version 8.6.*, 9.*

Search issue in the SOLR for few words

2020-11-02 Thread Viresh Sasalawad
Hi Sir/Madam,

Am facing an issue with few keyword searches (like gazing, one) in solr.
Can you please help why these words are not listed in solr results?

Indexing is done properly.


-- 
Thanks and Regards
Veeresh Sasalawad