Erik,
Looks like we're also running into this issue.
https://www.mail-archive.com/solr-user@lucene.apache.org/msg153798.html
Is there any think we can do to remedy this besides a node restart, which
causes leader re-election on the good shards which causes them to also
become un-operational?
> Are yours growing always, on all nodes, forever? Or is it one or two who
ends up in a bad state?
Randomly on some of the shards and some of the followers in the collection.
Then whichever tlog was open on follower when it was the leader, that one
doesn't stops growing. And that shard had active
Looks like the problem is related to tlog rotation on the follower shard.
We did the following for a specific shard.
0. start solr cloud
1. solr-0 (leader), solr-1, solr-2
2. rebalance to make solr-1 as preferred leader
3. solr-0, solr-1 (leader), solr-2
The tlog file on solr-0 kept on growing i
Looks like the problem is related to tlog rotation on the follower shard.
We did the following for a specific shard.
0. start solr cloud
1. solr-0 (leader), solr-1, solr-2
2. rebalance to make solr-1 as preferred leader
3. solr-0, solr-1 (leader), solr-2
The tlog file on solr-0 kept on growing i
We found that for the shard that does not get a leader, the tlog replay did
not complete (we don't see "log replay finished", "creating leader
registration node", "I am the new leader" etc log messages) for hours.
Also not sure why the TLOG are 10's of GBs (anywhere from 30 to 40GB).
Collectio
By tracing the output in the log files we see the following sequence.
Overseer role list has POD-1, POD-2, POD-3 in that order
POD-3 has 2 shard leaders.
POD-3 restarts.
A) Logs for the shard whose leader moves successfully from POD-3 to POD-1
On POD-1: o.a.s.c.ShardLeaderElectionContext Replay
Hello,
On reboot of one of the solr nodes in the cluster, we often see a
collection's shards with
1. LEADER replica in DOWN state, and/or
2. shard with no LEADER
Output from /solr/admin/collections?action=CLUSTERSTATUS is below.
Even after 5 to 10 minutes, the collection often does not recover.
> Does this happen on a warm searcher (are subsequent requests with no
intervening updates _ever_ fast?)?
Subsequent response times very fast if searcher remains open. As a control
test, I faceted on the same field that I used in the q param.
1. Start solr
2. Execute q=resultId:x&rows=0
=>
Ok. I'll try that. Meanwhile query on resultId is subsecond response. But the
immediate next query for faceting takes 40+secs. The core has 185million
docs and 63GB index size.
curl
'http://localhost:8983/solr/TestCollection_shard1_replica_t3/query?q=resultId:x&rows=0'
{
"responseHea
Hello,
I am seeing very slow response from json faceting against a single core
(though core is shard leader in a collection).
Fields processId and resultId are non-multivalued, indexed and docvalues
string (not text).
Soft Commit = 5sec (opensearcher=true) and Hard Commit = 10sec because new
do
I am running into a exception where creating child docs fails unless the
field already exists in the schema (stacktrace is at the bottom of this
post). My solr is v8.5.1 running in standard/non-cloud mode.
$> curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/mycore/updat
Is there a way to use combine paging's cursor feature with graph query
parser?
Background:
I have a hierarchical data structure that is split into N different flat
json docs and updated (inserted) into solr with from/to fields. Using the
from/to join syntax a graph query is needed since differen
Thank you for https://issues.apache.org/jira/browse/SOLR-12691.
I see it's marked as minor. Can we bump up the priority please ?
The example of 2 cores ingest + transientCacheSize==1 was provided for
reproduction reference only, and is not running in not production.
Production setup on AWS use
> Having 100+ cores on a Solr node and a transient cache size of 1
The original post clarified the current state. "we have about 75 cores with
"transientCacheSize" set to 32". If transientCacheSize is increased to match
current cores, we'll differ the issue. It's going to hit 100's cores per
sol
> The problem here is that you may have M requests queued up for the _same_
core, each with a new update request.
With transientCacheSize ==1, as soon as the update request for Core B is
received, Core B encounters data corruption not Core A. Both Core A and Core
B are receiving update requets.
In the below mentioned git commit, I see SolrCloudClient has been changed to
generate solr core urls differently than before.
In the previous version, solr urls were computed using "url =
coreNodeProps.getCoreUrl()".
This concatenated "base_url" + "core" name from the clusterstate for a
tenant's s
FYI. This issue went away after solrconfig.xml was tuned.
"Hard commits blocked | non-solrcloud v6.6.2" thread has the details.
http://lucene.472066.n3.nabble.com/Hard-commits-blocked-non-solrcloud-v6-6-2-td4374386.html
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
The below solrconfig.xml settings resolved the TIMED_WAIT in
ConcurrentMergeScheduler.doStall(). Thanks to Shawn and Erik for their
pointers.
...
30
100
30.0
18
6
300
...
${solr.autoCommit.maxTime:3}
> https://github.com/mohsinbeg/datadump/tree/master/solr58f449cec94a2c75_core_256
I had uploaded the output at the above link.
The OS has no swap configured. There are other processes on the host but
<1GB or <5% CPU cumulatively but none inside the docker as `top` shows. Solr
JVM heap is at 30GB
Hi Shawn, Erik
> updates should slow down but not deadlock.
The net effect is the same. As the CLOSE_WAITs increase, jvm ultimately
stops accepting new socket requests, at which point `kill ` is the
only option.
This means if replication handler is invoked which sets the deletion policy,
the th
Ran /solr/58f449cec94a2c75-core-248/admin/luke at 7:05pm PST
It showed "lastModified: 2018-02-10T02:25:08.231Z" indicating commit blocked
for about 41 mins.
Hard commit is set as 10secs in solrconfig.xml
Other cores are also now blocked.
https://jstack.review analysis of the thread dump says "Po
Shawn, Eric,
Were you able to look at the thread dump ?
https://github.com/mohsinbeg/datadump/blob/master/threadDump-7pjql_1.zip
Or is there additional data I may provide.
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> Setting openSearcher to false on autoSoftCommit makes no sense.
That was my mistake in my solrconfig.xml. Thank you for identifying it. I
have corrected it.
I then removed my custom element from my solrconfig.xml and
both hard commit and /solr/admin/core hang issues seemed to go way for a
cou
> If you issue a manual commit
> (http://blah/solr/core/update?commit=true) what happens?
That call never returned back to client browser.
So I also tried a core reload and did capture in the thread dump. That too
never returned.
"qtp310656974-1022" #1022 prio=5 os_prio=0 tid=0x7ef25401000
I am seeing that after some time hard commits in all my solr cores stop and
each one's searcher has an "opened at" date to be hours ago even though they
are continuing to ingesting data successfully (index size increasing
continuously).
http://localhost:8983/solr/#/x-core/plugins?type=core&en
> Maybe this is the issue:
https://github.com/eclipse/jetty.project/issues/2169
Looks like it is the issue. (I've readacted IP addresses below for security
reasons)
solr [ /opt/solr ]$ netstat -ptan | awk '{print $6 " " $7 }' | sort | uniq
-c
8425 CLOSE_WAIT -
92 ESTABLISHED -
1 FIN
Maybe this is the issue: https://github.com/eclipse/jetty.project/issues/2169
I have noticed when number of http requests / sec are increased, CLOSE_WAITS
increase linearly until solr stops accepting socket connections. Netstat
output is
$ netstat -ptan | awk '{print $6 " " $7 }' | sort | uniq -c
> You said that you're running Solr 6.2.2, but there is no 6.2.2 version.
> but the JVM argument list includes "-Xmx512m" which is a 512MB heap
My typos. They're 6.6.2 and -Xmx30g respectively.
> many open connections causes is a large number of open file handles,
solr [ /opt/solr/server/logs ]$
Hello,
In our solr non-cloud env., we are seeing lots of CLOSE_WAIT, causing jvm to
stop "working" with 3 mins of solr start.
solr [ /opt/solr ]$ netstat -anp | grep 8983 | grep CLOSE_WAIT | grep
10.xxx.xxx.xxx | wc -l
9453
Only option is then`kill -9` because even `jcmd Thread.print` is
unable
29 matches
Mail list logo