Re: Integration test failures on Jenkins

2024-06-25 Thread Jason Gerlowski
The failures are ongoing, so I suspect no one has (yet) reached out to
Infra about this.

I pinged them this morning in #asfinfra, so hopefully we can get this
resolved shortly.

-
To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
For additional commands, e-mail: dev-h...@solr.apache.org



Solr replication threads blocking each other causing large delays

2024-06-25 Thread Marcus Bergner
Hi,
Before I proceed to create a Solr Jira issue I'll ask for some feedback here 
first. I did start a thread on the users mailing list a few days ago about this 
topic without any relevant feedback 
(https://lists.apache.org/thread/yv5d869tzckdgp6py83lqrgtpqz0xqvv) and I have 
done some more investigation since then. That thread does contain some log file 
snippets if you are interested in a bit more context.

In a Solr 8.11 master/replica setup where documents in multiple Solr cores are 
updated at the same time on the master there are very large delays in the 
replica between replicating data for the first core and subsequent cores. This 
seems to stem from some mutex locking that blocks threads that operate on 
DIFFERENT cores, rather than only blocking threads that operate on the SAME 
core. A "replication" can be divided into a couple of milestones for the 
IndexFetcher thread(s) on the Solr replica:


  1.
Notices the leader has a new version and logs "Starting replication process"
  2.
Creates a new IndexWriter (createMainIndexWriter)
  3.
Downloads the index updates from the master
  4.
Creates a new IndexWriter again!? (createMainIndexWriter)
  5.
Calls openNewSearcher
  6.
Gets new commit point and is done

The above activities should (as far as I know) run independently for each core 
so that each core can perform replication concurrently with other cores but 
they clearly do not. Currently updates on multiple cores on the master will 
cause IndexFetcher threads on the replica to block each other and the problem 
seems to quickly get worse the more cores you update at the same time. This can 
make it extremely difficult to reach a sensible indexing latency service level 
agreement on the replicas since the blocking can add several minutes of delay. 
The biggest culprits code wise seems to be:


  *
createMainIndexWriter is prevented from returning while another replication 
thread is active. Any mutex locks used in such code should be core specific, 
not global.
  *
openNewSearcher can take a fairly long time to return (perhaps due to cache 
warming?) but it also seems to block other replication threads from 
progressing. Opening a new searcher should not hold any locks that are related 
to replication, especially not any locks related to other Solr cores.

As an example where I update documents in 3 cores the first IndexFetcher thread 
spent 12s between logging "Starting replication process" until it was done 
replicating and resumed its normal behavior of checking Leader vs Follower 
versions, the second spent 1m38s, and the third spent 2m49s.

Does this ring any bells in terms of existing known issues or known pitfalls 
regarding why Solr's replication threads must be synchronized in the way they 
are?

Depending on how feasible I think it is I may try to provide a merge request 
here but it would be good to at least have confirmation that the intended 
behavior for replication is that each core should replicate independently. Any 
pointers regarding specific mutexes that might need to be replaced by more core 
specific ones (and suggestions on what an existing good mutex is) would be 
helpful.

Kind regards,

Marcus


Re: Integration test failures on Jenkins

2024-06-25 Thread Houston Putman
Sorry, got busy with something and this fell off my radar. Will work with
Jason.

- Houston

On Tue, Jun 25, 2024 at 7:43 AM Jason Gerlowski 
wrote:

> The failures are ongoing, so I suspect no one has (yet) reached out to
> Infra about this.
>
> I pinged them this morning in #asfinfra, so hopefully we can get this
> resolved shortly.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> For additional commands, e-mail: dev-h...@solr.apache.org
>
>


lucene-solr-1 Jenkins Agent Management

2024-06-25 Thread Jason Gerlowski
Hey all,

Sending this email to discuss the ASF Jenkins' 'lucene-solr-1' worker
node, which both Lucene and Solr use to run builds.

While investigating a recent issue with some Solr builds, I asked
INFRA to restart 'lucene-solr-1'. They were happy to oblige (and I'm
now unblocked), but in the process they mentioned that 'lucene-solr-1'
was a "project run VM" and that someone in the project should have the
requisite access and permissions. [1] [2]

This was all a surprise to me, so I wanted to follow up here with a
few questions.

1. Is it true that our projects run or manage this VM in some way?
2. If so, does anyone remember the context around how or why this came
about? (As opposed to using one of the "standard" Jenkins build
machines that INFRA provides...)
3. Who all can access the box?  How are credentials managed?  (My
suspicion is that no one currently active on the Solr lists is able to
access the box, which seems worth remedying...)

Best,

Jason

[1] https://lists.apache.org/thread/g0y0ohczdctv0v5fn8sqv4t0j4y6hp68
[2] https://the-asf.slack.com/archives/CBX4TSBQ8/p1719318685682309

-
To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
For additional commands, e-mail: dev-h...@solr.apache.org