[ 
https://issues.apache.org/jira/browse/SOLR-13867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16959258#comment-16959258
 ] 

Mark Miller commented on SOLR-13867:
------------------------------------

The first step is to make the current code good - it is just riddled with bugs 
and inefficiency. I've made the system fast and more reasonable a couple times 
and all those bugs pop.

 

There are dozens and dozens of concurrency bugs at least and we are a web 
server - the epitome of a multi-threaded application.

Going forward, the focus has to be defense, and I do think separation of tests 
will improve things, but the core is no better shape than any contrib.

I would like to move towards having two test modes though - right now we 
practically ignore nightly. When I'm done making all the non nightly tests take 
5 seconds or so at most on good hardware, non Nightly should become more of a 
smoke test and Nightly will be where the real actions is (also doesnt just have 
to be run nightly, thats just the current tag).

If we had something solid and a bunch of cruft, I'd say say sure, let's cordon 
off the cruft, but we have absolutely nothing solid.

I can fix that for almost all the modules for the most part, and add some guard 
rails on the way, but beyond that I won't be able to baby sit, so we will need 
need tactics to keep a good system.

It will be new for this community. The way we grafted SolrCloud onto classic 
Solr in a dev boost way (we only had a couple devs) and the way Lucene pushed 
Solr into randomization and parallelization before it was even near ready, we 
have been struggling to stay afloat from the start and always without any kind 
of clear view of the actual system we have here.

I can change that and give us a closer to clean slate, it will be up to others 
if we can maintain that or not.

I'd like to see Solr 9 be the first fast and solid release of SolrCloud. 
Obviously I will need some help, but hang on and let me prove to you that we 
are not at all where we should be first.

> Make Solrcloud stable and performant and capable of having passing tests.
> -------------------------------------------------------------------------
>
>                 Key: SOLR-13867
>                 URL: https://issues.apache.org/jira/browse/SOLR-13867
>             Project: Solr
>          Issue Type: Task
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>            Priority: Major
>             Fix For: master (9.0)
>
>
> After spending a bit of time away from SolrCloud after being deeply involved 
> in trying to stabilize it and it's tests, I came back in 2018 and went deep 
> into the system with the Starburst upgrade.
> What I found surprised me, though I guess it should not have. The system is 
> slow, often silly, super buggy, not good at connection reuse or thread safety 
> or efficient Zookeeper communication or efficient startup and shutdown.
> Often, the things we do to make tests pass make things worse because you 
> can't do things reasonably without some major code work and so we fight for 
> tests passes, not correctness.
> Twice now, I've seen the system in the shape it was supposed to take. FAST. 
> Not bug free, but 100X more solid at least and much, much, much, much faster.
> The current system is sick and actually getting worse under it's weight as 
> more is shoveled on top. Even since 1.5 years ago, the problems are worse, 
> not better. Tests will never pass. Yes, our tests where in pretty bad shape. 
> But you can put them in the best shape possible and it won't matter. The 
> system will still fail tests.
> Sadly, I'm smart enough to know what has to be done, but not smart enough to 
> keep my work around after addressing most of the problems twice.
> Non the less, it's time to fix SolrCloud. It's not supposed to be this way. 
> I've twice spent a week or two in a state with super fast SolrCloud. Super 
> fast build system. Developmenet is actually fun. You actually have a chance. 
> I'm talking tests you have never seen take under 45-60 seconds taking 5.  
> Consistently. A different world.
> I spent a lot of time after starburst making tests pass for me. Then a lot of 
> time on a better build system that can help us improve development and good 
> practices around the project. And then a lot of time making tests faster. 
> These are important steps, but little itty bitty baby steps without 
> addressing the core rot that is growing. We don't find a problem and fully 
> understand what is up and craft a careful solution. We find something that we 
> can toss into the grand canyon, listen to it bounce around for a while, and 
> if nobody screams, we move on to the next thing. That's not necessarily 
> anyone's choice, there is little else you can do until the system is fixed. 
> When that happens we can start making smart changes instead of just shoving 
> around the mess.
> Twice I have made the current system fast. What happens first? Nothing works. 
> The system doesn't know how to be fast. It doesn't have the thread safety or 
> proper logic to be fast. And that is not a place I want to be.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to