Mark Miller created SOLR-13867:
----------------------------------

             Summary: Make Solrcloud stable and performant and capable of 
having passing tests.
                 Key: SOLR-13867
                 URL: https://issues.apache.org/jira/browse/SOLR-13867
             Project: Solr
          Issue Type: Task
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Mark Miller
            Assignee: Mark Miller
             Fix For: master (9.0)


After spending a bit of time away from SolrCloud and being deeply involved in 
trying to stabilize it and it's tests, I came back in 2018 and went deep into 
the system with the Starburst upgrade.

What I found surprised me, though I guess it should not have. The system is 
slow, often silly, super buggy, not good at connection reuse or thread safety 
or efficient Zookeeper communication or efficient startup and shutdown.

Often, the things we do to make tests pass make things worse because you can't 
do things reasonably without some major code work.

Twice now, I've seen the system in the shape it was supposed to take. FAST. Not 
bug free, but 100X more solid at least and much, much, much faster.

The current system is sick and actually getting worse under it's weight as more 
is shoveled on top. Even since 1.5 years ago, the problems are worse, not 
better.

Sadly, I'm smart enough to know what has to be done, but not smart enough to do 
most of it twice and then lose most of it twice.

Non the less, it's time to fix SolrCloud. It's not supposed to be this way.

I spent a lot of time after starburst making tests pass for me. Then a lot of 
time on a better build system that can help us improve development and good 
practices around the project. And then a lot of time making tests faster. These 
are important steps, but little itty bitty baby steps without addressing the 
core rot that is growing. We don't find a problem and fully understand what is 
up and craft a careful solution. We find something that we can toss into the 
grand canyon, listen to it bounce around for a while, and if no body screams, 
we move on to the next thing. That's not necessarily anyone's choice, there is 
little else you can do until the system is fixed. When that happens we can 
start making smart changes instead of just shoving around the mess.

Twice I have made the current system fast. What happens first? Nothing works. 
The system doesn't know how to be fast. It doesn't have the thread safety or 
proper logic to be fast. And that is not a place I want to be.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to