Best practices for Solr (how to update jar files safely)

2016-02-17 Thread Brian Wright

Hello,

We have a new project to use Solr. Our Solr instance will use Jetty 
rather than Tomcat. We plan to extend the Solr core system by adding 
additional classes (jar files) to the 
/opt/solr/server/solr-webapp/webapp/WEB-INF/lib directory to extend 
features. We also plan to run two instances of Solr on each physical 
server preferably from a single installed Solr instance. I've read the 
best practices doc on running two Solr instances, and while it's 
detailed about how to set up two instances, it doesn't cover our 
specific use case.


For ease of our custom jar deployments, we would prefer to have only one 
install of Solr. However, I have read that the JVM's default class 
loader provides lazy class loading. This means that should we replace a 
jar file (during new release deployments) and the JVM goes looking for 
the old jar after replacement, the JVM could crash and dump. Is lazy 
class loading a concern with Solr and specifically when custom built 
jars are used?


Would it be better to use two separate installs of Solr and manage each 
independently or can we safely get away with using a single Solr install 
and then update jars in this instance for both running instances? I'm 
trying to find the best safety practices for release deployment with the 
way we intend to install and use our Solr in our environment. I should 
also mention that running Solr instances cannot be down concurrently due 
to cluster sharding.


Is there any other way through CLASSPATH or other config/runtime methods 
where we could use a single install, but separate the WEB-INF directory 
into two to support each separate running instance? Are there any other 
ideas?


Any advice is appreciated.

Thanks.

--
Signature

*Brian Wright*
*Sr. Systems Engineer *
901 Mariners Island Blvd Suite 200
San Mateo, CA 94404 USA
*Email *bri...@marketo.com <mailto:bri...@marketo.com>
*Phone *+1.650.539.3530**
*www.marketo.com <http://www.marketo.com/>*

Marketo Logo




Re: Best practices for Solr (how to update jar files safely)

2016-02-19 Thread Brian Wright

Hi Shawn,

Without going into excessive detail on our design, I won't be able to 
sufficiently justify an answer to your question as to the why of it. 
Suffice it to say we plan to deploy this indexing for our entire 
customer base. Because of size these document collections and the way 
that they will grow over time, doubling up in machines is not feasible 
in our current infrastructure at this time. It may be justified later, 
but not today. It's less expensive to add more CPUs and RAM than 
doubling up on physical machines. Additionally, there are further 
budgetary constraints going into our international datacenters which 
prevents us from having identical clusters across the board, thus 
requiring doubling up. We're not talking about 2 or 3 machines here. 
We're talking 128 running instances of Solr with 64 clusters and many 
shards.


However, that doesn't preclude the use of something like Docker or KVM 
to allow encapsulation of each Solr environment on a virtual machine 
which is hooked to a fast storage subsystem.


I would also suggest that if the recommendation is not to run two 
instance side-by-side, then the documentation regarding how to set this 
up should be removed and a strong statement put in its place that 
running multiple Solr instances is not a supported configuration. Right 
now, the documentation does not state this and, in fact, implies that it 
is perfectly fine to run multiple instances side by side as long as 
independent disks are used to hold the instances.


Note, this was not my design and I am not a fan doing this, but I'm not 
the person making this decision. I am the person who's tasked to 
implement this design choice.


Thanks.

On 2/17/16 10:19 PM, Shawn Heisey wrote:

On 2/17/2016 10:38 PM, Brian Wright wrote:

We have a new project to use Solr. Our Solr instance will use Jetty
rather than Tomcat. We plan to extend the Solr core system by adding
additional classes (jar files) to the
/opt/solr/server/solr-webapp/webapp/WEB-INF/lib directory to extend
features. We also plan to run two instances of Solr on each physical
server preferably from a single installed Solr instance. I've read the
best practices doc on running two Solr instances, and while it's
detailed about how to set up two instances, it doesn't cover our
specific use case.

Why do you want to run multiple instances on one server?  Unless you
have a REALLY good reason to have more than one instance per server,
don't do it.  One instance can handle many indexes with no problem.

The only valid reason I can think of to run more than one instance per
machine is when a single instance requires a VERY large heap.  In that
case, it *might* be better to run two instances that each have a smaller
heap, so that garbage collection times are lower.  I personally would
add more machines, rather than run multiple instances.

Generally the best way to load custom jars (and contrib components like
the dataimport handler) in Solr is to create a "lib" directory in the
solr home (where solr.xml lives) and place all extra jars there.  They
will be loaded once when Solr starts, and all cores will have access to
them.

The rest of your email was concerned with running multiple instances.
If you *REALLY* want to go against advice and do this, here's the
recommended way:

https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production#TakingSolrtoProduction-RunningmultipleSolrnodesperhost

It is very likely possible to run multiple instances out of the same
installation directory, but I am not sure how to do it.

Thanks,
Shawn



--
Signature

*Brian Wright*
*Sr. Systems Engineer *
901 Mariners Island Blvd Suite 200
San Mateo, CA 94404 USA
*Email *bri...@marketo.com <mailto:bri...@marketo.com>
*Phone *+1.650.539.3530**
*www.marketo.com <http://www.marketo.com/>*

Marketo Logo




Re: Best practices for Solr (how to update jar files safely)

2016-02-19 Thread Brian Wright

Hi Shawn,

Thanks for the information.

On 2/19/16 3:32 PM, Shawn Heisey wrote:

You will use fewer resources if you only run one Solr instance on each
machine.  You can still have different considerations for different
hardware with one instance -- on the servers with more resources,
configure a larger Java heap and run more indexes.
Yes, I do realize the performance across this box would run better with 
a single instance. However, jump down to the bottom comment for more detail.



If the plan is to run SolrCloud, having only one Solr instance per
physical machine will ensure that SolrCloud never places more than one
replica for a shard on the same physical host, and it will do this
without special configuration.
I will confirm the use of SolrCloud in this environment. This hasn't 
been mentioned so far, but I don't always get all information from the 
software architects immediately when I'm still in build mode.



The documentation is driven by what users ask for.  A lot of users ask
how to run multiple instances on one machine.  Your idea above would be
my preference on how to handle the documentation.  Or perhaps leave the
instructions in there, but include a strong warning indicating that one
instance will usually work better.

Each time I see somebody ask how to run multiple instances, I give them
the same advice I gave you.  It is often ignored.


Here's the fundamental issue when talking about use of software like 
Solr in a corporate environment. As a systems engineer at Marketo, my 
best practices recommendations and justifications are based on the 
documentation provided by the project owners. If the project's docs 
state that something is feasible without any warnings, not only will the 
software engineer latch onto that documentation and want to use it in 
that way, my hands will be tied as a systems engineer when I'm expected 
to roll that architecture to production, as I have nothing to argue 
against that use case.


So, here's the word of caution to project owners. Many companies work 
just like Marketo. The software team will design and build a platform 
based on what the documentation states is possible. My job (in addition 
to building a functional system) is to not only read through the install 
docs to ensure Marketo is following best practices, but also to be the 
voice of clarity when concepts are cloudy. If a doc explicitly says 
"don't do this" or "warning: not supported", I have immediate 
justification to recommend not following that path when going into 
production. However, when docs are written that state that something is 
possible without a hint of "but this is bad and here's why", my hands 
are tied when talking to management. They need hard facts and with 
reasons not to go into production with a specific design. The management 
here are fact driven and need to see reasons from the project owners / 
developers (not from me) why any given setup is not recommended.


I am personally not at all fond of doubling up services of any type when 
going into production simply for the reason that either of the two 
processes could bring down the whole box and take down both instances. 
But, that argument alone isn't strong enough to justify not doing it. 
This issue isn't limited to Solr / Java. This type of failure can happen 
with any application of any type. However, for management to take a 
design change seriously, I need technical ammo (in the form of docs that 
state major performance degradation) that I can take to them and say, 
"Hey, look at this. It says not to do this because  " and then we 
come up with an alternative design.


I should also like to note that we have a current environment, although 
much smaller, which has been running two instances per box seemingly 
without issues for several years. For me to attempt to argue that it's 
not possible or is a bad idea goes against the experience we've already 
accrued when running two Solr JVMs side-by-side. So, if there is some 
legitimate benchmarks that, for example, show side-by-side instances 
degrade each other worse than running two Xen VMs on the same systems or 
when running alone, then I have a strong reason to suggest using an 
alternative design. Unfortunately, our past use of Solr already 
justifies the use of two instances in this new replacement system.


Thanks.


Thanks,
Shawn


--
Signature

*Brian Wright*
*Sr. Systems Engineer *
901 Mariners Island Blvd Suite 200
San Mateo, CA 94404 USA
*Email *bri...@marketo.com <mailto:bri...@marketo.com>
*Phone *+1.650.539.3530**
*www.marketo.com <http://www.marketo.com/>*

Marketo Logo




SOLR_HOME vs solr.xml (solrcloud config)

2016-03-11 Thread Brian Wright

Hi,

Another question regarding documentation of Solr and Zookeeper.

The manual states:


 Solr Hostname

Use the |SOLR_HOST| variable in the include file to set the hostname of 
the Solr server.


|SOLR_HOST=solr1.example.com|

Setting the hostname of the Solr server is recommended, especially when 
running in SolrCloud mode, as this determines the address of the node 
when it registers with ZooKeeper.



Yet, in the example solr.xml, the  stanza defines ...

  

${host:}
${jetty.port:8983}
${hostContext:solr}

${genericCoreNodeNames:true}

${zkClientTimeout:3}
name="distribUpdateSoTimeout">${distribUpdateSoTimeout:60}
name="distribUpdateConnTimeout">${distribUpdateConnTimeout:6}


  


More specifically the ${host:} variable. When this variable is filled in 
from startup execution, Zookeeper seems to obtain the correct hostname 
in spite of SOLR_HOST having been set. Yet, the docs recommend setting 
(or additionally setting?) SOLR_HOST with an explicit hostname.


If the  stanza in solr.xml is there to specifically define 
the setup of solrcloud, is it still recommended to define SOLR_HOST 
separately and what benefit does this provide over relying on ${host:} 
in solr.xml? This ${host:} variable at least works without making an 
explicit declaration of an FQDN. At the same time, ${host:} will auto 
populate if the hostname of the box changes. If you hardcode SOLR_HOST 
into a config, this won't dynamically update should the hostname of the 
box change (not that I'm going to run around changing hostnames, but I 
can see how explicitly defining SOLR_HOST could become a problem when 
someone doesn't know that it is defined).


What is the best practice here?

Thanks.

--
Signature

*Brian Wright*
*Sr. Systems Engineer *
901 Mariners Island Blvd Suite 200
San Mateo, CA 94404 USA
*Email *bri...@marketo.com <mailto:bri...@marketo.com>
*Phone *+1.650.539.3530**
*www.marketo.com <http://www.marketo.com/>*

Marketo Logo




smime.p7s
Description: S/MIME Cryptographic Signature


Re: SOLR_HOST vs solr.xml (solrcloud config)

2016-03-11 Thread Brian Wright

Correction...

Too fast on the send button. The subject should have been SOLR_HOST, not 
SOLR_HOME.  Sorry for any confusion. Though, the body is correct.


On 3/11/16 9:21 PM, Brian Wright wrote:

Hi,

Another question regarding documentation of Solr and Zookeeper.

The manual states:


  Solr Hostname

Use the |SOLR_HOST| variable in the include file to set the hostname 
of the Solr server.


|SOLR_HOST=solr1.example.com|

Setting the hostname of the Solr server is recommended, especially 
when running in SolrCloud mode, as this determines the address of the 
node when it registers with ZooKeeper.



Yet, in the example solr.xml, the  stanza defines ...

  

${host:}
${jetty.port:8983}
${hostContext:solr}

${genericCoreNodeNames:true}

${zkClientTimeout:3}
name="distribUpdateSoTimeout">${distribUpdateSoTimeout:60}
name="distribUpdateConnTimeout">${distribUpdateConnTimeout:6}


  


More specifically the ${host:} variable. When this variable is filled 
in from startup execution, Zookeeper seems to obtain the correct 
hostname in spite of SOLR_HOST having been set. Yet, the docs 
recommend setting (or additionally setting?) SOLR_HOST with an 
explicit hostname.


If the  stanza in solr.xml is there to specifically define 
the setup of solrcloud, is it still recommended to define SOLR_HOST 
separately and what benefit does this provide over relying on ${host:} 
in solr.xml? This ${host:} variable at least works without making an 
explicit declaration of an FQDN. At the same time, ${host:} will auto 
populate if the hostname of the box changes. If you hardcode SOLR_HOST 
into a config, this won't dynamically update should the hostname of 
the box change (not that I'm going to run around changing hostnames, 
but I can see how explicitly defining SOLR_HOST could become a problem 
when someone doesn't know that it is defined).


What is the best practice here?

Thanks.

--
Signature

*Brian Wright*
*Sr. Systems Engineer *
901 Mariners Island Blvd Suite 200
San Mateo, CA 94404 USA
*Email *<mailto:bri...@marketo.com>bri...@marketo.com
*Phone *+1.650.539.3530**
*<http://www.marketo.com/>www.marketo.com*

Marketo Logo





--
Signature

*Brian Wright*
*Sr. Systems Engineer *
901 Mariners Island Blvd Suite 200
San Mateo, CA 94404 USA
*Email *bri...@marketo.com <mailto:bri...@marketo.com>
*Phone *+1.650.539.3530**
*www.marketo.com <http://www.marketo.com/>*

Marketo Logo




smime.p7s
Description: S/MIME Cryptographic Signature