Best practices for Solr (how to update jar files safely)
Hello, We have a new project to use Solr. Our Solr instance will use Jetty rather than Tomcat. We plan to extend the Solr core system by adding additional classes (jar files) to the /opt/solr/server/solr-webapp/webapp/WEB-INF/lib directory to extend features. We also plan to run two instances of Solr on each physical server preferably from a single installed Solr instance. I've read the best practices doc on running two Solr instances, and while it's detailed about how to set up two instances, it doesn't cover our specific use case. For ease of our custom jar deployments, we would prefer to have only one install of Solr. However, I have read that the JVM's default class loader provides lazy class loading. This means that should we replace a jar file (during new release deployments) and the JVM goes looking for the old jar after replacement, the JVM could crash and dump. Is lazy class loading a concern with Solr and specifically when custom built jars are used? Would it be better to use two separate installs of Solr and manage each independently or can we safely get away with using a single Solr install and then update jars in this instance for both running instances? I'm trying to find the best safety practices for release deployment with the way we intend to install and use our Solr in our environment. I should also mention that running Solr instances cannot be down concurrently due to cluster sharding. Is there any other way through CLASSPATH or other config/runtime methods where we could use a single install, but separate the WEB-INF directory into two to support each separate running instance? Are there any other ideas? Any advice is appreciated. Thanks. -- Signature *Brian Wright* *Sr. Systems Engineer * 901 Mariners Island Blvd Suite 200 San Mateo, CA 94404 USA *Email *bri...@marketo.com <mailto:bri...@marketo.com> *Phone *+1.650.539.3530** *www.marketo.com <http://www.marketo.com/>* Marketo Logo
Re: Best practices for Solr (how to update jar files safely)
Hi Shawn, Without going into excessive detail on our design, I won't be able to sufficiently justify an answer to your question as to the why of it. Suffice it to say we plan to deploy this indexing for our entire customer base. Because of size these document collections and the way that they will grow over time, doubling up in machines is not feasible in our current infrastructure at this time. It may be justified later, but not today. It's less expensive to add more CPUs and RAM than doubling up on physical machines. Additionally, there are further budgetary constraints going into our international datacenters which prevents us from having identical clusters across the board, thus requiring doubling up. We're not talking about 2 or 3 machines here. We're talking 128 running instances of Solr with 64 clusters and many shards. However, that doesn't preclude the use of something like Docker or KVM to allow encapsulation of each Solr environment on a virtual machine which is hooked to a fast storage subsystem. I would also suggest that if the recommendation is not to run two instance side-by-side, then the documentation regarding how to set this up should be removed and a strong statement put in its place that running multiple Solr instances is not a supported configuration. Right now, the documentation does not state this and, in fact, implies that it is perfectly fine to run multiple instances side by side as long as independent disks are used to hold the instances. Note, this was not my design and I am not a fan doing this, but I'm not the person making this decision. I am the person who's tasked to implement this design choice. Thanks. On 2/17/16 10:19 PM, Shawn Heisey wrote: On 2/17/2016 10:38 PM, Brian Wright wrote: We have a new project to use Solr. Our Solr instance will use Jetty rather than Tomcat. We plan to extend the Solr core system by adding additional classes (jar files) to the /opt/solr/server/solr-webapp/webapp/WEB-INF/lib directory to extend features. We also plan to run two instances of Solr on each physical server preferably from a single installed Solr instance. I've read the best practices doc on running two Solr instances, and while it's detailed about how to set up two instances, it doesn't cover our specific use case. Why do you want to run multiple instances on one server? Unless you have a REALLY good reason to have more than one instance per server, don't do it. One instance can handle many indexes with no problem. The only valid reason I can think of to run more than one instance per machine is when a single instance requires a VERY large heap. In that case, it *might* be better to run two instances that each have a smaller heap, so that garbage collection times are lower. I personally would add more machines, rather than run multiple instances. Generally the best way to load custom jars (and contrib components like the dataimport handler) in Solr is to create a "lib" directory in the solr home (where solr.xml lives) and place all extra jars there. They will be loaded once when Solr starts, and all cores will have access to them. The rest of your email was concerned with running multiple instances. If you *REALLY* want to go against advice and do this, here's the recommended way: https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production#TakingSolrtoProduction-RunningmultipleSolrnodesperhost It is very likely possible to run multiple instances out of the same installation directory, but I am not sure how to do it. Thanks, Shawn -- Signature *Brian Wright* *Sr. Systems Engineer * 901 Mariners Island Blvd Suite 200 San Mateo, CA 94404 USA *Email *bri...@marketo.com <mailto:bri...@marketo.com> *Phone *+1.650.539.3530** *www.marketo.com <http://www.marketo.com/>* Marketo Logo
Re: Best practices for Solr (how to update jar files safely)
Hi Shawn, Thanks for the information. On 2/19/16 3:32 PM, Shawn Heisey wrote: You will use fewer resources if you only run one Solr instance on each machine. You can still have different considerations for different hardware with one instance -- on the servers with more resources, configure a larger Java heap and run more indexes. Yes, I do realize the performance across this box would run better with a single instance. However, jump down to the bottom comment for more detail. If the plan is to run SolrCloud, having only one Solr instance per physical machine will ensure that SolrCloud never places more than one replica for a shard on the same physical host, and it will do this without special configuration. I will confirm the use of SolrCloud in this environment. This hasn't been mentioned so far, but I don't always get all information from the software architects immediately when I'm still in build mode. The documentation is driven by what users ask for. A lot of users ask how to run multiple instances on one machine. Your idea above would be my preference on how to handle the documentation. Or perhaps leave the instructions in there, but include a strong warning indicating that one instance will usually work better. Each time I see somebody ask how to run multiple instances, I give them the same advice I gave you. It is often ignored. Here's the fundamental issue when talking about use of software like Solr in a corporate environment. As a systems engineer at Marketo, my best practices recommendations and justifications are based on the documentation provided by the project owners. If the project's docs state that something is feasible without any warnings, not only will the software engineer latch onto that documentation and want to use it in that way, my hands will be tied as a systems engineer when I'm expected to roll that architecture to production, as I have nothing to argue against that use case. So, here's the word of caution to project owners. Many companies work just like Marketo. The software team will design and build a platform based on what the documentation states is possible. My job (in addition to building a functional system) is to not only read through the install docs to ensure Marketo is following best practices, but also to be the voice of clarity when concepts are cloudy. If a doc explicitly says "don't do this" or "warning: not supported", I have immediate justification to recommend not following that path when going into production. However, when docs are written that state that something is possible without a hint of "but this is bad and here's why", my hands are tied when talking to management. They need hard facts and with reasons not to go into production with a specific design. The management here are fact driven and need to see reasons from the project owners / developers (not from me) why any given setup is not recommended. I am personally not at all fond of doubling up services of any type when going into production simply for the reason that either of the two processes could bring down the whole box and take down both instances. But, that argument alone isn't strong enough to justify not doing it. This issue isn't limited to Solr / Java. This type of failure can happen with any application of any type. However, for management to take a design change seriously, I need technical ammo (in the form of docs that state major performance degradation) that I can take to them and say, "Hey, look at this. It says not to do this because " and then we come up with an alternative design. I should also like to note that we have a current environment, although much smaller, which has been running two instances per box seemingly without issues for several years. For me to attempt to argue that it's not possible or is a bad idea goes against the experience we've already accrued when running two Solr JVMs side-by-side. So, if there is some legitimate benchmarks that, for example, show side-by-side instances degrade each other worse than running two Xen VMs on the same systems or when running alone, then I have a strong reason to suggest using an alternative design. Unfortunately, our past use of Solr already justifies the use of two instances in this new replacement system. Thanks. Thanks, Shawn -- Signature *Brian Wright* *Sr. Systems Engineer * 901 Mariners Island Blvd Suite 200 San Mateo, CA 94404 USA *Email *bri...@marketo.com <mailto:bri...@marketo.com> *Phone *+1.650.539.3530** *www.marketo.com <http://www.marketo.com/>* Marketo Logo
SOLR_HOME vs solr.xml (solrcloud config)
Hi, Another question regarding documentation of Solr and Zookeeper. The manual states: Solr Hostname Use the |SOLR_HOST| variable in the include file to set the hostname of the Solr server. |SOLR_HOST=solr1.example.com| Setting the hostname of the Solr server is recommended, especially when running in SolrCloud mode, as this determines the address of the node when it registers with ZooKeeper. Yet, in the example solr.xml, the stanza defines ... ${host:} ${jetty.port:8983} ${hostContext:solr} ${genericCoreNodeNames:true} ${zkClientTimeout:3} name="distribUpdateSoTimeout">${distribUpdateSoTimeout:60} name="distribUpdateConnTimeout">${distribUpdateConnTimeout:6} More specifically the ${host:} variable. When this variable is filled in from startup execution, Zookeeper seems to obtain the correct hostname in spite of SOLR_HOST having been set. Yet, the docs recommend setting (or additionally setting?) SOLR_HOST with an explicit hostname. If the stanza in solr.xml is there to specifically define the setup of solrcloud, is it still recommended to define SOLR_HOST separately and what benefit does this provide over relying on ${host:} in solr.xml? This ${host:} variable at least works without making an explicit declaration of an FQDN. At the same time, ${host:} will auto populate if the hostname of the box changes. If you hardcode SOLR_HOST into a config, this won't dynamically update should the hostname of the box change (not that I'm going to run around changing hostnames, but I can see how explicitly defining SOLR_HOST could become a problem when someone doesn't know that it is defined). What is the best practice here? Thanks. -- Signature *Brian Wright* *Sr. Systems Engineer * 901 Mariners Island Blvd Suite 200 San Mateo, CA 94404 USA *Email *bri...@marketo.com <mailto:bri...@marketo.com> *Phone *+1.650.539.3530** *www.marketo.com <http://www.marketo.com/>* Marketo Logo smime.p7s Description: S/MIME Cryptographic Signature
Re: SOLR_HOST vs solr.xml (solrcloud config)
Correction... Too fast on the send button. The subject should have been SOLR_HOST, not SOLR_HOME. Sorry for any confusion. Though, the body is correct. On 3/11/16 9:21 PM, Brian Wright wrote: Hi, Another question regarding documentation of Solr and Zookeeper. The manual states: Solr Hostname Use the |SOLR_HOST| variable in the include file to set the hostname of the Solr server. |SOLR_HOST=solr1.example.com| Setting the hostname of the Solr server is recommended, especially when running in SolrCloud mode, as this determines the address of the node when it registers with ZooKeeper. Yet, in the example solr.xml, the stanza defines ... ${host:} ${jetty.port:8983} ${hostContext:solr} ${genericCoreNodeNames:true} ${zkClientTimeout:3} name="distribUpdateSoTimeout">${distribUpdateSoTimeout:60} name="distribUpdateConnTimeout">${distribUpdateConnTimeout:6} More specifically the ${host:} variable. When this variable is filled in from startup execution, Zookeeper seems to obtain the correct hostname in spite of SOLR_HOST having been set. Yet, the docs recommend setting (or additionally setting?) SOLR_HOST with an explicit hostname. If the stanza in solr.xml is there to specifically define the setup of solrcloud, is it still recommended to define SOLR_HOST separately and what benefit does this provide over relying on ${host:} in solr.xml? This ${host:} variable at least works without making an explicit declaration of an FQDN. At the same time, ${host:} will auto populate if the hostname of the box changes. If you hardcode SOLR_HOST into a config, this won't dynamically update should the hostname of the box change (not that I'm going to run around changing hostnames, but I can see how explicitly defining SOLR_HOST could become a problem when someone doesn't know that it is defined). What is the best practice here? Thanks. -- Signature *Brian Wright* *Sr. Systems Engineer * 901 Mariners Island Blvd Suite 200 San Mateo, CA 94404 USA *Email *<mailto:bri...@marketo.com>bri...@marketo.com *Phone *+1.650.539.3530** *<http://www.marketo.com/>www.marketo.com* Marketo Logo -- Signature *Brian Wright* *Sr. Systems Engineer * 901 Mariners Island Blvd Suite 200 San Mateo, CA 94404 USA *Email *bri...@marketo.com <mailto:bri...@marketo.com> *Phone *+1.650.539.3530** *www.marketo.com <http://www.marketo.com/>* Marketo Logo smime.p7s Description: S/MIME Cryptographic Signature