Atomic updates and POST command?
Hi... I'm trying to get atomic updates working and am seeing some strangeness. Here's my JSON with the data to update .. [{"id":"/unique/path/id", "field1":{"set","newvalue1"}, "field2":{"set","newvalue2"} }] If I use the REST API via curl it works fine. With the following command, the field1 and field2 fields get the new values, and all's well. curl 'http://localhost:8983/solr/core01/update/json?commit=true' --data-binary @test1.json -H 'Content-type:application/json' BUT, if I use the post command .. ./bin/post -c core01 /home/xtech/solrtest/test1.json .. the record gets updated with new fields named "field1.set" and "field2.set", and the managed-schema file is modified to include these new field definitions. Not at all what I'd expect or want. Is there some setting or switch that will let the post command work "properly", or am I misunderstanding what's correct? I can use curl, but our current workflow uses the post command so I thought that might do the job. Any thoughts are welcome! Thanks, ...scott
BUMP: Atomic updates and POST command?
Just bumping this post from a few days ago. Is anyone using atomic updates? If so, how are you passing the updates to Solr? I'm seeing a significant difference between the REST API and the post command .. is this to be expected? What's the recommended method for doing the update? Thanks! ...scott On 8/29/18 3:02 PM, Scott Prentice wrote: Hi... I'm trying to get atomic updates working and am seeing some strangeness. Here's my JSON with the data to update .. [{"id":"/unique/path/id", "field1":{"set","newvalue1"}, "field2":{"set","newvalue2"} }] If I use the REST API via curl it works fine. With the following command, the field1 and field2 fields get the new values, and all's well. curl 'http://localhost:8983/solr/core01/update/json?commit=true' --data-binary @test1.json -H 'Content-type:application/json' BUT, if I use the post command .. ./bin/post -c core01 /home/xtech/solrtest/test1.json .. the record gets updated with new fields named "field1.set" and "field2.set", and the managed-schema file is modified to include these new field definitions. Not at all what I'd expect or want. Is there some setting or switch that will let the post command work "properly", or am I misunderstanding what's correct? I can use curl, but our current workflow uses the post command so I thought that might do the job. Any thoughts are welcome! Thanks, ...scott
Re: BUMP: Atomic updates and POST command?
Hmm. That makes sense .. but where do you provide the endpoint to post? Is that additional commands within the JSON or a parameter at the command line? Thanks, ...scott On 8/31/18 4:48 PM, Alexandre Rafalovitch wrote: I think you are using different end points there. /update by default vs /update/json So i think the post gets treated as generic json parsing. Can you try the same end point? Regards, Alex On Fri, Aug 31, 2018, 7:05 PM Scott Prentice wrote: Just bumping this post from a few days ago. Is anyone using atomic updates? If so, how are you passing the updates to Solr? I'm seeing a significant difference between the REST API and the post command .. is this to be expected? What's the recommended method for doing the update? Thanks! ...scott On 8/29/18 3:02 PM, Scott Prentice wrote: Hi... I'm trying to get atomic updates working and am seeing some strangeness. Here's my JSON with the data to update .. [{"id":"/unique/path/id", "field1":{"set","newvalue1"}, "field2":{"set","newvalue2"} }] If I use the REST API via curl it works fine. With the following command, the field1 and field2 fields get the new values, and all's well. curl 'http://localhost:8983/solr/core01/update/json?commit=true' --data-binary @test1.json -H 'Content-type:application/json' BUT, if I use the post command .. ./bin/post -c core01 /home/xtech/solrtest/test1.json .. the record gets updated with new fields named "field1.set" and "field2.set", and the managed-schema file is modified to include these new field definitions. Not at all what I'd expect or want. Is there some setting or switch that will let the post command work "properly", or am I misunderstanding what's correct? I can use curl, but our current workflow uses the post command so I thought that might do the job. Any thoughts are welcome! Thanks, ...scott
Re: BUMP: Atomic updates and POST command?
Ah .. is this done with the -url parameter? As in .. ./bin/post -url http://localhost:8983/solr/core01/update/json /home/xtech/solrtest/test1.json Will test. Thanks, ...scott On 8/31/18 5:15 PM, Scott Prentice wrote: Hmm. That makes sense .. but where do you provide the endpoint to post? Is that additional commands within the JSON or a parameter at the command line? Thanks, ...scott On 8/31/18 4:48 PM, Alexandre Rafalovitch wrote: I think you are using different end points there. /update by default vs /update/json So i think the post gets treated as generic json parsing. Can you try the same end point? Regards, Alex On Fri, Aug 31, 2018, 7:05 PM Scott Prentice wrote: Just bumping this post from a few days ago. Is anyone using atomic updates? If so, how are you passing the updates to Solr? I'm seeing a significant difference between the REST API and the post command .. is this to be expected? What's the recommended method for doing the update? Thanks! ...scott On 8/29/18 3:02 PM, Scott Prentice wrote: Hi... I'm trying to get atomic updates working and am seeing some strangeness. Here's my JSON with the data to update .. [{"id":"/unique/path/id", "field1":{"set","newvalue1"}, "field2":{"set","newvalue2"} }] If I use the REST API via curl it works fine. With the following command, the field1 and field2 fields get the new values, and all's well. curl 'http://localhost:8983/solr/core01/update/json?commit=true' --data-binary @test1.json -H 'Content-type:application/json' BUT, if I use the post command .. ./bin/post -c core01 /home/xtech/solrtest/test1.json .. the record gets updated with new fields named "field1.set" and "field2.set", and the managed-schema file is modified to include these new field definitions. Not at all what I'd expect or want. Is there some setting or switch that will let the post command work "properly", or am I misunderstanding what's correct? I can use curl, but our current workflow uses the post command so I thought that might do the job. Any thoughts are welcome! Thanks, ...scott
Re: BUMP: Atomic updates and POST command?
Nope. That's not it. It complains about this path not being found .. /solr/core01/update/json/json/docs So, I changed the -url value to this "http://localhost:8983/solr/core01/update"; .. which was "successful", but created the same odd index structure of "field.set". I'm clearly flailing. If you have any thoughts on this, do let me know. Thanks! ...scott On 8/31/18 5:20 PM, Scott Prentice wrote: Ah .. is this done with the -url parameter? As in .. ./bin/post -url http://localhost:8983/solr/core01/update/json /home/xtech/solrtest/test1.json Will test. Thanks, ...scott On 8/31/18 5:15 PM, Scott Prentice wrote: Hmm. That makes sense .. but where do you provide the endpoint to post? Is that additional commands within the JSON or a parameter at the command line? Thanks, ...scott On 8/31/18 4:48 PM, Alexandre Rafalovitch wrote: I think you are using different end points there. /update by default vs /update/json So i think the post gets treated as generic json parsing. Can you try the same end point? Regards, Alex On Fri, Aug 31, 2018, 7:05 PM Scott Prentice wrote: Just bumping this post from a few days ago. Is anyone using atomic updates? If so, how are you passing the updates to Solr? I'm seeing a significant difference between the REST API and the post command .. is this to be expected? What's the recommended method for doing the update? Thanks! ...scott On 8/29/18 3:02 PM, Scott Prentice wrote: Hi... I'm trying to get atomic updates working and am seeing some strangeness. Here's my JSON with the data to update .. [{"id":"/unique/path/id", "field1":{"set","newvalue1"}, "field2":{"set","newvalue2"} }] If I use the REST API via curl it works fine. With the following command, the field1 and field2 fields get the new values, and all's well. curl 'http://localhost:8983/solr/core01/update/json?commit=true' --data-binary @test1.json -H 'Content-type:application/json' BUT, if I use the post command .. ./bin/post -c core01 /home/xtech/solrtest/test1.json .. the record gets updated with new fields named "field1.set" and "field2.set", and the managed-schema file is modified to include these new field definitions. Not at all what I'd expect or want. Is there some setting or switch that will let the post command work "properly", or am I misunderstanding what's correct? I can use curl, but our current workflow uses the post command so I thought that might do the job. Any thoughts are welcome! Thanks, ...scott
Re: BUMP: Atomic updates and POST command?
Yup. That does the trick! Here's my command line .. $ ./bin/post -c core01 -format solr /home/xtech/solrtest/test1b.json I saw that "-format solr" option, but it wasn't clear what it did. It's still not clear to me how that changes the endpoint to allow for updates. But nice to see that it works! Thanks for your help! ...scott On 8/31/18 6:04 PM, Alexandre Rafalovitch wrote: Ok, Try "-format solr" instead of "-url ...". Regards, Alex. On 31 August 2018 at 20:54, Scott Prentice wrote: Nope. That's not it. It complains about this path not being found .. /solr/core01/update/json/json/docs So, I changed the -url value to this "http://localhost:8983/solr/core01/update"; .. which was "successful", but created the same odd index structure of "field.set". I'm clearly flailing. If you have any thoughts on this, do let me know. Thanks! ...scott On 8/31/18 5:20 PM, Scott Prentice wrote: Ah .. is this done with the -url parameter? As in .. ./bin/post -url http://localhost:8983/solr/core01/update/json /home/xtech/solrtest/test1.json Will test. Thanks, ...scott On 8/31/18 5:15 PM, Scott Prentice wrote: Hmm. That makes sense .. but where do you provide the endpoint to post? Is that additional commands within the JSON or a parameter at the command line? Thanks, ...scott On 8/31/18 4:48 PM, Alexandre Rafalovitch wrote: I think you are using different end points there. /update by default vs /update/json So i think the post gets treated as generic json parsing. Can you try the same end point? Regards, Alex On Fri, Aug 31, 2018, 7:05 PM Scott Prentice wrote: Just bumping this post from a few days ago. Is anyone using atomic updates? If so, how are you passing the updates to Solr? I'm seeing a significant difference between the REST API and the post command .. is this to be expected? What's the recommended method for doing the update? Thanks! ...scott On 8/29/18 3:02 PM, Scott Prentice wrote: Hi... I'm trying to get atomic updates working and am seeing some strangeness. Here's my JSON with the data to update .. [{"id":"/unique/path/id", "field1":{"set","newvalue1"}, "field2":{"set","newvalue2"} }] If I use the REST API via curl it works fine. With the following command, the field1 and field2 fields get the new values, and all's well. curl 'http://localhost:8983/solr/core01/update/json?commit=true' --data-binary @test1.json -H 'Content-type:application/json' BUT, if I use the post command .. ./bin/post -c core01 /home/xtech/solrtest/test1.json .. the record gets updated with new fields named "field1.set" and "field2.set", and the managed-schema file is modified to include these new field definitions. Not at all what I'd expect or want. Is there some setting or switch that will let the post command work "properly", or am I misunderstanding what's correct? I can use curl, but our current workflow uses the post command so I thought that might do the job. Any thoughts are welcome! Thanks, ...scott
Re: BUMP: Atomic updates and POST command?
Thanks, Shawn. That helps with the meaning of the "solr" format. Our needs are pretty basic. We have some upstream processes that crawl the data and generate a JSON feed that works with the default post command. So far this works well and keeps things simple. Thanks! ...scott On 9/1/18 9:26 PM, Shawn Heisey wrote: On 8/31/2018 7:18 PM, Scott Prentice wrote: Yup. That does the trick! Here's my command line .. $ ./bin/post -c core01 -format solr /home/xtech/solrtest/test1b.json I saw that "-format solr" option, but it wasn't clear what it did. It's still not clear to me how that changes the endpoint to allow for updates. But nice to see that it works! I think the assumption with JSON-style updates and the post tool is that you are sending "generic" json documents, not Solr-formatted json commands. So the post tool sends to the /update/json/docs handler, which can handle those easily. I believe that telling it that the format is "solr" means that the JSON input is instructions to Solr, not just document content. It very likely sends it to /update/json or /update when that's the case. I don't know if you know this, but the bin/post command calls something in Solr that is named SimplePostTool. It is, as that name suggests, a very simple tool. Although you CAN use it in production, a large percentage of users find that they outgrow its capabilities and must move to writing their own indexing system. Thanks, Shawn
SolrCloud installation troubles...
Using Solr 7.2.0 and Zookeeper 3.4.11 In an effort to move to a more robust Solr environment, I'm setting up a prototype system of 3 Solr servers and 3 Zookeeper servers. For now, this is all on one machine, but will eventually be 3 machines. This works fine on a Ubuntu 5.4.0-6 VM on my local system, but when I do the same setup on the company's network machine (a Red Hat 4.8.5-16 VM), I'm unable to create a collection. To keep things simple, I'm not using our custom schema yet, but just creating a collection through the Solr Admin UI using Collections > Add Collection, using the "_default" config set. On the Ubuntu system, I can create various collections .. 1 shard w/ 1 replication .. 2 shards w/ 3 replications .. 3 shards w/ 4 replications .. all seem alive and well. But when I do the same thing on the Red Hat system it fails. Through the UI, it'll first time out with this message .. Connection to Solr lost Then after a refresh, the collection appears to have been partially created, but it's in the "Gone" state, and after some time, is deleted by an apparent cleanup process. If I try to create one through the command line .. ./bin/solr create -c test99 -n _default -s 2 -rf 2 I get this response .. ERROR: Failed to create collection 'test99' due to: {10.6.208.31:8984_solr=org.apache.solr.client.solrj.SolrServerException:IOException occured when talking to server at: http://10.6.208.31:8984/solr, 10.6.208.31:8985_solr=org.apache.solr.client.solrj.SolrServerException:IOException occured when talking to server at: http://10.6.208.31:8985/solr, 10.6.208.31:8983_solr=org.apache.solr.client.solrj.SolrServerException:IOException occured when talking to server at: http://10.6.208.31:8983/solr} I've seen other reports of errors like this but no solutions that seem to apply to my situation. Any thoughts? Thanks! ...scott
Re: SolrCloud installation troubles...
On 1/29/18 12:44 PM, Shawn Heisey wrote: On 1/29/2018 1:13 PM, Scott Prentice wrote: But when I do the same thing on the Red Hat system it fails. Through the UI, it'll first time out with this message .. Connection to Solr lost Then after a refresh, the collection appears to have been partially created, but it's in the "Gone" state, and after some time, is deleted by an apparent cleanup process. If I try to create one through the command line .. ./bin/solr create -c test99 -n _default -s 2 -rf 2 I get this response .. ERROR: Failed to create collection 'test99' due to: {10.6.208.31:8984_solr=org.apache.solr.client.solrj.SolrServerException:IOException occured when talking to server at: http://10.6.208.31:8984/solr, 10.6.208.31:8985_solr=org.apache.solr.client.solrj.SolrServerException:IOException occured when talking to server at: http://10.6.208.31:8985/solr, 10.6.208.31:8983_solr=org.apache.solr.client.solrj.SolrServerException:IOException occured when talking to server at: http://10.6.208.31:8983/solr} This sounds like either network connectivity problems or possibly issues caused by extreme garbage collection pauses that result in timeouts. Thanks, Shawn Thanks, Shawn. I was wondering if there was something going on with IP redirection that was causing confusion. Any thoughts on how to debug? And, what do you mean by "extreme garbage collection pauses"? Is that Solr garbage collection or the OS itself? There's really nothing happening on this machine, it's purely for testing so there shouldn't be any extra load from other processes. Thanks! ...scott
Re: SolrCloud installation troubles...
Interesting. I am using "localhost" in the config files (using the IP caused things to break even worse). But perhaps I should check with IT to make sure the ports are all open. Thanks, ...scott On 1/29/18 12:57 PM, Davis, Daniel (NIH/NLM) [C] wrote: To expand on that answer, you have to wonder what ports are open in the server system's port-based firewall.I have to ask my systems team to open ports for everything I'm using, especially when I move from localhost to outside. You should be able to "fake it out" if you set up your zookeeper configuration to use localhost ports. -----Original Message- From: Scott Prentice [mailto:s...@leximation.com] Sent: Monday, January 29, 2018 3:13 PM To: solr-user@lucene.apache.org Subject: SolrCloud installation troubles... Using Solr 7.2.0 and Zookeeper 3.4.11 In an effort to move to a more robust Solr environment, I'm setting up a prototype system of 3 Solr servers and 3 Zookeeper servers. For now, this is all on one machine, but will eventually be 3 machines. This works fine on a Ubuntu 5.4.0-6 VM on my local system, but when I do the same setup on the company's network machine (a Red Hat 4.8.5-16 VM), I'm unable to create a collection. To keep things simple, I'm not using our custom schema yet, but just creating a collection through the Solr Admin UI using Collections > Add Collection, using the "_default" config set. On the Ubuntu system, I can create various collections .. 1 shard w/ 1 replication .. 2 shards w/ 3 replications .. 3 shards w/ 4 replications .. all seem alive and well. But when I do the same thing on the Red Hat system it fails. Through the UI, it'll first time out with this message .. Connection to Solr lost Then after a refresh, the collection appears to have been partially created, but it's in the "Gone" state, and after some time, is deleted by an apparent cleanup process. If I try to create one through the command line .. ./bin/solr create -c test99 -n _default -s 2 -rf 2 I get this response .. ERROR: Failed to create collection 'test99' due to: {10.6.208.31:8984_solr=org.apache.solr.client.solrj.SolrServerException:IOException occured when talking to server at: http://10.6.208.31:8984/solr, 10.6.208.31:8985_solr=org.apache.solr.client.solrj.SolrServerException:IOException occured when talking to server at: http://10.6.208.31:8985/solr, 10.6.208.31:8983_solr=org.apache.solr.client.solrj.SolrServerException:IOException occured when talking to server at: http://10.6.208.31:8983/solr} I've seen other reports of errors like this but no solutions that seem to apply to my situation. Any thoughts? Thanks! ...scott
Re: SolrCloud installation troubles...
Looks like 2888 and 2890 are not open. At least they are not reported with a netstat -plunt .. could be the problem. Thanks, all! ...scott On 1/29/18 1:10 PM, Davis, Daniel (NIH/NLM) [C] wrote: Trying 127.0.0.1 could help. We kind of tend to think localhost is always 127.0.0.1, but I've seen localhost start to resolve to ::1, the IPv6 equivalent of 127.0.0.1. I guess some environments can be strict enough to restrict communication on localhost; seems hard to imagine, but it does happen. -Original Message- From: Scott Prentice [mailto:s...@leximation.com] Sent: Monday, January 29, 2018 4:02 PM To: solr-user@lucene.apache.org Subject: Re: SolrCloud installation troubles... On 1/29/18 12:44 PM, Shawn Heisey wrote: On 1/29/2018 1:13 PM, Scott Prentice wrote: But when I do the same thing on the Red Hat system it fails. Through the UI, it'll first time out with this message .. Connection to Solr lost Then after a refresh, the collection appears to have been partially created, but it's in the "Gone" state, and after some time, is deleted by an apparent cleanup process. If I try to create one through the command line .. ./bin/solr create -c test99 -n _default -s 2 -rf 2 I get this response .. ERROR: Failed to create collection 'test99' due to: {10.6.208.31:8984_solr=org.apache.solr.client.solrj.SolrServerExcepti on:IOException occured when talking to server at: http://10.6.208.31:8984/solr, 10.6.208.31:8985_solr=org.apache.solr.client.solrj.SolrServerExceptio n:IOException occured when talking to server at: http://10.6.208.31:8985/solr, 10.6.208.31:8983_solr=org.apache.solr.client.solrj.SolrServerExceptio n:IOException occured when talking to server at: http://10.6.208.31:8983/solr} This sounds like either network connectivity problems or possibly issues caused by extreme garbage collection pauses that result in timeouts. Thanks, Shawn Thanks, Shawn. I was wondering if there was something going on with IP redirection that was causing confusion. Any thoughts on how to debug? And, what do you mean by "extreme garbage collection pauses"? Is that Solr garbage collection or the OS itself? There's really nothing happening on this machine, it's purely for testing so there shouldn't be any extra load from other processes. Thanks! ...scott
Re: SolrCloud installation troubles...
On 1/29/18 1:31 PM, Shawn Heisey wrote: On 1/29/2018 2:02 PM, Scott Prentice wrote: Thanks, Shawn. I was wondering if there was something going on with IP redirection that was causing confusion. Any thoughts on how to debug? And, what do you mean by "extreme garbage collection pauses"? Is that Solr garbage collection or the OS itself? There's really nothing happening on this machine, it's purely for testing so there shouldn't be any extra load from other processes. Garbage collection is one of the primary features of Java's memory management. It's not Solr or the OS. If the java heap is really enormous, you can end up with long pauses, but I wouldn't expect them to be frequent unless the index is also really huge. A very common issue that can cause even worse pause issues than a large heap is a heap that's too small, but not quite small enough to cause Java to completely run out of heap memory. The default max heap size in recent Solr versions is 512MB, which is very small. A Java program (which Solr is) can never use more heap memory than the maximum it is configured with, even if the machine has more memory available. This paragraph is included because you mentioned IP redirection: Extreme care must be used when setting up SolrCloud on virtual machines where accessing the VM has to go through any kind of IP translation. SolrCloud keeps track of how to reach each server in the cloud and if it stores an untranslated address when you need the translated address (or vice-versa), things are not going to work. Generally speaking translated addresses are going to be problematic for SolrCloud, and should not be used. Thanks, Shawn Thanks for the clarification. Yes, we're just using the default heap size for Solr, but there's no index (yet) and nothing really going on, so I'd hope that garbage collection isn't the problem. I'm putting my money on some IP translation issues (this is on a tightly controlled corporate network) or the fact that the 2888 and 2890 ports appear to not be open. I'll dig down the network issue path for now and see where that gets me. Thanks, ...scott
Shard replica labels in Solr Admin graph?
We initially tested our Solr Cloud implementation on a single VM with 3 Solr servers and 3 Zookeeper servers. Once that seemed good, we moved to 3 VMs with 1 Solr/Zookeeper on each. That's all looking good, but in the Solr Admin > Cloud > Graph, all of my shard replicas are on "127.0.1.1" .. with the single VM setup it listed the port number so you could tell which "server" it was on. Is there some way to get the shard replicas to list with the actual IPs of the server that the replica is on, rather than 127.0.1.1? Thanks! ...scott
Re: Shard replica labels in Solr Admin graph?
Thanks Shawn! I made the adjustment to /etc/hosts, and now all's well. This also fixed an underlying problem that I hadn't noticed at the time I send my query .. that only one Solr server was actually running. Turns out that Zookeeper saw them all as 127.0.1.1 and didn't let the other instances fully start up. These were brand new, fresh, Ubuntu installs. Strange that the /etc/hosts isn't set up to handle this. Cheers, ...scott On 2/28/18 8:48 PM, Shawn Heisey wrote: On 2/28/2018 5:42 PM, Scott Prentice wrote: We initially tested our Solr Cloud implementation on a single VM with 3 Solr servers and 3 Zookeeper servers. Once that seemed good, we moved to 3 VMs with 1 Solr/Zookeeper on each. That's all looking good, but in the Solr Admin > Cloud > Graph, all of my shard replicas are on "127.0.1.1" .. with the single VM setup it listed the port number so you could tell which "server" it was on. Is there some way to get the shard replicas to list with the actual IPs of the server that the replica is on, rather than 127.0.1.1? That is not going to work if those are separate machines. There are two ways to fix this. One is to figure out why Java is choosing a loopback address when it attempts to detect the machine's hostname. I'm almost certain that /etc/hosts is set up incorrectly. In my opinion, a typical /etc/hosts file should have two IPv4 lines, one defining localhost as 127.0.0.1, and another defining the machine's actual IP address as both the fully qualified domain name and the short hostname. An example: 127.0.0.1 localhost 192.168.1.200 smeagol.REDACTED.com smeagol The machine's hostname should not be found on any line that does not have a real IP address on it. The other way to solve the problem is to specify the "host" system property to override Java's detection of the machine address/hostname. You can either add a commandline option to set the property, or add it to solr.xml. Note that if your solr.xml file is in zookeeper, then you can't use solr.xml. This is because with solr.xml in zookeeper, every machine would have the same host definition, and that won't work. https://lucene.apache.org/solr/guide/6_6/format-of-solr-xml.html#the-code-solrcloud-code-element Thanks, Shawn
Scoping SolrCloud setup
We're in the process of moving from 12 single-core collections (non-cloud Solr) on 3 VMs to a SolrCloud setup. Our collections aren't huge, ranging in size from 50K to 150K documents with one at 1.2M docs. Our max query frequency is rather low .. probably no more than 10-20/min. We do update frequently, maybe 10-100 documents every 10 mins. Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've got each collection split into 2 shards with 3 replicas (one per VM). Also, Zookeeper is running on each VM. I understand that it's best to have each ZK server on a separate machine, but hoping this will work for now. This all seemed like a good place to start, but after reading lots of articles and posts, I'm thinking that maybe our smaller collections (under 100K docs) should just be one shard each, and maybe the 1.2M collection should be more like 6 shards. How do you decide how many shards is right? Also, our current live system is separated into dev/stage/prod tiers, not, all of these tiers are together on each of the cloud VMs. This bothers some people, thinking that it may make our production environment less stable. I know that in an ideal world, we'd have them all on separate systems, but with the replication, it seems like we're going to make the overall system more stable. Is this a correct understanding? I'm just wondering anyone has opinions on whether we're going in a reasonable direction or not. Are there any articles that discuss these initial sizing/scoping issues? Thanks! ...scott
Re: Scoping SolrCloud setup
Greg... Thanks. That's very helpful, and is inline with what I've been seeing. So, to be clear, you're saying that the size of all collections on a server should be less than the available RAM. It looks like we've got about 13GB of documents in all (and growing), so, if we're restricted to 16GB on each VM I'm thinking that it probably makes sense to split the collections over multiple VMs rather than having them all on one. Perhaps instead of all indexes replicated on 3 VMs, we should split things up over 4 VMs and go down to just 2 replicas. We can add 2 more VMs to go up to 3 replicas if that seems necessary at some point. Thanks, ...scott On 3/13/18 6:15 PM, Greg Roodt wrote: A single shard is much simpler conceptually and also cheaper to query. I would say that even your 1.2M collection can be a single shard. I'm running a single shard setup 4X that size. You can still have replicas of this shard for redundancy / availability purposes. I'm not an expert, but I think one of the deciding factors is if your index can fit into RAM (not JVM Heap, but OS cache). What are the sizes of your indexes? On 14 March 2018 at 11:01, Scott Prentice wrote: We're in the process of moving from 12 single-core collections (non-cloud Solr) on 3 VMs to a SolrCloud setup. Our collections aren't huge, ranging in size from 50K to 150K documents with one at 1.2M docs. Our max query frequency is rather low .. probably no more than 10-20/min. We do update frequently, maybe 10-100 documents every 10 mins. Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've got each collection split into 2 shards with 3 replicas (one per VM). Also, Zookeeper is running on each VM. I understand that it's best to have each ZK server on a separate machine, but hoping this will work for now. This all seemed like a good place to start, but after reading lots of articles and posts, I'm thinking that maybe our smaller collections (under 100K docs) should just be one shard each, and maybe the 1.2M collection should be more like 6 shards. How do you decide how many shards is right? Also, our current live system is separated into dev/stage/prod tiers, not, all of these tiers are together on each of the cloud VMs. This bothers some people, thinking that it may make our production environment less stable. I know that in an ideal world, we'd have them all on separate systems, but with the replication, it seems like we're going to make the overall system more stable. Is this a correct understanding? I'm just wondering anyone has opinions on whether we're going in a reasonable direction or not. Are there any articles that discuss these initial sizing/scoping issues? Thanks! ...scott
Re: Scoping SolrCloud setup
Emir... Thanks for the input. Our larger collections are localized content, so it may make sense to shard those so we can target the specific index. I'll need to confirm how it's being used, if queries are always within a language or if they are cross-language. Thanks also for the link .. very helpful! All the best, ...scott On 3/14/18 2:21 AM, Emir Arnautović wrote: Hi Scott, There is no definite answer - it depends on your documents and query patterns. Sharding does come with an overhead but also allows Solr to parallelise search. Query latency is usually something that tells you if you need to split collection to multiple shards or not. In caseIf you are ok with latency there is no need to split. Other scenario where shards make sense is when routing is used in majority of queries so that enables you to query only subset of documents. Also, there is indexing aspect where sharding helps - in case of high indexing throughput is needed, having multiple shards will spread indexing load to multiple servers. It seems to me that there is no high indexing throughput requirement and the main criteria should be query latency. Here is another blog post talking about this subject: http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html <http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html> Thanks, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ On 14 Mar 2018, at 01:01, Scott Prentice wrote: We're in the process of moving from 12 single-core collections (non-cloud Solr) on 3 VMs to a SolrCloud setup. Our collections aren't huge, ranging in size from 50K to 150K documents with one at 1.2M docs. Our max query frequency is rather low .. probably no more than 10-20/min. We do update frequently, maybe 10-100 documents every 10 mins. Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've got each collection split into 2 shards with 3 replicas (one per VM). Also, Zookeeper is running on each VM. I understand that it's best to have each ZK server on a separate machine, but hoping this will work for now. This all seemed like a good place to start, but after reading lots of articles and posts, I'm thinking that maybe our smaller collections (under 100K docs) should just be one shard each, and maybe the 1.2M collection should be more like 6 shards. How do you decide how many shards is right? Also, our current live system is separated into dev/stage/prod tiers, not, all of these tiers are together on each of the cloud VMs. This bothers some people, thinking that it may make our production environment less stable. I know that in an ideal world, we'd have them all on separate systems, but with the replication, it seems like we're going to make the overall system more stable. Is this a correct understanding? I'm just wondering anyone has opinions on whether we're going in a reasonable direction or not. Are there any articles that discuss these initial sizing/scoping issues? Thanks! ...scott
Zookeeper service?
We might be going at this wrong, but we've got Solr set up as a service, so if the machine goes down it'll restart. But without Zookeeper running as a service, that's not much help. I found the zookeeperd install, which in theory seems like it should do the trick, but that installs a new instance of ZK and isn't using our zoo.cfg. I guess we can make the config adjustments to the new ZK installation and use that, but was hoping to use our existing ZK installation. Or maybe I can hack the zookeeper startup script to get it to run as a service? There's the SOLR_WAIT_FOR_ZK config parameter, so Solr will wait for ZK to fire up .. seems like it's partially there. This all seems like too many hoops to jump through to be the "right" way to go. I assume that others have run into a similar situation? Thoughts? Thanks! ...scott
Re: Scoping SolrCloud setup
Erick... Thanks. Yes. I think we were just going shard-happy without really understanding the purpose. I think we'll start by keeping things simple .. no shards, fewer replicas, maybe a bit more RAM. Then we can assess the performance and make adjustments as needed. Yes, that's the main reason for moving from our current non-cloud Solr setup to SolrCloud .. future flexibility as well as greater stability. Thanks! ...scott On 3/14/18 11:34 AM, Erick Erickson wrote: Scott: Eventually you'll hit the limit of your hardware, regardless of VMs. I've seen multiple VMs help a lot when you have really beefy hardware, as in 32 cores, 128G memory and the like. Otherwise it's iffier. re: sharding or not. As others wrote, sharding is only useful when a single collection grows past the limits of your hardware. Until that point, it's usually a better bet to get better hardware than shard. I've seen 300M docs fit in a single shard. I've also seen 10M strain pretty beefy hardware.. but from what you've said multiple shards are really not something you need to worry about. About balancing all this across VMs and/or machines. You have a _lot_ of room to balance things. Let's say you put all your collections on one physical machine to start (not recommending, just sayin'). 6 months from now you need to move collections 1-10 to another machine due to growth. You: 1> spin up a new machine 2> build out collections 1-10 on those machines by using the Collections API ADDREPLICA. 3> once the new replicas are healthy. DELETEREPLICA on the old hardware. No down time. No configuration to deal with, SolrCloud will take care of it for you. Best, Erick On Wed, Mar 14, 2018 at 9:32 AM, Scott Prentice wrote: Emir... Thanks for the input. Our larger collections are localized content, so it may make sense to shard those so we can target the specific index. I'll need to confirm how it's being used, if queries are always within a language or if they are cross-language. Thanks also for the link .. very helpful! All the best, ...scott On 3/14/18 2:21 AM, Emir Arnautović wrote: Hi Scott, There is no definite answer - it depends on your documents and query patterns. Sharding does come with an overhead but also allows Solr to parallelise search. Query latency is usually something that tells you if you need to split collection to multiple shards or not. In caseIf you are ok with latency there is no need to split. Other scenario where shards make sense is when routing is used in majority of queries so that enables you to query only subset of documents. Also, there is indexing aspect where sharding helps - in case of high indexing throughput is needed, having multiple shards will spread indexing load to multiple servers. It seems to me that there is no high indexing throughput requirement and the main criteria should be query latency. Here is another blog post talking about this subject: http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html <http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html> Thanks, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ On 14 Mar 2018, at 01:01, Scott Prentice wrote: We're in the process of moving from 12 single-core collections (non-cloud Solr) on 3 VMs to a SolrCloud setup. Our collections aren't huge, ranging in size from 50K to 150K documents with one at 1.2M docs. Our max query frequency is rather low .. probably no more than 10-20/min. We do update frequently, maybe 10-100 documents every 10 mins. Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've got each collection split into 2 shards with 3 replicas (one per VM). Also, Zookeeper is running on each VM. I understand that it's best to have each ZK server on a separate machine, but hoping this will work for now. This all seemed like a good place to start, but after reading lots of articles and posts, I'm thinking that maybe our smaller collections (under 100K docs) should just be one shard each, and maybe the 1.2M collection should be more like 6 shards. How do you decide how many shards is right? Also, our current live system is separated into dev/stage/prod tiers, not, all of these tiers are together on each of the cloud VMs. This bothers some people, thinking that it may make our production environment less stable. I know that in an ideal world, we'd have them all on separate systems, but with the replication, it seems like we're going to make the overall system more stable. Is this a correct understanding? I'm just wondering anyone has opinions on whether we're going in a reasonable direction or not. Are there any articles that discuss these initial sizing/scoping issues? Thanks! ...scott
Re: Zookeeper service?
Yeah .. I knew it was a different Apache project, but figured that since it was so tightly integrated with SolrCloud that others may have run into this issue. I did some poking around and have (for now) ended up with this .. implemented it as a service through the "systemd.unit" configuration. Created the following unit file here .. /etc/systemd/system/zookeeper.service - [Unit] Description=Zookeeper Daemon Wants=syslog.target [Service] Type=forking ExecStart=/apps/local_data/apps/solr/zk_installation/zookeeper/bin/zkServer.sh start /apps/local_data/apps/solr/zk_distribution/zoo.cfg ExecStop=/apps/local_data/apps/solr/zk_installation/zookeeper/bin/zkServer.sh stop /apps/local_data/apps/solr/zk_distribution/zoo.cfg TimeoutSec=30 Restart=on-failure [Install] WantedBy=multi-user.target - Rebooted the server and it seems to work. Your implementation sounds reasonable as well. I may post a query to the ZK list to see what other options are out there. The zookeeperd install seemed like it was going to require more hacking than I wanted to do since our config was already set up and working. Thanks Shawn! ...scott On 3/14/18 5:17 PM, Shawn Heisey wrote: On 3/14/2018 12:24 PM, Scott Prentice wrote: We might be going at this wrong, but we've got Solr set up as a service, so if the machine goes down it'll restart. But without Zookeeper running as a service, that's not much help. You're probably going to be very unhappy to be told this ... but ZooKeeper is a completely separate Apache project. This mailing list handles Solr. While SolrCloud does require ZK, setting it up is outside the scope of this mailing list. I can tell you what I did to get it running as a service on CentOS 6. It works, but it's not very robust, and if you ask the zookeeper user mailing list, they may have better options. I do strongly recommend that you ask that mailing list. - I extracted the .tar.gz file to the "/opt" folder. Then I renamed the zookeeper-X.Y.Z directory to something else. I used "mbzoo" ... which only makes sense if you're familiar with our locally developed software. Next I created a very small shell script, and saved it as /usr/local/sbin/zkrun (script is below between the lines of equal signs): = #!/bin/sh # chkconfig: - 75 50 # description: Starts and stops ZK cd /opt/mbzoo bin/zkServer.sh $1 = I made that script executable and created a symlink for init.d: chown +x /usr/local/sbin/zkrun ln -s /usr/local/sbin/zkrun /etc/init.d/zookeeper Then all I had to do was activate the init script: chkconfig --add zookeeper chkconfig zookeeper on Once that's done, a "service zookeeper start" command should work. On debian/ubuntu/mint and similar distros, you'd probably use update-rc.d instead of chkconfig, with different options. If you're on an OS other than Linux, everything I've described might need changes. If you're on Windows, chances are that you'll end up using a program named NSSM. If you can get your company to accept using it once they find out the FULL program name. Thanks, Shawn
Re: Scoping SolrCloud setup
Walter... Thanks for the additional data points. Clearly we're a long way from needing anything too complex. Cheers! ...scott On 3/14/18 1:12 PM, Walter Underwood wrote: That would be my recommendation for a first setup. One Solr instance per host, one shard per collection. We run 5 million document cores with 8 GB of heap for the JVM. We size the RAM so that all the indexes fit in OS filesystem buffers. Our big cluster is 32 hosts, 21 million documents in four shards. Each host is a 36 processor Amazon instance. Each host has one 8 GB Solr process (Solr 6.6.2, java 8u121, G1 collector). No faceting, but we get very long queries, average length is 25 terms. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Mar 14, 2018, at 12:50 PM, Scott Prentice wrote: Erick... Thanks. Yes. I think we were just going shard-happy without really understanding the purpose. I think we'll start by keeping things simple .. no shards, fewer replicas, maybe a bit more RAM. Then we can assess the performance and make adjustments as needed. Yes, that's the main reason for moving from our current non-cloud Solr setup to SolrCloud .. future flexibility as well as greater stability. Thanks! ...scott On 3/14/18 11:34 AM, Erick Erickson wrote: Scott: Eventually you'll hit the limit of your hardware, regardless of VMs. I've seen multiple VMs help a lot when you have really beefy hardware, as in 32 cores, 128G memory and the like. Otherwise it's iffier. re: sharding or not. As others wrote, sharding is only useful when a single collection grows past the limits of your hardware. Until that point, it's usually a better bet to get better hardware than shard. I've seen 300M docs fit in a single shard. I've also seen 10M strain pretty beefy hardware.. but from what you've said multiple shards are really not something you need to worry about. About balancing all this across VMs and/or machines. You have a _lot_ of room to balance things. Let's say you put all your collections on one physical machine to start (not recommending, just sayin'). 6 months from now you need to move collections 1-10 to another machine due to growth. You: 1> spin up a new machine 2> build out collections 1-10 on those machines by using the Collections API ADDREPLICA. 3> once the new replicas are healthy. DELETEREPLICA on the old hardware. No down time. No configuration to deal with, SolrCloud will take care of it for you. Best, Erick On Wed, Mar 14, 2018 at 9:32 AM, Scott Prentice wrote: Emir... Thanks for the input. Our larger collections are localized content, so it may make sense to shard those so we can target the specific index. I'll need to confirm how it's being used, if queries are always within a language or if they are cross-language. Thanks also for the link .. very helpful! All the best, ...scott On 3/14/18 2:21 AM, Emir Arnautović wrote: Hi Scott, There is no definite answer - it depends on your documents and query patterns. Sharding does come with an overhead but also allows Solr to parallelise search. Query latency is usually something that tells you if you need to split collection to multiple shards or not. In caseIf you are ok with latency there is no need to split. Other scenario where shards make sense is when routing is used in majority of queries so that enables you to query only subset of documents. Also, there is indexing aspect where sharding helps - in case of high indexing throughput is needed, having multiple shards will spread indexing load to multiple servers. It seems to me that there is no high indexing throughput requirement and the main criteria should be query latency. Here is another blog post talking about this subject: http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html <http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html> Thanks, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ On 14 Mar 2018, at 01:01, Scott Prentice wrote: We're in the process of moving from 12 single-core collections (non-cloud Solr) on 3 VMs to a SolrCloud setup. Our collections aren't huge, ranging in size from 50K to 150K documents with one at 1.2M docs. Our max query frequency is rather low .. probably no more than 10-20/min. We do update frequently, maybe 10-100 documents every 10 mins. Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've got each collection split into 2 shards with 3 replicas (one per VM). Also, Zookeeper is running on each VM. I understand that it's best to have each ZK server on a separate machine, but hoping this will work for now. This all seemed like a good place to start, but after reading lots of articles and posts, I'm thinking that maybe our smaller collections (under 100K docs