Atomic updates and POST command?

2018-08-29 Thread Scott Prentice

Hi...

I'm trying to get atomic updates working and am seeing some strangeness. 
Here's my JSON with the data to update ..


[{"id":"/unique/path/id",
  "field1":{"set","newvalue1"},
  "field2":{"set","newvalue2"}
}]

If I use the REST API via curl it works fine. With the following 
command, the field1 and field2 fields get the new values, and all's well.


curl 'http://localhost:8983/solr/core01/update/json?commit=true' 
--data-binary @test1.json -H 'Content-type:application/json'


BUT, if I use the post command ..

./bin/post -c core01 /home/xtech/solrtest/test1.json

.. the record gets updated with new fields named "field1.set" and 
"field2.set", and the managed-schema file is modified to include these 
new field definitions. Not at all what I'd expect or want. Is there some 
setting or switch that will let the post command work "properly", or am 
I misunderstanding what's correct? I can use curl, but our current 
workflow uses the post command so I thought that might do the job.


Any thoughts are welcome!

Thanks,
...scott






BUMP: Atomic updates and POST command?

2018-08-31 Thread Scott Prentice

Just bumping this post from a few days ago.

Is anyone using atomic updates? If so, how are you passing the updates 
to Solr? I'm seeing a significant difference between the REST API and 
the post command .. is this to be expected? What's the recommended 
method for doing the update?


Thanks!
...scott


On 8/29/18 3:02 PM, Scott Prentice wrote:

Hi...

I'm trying to get atomic updates working and am seeing some 
strangeness. Here's my JSON with the data to update ..


[{"id":"/unique/path/id",
  "field1":{"set","newvalue1"},
  "field2":{"set","newvalue2"}
}]

If I use the REST API via curl it works fine. With the following 
command, the field1 and field2 fields get the new values, and all's well.


curl 'http://localhost:8983/solr/core01/update/json?commit=true' 
--data-binary @test1.json -H 'Content-type:application/json'


BUT, if I use the post command ..

./bin/post -c core01 /home/xtech/solrtest/test1.json

.. the record gets updated with new fields named "field1.set" and 
"field2.set", and the managed-schema file is modified to include these 
new field definitions. Not at all what I'd expect or want. Is there 
some setting or switch that will let the post command work "properly", 
or am I misunderstanding what's correct? I can use curl, but our 
current workflow uses the post command so I thought that might do the 
job.


Any thoughts are welcome!

Thanks,
...scott









Re: BUMP: Atomic updates and POST command?

2018-08-31 Thread Scott Prentice
Hmm. That makes sense .. but where do you provide the endpoint to post? 
Is that additional commands within the JSON or a parameter at the 
command line?


Thanks,
...scott


On 8/31/18 4:48 PM, Alexandre Rafalovitch wrote:

I think you are using different end points there. /update by default vs
/update/json

So i think the post gets treated as generic json parsing.

Can you try the same end point?

Regards,
  Alex


On Fri, Aug 31, 2018, 7:05 PM Scott Prentice wrote:


Just bumping this post from a few days ago.

Is anyone using atomic updates? If so, how are you passing the updates
to Solr? I'm seeing a significant difference between the REST API and
the post command .. is this to be expected? What's the recommended
method for doing the update?

Thanks!
...scott


On 8/29/18 3:02 PM, Scott Prentice wrote:

Hi...

I'm trying to get atomic updates working and am seeing some
strangeness. Here's my JSON with the data to update ..

[{"id":"/unique/path/id",
   "field1":{"set","newvalue1"},
   "field2":{"set","newvalue2"}
}]

If I use the REST API via curl it works fine. With the following
command, the field1 and field2 fields get the new values, and all's well.

curl 'http://localhost:8983/solr/core01/update/json?commit=true'
--data-binary @test1.json -H 'Content-type:application/json'

BUT, if I use the post command ..

./bin/post -c core01 /home/xtech/solrtest/test1.json

.. the record gets updated with new fields named "field1.set" and
"field2.set", and the managed-schema file is modified to include these
new field definitions. Not at all what I'd expect or want. Is there
some setting or switch that will let the post command work "properly",
or am I misunderstanding what's correct? I can use curl, but our
current workflow uses the post command so I thought that might do the
job.

Any thoughts are welcome!

Thanks,
...scott











Re: BUMP: Atomic updates and POST command?

2018-08-31 Thread Scott Prentice

Ah .. is this done with the -url parameter? As in ..

./bin/post -url http://localhost:8983/solr/core01/update/json 
/home/xtech/solrtest/test1.json


Will test.

Thanks,
...scott


On 8/31/18 5:15 PM, Scott Prentice wrote:
Hmm. That makes sense .. but where do you provide the endpoint to 
post? Is that additional commands within the JSON or a parameter at 
the command line?


Thanks,
...scott


On 8/31/18 4:48 PM, Alexandre Rafalovitch wrote:

I think you are using different end points there. /update by default vs
/update/json

So i think the post gets treated as generic json parsing.

Can you try the same end point?

Regards,
  Alex


On Fri, Aug 31, 2018, 7:05 PM Scott Prentice wrote:


Just bumping this post from a few days ago.

Is anyone using atomic updates? If so, how are you passing the updates
to Solr? I'm seeing a significant difference between the REST API and
the post command .. is this to be expected? What's the recommended
method for doing the update?

Thanks!
...scott


On 8/29/18 3:02 PM, Scott Prentice wrote:

Hi...

I'm trying to get atomic updates working and am seeing some
strangeness. Here's my JSON with the data to update ..

[{"id":"/unique/path/id",
   "field1":{"set","newvalue1"},
   "field2":{"set","newvalue2"}
}]

If I use the REST API via curl it works fine. With the following
command, the field1 and field2 fields get the new values, and all's 
well.


curl 'http://localhost:8983/solr/core01/update/json?commit=true'
--data-binary @test1.json -H 'Content-type:application/json'

BUT, if I use the post command ..

./bin/post -c core01 /home/xtech/solrtest/test1.json

.. the record gets updated with new fields named "field1.set" and
"field2.set", and the managed-schema file is modified to include these
new field definitions. Not at all what I'd expect or want. Is there
some setting or switch that will let the post command work "properly",
or am I misunderstanding what's correct? I can use curl, but our
current workflow uses the post command so I thought that might do the
job.

Any thoughts are welcome!

Thanks,
...scott














Re: BUMP: Atomic updates and POST command?

2018-08-31 Thread Scott Prentice

Nope. That's not it. It complains about this path not being found ..

    /solr/core01/update/json/json/docs

So, I changed the -url value to this 
"http://localhost:8983/solr/core01/update"; .. which was "successful", 
but created the same odd index structure of "field.set".


I'm clearly flailing. If you have any thoughts on this, do let me know.

Thanks!
...scott


On 8/31/18 5:20 PM, Scott Prentice wrote:

Ah .. is this done with the -url parameter? As in ..

./bin/post -url http://localhost:8983/solr/core01/update/json 
/home/xtech/solrtest/test1.json


Will test.

Thanks,
...scott


On 8/31/18 5:15 PM, Scott Prentice wrote:
Hmm. That makes sense .. but where do you provide the endpoint to 
post? Is that additional commands within the JSON or a parameter at 
the command line?


Thanks,
...scott


On 8/31/18 4:48 PM, Alexandre Rafalovitch wrote:

I think you are using different end points there. /update by default vs
/update/json

So i think the post gets treated as generic json parsing.

Can you try the same end point?

Regards,
      Alex


On Fri, Aug 31, 2018, 7:05 PM Scott Prentice wrote:


Just bumping this post from a few days ago.

Is anyone using atomic updates? If so, how are you passing the updates
to Solr? I'm seeing a significant difference between the REST API and
the post command .. is this to be expected? What's the recommended
method for doing the update?

Thanks!
...scott


On 8/29/18 3:02 PM, Scott Prentice wrote:

Hi...

I'm trying to get atomic updates working and am seeing some
strangeness. Here's my JSON with the data to update ..

[{"id":"/unique/path/id",
   "field1":{"set","newvalue1"},
   "field2":{"set","newvalue2"}
}]

If I use the REST API via curl it works fine. With the following
command, the field1 and field2 fields get the new values, and 
all's well.


curl 'http://localhost:8983/solr/core01/update/json?commit=true'
--data-binary @test1.json -H 'Content-type:application/json'

BUT, if I use the post command ..

./bin/post -c core01 /home/xtech/solrtest/test1.json

.. the record gets updated with new fields named "field1.set" and
"field2.set", and the managed-schema file is modified to include 
these

new field definitions. Not at all what I'd expect or want. Is there
some setting or switch that will let the post command work 
"properly",

or am I misunderstanding what's correct? I can use curl, but our
current workflow uses the post command so I thought that might do the
job.

Any thoughts are welcome!

Thanks,
...scott

















Re: BUMP: Atomic updates and POST command?

2018-08-31 Thread Scott Prentice

Yup. That does the trick! Here's my command line ..

    $ ./bin/post -c core01 -format solr /home/xtech/solrtest/test1b.json

I saw that "-format solr" option, but it wasn't clear what it did. It's 
still not clear to me how that changes the endpoint to allow for 
updates. But nice to see that it works!


Thanks for your help!
...scott


On 8/31/18 6:04 PM, Alexandre Rafalovitch wrote:

Ok,

Try "-format solr" instead of "-url ...".

Regards,
Alex.

On 31 August 2018 at 20:54, Scott Prentice  wrote:

Nope. That's not it. It complains about this path not being found ..

 /solr/core01/update/json/json/docs

So, I changed the -url value to this
"http://localhost:8983/solr/core01/update"; .. which was "successful", but
created the same odd index structure of "field.set".

I'm clearly flailing. If you have any thoughts on this, do let me know.

Thanks!
...scott



On 8/31/18 5:20 PM, Scott Prentice wrote:

Ah .. is this done with the -url parameter? As in ..

./bin/post -url http://localhost:8983/solr/core01/update/json
/home/xtech/solrtest/test1.json

Will test.

Thanks,
...scott


On 8/31/18 5:15 PM, Scott Prentice wrote:

Hmm. That makes sense .. but where do you provide the endpoint to post?
Is that additional commands within the JSON or a parameter at the command
line?

Thanks,
...scott


On 8/31/18 4:48 PM, Alexandre Rafalovitch wrote:

I think you are using different end points there. /update by default vs
/update/json

So i think the post gets treated as generic json parsing.

Can you try the same end point?

Regards,
   Alex


On Fri, Aug 31, 2018, 7:05 PM Scott Prentice wrote:


Just bumping this post from a few days ago.

Is anyone using atomic updates? If so, how are you passing the updates
to Solr? I'm seeing a significant difference between the REST API and
the post command .. is this to be expected? What's the recommended
method for doing the update?

Thanks!
...scott


On 8/29/18 3:02 PM, Scott Prentice wrote:

Hi...

I'm trying to get atomic updates working and am seeing some
strangeness. Here's my JSON with the data to update ..

[{"id":"/unique/path/id",
"field1":{"set","newvalue1"},
"field2":{"set","newvalue2"}
}]

If I use the REST API via curl it works fine. With the following
command, the field1 and field2 fields get the new values, and all's
well.

curl 'http://localhost:8983/solr/core01/update/json?commit=true'
--data-binary @test1.json -H 'Content-type:application/json'

BUT, if I use the post command ..

./bin/post -c core01 /home/xtech/solrtest/test1.json

.. the record gets updated with new fields named "field1.set" and
"field2.set", and the managed-schema file is modified to include these
new field definitions. Not at all what I'd expect or want. Is there
some setting or switch that will let the post command work "properly",
or am I misunderstanding what's correct? I can use curl, but our
current workflow uses the post command so I thought that might do the
job.

Any thoughts are welcome!

Thanks,
...scott













Re: BUMP: Atomic updates and POST command?

2018-09-03 Thread Scott Prentice

Thanks, Shawn. That helps with the meaning of the "solr" format.

Our needs are pretty basic. We have some upstream processes that crawl 
the data and generate a JSON feed that works with the default post 
command. So far this works well and keeps things simple.


Thanks!
...scott


On 9/1/18 9:26 PM, Shawn Heisey wrote:

On 8/31/2018 7:18 PM, Scott Prentice wrote:

Yup. That does the trick! Here's my command line ..

    $ ./bin/post -c core01 -format solr /home/xtech/solrtest/test1b.json

I saw that "-format solr" option, but it wasn't clear what it did. 
It's still not clear to me how that changes the endpoint to allow for 
updates. But nice to see that it works! 


I think the assumption with JSON-style updates and the post tool is 
that you are sending "generic" json documents, not Solr-formatted json 
commands.  So the post tool sends to the /update/json/docs handler, 
which can handle those easily.  I believe that telling it that the 
format is "solr" means that the JSON input is instructions to Solr, 
not just document content.  It very likely sends it to /update/json or 
/update when that's the case.


I don't know if you know this, but the bin/post command calls 
something in Solr that is named SimplePostTool.  It is, as that name 
suggests, a very simple tool.  Although you CAN use it in production, 
a large percentage of users find that they outgrow its capabilities 
and must move to writing their own indexing system.


Thanks,
Shawn






SolrCloud installation troubles...

2018-01-29 Thread Scott Prentice

Using Solr 7.2.0 and Zookeeper 3.4.11

In an effort to move to a more robust Solr environment, I'm setting up a 
prototype system of 3 Solr servers and 3 Zookeeper servers. For now, 
this is all on one machine, but will eventually be 3 machines.


This works fine on a Ubuntu 5.4.0-6 VM on my local system, but when I do 
the same setup on the company's network machine (a Red Hat 4.8.5-16 VM), 
I'm unable to create a collection. To keep things simple, I'm not using 
our custom schema yet, but just creating a collection through the Solr 
Admin UI using Collections > Add Collection, using the "_default" config 
set. On the Ubuntu system, I can create various collections .. 1 shard 
w/ 1 replication .. 2 shards w/ 3 replications .. 3 shards w/ 4 
replications .. all seem alive and well.


But when I do the same thing on the Red Hat system it fails. Through the 
UI, it'll first time out with this message ..


    Connection to Solr lost

Then after a refresh, the collection appears to have been partially 
created, but it's in the "Gone" state, and after some time, is deleted 
by an apparent cleanup process. If I try to create one through the 
command line ..


    ./bin/solr create -c test99 -n _default -s 2 -rf 2

I get this response ..

ERROR: Failed to create collection 'test99' due to: 
{10.6.208.31:8984_solr=org.apache.solr.client.solrj.SolrServerException:IOException 
occured when talking to server at: http://10.6.208.31:8984/solr, 
10.6.208.31:8985_solr=org.apache.solr.client.solrj.SolrServerException:IOException 
occured when talking to server at: http://10.6.208.31:8985/solr, 
10.6.208.31:8983_solr=org.apache.solr.client.solrj.SolrServerException:IOException 
occured when talking to server at: http://10.6.208.31:8983/solr}


I've seen other reports of errors like this but no solutions that seem 
to apply to my situation. Any thoughts?


Thanks!
...scott




Re: SolrCloud installation troubles...

2018-01-29 Thread Scott Prentice


On 1/29/18 12:44 PM, Shawn Heisey wrote:

On 1/29/2018 1:13 PM, Scott Prentice wrote:
But when I do the same thing on the Red Hat system it fails. Through 
the UI, it'll first time out with this message ..


    Connection to Solr lost

Then after a refresh, the collection appears to have been partially 
created, but it's in the "Gone" state, and after some time, is 
deleted by an apparent cleanup process. If I try to create one 
through the command line ..


    ./bin/solr create -c test99 -n _default -s 2 -rf 2

I get this response ..

ERROR: Failed to create collection 'test99' due to: 
{10.6.208.31:8984_solr=org.apache.solr.client.solrj.SolrServerException:IOException 
occured when talking to server at: http://10.6.208.31:8984/solr, 
10.6.208.31:8985_solr=org.apache.solr.client.solrj.SolrServerException:IOException 
occured when talking to server at: http://10.6.208.31:8985/solr, 
10.6.208.31:8983_solr=org.apache.solr.client.solrj.SolrServerException:IOException 
occured when talking to server at: http://10.6.208.31:8983/solr} 


This sounds like either network connectivity problems or possibly 
issues caused by extreme garbage collection pauses that result in 
timeouts.


Thanks,
Shawn

Thanks, Shawn. I was wondering if there was something going on with IP 
redirection that was causing confusion. Any thoughts on how to debug? 
And, what do you mean by "extreme garbage collection pauses"? Is that 
Solr garbage collection or the OS itself? There's really nothing 
happening on this machine, it's purely for testing so there shouldn't be 
any extra load from other processes.


Thanks!
...scott





Re: SolrCloud installation troubles...

2018-01-29 Thread Scott Prentice
Interesting. I am using "localhost" in the config files (using the IP 
caused things to break even worse). But perhaps I should check with IT 
to make sure the ports are all open.


Thanks,
...scott


On 1/29/18 12:57 PM, Davis, Daniel (NIH/NLM) [C] wrote:

To expand on that answer, you have to wonder what ports are open in the server 
system's port-based firewall.I have to ask my systems team to open ports 
for everything I'm using, especially when I move from localhost to outside.

You should be able to "fake it out" if you set up your zookeeper configuration 
to use localhost ports.

-----Original Message-
From: Scott Prentice [mailto:s...@leximation.com]
Sent: Monday, January 29, 2018 3:13 PM
To: solr-user@lucene.apache.org
Subject: SolrCloud installation troubles...

Using Solr 7.2.0 and Zookeeper 3.4.11

In an effort to move to a more robust Solr environment, I'm setting up a 
prototype system of 3 Solr servers and 3 Zookeeper servers. For now, this is 
all on one machine, but will eventually be 3 machines.

This works fine on a Ubuntu 5.4.0-6 VM on my local system, but when I do the same setup on 
the company's network machine (a Red Hat 4.8.5-16 VM), I'm unable to create a collection. To 
keep things simple, I'm not using our custom schema yet, but just creating a collection 
through the Solr Admin UI using Collections > Add Collection, using the 
"_default" config set. On the Ubuntu system, I can create various collections .. 1 
shard w/ 1 replication .. 2 shards w/ 3 replications .. 3 shards w/ 4 replications .. all 
seem alive and well.

But when I do the same thing on the Red Hat system it fails. Through the UI, 
it'll first time out with this message ..

      Connection to Solr lost

Then after a refresh, the collection appears to have been partially created, but it's in 
the "Gone" state, and after some time, is deleted by an apparent cleanup 
process. If I try to create one through the command line ..

      ./bin/solr create -c test99 -n _default -s 2 -rf 2

I get this response ..

ERROR: Failed to create collection 'test99' due to:
{10.6.208.31:8984_solr=org.apache.solr.client.solrj.SolrServerException:IOException
occured when talking to server at: http://10.6.208.31:8984/solr, 
10.6.208.31:8985_solr=org.apache.solr.client.solrj.SolrServerException:IOException
occured when talking to server at: http://10.6.208.31:8985/solr, 
10.6.208.31:8983_solr=org.apache.solr.client.solrj.SolrServerException:IOException
occured when talking to server at: http://10.6.208.31:8983/solr}

I've seen other reports of errors like this but no solutions that seem to apply 
to my situation. Any thoughts?

Thanks!
...scott






Re: SolrCloud installation troubles...

2018-01-29 Thread Scott Prentice
Looks like 2888 and 2890 are not open. At least they are not reported 
with a netstat -plunt .. could be the problem.


Thanks, all!

...scott


On 1/29/18 1:10 PM, Davis, Daniel (NIH/NLM) [C] wrote:

Trying 127.0.0.1 could help.   We kind of tend to think localhost is always 
127.0.0.1, but I've seen localhost start to resolve to ::1, the IPv6 equivalent 
of 127.0.0.1.

I guess some environments can be strict enough to restrict communication on 
localhost; seems hard to imagine, but it does happen.

-Original Message-
From: Scott Prentice [mailto:s...@leximation.com]
Sent: Monday, January 29, 2018 4:02 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud installation troubles...


On 1/29/18 12:44 PM, Shawn Heisey wrote:

On 1/29/2018 1:13 PM, Scott Prentice wrote:

But when I do the same thing on the Red Hat system it fails. Through
the UI, it'll first time out with this message ..

     Connection to Solr lost

Then after a refresh, the collection appears to have been partially
created, but it's in the "Gone" state, and after some time, is
deleted by an apparent cleanup process. If I try to create one
through the command line ..

     ./bin/solr create -c test99 -n _default -s 2 -rf 2

I get this response ..

ERROR: Failed to create collection 'test99' due to:
{10.6.208.31:8984_solr=org.apache.solr.client.solrj.SolrServerExcepti
on:IOException occured when talking to server at:
http://10.6.208.31:8984/solr,
10.6.208.31:8985_solr=org.apache.solr.client.solrj.SolrServerExceptio
n:IOException occured when talking to server at:
http://10.6.208.31:8985/solr,
10.6.208.31:8983_solr=org.apache.solr.client.solrj.SolrServerExceptio
n:IOException occured when talking to server at:
http://10.6.208.31:8983/solr}

This sounds like either network connectivity problems or possibly
issues caused by extreme garbage collection pauses that result in
timeouts.

Thanks,
Shawn


Thanks, Shawn. I was wondering if there was something going on with IP 
redirection that was causing confusion. Any thoughts on how to debug?
And, what do you mean by "extreme garbage collection pauses"? Is that Solr 
garbage collection or the OS itself? There's really nothing happening on this machine, 
it's purely for testing so there shouldn't be any extra load from other processes.

Thanks!
...scott







Re: SolrCloud installation troubles...

2018-01-29 Thread Scott Prentice


On 1/29/18 1:31 PM, Shawn Heisey wrote:

On 1/29/2018 2:02 PM, Scott Prentice wrote:
Thanks, Shawn. I was wondering if there was something going on with 
IP redirection that was causing confusion. Any thoughts on how to 
debug? And, what do you mean by "extreme garbage collection pauses"? 
Is that Solr garbage collection or the OS itself? There's really 
nothing happening on this machine, it's purely for testing so there 
shouldn't be any extra load from other processes. 


Garbage collection is one of the primary features of Java's memory 
management.  It's not Solr or the OS.


If the java heap is really enormous, you can end up with long pauses, 
but I wouldn't expect them to be frequent unless the index is also 
really huge.


A very common issue that can cause even worse pause issues than a 
large heap is a heap that's too small, but not quite small enough to 
cause Java to completely run out of heap memory.  The default max heap 
size in recent Solr versions is 512MB, which is very small.  A Java 
program (which Solr is) can never use more heap memory than the 
maximum it is configured with, even if the machine has more memory 
available.


This paragraph is included because you mentioned IP redirection: 
Extreme care must be used when setting up SolrCloud on virtual 
machines where accessing the VM has to go through any kind of IP 
translation.  SolrCloud keeps track of how to reach each server in the 
cloud and if it stores an untranslated address when you need the 
translated address (or vice-versa), things are not going to work.  
Generally speaking translated addresses are going to be problematic 
for SolrCloud, and should not be used.


Thanks,
Shawn

Thanks for the clarification. Yes, we're just using the default heap 
size for Solr, but there's no index (yet) and nothing really going on, 
so I'd hope that garbage collection isn't the problem.


I'm putting my money on some IP translation issues (this is on a tightly 
controlled corporate network) or the fact that the 2888 and 2890 ports 
appear to not be open. I'll dig down the network issue path for now and 
see where that gets me.


Thanks,
...scott




Shard replica labels in Solr Admin graph?

2018-02-28 Thread Scott Prentice
We initially tested our Solr Cloud implementation on a single VM with 3 
Solr servers and 3 Zookeeper servers. Once that seemed good, we moved to 
3 VMs with 1 Solr/Zookeeper on each. That's all looking good, but in the 
Solr Admin > Cloud > Graph, all of my shard replicas are on "127.0.1.1" 
.. with the single VM setup it listed the port number so you could tell 
which "server" it was on.


Is there some way to get the shard replicas to list with the actual IPs 
of the server that the replica is on, rather than 127.0.1.1?


Thanks!
...scott



Re: Shard replica labels in Solr Admin graph?

2018-03-02 Thread Scott Prentice

Thanks Shawn!

I made the adjustment to /etc/hosts, and now all's well. This also fixed 
an underlying problem that I hadn't noticed at the time I send my query 
.. that only one Solr server was actually running. Turns out that 
Zookeeper saw them all as 127.0.1.1 and didn't let the other instances 
fully start up.


These were brand new, fresh, Ubuntu installs. Strange that the 
/etc/hosts isn't set up to handle this.


Cheers,
...scott


On 2/28/18 8:48 PM, Shawn Heisey wrote:

On 2/28/2018 5:42 PM, Scott Prentice wrote:
We initially tested our Solr Cloud implementation on a single VM with 
3 Solr servers and 3 Zookeeper servers. Once that seemed good, we 
moved to 3 VMs with 1 Solr/Zookeeper on each. That's all looking 
good, but in the Solr Admin > Cloud > Graph, all of my shard replicas 
are on "127.0.1.1" .. with the single VM setup it listed the port 
number so you could tell which "server" it was on.


Is there some way to get the shard replicas to list with the actual 
IPs of the server that the replica is on, rather than 127.0.1.1?


That is not going to work if those are separate machines.

There are two ways to fix this.

One is to figure out why Java is choosing a loopback address when it 
attempts to detect the machine's hostname.  I'm almost certain that 
/etc/hosts is set up incorrectly.  In my opinion, a typical /etc/hosts 
file should have two IPv4 lines, one defining localhost as 127.0.0.1, 
and another defining the machine's actual IP address as both the fully 
qualified domain name and the short hostname. An example:


127.0.0.1   localhost
192.168.1.200   smeagol.REDACTED.com    smeagol

The machine's hostname should not be found on any line that does not 
have a real IP address on it.


The other way to solve the problem is to specify the "host" system 
property to override Java's detection of the machine 
address/hostname.  You can either add a commandline option to set the 
property, or add it to solr.xml.  Note that if your solr.xml file is 
in zookeeper, then you can't use solr.xml.  This is because with 
solr.xml in zookeeper, every machine would have the same host 
definition, and that won't work.


https://lucene.apache.org/solr/guide/6_6/format-of-solr-xml.html#the-code-solrcloud-code-element 



Thanks,
Shawn






Scoping SolrCloud setup

2018-03-13 Thread Scott Prentice
We're in the process of moving from 12 single-core collections 
(non-cloud Solr) on 3 VMs to a SolrCloud setup. Our collections aren't 
huge, ranging in size from 50K to 150K documents with one at 1.2M docs. 
Our max query frequency is rather low .. probably no more than 
10-20/min. We do update frequently, maybe 10-100 documents every 10 mins.


Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've 
got each collection split into 2 shards with 3 replicas (one per VM). 
Also, Zookeeper is running on each VM. I understand that it's best to 
have each ZK server on a separate machine, but hoping this will work for 
now.


This all seemed like a good place to start, but after reading lots of 
articles and posts, I'm thinking that maybe our smaller collections 
(under 100K docs) should just be one shard each, and maybe the 1.2M 
collection should be more like 6 shards. How do you decide how many 
shards is right?


Also, our current live system is separated into dev/stage/prod tiers, 
not, all of these tiers are together on each of the cloud VMs. This 
bothers some people, thinking that it may make our production 
environment less stable. I know that in an ideal world, we'd have them 
all on separate systems, but with the replication, it seems like we're 
going to make the overall system more stable. Is this a correct 
understanding?


I'm just wondering anyone has opinions on whether we're going in a 
reasonable direction or not. Are there any articles that discuss these 
initial sizing/scoping issues?


Thanks!
...scott




Re: Scoping SolrCloud setup

2018-03-14 Thread Scott Prentice

Greg...

Thanks. That's very helpful, and is inline with what I've been seeing.

So, to be clear, you're saying that the size of all collections on a 
server should be less than the available RAM. It looks like we've got 
about 13GB of documents in all (and growing), so, if we're restricted to 
16GB on each VM I'm thinking that it probably makes sense to split the 
collections over multiple VMs rather than having them all on one. 
Perhaps instead of all indexes replicated on 3 VMs, we should split 
things up over 4 VMs and go down to just 2 replicas. We can add 2 more 
VMs to go up to 3 replicas if that seems necessary at some point.


Thanks,
...scott


On 3/13/18 6:15 PM, Greg Roodt wrote:

A single shard is much simpler conceptually and also cheaper to query. I
would say that even your 1.2M collection can be a single shard. I'm running
a single shard setup 4X that size. You can still have replicas of this
shard for redundancy / availability purposes.

I'm not an expert, but I think one of the deciding factors is if your index
can fit into RAM (not JVM Heap, but OS cache). What are the sizes of your
indexes?

On 14 March 2018 at 11:01, Scott Prentice  wrote:


We're in the process of moving from 12 single-core collections (non-cloud
Solr) on 3 VMs to a SolrCloud setup. Our collections aren't huge, ranging
in size from 50K to 150K documents with one at 1.2M docs. Our max query
frequency is rather low .. probably no more than 10-20/min. We do update
frequently, maybe 10-100 documents every 10 mins.

Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've got
each collection split into 2 shards with 3 replicas (one per VM). Also,
Zookeeper is running on each VM. I understand that it's best to have each
ZK server on a separate machine, but hoping this will work for now.

This all seemed like a good place to start, but after reading lots of
articles and posts, I'm thinking that maybe our smaller collections (under
100K docs) should just be one shard each, and maybe the 1.2M collection
should be more like 6 shards. How do you decide how many shards is right?

Also, our current live system is separated into dev/stage/prod tiers, not,
all of these tiers are together on each of the cloud VMs. This bothers some
people, thinking that it may make our production environment less stable. I
know that in an ideal world, we'd have them all on separate systems, but
with the replication, it seems like we're going to make the overall system
more stable. Is this a correct understanding?

I'm just wondering anyone has opinions on whether we're going in a
reasonable direction or not. Are there any articles that discuss these
initial sizing/scoping issues?

Thanks!
...scott







Re: Scoping SolrCloud setup

2018-03-14 Thread Scott Prentice

Emir...

Thanks for the input. Our larger collections are localized content, so 
it may make sense to shard those so we can target the specific index. 
I'll need to confirm how it's being used, if queries are always within a 
language or if they are cross-language.


Thanks also for the link .. very helpful!

All the best,
...scott



On 3/14/18 2:21 AM, Emir Arnautović wrote:

Hi Scott,
There is no definite answer - it depends on your documents and query patterns. 
Sharding does come with an overhead but also allows Solr to parallelise search. 
Query latency is usually something that tells you if you need to split 
collection to multiple shards or not. In caseIf you are ok with latency there 
is no need to split. Other scenario where shards make sense is when routing is 
used in majority of queries so that enables you to query only subset of 
documents.
Also, there is indexing aspect where sharding helps - in case of high indexing 
throughput is needed, having multiple shards will spread indexing load to 
multiple servers.
It seems to me that there is no high indexing throughput requirement and the 
main criteria should be query latency.
Here is another blog post talking about this subject: 
http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html 
<http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html>

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/




On 14 Mar 2018, at 01:01, Scott Prentice  wrote:

We're in the process of moving from 12 single-core collections (non-cloud Solr) 
on 3 VMs to a SolrCloud setup. Our collections aren't huge, ranging in size 
from 50K to 150K documents with one at 1.2M docs. Our max query frequency is 
rather low .. probably no more than 10-20/min. We do update frequently, maybe 
10-100 documents every 10 mins.

Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've got each 
collection split into 2 shards with 3 replicas (one per VM). Also, Zookeeper is 
running on each VM. I understand that it's best to have each ZK server on a 
separate machine, but hoping this will work for now.

This all seemed like a good place to start, but after reading lots of articles 
and posts, I'm thinking that maybe our smaller collections (under 100K docs) 
should just be one shard each, and maybe the 1.2M collection should be more 
like 6 shards. How do you decide how many shards is right?

Also, our current live system is separated into dev/stage/prod tiers, not, all 
of these tiers are together on each of the cloud VMs. This bothers some people, 
thinking that it may make our production environment less stable. I know that 
in an ideal world, we'd have them all on separate systems, but with the 
replication, it seems like we're going to make the overall system more stable. 
Is this a correct understanding?

I'm just wondering anyone has opinions on whether we're going in a reasonable 
direction or not. Are there any articles that discuss these initial 
sizing/scoping issues?

Thanks!
...scott








Zookeeper service?

2018-03-14 Thread Scott Prentice
We might be going at this wrong, but we've got Solr set up as a service, 
so if the machine goes down it'll restart. But without Zookeeper running 
as a service, that's not much help. I found the zookeeperd install, 
which in theory seems like it should do the trick, but that installs a 
new instance of ZK and isn't using our zoo.cfg. I guess we can make the 
config adjustments to the new ZK installation and use that, but was 
hoping to use our existing ZK installation. Or maybe I can hack the 
zookeeper startup script to get it to run as a service?


There's the SOLR_WAIT_FOR_ZK config parameter, so Solr will wait for ZK 
to fire up .. seems like it's partially there.


This all seems like too many hoops to jump through to be the "right" way 
to go. I assume that others have run into a similar situation?


Thoughts?

Thanks!
...scott




Re: Scoping SolrCloud setup

2018-03-14 Thread Scott Prentice

Erick...

Thanks. Yes. I think we were just going shard-happy without really 
understanding the purpose. I think we'll start by keeping things simple 
.. no shards, fewer replicas, maybe a bit more RAM. Then we can assess 
the performance and make adjustments as needed.


Yes, that's the main reason for moving from our current non-cloud Solr 
setup to SolrCloud .. future flexibility as well as greater stability.


Thanks!
...scott


On 3/14/18 11:34 AM, Erick Erickson wrote:

Scott:

Eventually you'll hit the limit of your hardware, regardless of VMs.
I've seen multiple VMs help a lot when you have really beefy hardware,
as in 32 cores, 128G memory and the like. Otherwise it's iffier.

re: sharding or not. As others wrote, sharding is only useful when a
single collection grows past the limits of your hardware. Until that
point, it's usually a better bet to get better hardware than shard.
I've seen 300M docs fit in a single shard. I've also seen 10M strain
pretty beefy hardware.. but from what you've said multiple shards are
really not something you need to worry about.

About balancing all this across VMs and/or machines. You have a _lot_
of room to balance things. Let's say you put all your collections on
one physical machine to start (not recommending, just sayin'). 6
months from now you need to move collections 1-10 to another machine
due to growth. You:
1> spin up a new machine
2> build out collections 1-10 on those machines by using the
Collections API ADDREPLICA.
3> once the new replicas are healthy. DELETEREPLICA on the old hardware.

No down time. No configuration to deal with, SolrCloud will take care
of it for you.

Best,
Erick

On Wed, Mar 14, 2018 at 9:32 AM, Scott Prentice  wrote:

Emir...

Thanks for the input. Our larger collections are localized content, so it
may make sense to shard those so we can target the specific index. I'll need
to confirm how it's being used, if queries are always within a language or
if they are cross-language.

Thanks also for the link .. very helpful!

All the best,
...scott




On 3/14/18 2:21 AM, Emir Arnautović wrote:

Hi Scott,
There is no definite answer - it depends on your documents and query
patterns. Sharding does come with an overhead but also allows Solr to
parallelise search. Query latency is usually something that tells you if you
need to split collection to multiple shards or not. In caseIf you are ok
with latency there is no need to split. Other scenario where shards make
sense is when routing is used in majority of queries so that enables you to
query only subset of documents.
Also, there is indexing aspect where sharding helps - in case of high
indexing throughput is needed, having multiple shards will spread indexing
load to multiple servers.
It seems to me that there is no high indexing throughput requirement and
the main criteria should be query latency.
Here is another blog post talking about this subject:
http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html
<http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html>

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/




On 14 Mar 2018, at 01:01, Scott Prentice  wrote:

We're in the process of moving from 12 single-core collections (non-cloud
Solr) on 3 VMs to a SolrCloud setup. Our collections aren't huge, ranging in
size from 50K to 150K documents with one at 1.2M docs. Our max query
frequency is rather low .. probably no more than 10-20/min. We do update
frequently, maybe 10-100 documents every 10 mins.

Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've got
each collection split into 2 shards with 3 replicas (one per VM). Also,
Zookeeper is running on each VM. I understand that it's best to have each ZK
server on a separate machine, but hoping this will work for now.

This all seemed like a good place to start, but after reading lots of
articles and posts, I'm thinking that maybe our smaller collections (under
100K docs) should just be one shard each, and maybe the 1.2M collection
should be more like 6 shards. How do you decide how many shards is right?

Also, our current live system is separated into dev/stage/prod tiers,
not, all of these tiers are together on each of the cloud VMs. This bothers
some people, thinking that it may make our production environment less
stable. I know that in an ideal world, we'd have them all on separate
systems, but with the replication, it seems like we're going to make the
overall system more stable. Is this a correct understanding?

I'm just wondering anyone has opinions on whether we're going in a
reasonable direction or not. Are there any articles that discuss these
initial sizing/scoping issues?

Thanks!
...scott






Re: Zookeeper service?

2018-03-14 Thread Scott Prentice
Yeah .. I knew it was a different Apache project, but figured that since 
it was so tightly integrated with SolrCloud that others may have run 
into this issue.


I did some poking around and have (for now) ended up with this .. 
implemented it as a service through the "systemd.unit" configuration. 
Created the following unit file here .. 
/etc/systemd/system/zookeeper.service


-
[Unit]
Description=Zookeeper Daemon
Wants=syslog.target

[Service]
Type=forking
ExecStart=/apps/local_data/apps/solr/zk_installation/zookeeper/bin/zkServer.sh 
start /apps/local_data/apps/solr/zk_distribution/zoo.cfg
ExecStop=/apps/local_data/apps/solr/zk_installation/zookeeper/bin/zkServer.sh 
stop /apps/local_data/apps/solr/zk_distribution/zoo.cfg

TimeoutSec=30
Restart=on-failure

[Install]
WantedBy=multi-user.target
-

Rebooted the server and it seems to work. Your implementation sounds 
reasonable as well. I may post a query to the ZK list to see what other 
options are out there. The zookeeperd install seemed like it was going 
to require more hacking than I wanted to do since our config was already 
set up and working.


Thanks Shawn!

...scott



On 3/14/18 5:17 PM, Shawn Heisey wrote:

On 3/14/2018 12:24 PM, Scott Prentice wrote:

We might be going at this wrong, but we've got Solr set up as a
service, so if the machine goes down it'll restart. But without
Zookeeper running as a service, that's not much help.

You're probably going to be very unhappy to be told this ... but
ZooKeeper is a completely separate Apache project.  This mailing list
handles Solr.  While SolrCloud does require ZK, setting it up is outside
the scope of this mailing list.

I can tell you what I did to get it running as a service on CentOS 6.
It works, but it's not very robust, and if you ask the zookeeper user
mailing list, they may have better options.  I do strongly recommend
that you ask that mailing list.

-

I extracted the .tar.gz file to the "/opt" folder.  Then I renamed the
zookeeper-X.Y.Z directory to something else.  I used "mbzoo" ... which
only makes sense if you're familiar with our locally developed software.

Next I created a very small shell script, and saved it as
/usr/local/sbin/zkrun (script is below between the lines of equal signs):

=
#!/bin/sh

# chkconfig: - 75 50
# description: Starts and stops ZK

cd /opt/mbzoo
bin/zkServer.sh $1
=

I made that script executable and created a symlink for init.d:

chown +x /usr/local/sbin/zkrun
ln -s /usr/local/sbin/zkrun /etc/init.d/zookeeper

Then all I had to do was activate the init script:

chkconfig --add zookeeper
chkconfig zookeeper on

Once that's done, a "service zookeeper start" command should work.

On debian/ubuntu/mint and similar distros, you'd probably use
update-rc.d instead of chkconfig, with different options.  If you're on
an OS other than Linux, everything I've described might need changes.

If you're on Windows, chances are that you'll end up using a program
named NSSM.  If you can get your company to accept using it once they
find out the FULL program name.

Thanks,
Shawn






Re: Scoping SolrCloud setup

2018-03-14 Thread Scott Prentice

Walter...

Thanks for the additional data points. Clearly we're a long way from 
needing anything too complex.


Cheers!
...scott


On 3/14/18 1:12 PM, Walter Underwood wrote:

That would be my recommendation for a first setup. One Solr instance per host, 
one shard per collection. We run 5 million document cores with 8 GB of heap for 
the JVM. We size the RAM so that all the indexes fit in OS filesystem buffers.

Our big cluster is 32 hosts, 21 million documents in four shards. Each host is 
a 36 processor Amazon instance. Each host has one 8 GB Solr process (Solr 
6.6.2, java 8u121, G1 collector). No faceting, but we get very long queries, 
average length is 25 terms.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Mar 14, 2018, at 12:50 PM, Scott Prentice  wrote:

Erick...

Thanks. Yes. I think we were just going shard-happy without really 
understanding the purpose. I think we'll start by keeping things simple .. no 
shards, fewer replicas, maybe a bit more RAM. Then we can assess the 
performance and make adjustments as needed.

Yes, that's the main reason for moving from our current non-cloud Solr setup to 
SolrCloud .. future flexibility as well as greater stability.

Thanks!
...scott


On 3/14/18 11:34 AM, Erick Erickson wrote:

Scott:

Eventually you'll hit the limit of your hardware, regardless of VMs.
I've seen multiple VMs help a lot when you have really beefy hardware,
as in 32 cores, 128G memory and the like. Otherwise it's iffier.

re: sharding or not. As others wrote, sharding is only useful when a
single collection grows past the limits of your hardware. Until that
point, it's usually a better bet to get better hardware than shard.
I've seen 300M docs fit in a single shard. I've also seen 10M strain
pretty beefy hardware.. but from what you've said multiple shards are
really not something you need to worry about.

About balancing all this across VMs and/or machines. You have a _lot_
of room to balance things. Let's say you put all your collections on
one physical machine to start (not recommending, just sayin'). 6
months from now you need to move collections 1-10 to another machine
due to growth. You:
1> spin up a new machine
2> build out collections 1-10 on those machines by using the
Collections API ADDREPLICA.
3> once the new replicas are healthy. DELETEREPLICA on the old hardware.

No down time. No configuration to deal with, SolrCloud will take care
of it for you.

Best,
Erick

On Wed, Mar 14, 2018 at 9:32 AM, Scott Prentice  wrote:

Emir...

Thanks for the input. Our larger collections are localized content, so it
may make sense to shard those so we can target the specific index. I'll need
to confirm how it's being used, if queries are always within a language or
if they are cross-language.

Thanks also for the link .. very helpful!

All the best,
...scott




On 3/14/18 2:21 AM, Emir Arnautović wrote:

Hi Scott,
There is no definite answer - it depends on your documents and query
patterns. Sharding does come with an overhead but also allows Solr to
parallelise search. Query latency is usually something that tells you if you
need to split collection to multiple shards or not. In caseIf you are ok
with latency there is no need to split. Other scenario where shards make
sense is when routing is used in majority of queries so that enables you to
query only subset of documents.
Also, there is indexing aspect where sharding helps - in case of high
indexing throughput is needed, having multiple shards will spread indexing
load to multiple servers.
It seems to me that there is no high indexing throughput requirement and
the main criteria should be query latency.
Here is another blog post talking about this subject:
http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html
<http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html>

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/




On 14 Mar 2018, at 01:01, Scott Prentice  wrote:

We're in the process of moving from 12 single-core collections (non-cloud
Solr) on 3 VMs to a SolrCloud setup. Our collections aren't huge, ranging in
size from 50K to 150K documents with one at 1.2M docs. Our max query
frequency is rather low .. probably no more than 10-20/min. We do update
frequently, maybe 10-100 documents every 10 mins.

Our prototype setup is using 3 VMs (4 core, 16GB RAM each), and we've got
each collection split into 2 shards with 3 replicas (one per VM). Also,
Zookeeper is running on each VM. I understand that it's best to have each ZK
server on a separate machine, but hoping this will work for now.

This all seemed like a good place to start, but after reading lots of
articles and posts, I'm thinking that maybe our smaller collections (under
100K docs