possible to dump "routing table" from a single Solr node?

2016-02-03 Thread Ian Rose
Hi all,

I'm having a situation where our SolrCloud cluster often gets into a bad
state where our solr nodes frequently respond with "no servers hosting
shard" even though the node that hosts that shard is clearly up.  We
suspect that this is a state bug where some servers are somehow ending up
with an incorrect view of the network (e.g. which nodes are up/down, which
shards are hosted on which nodes).  Is it possible to somehow get a "dump"
of the current "routing table" (i.e. documents with prefixes in this range
in this collection are stored in this shard on this node)?  That would help
immensely when debugging.

Thanks!
- Ian


TooManyBasicQueries?

2015-03-13 Thread Ian Rose
I sometimes see the following in my logs:

ERROR org.apache.solr.core.SolrCore  –
org.apache.lucene.queryparser.surround.query.TooManyBasicQueries: Exceeded
maximum of 1000 basic queries.


What does this mean?  Does this mean that we have issued a query with too
many terms?  Or that the number of concurrent queries running on the server
is too high?

Also, is this a builtin limit or something set in a config file?

Thanks!
- Ian


Re: TooManyBasicQueries?

2015-03-24 Thread Ian Rose
Hi Erik -

Sorry, I totally missed your reply.  To the best of my knowledge, we are
not using any surround queries (have to admit I had never heard of them
until now).  We use solr.SearchHandler for all of our queries.

Does that answer the question?

Cheers,
Ian


On Fri, Mar 13, 2015 at 10:08 AM, Erik Hatcher 
wrote:

> It results from a surround query with too many terms.   Says the javadoc:
>
> * Exception thrown when {@link BasicQueryFactory} would exceed the limit
> * of query clauses.
>
> I’m curious, are you issuing a large {!surround} query or is it expanding
> to hit that limit?
>
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com
>
>
>
>
> > On Mar 13, 2015, at 9:44 AM, Ian Rose  wrote:
> >
> > I sometimes see the following in my logs:
> >
> > ERROR org.apache.solr.core.SolrCore  –
> > org.apache.lucene.queryparser.surround.query.TooManyBasicQueries:
> Exceeded
> > maximum of 1000 basic queries.
> >
> >
> > What does this mean?  Does this mean that we have issued a query with too
> > many terms?  Or that the number of concurrent queries running on the
> server
> > is too high?
> >
> > Also, is this a builtin limit or something set in a config file?
> >
> > Thanks!
> > - Ian
>
>


rough maximum cores (shards) per machine?

2015-03-24 Thread Ian Rose
Hi all -

I'm sure this topic has been covered before but I was unable to find any
clear references online or in the mailing list.

Are there any rules of thumb for how many cores (aka shards, since I am
using SolrCloud) is "too many" for one machine?  I realize there is no one
answer (depends on size of the machine, etc.) so I'm just looking for a
rough idea.  Something like the following would be very useful:

* People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
server without any problems.
* I have never heard of anyone successfully running X cores/shards on a
single machine, even if you throw a lot of hardware at it.

Thanks!
- Ian


Re: TooManyBasicQueries?

2015-03-24 Thread Ian Rose
Ah yes, right you are.  I had thought that `surround` required a different
endpoint, but I see now that someone is using a surround query.

Many thanks!

On Tue, Mar 24, 2015 at 10:02 AM, Erik Hatcher 
wrote:

> Somehow a surround query is being constructed along the way.  Search your
> logs for “surround” and see if someone is maybe sneaking a q={!surround}…
> in there.  If you’re passing input directly through from your application
> to Solr’s q parameter without any sanitizing or filtering, it’s possible a
> surround query parser could be asked for.
>
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com <http://www.lucidworks.com/>
>
>
>
>
> > On Mar 24, 2015, at 8:55 AM, Ian Rose  wrote:
> >
> > Hi Erik -
> >
> > Sorry, I totally missed your reply.  To the best of my knowledge, we are
> > not using any surround queries (have to admit I had never heard of them
> > until now).  We use solr.SearchHandler for all of our queries.
> >
> > Does that answer the question?
> >
> > Cheers,
> > Ian
> >
> >
> > On Fri, Mar 13, 2015 at 10:08 AM, Erik Hatcher 
> > wrote:
> >
> >> It results from a surround query with too many terms.   Says the
> javadoc:
> >>
> >> * Exception thrown when {@link BasicQueryFactory} would exceed the limit
> >> * of query clauses.
> >>
> >> I’m curious, are you issuing a large {!surround} query or is it
> expanding
> >> to hit that limit?
> >>
> >>
> >> —
> >> Erik Hatcher, Senior Solutions Architect
> >> http://www.lucidworks.com
> >>
> >>
> >>
> >>
> >>> On Mar 13, 2015, at 9:44 AM, Ian Rose  wrote:
> >>>
> >>> I sometimes see the following in my logs:
> >>>
> >>> ERROR org.apache.solr.core.SolrCore  –
> >>> org.apache.lucene.queryparser.surround.query.TooManyBasicQueries:
> >> Exceeded
> >>> maximum of 1000 basic queries.
> >>>
> >>>
> >>> What does this mean?  Does this mean that we have issued a query with
> too
> >>> many terms?  Or that the number of concurrent queries running on the
> >> server
> >>> is too high?
> >>>
> >>> Also, is this a builtin limit or something set in a config file?
> >>>
> >>> Thanks!
> >>> - Ian
> >>
> >>
>
>


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Ian Rose
Let me give a bit of background.  Our Solr cluster is multi-tenant, where
we use one collection for each of our customers.  In many cases, these
customers are very tiny, so their collection consists of just a single
shard on a single Solr node.  In fact, a non-trivial number of them are
totally empty (e.g. trial customers that never did anything with their
trial account).  However there are also some customers that are larger,
requiring their collection to be sharded.  Our strategy is to try to keep
the total documents in any one shard under 20 million (honestly not sure
where my coworker got that number from - I am open to alternatives but I
realize this is heavily app-specific).

So my original question is not related to indexing or query traffic, but
just the sheer number of cores.  For example, if I have 10 active cores on
a machine and everything is working fine, should I expect that everything
will still work fine if I add 10 nearly-idle cores to that machine?  What
about 100?  1000?  I figure the overhead of each core is probably fairly
low but at some point starts to matter.

Does that make sense?
- Ian


On Tue, Mar 24, 2015 at 11:12 AM, Jack Krupansky 
wrote:

> Shards per collection, or across all collections on the node?
>
> It will all depend on:
>
> 1. Your ingestion/indexing rate. High, medium or low?
> 2. Your query access pattern. Note that a typical query fans out to all
> shards, so having more shards than CPU cores means less parallelism.
> 3. How many collections you will have per node.
>
> In short, it depends on what you want to achieve, not some limit of Solr
> per se.
>
> Why are you even sharding the node anyway? Why not just run with a single
> shard per node, and do sharding by having separate nodes, to maximize
> parallel processing and availability?
>
> Also be careful to be clear about using the Solr term "shard" (a slice,
> across all replica nodes) as distinct from the Elasticsearch term "shard"
> (a single slice of an index for a single replica, analogous to a Solr
> "core".)
>
>
> -- Jack Krupansky
>
> On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose  wrote:
>
> > Hi all -
> >
> > I'm sure this topic has been covered before but I was unable to find any
> > clear references online or in the mailing list.
> >
> > Are there any rules of thumb for how many cores (aka shards, since I am
> > using SolrCloud) is "too many" for one machine?  I realize there is no
> one
> > answer (depends on size of the machine, etc.) so I'm just looking for a
> > rough idea.  Something like the following would be very useful:
> >
> > * People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
> > server without any problems.
> > * I have never heard of anyone successfully running X cores/shards on a
> > single machine, even if you throw a lot of hardware at it.
> >
> > Thanks!
> > - Ian
> >
>


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Ian Rose
First off thanks everyone for the very useful replies thus far.

Shawn - thanks for the list of items to check.  #1 and #2 should be fine
for us and I'll check our ulimit for #3.

To add a bit of clarification, we are indeed using SolrCloud.  Our current
setup is to create a new collection for each customer.  For now we allow
SolrCloud to decide for itself where to locate the initial shard(s) but in
time we expect to refine this such that our system will automatically
choose the least loaded nodes according to some metric(s).

Having more than one business entity controlling the configuration of a
> single (Solr) server is a recipe for disaster. Solr works well if there is
> an architect for the system.


Jack, can you explain a bit what you mean here?  It looks like Toke caught
your meaning but I'm afraid it missed me.  What do you mean by "business
entity"?  Is your concern that with automatic creation of collections they
will be distributed willy-nilly across the cluster, leading to uneven load
across nodes?  If it is relevant, the schema and solrconfig are controlled
entirely by me and is the same for all collections.  Thus theoretically we
could actually just use one single collection for all of our customers
(adding a 'customer:' type fq to all queries) but since we never
need to query across customers it seemed more performant (as well as safer
- less chance of accidentally leaking data across customers) to use
separate collections.

Better to give each tenant a separate Solr instance that you spin up and
> spin down based on demand.


Regarding this, if by tenant you mean "customer", this is not viable for us
from a cost perspective.  As I mentioned initially, many of our customers
are very small so dedicating an entire machine to each of them would not be
economical (or efficient).  Or perhaps I am not understanding what your
definition of "tenant" is?

Cheers,
Ian



On Tue, Mar 24, 2015 at 4:51 PM, Toke Eskildsen 
wrote:

> Jack Krupansky [jack.krupan...@gmail.com] wrote:
> > I'm sure that I am quite unqualified to describe his hypothetical setup.
> I
> > mean, he's the one using the term multi-tenancy, so it's for him to be
> > clear.
>
> It was my understanding that Ian used them interchangeably, but of course
> Ian it the only one that knows.
>
> > For me, it's a question of who has control over the config and schema and
> > collection creation. Having more than one business entity controlling the
> > configuration of a single (Solr) server is a recipe for disaster.
>
> Thank you. Now your post makes a lot more sense. I will not argue against
> that.
>
> - Toke Eskildsen
>


Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Ian Rose
Per - Wow, 1 trillion documents stored is pretty impressive.  One
clarification: when you say that you have 2 replica per collection on each
machine, what exactly does that mean?  Do you mean that each collection is
sharded into 50 shards, divided evenly over all 25 machines (thus 2 shards
per machine)?  Or are some of these slave replicas (e.g. 25x sharding with
1 replica per shard)?

Thanks!

On Wed, Mar 25, 2015 at 5:13 AM, Per Steffensen  wrote:

> In one of our production environments we use 32GB, 4-core, 3T RAID0
> spinning disk Dell servers (do not remember the exact model). We have about
> 25 collections with 2 replica (shard-instances) per collection on each
> machine - 25 machines. Total of 25 coll * 2 replica/coll/machine * 25
> machines = 1250 replica. Each replica contains about 800 million pretty
> small documents - thats about 1000 billion (do not know the english word
> for it) documents all in all. We index about 1.5 billion new documents
> every day (mainly into one of the collections = 50 replica across 25
> machine) and keep a history of 2 years on the data. Shifting the "index
> into" collection every month. We can fairly easy keep up with the indexing
> load. We have almost non of the data on the heap, but of course a small
> fraction of the data in the files will at any time be in OS file-cache.
> Compared to our indexing frequency we do not do a lot of searches. We have
> about 10 users searching the system from time to time - anything from major
> extracts to small quick searches. Depending on the nature of the search we
> have response-times between 1 sec and 5 min. But of course that is very
> dependent on "clever" choice on each field wrt index, store, doc-value etc.
> BUT we are not using out-of-box Apache Solr. We have made quit a lot of
> performance tweaks ourselves.
> Please note that, even though you disable all Solr caches, each replica
> will use heap-memory linearly dependent on the number of documents (and
> their size) in that replica. But not much, so you can get pretty far with
> relatively little RAM.
> Our version of Solr is based on Apache Solr 4.4.0, but I expect/hope it
> did not get worse in newer releases.
>
> Just to give you some idea of what can at least be achieved - in the
> high-end of #replica and #docs, I guess
>
> Regards, Per Steffensen
>
>
> On 24/03/15 14:02, Ian Rose wrote:
>
>> Hi all -
>>
>> I'm sure this topic has been covered before but I was unable to find any
>> clear references online or in the mailing list.
>>
>> Are there any rules of thumb for how many cores (aka shards, since I am
>> using SolrCloud) is "too many" for one machine?  I realize there is no one
>> answer (depends on size of the machine, etc.) so I'm just looking for a
>> rough idea.  Something like the following would be very useful:
>>
>> * People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
>> server without any problems.
>> * I have never heard of anyone successfully running X cores/shards on a
>> single machine, even if you throw a lot of hardware at it.
>>
>> Thanks!
>> - Ian
>>
>>
>


Re: Help understanding addreplica error message re: maxShardsPerNode

2015-04-08 Thread Ian Rose
Wups - sorry folks, I send this prematurely.  After typing this out I think
I have it figured out - although SPLITSHARD ignores maxShardsPerNode,
ADDREPLICA does not.  So ADDREPLICA fails because I already have too many
shards on a single node.

On Wed, Apr 8, 2015 at 11:18 PM, Ian Rose  wrote:

> On my local machine I have the following test setup:
>
> * 2 "nodes" (JVMs)
> * 1 collection named "testdrive", that was originally created with
> numShards=1 and maxShardsPerNode=1.
> * After a series of SPLITSHARD commands, I now have 4 shards, as follows:
>
> testdrive_shard1_0_0_replica1 (L) Active 115
> testdrive_shard1_0_1_replica1 (L) Active 0
> testdrive_shard1_1_0_replica1 (L) Active 5
> testdrive_shard1_1_1_replica1 (L) Active 88
>
> The number in the last column is the number of documents.  The 4 shards
> are all on the same node; the second node holds nothing for this collection.
>
> Already, this situation is a little strange because I have 4 shards on one
> node, despite the fact that maxShardsPerNode is 1.  My guess is that
> SPLITSHARD ignores the maxShardsPerNode value - is that right?
>
> Now, if I issue an ADDREPLICA command
> with collection=testdrive&shard=shard1_0_0, I get the following error:
>
> "Cannot create shards testdrive. Value of maxShardsPerNode is 1, and the
> number of live nodes is 2. This allows a maximum of 2 to be created. Value
> of numShards is 4 and value of replicationFactor is 1. This requires 4
> shards to be created (higher than the allowed number)"
>
> I don't totally understand this.
>
>
>


Re: change maxShardsPerNode for existing collection?

2015-04-08 Thread Ian Rose
Thanks, I figured that might be the case (hand-editting clusterstate.json).

- Ian


On Wed, Apr 8, 2015 at 11:46 PM, ralph tice  wrote:

> It looks like there's a patch available:
> https://issues.apache.org/jira/browse/SOLR-5132
>
> Currently the only way without that patch is to hand-edit
> clusterstate.json, which is very ill advised.  If you absolutely must,
> it's best to stop all your Solr nodes, backup the current clusterstate
> in ZK, modify it, and then start your nodes.
>
> On Wed, Apr 8, 2015 at 10:21 PM, Ian Rose  wrote:
> > I previously created several collections with maxShardsPerNode=1 but I
> > would now like to change that (to "unlimited" if that is an option).  Is
> > changing this value possible?
> >
> > Cheers,
> > - Ian
>


Help understanding addreplica error message re: maxShardsPerNode

2015-04-08 Thread Ian Rose
On my local machine I have the following test setup:

* 2 "nodes" (JVMs)
* 1 collection named "testdrive", that was originally created with
numShards=1 and maxShardsPerNode=1.
* After a series of SPLITSHARD commands, I now have 4 shards, as follows:

testdrive_shard1_0_0_replica1 (L) Active 115
testdrive_shard1_0_1_replica1 (L) Active 0
testdrive_shard1_1_0_replica1 (L) Active 5
testdrive_shard1_1_1_replica1 (L) Active 88

The number in the last column is the number of documents.  The 4 shards are
all on the same node; the second node holds nothing for this collection.

Already, this situation is a little strange because I have 4 shards on one
node, despite the fact that maxShardsPerNode is 1.  My guess is that
SPLITSHARD ignores the maxShardsPerNode value - is that right?

Now, if I issue an ADDREPLICA command
with collection=testdrive&shard=shard1_0_0, I get the following error:

"Cannot create shards testdrive. Value of maxShardsPerNode is 1, and the
number of live nodes is 2. This allows a maximum of 2 to be created. Value
of numShards is 4 and value of replicationFactor is 1. This requires 4
shards to be created (higher than the allowed number)"

I don't totally understand this.


change maxShardsPerNode for existing collection?

2015-04-08 Thread Ian Rose
I previously created several collections with maxShardsPerNode=1 but I
would now like to change that (to "unlimited" if that is an option).  Is
changing this value possible?

Cheers,
- Ian


proper routing (from non-Java client) in solr cloud 5.0.0

2015-04-14 Thread Ian Rose
Hi all -

I've just upgraded my dev install of Solr (cloud) from 4.10 to 5.0.  Our
client is written in Go, for which I am not aware of a client, so we wrote
our own.  One tricky bit for this was the routing logic; if a document has
routing prefix X and belong to collection Y, we need to know which solr
node to connect to.  Previously we accomplished this by watching the
clusterstate.json
file in zookeeper - at startup and whenever it changes, the client parses
the file contents to build a routing table.

However in 5.0 newly create collections do not show up in clusterstate.json
but instead have their own state.json document.  Are there any
recommendations for how to handle this from the client?  The obvious answer
is to watch every collection's state.json document, but we run a lot of
collections (~1000 currently, and growing) so I'm concerned about keeping
that many watches open at the same time (should I be?).  How does the SolrJ
client handle this?

Thanks!
- Ian


Re: proper routing (from non-Java client) in solr cloud 5.0.0

2015-04-14 Thread Ian Rose
Hi Hrishikesh,

Thanks for the pointers - I had not looked at SOLR-5474
<https://issues.apache.org/jira/browse/SOLR-5474> previously.  Interesting
approach...  I think we will stick with trying to keep zk watches open from
all clients to all collections for now, but if that starts to be a
bottleneck its good to know how the route that Solrj has chosen...

cheers,
Ian



On Tue, Apr 14, 2015 at 3:56 PM, Hrishikesh Gadre 
wrote:

> Hi Ian,
>
> As per my understanding, Solrj does not use Zookeeper watches but instead
> caches the information (along with a TTL). You can find more information
> here,
>
> https://issues.apache.org/jira/browse/SOLR-5473
> https://issues.apache.org/jira/browse/SOLR-5474
>
> Regards
> Hrishikesh
>
>
> On Tue, Apr 14, 2015 at 8:49 AM, Ian Rose  wrote:
>
> > Hi all -
> >
> > I've just upgraded my dev install of Solr (cloud) from 4.10 to 5.0.  Our
> > client is written in Go, for which I am not aware of a client, so we
> wrote
> > our own.  One tricky bit for this was the routing logic; if a document
> has
> > routing prefix X and belong to collection Y, we need to know which solr
> > node to connect to.  Previously we accomplished this by watching the
> > clusterstate.json
> > file in zookeeper - at startup and whenever it changes, the client parses
> > the file contents to build a routing table.
> >
> > However in 5.0 newly create collections do not show up in
> clusterstate.json
> > but instead have their own state.json document.  Are there any
> > recommendations for how to handle this from the client?  The obvious
> answer
> > is to watch every collection's state.json document, but we run a lot of
> > collections (~1000 currently, and growing) so I'm concerned about keeping
> > that many watches open at the same time (should I be?).  How does the
> SolrJ
> > client handle this?
> >
> > Thanks!
> > - Ian
> >
>


Async deleteshard commands?

2015-04-28 Thread Ian Rose
Is it possible to run DELETESHARD commands in async mode?  Google searches
seem to indicate yes, but not definitively.

My local experience indicates otherwise.  If I start with an async
SPLITSHARD like so:

http://localhost:8983/solr/admin/collections?action=splitshard&collection=2Gp&shard=shard1_0_0&async=12-foo-1

Then I get back the expected response format, with 
12-foo-1

And I can later query for the result via REQUESTSTATUS.

However if I try an async DELETESHARD like so:

http://localhost:8983/solr/admin/collections?action=deleteshard&collection=2Gp&shard=shard1_0_0&async=12-foo-4

The response includes the command result, indicating that the command was
not run async:




0
16




And in addition REQUESTSTATUS calls for that requestId fail with "Did not
find taskid [12-foo-4] in any tasks queue".

Synchronous deletes are causing problems for me in production as they are
timing out in some cases.

Thanks,
Ian


p.s. I'm on version 5.0.0


Re: Async deleteshard commands?

2015-04-28 Thread Ian Rose
Done!

https://issues.apache.org/jira/browse/SOLR-7481


On Tue, Apr 28, 2015 at 11:09 AM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> This is a bug. Can you please open a Jira issue?
>
> On Tue, Apr 28, 2015 at 8:35 PM, Ian Rose  wrote:
>
> > Is it possible to run DELETESHARD commands in async mode?  Google
> searches
> > seem to indicate yes, but not definitively.
> >
> > My local experience indicates otherwise.  If I start with an async
> > SPLITSHARD like so:
> >
> >
> >
> http://localhost:8983/solr/admin/collections?action=splitshard&collection=2Gp&shard=shard1_0_0&async=12-foo-1
> >
> > Then I get back the expected response format, with 
> > 12-foo-1
> >
> > And I can later query for the result via REQUESTSTATUS.
> >
> > However if I try an async DELETESHARD like so:
> >
> >
> >
> http://localhost:8983/solr/admin/collections?action=deleteshard&collection=2Gp&shard=shard1_0_0&async=12-foo-4
> >
> > The response includes the command result, indicating that the command was
> > not run async:
> >
> > 
> > 
> > 
> > 0
> > 16
> > 
> > 
> > 
> >
> > And in addition REQUESTSTATUS calls for that requestId fail with "Did not
> > find taskid [12-foo-4] in any tasks queue".
> >
> > Synchronous deletes are causing problems for me in production as they are
> > timing out in some cases.
> >
> > Thanks,
> > Ian
> >
> >
> > p.s. I'm on version 5.0.0
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: Async deleteshard commands?

2015-04-28 Thread Ian Rose
Hi Anshum,

FWIW I find that page is not entirely accurate with regard to async
params.  For example, my testing shows that DELETEREPLICA *does* support
the async param, although that is not listed here:
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api9

Cheers,
Ian


On Tue, Apr 28, 2015 at 12:47 PM, Anshum Gupta 
wrote:

> Hi Ian,
>
> DELETESHARD doesn't support ASYNC calls officially. We could certainly do
> with a better response but I believe with most of the Collections API calls
> at this time in Solr, you could send random params which would get ignored.
> Therefore, in this case, I believe that the async param gets ignored.
>
> The go-to reference point to check what's supported is the official
> reference guide:
>
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api7
>
> This doesn't mentioned support for async DELETESHARD calls.
>
> On Tue, Apr 28, 2015 at 8:05 AM, Ian Rose  wrote:
>
> > Is it possible to run DELETESHARD commands in async mode?  Google
> searches
> > seem to indicate yes, but not definitively.
> >
> > My local experience indicates otherwise.  If I start with an async
> > SPLITSHARD like so:
> >
> >
> >
> http://localhost:8983/solr/admin/collections?action=splitshard&collection=2Gp&shard=shard1_0_0&async=12-foo-1
> >
> > Then I get back the expected response format, with 
> > 12-foo-1
> >
> > And I can later query for the result via REQUESTSTATUS.
> >
> > However if I try an async DELETESHARD like so:
> >
> >
> >
> http://localhost:8983/solr/admin/collections?action=deleteshard&collection=2Gp&shard=shard1_0_0&async=12-foo-4
> >
> > The response includes the command result, indicating that the command was
> > not run async:
> >
> > 
> > 
> > 
> > 0
> > 16
> > 
> > 
> > 
> >
> > And in addition REQUESTSTATUS calls for that requestId fail with "Did not
> > find taskid [12-foo-4] in any tasks queue".
> >
> > Synchronous deletes are causing problems for me in production as they are
> > timing out in some cases.
> >
> > Thanks,
> > Ian
> >
> >
> > p.s. I'm on version 5.0.0
> >
>
>
>
> --
> Anshum Gupta
>


Re: Async deleteshard commands?

2015-04-28 Thread Ian Rose
Sure.  Here is an example of ADDREPLICA in synchronous mode:

http://localhost:8983/solr/admin/collections?action=addreplica&collection=293&shard=shard1_1

response:


0
1168




0
1158

293_shard1_1_replica2




And here is the same in asynchronous mode:

http://localhost:8983/solr/admin/collections?action=addreplica&collection=293&shard=shard1_1&async=foo99

response:


0
2

foo99


Note that the format of this response does NOT match the response format
that I got from the attempt at an async DELETESHARD in my earlier email.

Also note that I am now able to query for the status of this request:

http://localhost:8983/solr/admin/collections?action=requeststatus&requestid=foo99

response:


0
0


completed
found foo99 in completed tasks





On Tue, Apr 28, 2015 at 2:06 PM, Anshum Gupta 
wrote:

> Hi Ian,
>
> What do you mean by "*my testing shows*" ? Can you elaborate on the steps
> and how did you confirm that the call was indeed *async* ?
> I may be wrong but I think what you're seeing is a normal DELETEREPLICA
> call succeeding behind the scenes. It is not treated or processed as an
> async call.
>
> Also, that page is the official reference guide and might need fixing if
> it's out of sync.
>
>
> On Tue, Apr 28, 2015 at 10:47 AM, Ian Rose  wrote:
>
> > Hi Anshum,
> >
> > FWIW I find that page is not entirely accurate with regard to async
> > params.  For example, my testing shows that DELETEREPLICA *does* support
> > the async param, although that is not listed here:
> >
> >
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api9
> >
> > Cheers,
> > Ian
> >
> >
> > On Tue, Apr 28, 2015 at 12:47 PM, Anshum Gupta 
> > wrote:
> >
> > > Hi Ian,
> > >
> > > DELETESHARD doesn't support ASYNC calls officially. We could certainly
> do
> > > with a better response but I believe with most of the Collections API
> > calls
> > > at this time in Solr, you could send random params which would get
> > ignored.
> > > Therefore, in this case, I believe that the async param gets ignored.
> > >
> > > The go-to reference point to check what's supported is the official
> > > reference guide:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api7
> > >
> > > This doesn't mentioned support for async DELETESHARD calls.
> > >
> > > On Tue, Apr 28, 2015 at 8:05 AM, Ian Rose 
> wrote:
> > >
> > > > Is it possible to run DELETESHARD commands in async mode?  Google
> > > searches
> > > > seem to indicate yes, but not definitively.
> > > >
> > > > My local experience indicates otherwise.  If I start with an async
> > > > SPLITSHARD like so:
> > > >
> > > >
> > > >
> > >
> >
> http://localhost:8983/solr/admin/collections?action=splitshard&collection=2Gp&shard=shard1_0_0&async=12-foo-1
> > > >
> > > > Then I get back the expected response format, with  > name="requestid">
> > > > 12-foo-1
> > > >
> > > > And I can later query for the result via REQUESTSTATUS.
> > > >
> > > > However if I try an async DELETESHARD like so:
> > > >
> > > >
> > > >
> > >
> >
> http://localhost:8983/solr/admin/collections?action=deleteshard&collection=2Gp&shard=shard1_0_0&async=12-foo-4
> > > >
> > > > The response includes the command result, indicating that the command
> > was
> > > > not run async:
> > > >
> > > > 
> > > > 
> > > > 
> > > > 0
> > > > 16
> > > > 
> > > > 
> > > > 
> > > >
> > > > And in addition REQUESTSTATUS calls for that requestId fail with "Did
> > not
> > > > find taskid [12-foo-4] in any tasks queue".
> > > >
> > > > Synchronous deletes are causing problems for me in production as they
> > are
> > > > timing out in some cases.
> > > >
> > > > Thanks,
> > > > Ian
> > > >
> > > >
> > > > p.s. I'm on version 5.0.0
> > > >
> > >
> > >
> > >
> > > --
> > > Anshum Gupta
> > >
> >
>
>
>
> --
> Anshum Gupta
>


Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Ian Rose
Howdy all -

The short version is: We are not seeing Solr Cloud performance scale (event
close to) linearly as we add nodes. Can anyone suggest good diagnostics for
finding scaling bottlenecks? Are there known 'gotchas' that make Solr Cloud
fail to scale?

In detail:

We have used Solr (in non-Cloud mode) for over a year and are now beginning
a transition to SolrCloud.  To this end I have been running some basic load
tests to figure out what kind of capacity we should expect to provision.
In short, I am seeing very poor scalability (increase in effective QPS) as
I add Solr nodes.  I'm hoping to get some ideas on where I should be
looking to debug this.  Apologies in advance for the length of this email;
I'm trying to be comprehensive and provide all relevant information.

Our setup:

1 load generating client
 - generates tiny, fake documents with unique IDs
 - performs only writes (no queries at all)
 - chooses a random solr server for each ADD request (with 1 doc per add
request)

N collections spread over K solr servers
 - every collection is sharded K times (so every solr instance has 1 shard
from every collection)
 - no replicas
 - external zookeeper server (not using zkRun)
 - autoCommit maxTime=6
 - autoSoftCommit maxTime =15000

Everything is running within a single zone on Google Compute Engine, so
high quality gigabit network links between all machines (ping times < 1ms).

My methodology is as follows.
1. Start up a K solr servers.
2. Remove all existing collections.
3. Create N collections, with numShards=K for each.
4. Start load testing.  Every minute, print the number of successful
updates and the number of failed updates.
5. Keep increasing the offered load (via simulated users) until the qps
flatlines.

In brief (more detailed results at the bottom of email), I find that for
any number of nodes between 2 and 5, the QPS always caps out at ~3000.
Obviously something must be wrong here, as there should be a trend of the
QPS scaling (roughly) linearly with the number of nodes.  Or at the very
least going up at all!

So my question is what else should I be looking at here?

* CPU on the loadtest client is well under 100%
* No other obvious bottlenecks on loadtest client (running 2 clients leads
to ~1/2 qps on each)
* In many cases, CPU on the solr servers is quite low as well (e.g. with
100 users hitting 5 solr nodes, all nodes are >50% idle)
* Network bandwidth is a few MB/s, well under the gigabit capacity of our
network
* Disk bandwidth (< 2 MB/s) and iops (< 20/s) are low.

Any ideas?  Thanks very much!
- Ian


p.s. Here is my raw data broken out by number of nodes and number of
simulated users:


Num NodesNum UsersQPS111020153180110382511539001204050140410021472251790210
229021528502202900240321026032002803210210031803138535158031020903152560320
27603252890380305041375451560410220041525004202700425280043028505152450520
2640525279053028405100290052002810


Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Ian Rose
>
> If you want to increase QPS, you should not be increasing numShards.
> You need to increase replicationFactor.  When your numShards matches the
> number of servers, every single server will be doing part of the work
> for every query.



I think this is true only for actual queries, right?  I am not issuing any
queries, only writes (document inserts).  In the case of writes, increasing
the number of shards should increase my throughput (in ops/sec) more or
less linearly, right?


On Thu, Oct 30, 2014 at 4:50 PM, Shawn Heisey  wrote:

> On 10/30/2014 2:23 PM, Ian Rose wrote:
> > My methodology is as follows.
> > 1. Start up a K solr servers.
> > 2. Remove all existing collections.
> > 3. Create N collections, with numShards=K for each.
> > 4. Start load testing.  Every minute, print the number of successful
> > updates and the number of failed updates.
> > 5. Keep increasing the offered load (via simulated users) until the qps
> > flatlines.
>
> If you want to increase QPS, you should not be increasing numShards.
> You need to increase replicationFactor.  When your numShards matches the
> number of servers, every single server will be doing part of the work
> for every query.  If you increase replicationFactor instead, then each
> server can be doing a different query in parallel.
>
> Sharding the index is what you need to do when you need to scale the
> size of the index, so each server does not get overwhelmed by dealing
> with every document for every query.
>
> Getting a high QPS with a big index requires increasing both numShards
> *AND* replicationFactor.
>
> Thanks,
> Shawn
>
>


Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Ian Rose
Thanks for the suggestions so for, all.

1) We are not using SolrJ on the client (not using Java at all) but I am
working on writing a "smart" router so that we can always send to the
correct node.  I am certainly curious to see how that changes things.
Nonetheless even with the overhead of extra routing hops, the observed
behavior (no increase in performance with more nodes) doesn't make any
sense to me.

2) Commits: we are using autoCommit with openSearcher=false (maxTime=6)
and autoSoftCommit (maxTime=15000).

3) Suggestions to batch documents certainly make sense for production code
but in this case I am not real concerned with absolute performance; I just
want to see the *relative* performance as we use more Solr nodes.  So I
don't think batching or not really matters.

4) "No, that won't affect indexing speed all that much.  The way to
increase indexing speed is to increase the number of processes or threads
that are indexing at the same time.  Instead of having one client
sending update
requests, try five of them."

Can you elaborate on this some?  I'm worried I might be misunderstanding
something fundamental.  A cluster of 3 shards over 3 Solr nodes
*should* support
a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole idea
behind sharding.  Regarding your comment of "increase the number of
processes or threads", note that for each value of K (number of Solr nodes)
I measured with several different numbers of simulated users so that I
could find a "saturation point".  For example, take a look at my data for
K=2:

Num NodesNum UsersQPS214722517902102290215285022029002403210260320028032102
1003180

It's clear that once the load test client has ~40 simulated users, the Solr
cluster is saturated.  Creating more users just increases the average
request latency, such that the total QPS remained (nearly) constant.  So I
feel pretty confident that a cluster of size 2 *maxes out* at ~3200 qps.
The problem is that I am finding roughly this same "max point", no matter
how many simulated users the load test client created, for any value of K
(> 1).

Cheers,
- Ian


On Thu, Oct 30, 2014 at 8:01 PM, Erick Erickson 
wrote:

> Your indexing client, if written in SolrJ, should use CloudSolrServer
> which is, in Matt's terms "leader aware". It divides up the
> documents to be indexed into packets that where each doc in
> the packet belongs on the same shard, and then sends the packet
> to the shard leader. This avoids a lot of re-routing and should
> scale essentially linearly. You may have to add more clients
> though, depending upon who hard the document-generator is
> working.
>
> Also, make sure that you send batches of documents as Shawn
> suggests, I use 1,000 as a starting point.
>
> Best,
> Erick
>
> On Thu, Oct 30, 2014 at 2:10 PM, Shawn Heisey  wrote:
> > On 10/30/2014 2:56 PM, Ian Rose wrote:
> >> I think this is true only for actual queries, right? I am not issuing
> >> any queries, only writes (document inserts). In the case of writes,
> >> increasing the number of shards should increase my throughput (in
> >> ops/sec) more or less linearly, right?
> >
> > No, that won't affect indexing speed all that much.  The way to increase
> > indexing speed is to increase the number of processes or threads that
> > are indexing at the same time.  Instead of having one client sending
> > update requests, try five of them.  Also, index many documents with each
> > update request.  Sending one document at a time is very inefficient.
> >
> > You didn't say how you're doing commits, but those need to be as
> > infrequent as you can manage.  Ideally, you would use autoCommit with
> > openSearcher=false on an interval of about five minutes, and send an
> > explicit commit (with the default openSearcher=true) after all the
> > indexing is done.
> >
> > You may have requirements regarding document visibility that this won't
> > satisfy, but try to avoid doing commits with openSearcher=true (soft
> > commits qualify for this) extremely frequently, like once a second.
> > Once a minute is much more realistic.  Opening a new searcher is an
> > expensive operation, especially if you have cache warming configured.
> >
> > Thanks,
> > Shawn
> >
>


Re: Ideas for debugging poor SolrCloud scalability

2014-10-31 Thread Ian Rose
Hi Erick -

Thanks for the detailed response and apologies for my confusing
terminology.  I should have said "WPS" (writes per second) instead of QPS
but I didn't want to introduce a weird new acronym since QPS is well
known.  Clearly a bad decision on my part.  To clarify: I am doing
*only* writes
(document adds).  Whenever I wrote "QPS" I was referring to writes.

It seems clear at this point that I should wrap up the code to do "smart"
routing rather than choose Solr nodes randomly.  And then see if that
changes things.  I must admit that although I understand that random node
selection will impose a performance hit, theoretically it seems to me that
the system should still scale up as you add more nodes (albeit at lower
absolute level of performance than if you used a smart router).
Nonetheless, I'm just theorycrafting here so the better thing to do is just
try it experimentally.  I hope to have that working today - will report
back on my findings.

Cheers,
- Ian

p.s. To clarify why we are rolling our own smart router code, we use Go
over here rather than Java.  Although if we still get bad performance with
our custom Go router I may try a pure Java load client using
CloudSolrServer to eliminate the possibility of bugs in our implementation.


On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson 
wrote:

> I'm really confused:
>
> bq: I am not issuing any queries, only writes (document inserts)
>
> bq: It's clear that once the load test client has ~40 simulated users
>
> bq: A cluster of 3 shards over 3 Solr nodes *should* support
> a higher QPS than 2 shards over 2 Solr nodes, right
>
> QPS is usually used to mean "Queries Per Second", which is different from
> the statement that "I am not issuing any queries". And what do the
> number of users have to do with inserting documents?
>
> You also state: " In many cases, CPU on the solr servers is quite low as
> well"
>
> So let's talk about indexing first. Indexing should scale nearly
> linearly as long as
> 1> you are routing your docs to the correct leader, which happens with
> SolrJ
> and the CloudSolrSever automatically. Rather than rolling your own, I
> strongly
> suggest you try this out.
> 2> you have enough clients feeding the cluster to push CPU utilization
> on them all.
> Very often "slow indexing", or in your case "lack of scaling" is a
> result of document
> acquisition or, in your case, your doc generator is spending all it's
> time waiting for
> the individual documents to get to Solr and come back.
>
> bq: "chooses a random solr server for each ADD request (with 1 doc per add
> request)"
>
> Probably your culprit right there. Each and every document requires that
> you
> have to cross the network (and forward that doc to the correct leader). So
> given
> that you're not seeing high CPU utilization, I suspect that you're not
> sending
> enough docs to SolrCloud fast enough to see scaling. You need to batch up
> multiple docs, I generally send 1,000 docs at a time.
>
> But even if you do solve this, the inter-node routing will prevent
> linear scaling.
> When a doc (or a batch of docs) goes to a random Solr node, here's what
> happens:
> 1> the docs are re-packaged into groups based on which shard they're
> destined for
> 2> the sub-packets are forwarded to the leader for each shard
> 3> the responses are gathered back and returned to the client.
>
> This set of operations will eventually degrade the scaling.
>
> bq:  A cluster of 3 shards over 3 Solr nodes *should* support
> a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole idea
> behind sharding.
>
> If we're talking search requests, the answer is no. Sharding is
> what you do when your collection no longer fits on a single node.
> If it _does_ fit on a single node, then you'll usually get better query
> performance by adding a bunch of replicas to a single shard. When
> the number of  docs on each shard grows large enough that you
> no longer get good query performance, _then_ you shard. And
> take the query hit.
>
> If we're talking about inserts, then see above. I suspect your problem is
> that you're _not_ "saturating the SolrCloud cluster", you're sending
> docs to Solr very inefficiently and waiting on I/O. Batching docs and
> sending them to the right leader should scale pretty linearly until you
> start saturating your network.
>
> Best,
> Erick
>
> On Thu, Oct 30, 2014 at 6:56 PM, Ian Rose  wrote:
> > Thanks for the suggestions so for, all.
> >
> > 1) We are not using SolrJ on the client (not using Java at all) but I 

Re: Ideas for debugging poor SolrCloud scalability

2014-11-01 Thread Ian Rose
Erick,

Just to make sure I am thinking about this right: batching will certainly
make a big difference in performance, but it should be more or less a
constant factor no matter how many Solr nodes you are using, right?  Right
now in my load tests, I'm not actually that concerned about the absolute
performance numbers; instead I'm just trying to figure out why relative
performance (no matter how bad it is since I am not batching) does not go
up with more Solr nodes.  Once I get that part figured out and we are
seeing more writes per sec when we add nodes, then I'll turn on batching in
the client to see what kind of additional performance gain that gets us.

Cheers,
Ian


On Fri, Oct 31, 2014 at 3:43 PM, Peter Keegan 
wrote:

> Yes, I was inadvertently sending them to a replica. When I sent them to the
> leader, the leader reported (1000 adds) and the replica reported only 1 add
> per document. So, it looks like the leader forwards the batched jobs
> individually to the replicas.
>
> On Fri, Oct 31, 2014 at 3:26 PM, Erick Erickson 
> wrote:
>
> > Internally, the docs are batched up into smaller buckets (10 as I
> > remember) and forwarded to the correct shard leader. I suspect that's
> > what you're seeing.
> >
> > Erick
> >
> > On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan 
> > wrote:
> > > Regarding batch indexing:
> > > When I send batches of 1000 docs to a standalone Solr server, the log
> > file
> > > reports "(1000 adds)" in LogUpdateProcessor. But when I send them to
> the
> > > leader of a replicated index, the leader log file reports much smaller
> > > numbers, usually "(12 adds)". Why do the batches appear to be broken
> up?
> > >
> > > Peter
> > >
> > > On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson <
> > erickerick...@gmail.com>
> > > wrote:
> > >
> > >> NP, just making sure.
> > >>
> > >> I suspect you'll get lots more bang for the buck, and
> > >> results much more closely matching your expectations if
> > >>
> > >> 1> you batch up a bunch of docs at once rather than
> > >> sending them one at a time. That's probably the easiest
> > >> thing to try. Sending docs one at a time is something of
> > >> an anti-pattern. I usually start with batches of 1,000.
> > >>
> > >> And just to check.. You're not issuing any commits from the
> > >> client, right? Performance will be terrible if you issue commits
> > >> after every doc, that's totally an anti-pattern. Doubly so for
> > >> optimizes Since you showed us your solrconfig  autocommit
> > >> settings I'm assuming not but want to be sure.
> > >>
> > >> 2> use a leader-aware client. I'm totally unfamiliar with Go,
> > >> so I have no suggestions whatsoever to offer there But you'll
> > >> want to batch in this case too.
> > >>
> > >> On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose 
> > wrote:
> > >> > Hi Erick -
> > >> >
> > >> > Thanks for the detailed response and apologies for my confusing
> > >> > terminology.  I should have said "WPS" (writes per second) instead
> of
> > QPS
> > >> > but I didn't want to introduce a weird new acronym since QPS is well
> > >> > known.  Clearly a bad decision on my part.  To clarify: I am doing
> > >> > *only* writes
> > >> > (document adds).  Whenever I wrote "QPS" I was referring to writes.
> > >> >
> > >> > It seems clear at this point that I should wrap up the code to do
> > "smart"
> > >> > routing rather than choose Solr nodes randomly.  And then see if
> that
> > >> > changes things.  I must admit that although I understand that random
> > node
> > >> > selection will impose a performance hit, theoretically it seems to
> me
> > >> that
> > >> > the system should still scale up as you add more nodes (albeit at
> > lower
> > >> > absolute level of performance than if you used a smart router).
> > >> > Nonetheless, I'm just theorycrafting here so the better thing to do
> is
> > >> just
> > >> > try it experimentally.  I hope to have that working today - will
> > report
> > >> > back on my findings.
> > >> >
> > >> > Cheers,
> > >> > - Ian
> > >> &

any difference between using collection vs. shard in URL?

2014-11-05 Thread Ian Rose
If I add some documents to a SolrCloud shard in a collection "alpha", I can
post them to "/solr/alpha/update".  However I notice that you can also post
them using the shard name, e.g. "/solr/alpha_shard4_replica1/update" - in
fact this is what Solr seems to do internally (like if you send documents
to the wrong node so Solr needs to forward them over to the leader of the
correct shard).

Assuming you *do* always post your documents to the correct shard, is there
any difference between these two, performance or otherwise?

Thanks!
- Ian


Re: any difference between using collection vs. shard in URL?

2014-11-05 Thread Ian Rose
Awesome, thanks.  That's what I was hoping.

Cheers,
Ian


On Wed, Nov 5, 2014 at 10:33 AM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> There's no difference between the two. Even if you send updates to a shard
> url, it will still be forwarded to the right shard leader according to the
> hash of the id (assuming you're using the default compositeId router). Of
> course, if you happen to hit the right shard leader then it is just an
> internal forward and not an extra network hop.
>
> The advantage with using the collection name is that you can hit any
> SolrCloud node (even the ones not hosting this collection) and it will
> still work. So for a non Java client, a load balancer can be setup in front
> of the entire cluster and things will just work.
>
> On Wed, Nov 5, 2014 at 8:50 PM, Ian Rose  wrote:
>
> > If I add some documents to a SolrCloud shard in a collection "alpha", I
> can
> > post them to "/solr/alpha/update".  However I notice that you can also
> post
> > them using the shard name, e.g. "/solr/alpha_shard4_replica1/update" - in
> > fact this is what Solr seems to do internally (like if you send documents
> > to the wrong node so Solr needs to forward them over to the leader of the
> > correct shard).
> >
> > Assuming you *do* always post your documents to the correct shard, is
> there
> > any difference between these two, performance or otherwise?
> >
> > Thanks!
> > - Ian
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: Ideas for debugging poor SolrCloud scalability

2014-11-07 Thread Ian Rose
Hi again, all -

Since several people were kind enough to jump in to offer advice on this
thread, I wanted to follow up in case anyone finds this useful in the
future.

*tl;dr: *Routing updates to a random Solr node (and then letting it forward
the docs to where they need to go) is very expensive, more than I
expected.  Using a "smart" router that uses the cluster config to route
documents directly to their shard results in (near) linear scaling for us.

*Expository version:*

We use Go on our client, for which (to my knowledge) there is no SolrCloud
router implementation.  So we started by just routing updates to a random
Solr node and letting it forward the docs to where they need to go.  My
theory was that this would lead to a constant amount of additional work
(and thus still linear scaling).  This was based on the observation that if
you send an update of K documents to a Solr node in a N node cluster, in
the worst case scenario, all K documents will need to be forwarded on to
other nodes.  Since Solr nodes have perfect knowledge of where docs belong,
each doc would only take 1 additional hop to get to its replica.  So random
routing (in the limit) imposes 1 additional network hop for each document.

In practice, however, we find that (for small networks, at least) per-node
performance falls as you add shards.  In fact, the client performance (in
writes/sec) was essentially constant no matter how many shards we added.  I
do have a working theory as to why this might be (i.e. where the flaw is in
my logic above) but as this is merely an unverified theory I don't want to
lead anyone astray by writing it up here.

However, by writing a "smart" router that retrieves the clusterstate.json
file from Zookeeper and uses that to "perfectly" route documents to their
proper shard, we were able to achieve much better scalability.  Using a
synthetic workload, we were able to achieve 141.7 writes/sec to a cluster
of size 1 and 2506 writes/sec to a cluster of size 20 (125
writes/sec/node).  So a dropoff of ~12% which is not too bad.  We are
hoping to continue our tests with larger clusters to ensure that the
per-node write performance levels off and does not continue to drop as the
cluster scales.

I will also note that we initially had several bugs in our "smart" router
implementation so if you follow a similar path and see bad performance look
to your router implementation as you might not be routing correctly.  We
ended up writing a simple proxy that we ran in front of Solr to observe all
requests which helped immensely when verifying and debugging our router.
Yes tcpdump does something similar but viewing HTTP-level traffic is way
more convenient than TCP-level.  Plus Go makes little proxies like this
super easy to do.

Hope all that is useful to someone.  Thanks again to the posters above for
providing suggestions!

- Ian



On Sat, Nov 1, 2014 at 7:13 PM, Erick Erickson 
wrote:

> bq: but it should be more or less a constant factor no matter how many
> Solr nodes you are using, right?
>
> Not really. You've stated that you're not driving Solr very hard in
> your tests. Therefore you're waiting on I/O. Therefore your tests just
> aren't going to scale linearly with the number of shards. This is a
> simplification, but
>
> Your network utilization is pretty much irrelevant. I send a packet
> somewhere. "somewhere" does some stuff and sends me back an
> acknowledgement. While I'm waiting, the network is getting no traffic,
> so. If the network traffic was in the 90% range that would be
> different, so it's a good thing to monitor.
>
> Really, use a "leader aware" client and rack enough clients together
> that you're driving Solr hard. Then double the number of shards. Then
> rack enough _more_ clients to drive Solr at the same level. In this
> case I'll go out on a limb and predict near 2x throughput increases.
>
> One additional note, though. When you add _replicas_ to shards expect
> to see a drop in throughput that may be quite significant, 20-40%
> anecdotally...
>
> Best,
> Erick
>
> On Sat, Nov 1, 2014 at 9:23 AM, Shawn Heisey  wrote:
> > On 11/1/2014 9:52 AM, Ian Rose wrote:
> >> Just to make sure I am thinking about this right: batching will
> certainly
> >> make a big difference in performance, but it should be more or less a
> >> constant factor no matter how many Solr nodes you are using, right?
> Right
> >> now in my load tests, I'm not actually that concerned about the absolute
> >> performance numbers; instead I'm just trying to figure out why relative
> >> performance (no matter how bad it is since I am not batching) does not
> go
> >> up with more Solr nodes.  Once I get that part figured out and we are
> >

Migrating shards

2014-11-07 Thread Ian Rose
Howdy -

What is the current best practice for migrating shards to another machine?
I have heard suggestions that it is "add replica on new machine, wait for
it to catch up, delete original replica on old machine".  But I wanted to
check to make sure...

And if that is the best method, two follow-up questions:

1. Is there a best practice for knowing when the new replica has "caught
up" or do you just do a "*:*" query on both, compare counts, and call it a
day when they are the same (or nearly so, since the slave replica might lag
a little bit)?

2. When deleting the original (old) replica, since that one could be the
leader, is the replica deletion done in a safe manner such that no
documents will be lost (e.g. ones that were recently received by the leader
and not yet synced over to the slave replica before the leader is deleted)?

Thanks as always,
Ian


Re: Migrating shards

2014-11-07 Thread Ian Rose
Sounds great - thanks all.

On Fri, Nov 7, 2014 at 2:06 PM, Erick Erickson 
wrote:

> bq: I think ADD/DELETE replica APIs are best for within a SolrCloud
>
> I second this, if for no other reason than I'd expect this to get
> more attention than the underlying core admin API.
>
> That said, I believe ADD/DELETE replica just makes use of the core
> admin API under the covers, in which case you'd get all the goodness
> baked into the core admin API plus whatever extra is written into
> the collections api processing.
>
> Best,
> Erick
>
> On Fri, Nov 7, 2014 at 8:28 AM, ralph tice  wrote:
> > I think ADD/DELETE replica APIs are best for within a SolrCloud,
> > however if you need to move data across SolrClouds you will have to
> > resort to older APIs, which I didn't find good documentation of but
> > many references to.  So I wrote up the instructions to do so here:
> > https://gist.github.com/ralph-tice/887414a7f8082a0cb828
> >
> > I haven't had much time to think about how to translate this to more
> > generic documentation for inclusion in the community wiki but I would
> > love to hear some feedback if anyone else has a similar use case for
> > moving Solr indexes across SolrClouds.
> >
> >
> >
> > On Fri, Nov 7, 2014 at 10:18 AM, Michael Della Bitta
> >  wrote:
> >> 1. The new replica will not begin serving data until it's all there and
> >> caught up. You can watch the replica status on the Cloud screen to see
> it
> >> catch up; when it's green, you're done. If you're trying to automate
> this,
> >> you're going to look for the replica that says "recovering" in
> >> clusterstate.json and wait until it's "active."
> >>
> >> 2. I believe this to be the case, but I'll wait for someone else to
> chime in
> >> who knows better. Also, I wonder if there's a difference between
> >> DELETEREPLICA and unloading the core directly.
> >>
> >> Michael
> >>
> >>
> >>
> >> On 11/7/14 10:24, Ian Rose wrote:
> >>>
> >>> Howdy -
> >>>
> >>> What is the current best practice for migrating shards to another
> machine?
> >>> I have heard suggestions that it is "add replica on new machine, wait
> for
> >>> it to catch up, delete original replica on old machine".  But I wanted
> to
> >>> check to make sure...
> >>>
> >>> And if that is the best method, two follow-up questions:
> >>>
> >>> 1. Is there a best practice for knowing when the new replica has
> "caught
> >>> up" or do you just do a "*:*" query on both, compare counts, and call
> it a
> >>> day when they are the same (or nearly so, since the slave replica might
> >>> lag
> >>> a little bit)?
> >>>
> >>> 2. When deleting the original (old) replica, since that one could be
> the
> >>> leader, is the replica deletion done in a safe manner such that no
> >>> documents will be lost (e.g. ones that were recently received by the
> >>> leader
> >>> and not yet synced over to the slave replica before the leader is
> >>> deleted)?
> >>>
> >>> Thanks as always,
> >>> Ian
> >>>
> >>
>


clarification regarding shard splitting and composite IDs

2014-11-10 Thread Ian Rose
Howdy -

We are using composite IDs of the form !.  This ensures that
all events for a user are stored in the same shard.

I'm assuming from the description of how composite ID routing works, that
if you split a shard the "split point" of the hash range for that shard is
chosen to maintain the invariant that all documents that share a routing
prefix (before the "!") will still map to the same (new) shard.  Is that
accurate?

A naive shard-split implementation (e.g. that chose the hash range split
point arbitrarily) could end up with "child" shards that split a routing
prefix.

Thanks,
Ian


Re: Using Zookeeper with REST URL

2014-11-19 Thread Ian Rose
I don't think zookeeper has a REST api.  You'll need to use a Zookeeper
client library in your language (or roll one yourself).

On Wed, Nov 19, 2014 at 9:48 AM, nabil Kouici  wrote:

> Hi All,
>
> I'm connecting to solr using REST API (No library like SolJ). As my solr
> configuration is in cloud using Zookeeper ensemble, I don't know how to get
> available Solr server from ZooKeeper to be used in my URL Call. With SolrJ
> I can do:
>
> String zkHostString = "10.0.1.8:2181";
> CloudSolrServersolr = newCloudSolrServer(zkHostString);
> solr.connect();
>
> SolrQuerysolrQuery = newSolrQuery("*:*");
> solrQuery.setRows(10);
> QueryResponse resp = solr.query(solrQuery);
>
> Any help.
>
> Regards,
> Nabil