maxReplicasPerNode

2015-03-24 Thread Shai Erera
Hi

I saw that we can define maxShardsPerNode when creating a collection, but I
don't see that I can set something similar for replicas. My scenario is the
following:

   - I setup one Solr node
   - Create collection with numShards=1 and replicationFactor=2
   - Hopefully, one replica is created on that node
   - When I bring up the second Solr node, the second replica will be
   created

What I see is that both replicas are created on the first node, and when I
bring up the second Solr node, none of the replicas are moved.

I know that I can "move" one replica by calling ADDREPLICA on node2, then
DELETEREPLICA on node1, but I was wondering if there's an automated way to
do that.

I've also considered creating the collection with replicationFactor=1 and
when the second node comes up it will look for shards w/ one replica only,
and assign themselves as the replica. But it means I have to own that piece
of logic, where if Solr already does that, that's better.

Also, from what I understand, if I create a collection w/ rf=2 and there
are two nodes, then each node is assigned a replica. If one of the nodes
comes down, and a 3rd node comes up, it will be assigned a replica -- is
that correct?

Another related question, if there are two replicas on node1 and node2, and
node2 goes down -- will node1 be assigned the second replica as well?

If this is explained somewhere, I'd appreciate if you can give me a pointer.

Shai


Re: maxReplicasPerNode

2015-03-24 Thread Anshum Gupta
Hi Shai,

As of now, all replicas for a collections are created to meet the specified
replication factor at the time of collection creation. There's no way to
defer that until more nodes are up. Your best bet is to have the nodes
already up before you CREATE the collection or create the collection with a
lower replication factor and then use ADDREPLICA.

About auto-addition of replicas, that's kind of supported when using shared
file system (HDFS) to host the index. It's doesn't truly work as per your
use-case i.e. it doesn't consider the intended replication factor but only
brings up a Replica in case all replicas for a node are down, so that
SolrCloud continues to be usable. It also doesn't auto-remove replica when
the old node comes back up. You can read more about this in the
"Automatically Add Replicas in SolrCloud" section here:
https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS

About #3, i line with my answer to the previous question, Solr wouldn't
auto-add a Replica to meet the replication factor when a node goes down.


On Tue, Mar 24, 2015 at 12:36 AM, Shai Erera  wrote:

> Hi
>
> I saw that we can define maxShardsPerNode when creating a collection, but I
> don't see that I can set something similar for replicas. My scenario is the
> following:
>
>- I setup one Solr node
>- Create collection with numShards=1 and replicationFactor=2
>- Hopefully, one replica is created on that node
>- When I bring up the second Solr node, the second replica will be
>created
>
> What I see is that both replicas are created on the first node, and when I
> bring up the second Solr node, none of the replicas are moved.
>
> I know that I can "move" one replica by calling ADDREPLICA on node2, then
> DELETEREPLICA on node1, but I was wondering if there's an automated way to
> do that.
>
> I've also considered creating the collection with replicationFactor=1 and
> when the second node comes up it will look for shards w/ one replica only,
> and assign themselves as the replica. But it means I have to own that piece
> of logic, where if Solr already does that, that's better.
>
> Also, from what I understand, if I create a collection w/ rf=2 and there
> are two nodes, then each node is assigned a replica. If one of the nodes
> comes down, and a 3rd node comes up, it will be assigned a replica -- is
> that correct?
>
> Another related question, if there are two replicas on node1 and node2, and
> node2 goes down -- will node1 be assigned the second replica as well?
>
> If this is explained somewhere, I'd appreciate if you can give me a
> pointer.
>
> Shai
>



-- 
Anshum Gupta


Re: maxReplicasPerNode

2015-03-24 Thread Shai Erera
Thanks Anshum,

About #3, i line with my answer to the previous question, Solr wouldn't
> auto-add a Replica to meet the replication factor when a node goes down.
>

Just to make sure the answer applies to both these cases:

   1. There are two replicas on node1 and node2. Solr won't add a replica
   to node1 when node2 goes down.
   2. The collection was created with rf=2, Solr creates replicas on node1
   and node2. If node2 goes down and a node3 comes up instead, will it be
   assigned a replica, or Solr does not do that also?

In short, is there any scenario where Solr would auto-add replicas (aside
from running on HDFS) to meet the 'rf' setting, or after the collection has
been created, ensuring RF is met is my responsibility?

Shai

On Tue, Mar 24, 2015 at 10:02 AM, Anshum Gupta 
wrote:

> Hi Shai,
>
> As of now, all replicas for a collections are created to meet the specified
> replication factor at the time of collection creation. There's no way to
> defer that until more nodes are up. Your best bet is to have the nodes
> already up before you CREATE the collection or create the collection with a
> lower replication factor and then use ADDREPLICA.
>
> About auto-addition of replicas, that's kind of supported when using shared
> file system (HDFS) to host the index. It's doesn't truly work as per your
> use-case i.e. it doesn't consider the intended replication factor but only
> brings up a Replica in case all replicas for a node are down, so that
> SolrCloud continues to be usable. It also doesn't auto-remove replica when
> the old node comes back up. You can read more about this in the
> "Automatically Add Replicas in SolrCloud" section here:
> https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
>
> About #3, i line with my answer to the previous question, Solr wouldn't
> auto-add a Replica to meet the replication factor when a node goes down.
>
>
> On Tue, Mar 24, 2015 at 12:36 AM, Shai Erera  wrote:
>
> > Hi
> >
> > I saw that we can define maxShardsPerNode when creating a collection,
> but I
> > don't see that I can set something similar for replicas. My scenario is
> the
> > following:
> >
> >- I setup one Solr node
> >- Create collection with numShards=1 and replicationFactor=2
> >- Hopefully, one replica is created on that node
> >- When I bring up the second Solr node, the second replica will be
> >created
> >
> > What I see is that both replicas are created on the first node, and when
> I
> > bring up the second Solr node, none of the replicas are moved.
> >
> > I know that I can "move" one replica by calling ADDREPLICA on node2, then
> > DELETEREPLICA on node1, but I was wondering if there's an automated way
> to
> > do that.
> >
> > I've also considered creating the collection with replicationFactor=1 and
> > when the second node comes up it will look for shards w/ one replica
> only,
> > and assign themselves as the replica. But it means I have to own that
> piece
> > of logic, where if Solr already does that, that's better.
> >
> > Also, from what I understand, if I create a collection w/ rf=2 and there
> > are two nodes, then each node is assigned a replica. If one of the nodes
> > comes down, and a 3rd node comes up, it will be assigned a replica -- is
> > that correct?
> >
> > Another related question, if there are two replicas on node1 and node2,
> and
> > node2 goes down -- will node1 be assigned the second replica as well?
> >
> > If this is explained somewhere, I'd appreciate if you can give me a
> > pointer.
> >
> > Shai
> >
>
>
>
> --
> Anshum Gupta
>


_text

2015-03-24 Thread phiroc

Hello,

my SOLR 5 Admin Panel displays the following error:

23/03/2015 15:05:05 ERROR   SolrCore
org.apache.solr.common.SolrException: undefined field: "_text"

How should _text be defined in schema.xml?

Many thanks.

Philippe


Re: _text

2015-03-24 Thread Zheng Lin Edwin Yeo
Hi Philippe,

Are you using the default schemaFactory, in which your setting in
solrconfig.xml is , or you
have used your own defined schema.xml, in which your setting in
solrconfig.xml should be ?


Regards,
Edwin


On 24 March 2015 at 17:40,  wrote:

>
> Hello,
>
> my SOLR 5 Admin Panel displays the following error:
>
> 23/03/2015 15:05:05 ERROR   SolrCore
> org.apache.solr.common.SolrException: undefined field: "_text"
>
> How should _text be defined in schema.xml?
>
> Many thanks.
>
> Philippe
>


Re: _text

2015-03-24 Thread phiroc
Hi Zheng,

I copied the SOLR 5 schema.xml file on Github (?), which contains the following 
line:







- Mail original -
De: "Zheng Lin Edwin Yeo" 
À: solr-user@lucene.apache.org
Envoyé: Mardi 24 Mars 2015 10:59:49
Objet: Re: _text

Hi Philippe,

Are you using the default schemaFactory, in which your setting in
solrconfig.xml is , or you
have used your own defined schema.xml, in which your setting in
solrconfig.xml should be ?


Regards,
Edwin


On 24 March 2015 at 17:40,  wrote:

>
> Hello,
>
> my SOLR 5 Admin Panel displays the following error:
>
> 23/03/2015 15:05:05 ERROR   SolrCore
> org.apache.solr.common.SolrException: undefined field: "_text"
>
> How should _text be defined in schema.xml?
>
> Many thanks.
>
> Philippe
>


Re: _text

2015-03-24 Thread Zheng Lin Edwin Yeo
Hi Philippe,

That means you're using the physical schema.xml. You can check the file in
your collection, under conf folder. For mine I don't have the _text field
in my schema.xml. If you don't require it in your setup, you can try
removing it and see if it's ok?

Else you can use the schema.xml or the entire conf folder from the
techproducts example located at
{SOLR_HOME}\server\solr\configsets\sample_techproducts_configs\CONF, which
comes together with the Solr 5.0 package.




On 24 March 2015 at 18:12,  wrote:

> Hi Zheng,
>
> I copied the SOLR 5 schema.xml file on Github (?), which contains the
> following line:
>
> 
>
>
>
>
>
> - Mail original -
> De: "Zheng Lin Edwin Yeo" 
> À: solr-user@lucene.apache.org
> Envoyé: Mardi 24 Mars 2015 10:59:49
> Objet: Re: _text
>
> Hi Philippe,
>
> Are you using the default schemaFactory, in which your setting in
> solrconfig.xml is , or you
> have used your own defined schema.xml, in which your setting in
> solrconfig.xml should be  class="ClassicIndexSchemaFactory"/>?
>
>
> Regards,
> Edwin
>
>
> On 24 March 2015 at 17:40,  wrote:
>
> >
> > Hello,
> >
> > my SOLR 5 Admin Panel displays the following error:
> >
> > 23/03/2015 15:05:05 ERROR   SolrCore
> > org.apache.solr.common.SolrException: undefined field: "_text"
> >
> > How should _text be defined in schema.xml?
> >
> > Many thanks.
> >
> > Philippe
> >
>


Read or Capture Solr Logs

2015-03-24 Thread Nitin Solanki
Hello,
I want to read or capture all the queries which are searched by
users. Any help on this?


RE: Read or Capture Solr Logs

2015-03-24 Thread Markus Jelsma
Hello, you can either process the logs, or make a simple SearchComponent 
implementation that reads SolrQueryRequest.

Markus

 
 
-Original message-
> From:Nitin Solanki 
> Sent: Tuesday 24th March 2015 11:38
> To: solr-user@lucene.apache.org
> Subject: Read or Capture Solr Logs
> 
> Hello,
> I want to read or capture all the queries which are searched by
> users. Any help on this?
> 


Re: Read or Capture Solr Logs

2015-03-24 Thread Nitin Solanki
Hi Markus,
  Can you please help me. How to do that?
Using both "Process the logs"
or "make a simple SearchComponent implementation that reads
SolrQueryRequest."

On Tue, Mar 24, 2015 at 4:17 PM, Markus Jelsma 
wrote:

> Hello, you can either process the logs, or make a simple SearchComponent
> implementation that reads SolrQueryRequest.
>
> Markus
>
>
>
> -Original message-
> > From:Nitin Solanki 
> > Sent: Tuesday 24th March 2015 11:38
> > To: solr-user@lucene.apache.org
> > Subject: Read or Capture Solr Logs
> >
> > Hello,
> > I want to read or capture all the queries which are searched
> by
> > users. Any help on this?
> >
>


Re: Read or Capture Solr Logs

2015-03-24 Thread Nitin Solanki
Hi Markus,
  Can you please help me. How to do that?
Using both "Process the logs" or "make a simple SearchComponent
implementation that reads SolrQueryRequest"

On Tue, Mar 24, 2015 at 4:25 PM, Nitin Solanki  wrote:

> Hi Markus,
>   Can you please help me. How to do that?
> Using both "Process the logs"
> or "make a simple SearchComponent implementation that reads
> SolrQueryRequest."
>
> On Tue, Mar 24, 2015 at 4:17 PM, Markus Jelsma  > wrote:
>
>> Hello, you can either process the logs, or make a simple SearchComponent
>> implementation that reads SolrQueryRequest.
>>
>> Markus
>>
>>
>>
>> -Original message-
>> > From:Nitin Solanki 
>> > Sent: Tuesday 24th March 2015 11:38
>> > To: solr-user@lucene.apache.org
>> > Subject: Read or Capture Solr Logs
>> >
>> > Hello,
>> > I want to read or capture all the queries which are
>> searched by
>> > users. Any help on this?
>> >
>>
>
>


How to verify a document is indexed by all replicas

2015-03-24 Thread Shai Erera
Hi

Is there a recommended, preferably fast, way to check that a document is
indexed by all replicas? I currently do that by issuing a search request to
each replica, but was wondering if there's a faster way.

Even better, is there a way to verify all replicas of a shard are
"up-to-date", e.g. by comparing their version or something? By "up-to-date"
I mean that they've all processed the same update requests that came
through.

If there's a replica lagging behind, I'd like to wait for it to catch up,
something like a checkpoint(), before I continue sending more updates.

Shai


Set search query logs into Solr

2015-03-24 Thread Nitin Solanki
Hello,
 I want to insert searched queries into solr log to track the
input of users. I googled too much but didn't find anything. Please help.
Your help will be appreciated...


RE: Read or Capture Solr Logs

2015-03-24 Thread Markus Jelsma
Hello - process the logs means you have to build your own program that reads 
and processes the logs, and does what ever you need it to. In a custom 
SearchComponent you can implement e.g. process() [1] and read the query, and do 
something with it.

[1]: 
http://lucene.apache.org/solr/5_0_0/solr-core/org/apache/solr/handler/component/SearchComponent.html#process%28org.apache.solr.handler.component.ResponseBuilder%29
 
-Original message-
> From:Nitin Solanki 
> Sent: Tuesday 24th March 2015 11:55
> To: solr-user@lucene.apache.org
> Subject: Re: Read or Capture Solr Logs
> 
> Hi Markus,
>   Can you please help me. How to do that?
> Using both "Process the logs"
> or "make a simple SearchComponent implementation that reads
> SolrQueryRequest."
> 
> On Tue, Mar 24, 2015 at 4:17 PM, Markus Jelsma 
> wrote:
> 
> > Hello, you can either process the logs, or make a simple SearchComponent
> > implementation that reads SolrQueryRequest.
> >
> > Markus
> >
> >
> >
> > -Original message-
> > > From:Nitin Solanki 
> > > Sent: Tuesday 24th March 2015 11:38
> > > To: solr-user@lucene.apache.org
> > > Subject: Read or Capture Solr Logs
> > >
> > > Hello,
> > > I want to read or capture all the queries which are searched
> > by
> > > users. Any help on this?
> > >
> >
> 


Re: TooManyBasicQueries?

2015-03-24 Thread Ian Rose
Hi Erik -

Sorry, I totally missed your reply.  To the best of my knowledge, we are
not using any surround queries (have to admit I had never heard of them
until now).  We use solr.SearchHandler for all of our queries.

Does that answer the question?

Cheers,
Ian


On Fri, Mar 13, 2015 at 10:08 AM, Erik Hatcher 
wrote:

> It results from a surround query with too many terms.   Says the javadoc:
>
> * Exception thrown when {@link BasicQueryFactory} would exceed the limit
> * of query clauses.
>
> I’m curious, are you issuing a large {!surround} query or is it expanding
> to hit that limit?
>
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com
>
>
>
>
> > On Mar 13, 2015, at 9:44 AM, Ian Rose  wrote:
> >
> > I sometimes see the following in my logs:
> >
> > ERROR org.apache.solr.core.SolrCore  –
> > org.apache.lucene.queryparser.surround.query.TooManyBasicQueries:
> Exceeded
> > maximum of 1000 basic queries.
> >
> >
> > What does this mean?  Does this mean that we have issued a query with too
> > many terms?  Or that the number of concurrent queries running on the
> server
> > is too high?
> >
> > Also, is this a builtin limit or something set in a config file?
> >
> > Thanks!
> > - Ian
>
>


Solr replicas going in recovering state during heavy indexing

2015-03-24 Thread Gopal Jee
Hi
We have a large solrcloud cluster. We have observed that during heavy
indexing, large number of replicas go to recovering or down state.
What could be the possible reason and/or fix for the issue.

Gopal


rough maximum cores (shards) per machine?

2015-03-24 Thread Ian Rose
Hi all -

I'm sure this topic has been covered before but I was unable to find any
clear references online or in the mailing list.

Are there any rules of thumb for how many cores (aka shards, since I am
using SolrCloud) is "too many" for one machine?  I realize there is no one
answer (depends on size of the machine, etc.) so I'm just looking for a
rough idea.  Something like the following would be very useful:

* People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
server without any problems.
* I have never heard of anyone successfully running X cores/shards on a
single machine, even if you throw a lot of hardware at it.

Thanks!
- Ian


Issues to create new core

2015-03-24 Thread Alejandro Jesus Mariño Molerio
Dear Solr Community: 
I just began to work with Solr. I choose Solr 5.0, but when I try to create a 
new core with GUI, show the following error: " Error CREATEing SolrCore 
'datos': Unable to create core [datos] Caused by: Can't find resource 
'solrconfig.xml' in classpath or 'C:\solr\server\solr\datos\conf'". My question 
is simple, How can fix this problem?. 

Thanks in advance for your consideration. 
Alejandro. 


Re: TooManyBasicQueries?

2015-03-24 Thread Erik Hatcher
Somehow a surround query is being constructed along the way.  Search your logs 
for “surround” and see if someone is maybe sneaking a q={!surround}… in there.  
If you’re passing input directly through from your application to Solr’s q 
parameter without any sanitizing or filtering, it’s possible a surround query 
parser could be asked for.


—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com 




> On Mar 24, 2015, at 8:55 AM, Ian Rose  wrote:
> 
> Hi Erik -
> 
> Sorry, I totally missed your reply.  To the best of my knowledge, we are
> not using any surround queries (have to admit I had never heard of them
> until now).  We use solr.SearchHandler for all of our queries.
> 
> Does that answer the question?
> 
> Cheers,
> Ian
> 
> 
> On Fri, Mar 13, 2015 at 10:08 AM, Erik Hatcher 
> wrote:
> 
>> It results from a surround query with too many terms.   Says the javadoc:
>> 
>> * Exception thrown when {@link BasicQueryFactory} would exceed the limit
>> * of query clauses.
>> 
>> I’m curious, are you issuing a large {!surround} query or is it expanding
>> to hit that limit?
>> 
>> 
>> —
>> Erik Hatcher, Senior Solutions Architect
>> http://www.lucidworks.com
>> 
>> 
>> 
>> 
>>> On Mar 13, 2015, at 9:44 AM, Ian Rose  wrote:
>>> 
>>> I sometimes see the following in my logs:
>>> 
>>> ERROR org.apache.solr.core.SolrCore  –
>>> org.apache.lucene.queryparser.surround.query.TooManyBasicQueries:
>> Exceeded
>>> maximum of 1000 basic queries.
>>> 
>>> 
>>> What does this mean?  Does this mean that we have issued a query with too
>>> many terms?  Or that the number of concurrent queries running on the
>> server
>>> is too high?
>>> 
>>> Also, is this a builtin limit or something set in a config file?
>>> 
>>> Thanks!
>>> - Ian
>> 
>> 



Re: maxReplicasPerNode

2015-03-24 Thread Shawn Heisey
On 3/24/2015 3:22 AM, Shai Erera wrote:
>>> If this is explained somewhere, I'd appreciate if you can give me a
>>> pointer.

I don't think it's explained anywhere, so that's a lack in the
documentation.

One problem with automatic replica addition in response to cluster
problems is that there is no mechanism (currently, at least) to indicate
that a node disappearance is intentional and temporary, and no way to
configure a minimum time interval before taking automatic action.  It
would be necessary to have these mechanisms before any kind of automatic
repair ability could be implemented.

Thanks,
Shawn



Re: How to remove an Alert

2015-03-24 Thread Shawn Heisey
On 3/23/2015 2:35 PM, jack.met...@hp.com wrote:
> I have a problem with [ ... briefly describe your problem here ... ]
> 
>   [ ... insert additional info here - keep it short and to the point ... ]
> 
> Below are some SPM graphs showing the state of my system.
> Here's the 'Threads' graph:
>   https://apps.sematext.com/spm-reports/s/aFUIR1fecb

You've used some kind of boilerplate help request, but forgot to edit it
for your specific problem.

Solr doesn't send alerts, so the subject of your message makes no sense
in a Solr context, and you haven't indicated how it connects with the
SPM graph you linked.

You'll need to ask an actual question and provide relevant details from
your system to support your question.

Thanks,
Shawn



Re: document contained more than 100000 characters

2015-03-24 Thread Shawn Heisey
On 3/23/2015 3:08 AM, Srinivas wrote:
> Present in my project we are using apache tika for reading metadata of the
> file,So whenever we handled large files(contained more than 10
> characters file) tika generating the error is file contained more than
> 10 characters, So is it possible or not handling large files by using
> tika,Please let me know.

This sounds like a Tika problem.  This is a solr mailing list.  You may
find some Tika expertise here, but this is the incorrect place for a
question about Tika.

Solr does use the Tika parser, in the contrib module for the
ExtractingRequestHandler.  I have never heard of such a limitation in
the context of the ExtractingRequestHandler, and I've heard some people
complain about OutOfMemory exceptions when they index 4 gigabyte PDF
files with our extracting handler ... so I am guessing that you are
using Tika in your own software.  If that is correct, you'll need to ask
your question on a Tika mailing list.

Thanks,
Shawn



Re: TooManyBasicQueries?

2015-03-24 Thread Ian Rose
Ah yes, right you are.  I had thought that `surround` required a different
endpoint, but I see now that someone is using a surround query.

Many thanks!

On Tue, Mar 24, 2015 at 10:02 AM, Erik Hatcher 
wrote:

> Somehow a surround query is being constructed along the way.  Search your
> logs for “surround” and see if someone is maybe sneaking a q={!surround}…
> in there.  If you’re passing input directly through from your application
> to Solr’s q parameter without any sanitizing or filtering, it’s possible a
> surround query parser could be asked for.
>
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com 
>
>
>
>
> > On Mar 24, 2015, at 8:55 AM, Ian Rose  wrote:
> >
> > Hi Erik -
> >
> > Sorry, I totally missed your reply.  To the best of my knowledge, we are
> > not using any surround queries (have to admit I had never heard of them
> > until now).  We use solr.SearchHandler for all of our queries.
> >
> > Does that answer the question?
> >
> > Cheers,
> > Ian
> >
> >
> > On Fri, Mar 13, 2015 at 10:08 AM, Erik Hatcher 
> > wrote:
> >
> >> It results from a surround query with too many terms.   Says the
> javadoc:
> >>
> >> * Exception thrown when {@link BasicQueryFactory} would exceed the limit
> >> * of query clauses.
> >>
> >> I’m curious, are you issuing a large {!surround} query or is it
> expanding
> >> to hit that limit?
> >>
> >>
> >> —
> >> Erik Hatcher, Senior Solutions Architect
> >> http://www.lucidworks.com
> >>
> >>
> >>
> >>
> >>> On Mar 13, 2015, at 9:44 AM, Ian Rose  wrote:
> >>>
> >>> I sometimes see the following in my logs:
> >>>
> >>> ERROR org.apache.solr.core.SolrCore  –
> >>> org.apache.lucene.queryparser.surround.query.TooManyBasicQueries:
> >> Exceeded
> >>> maximum of 1000 basic queries.
> >>>
> >>>
> >>> What does this mean?  Does this mean that we have issued a query with
> too
> >>> many terms?  Or that the number of concurrent queries running on the
> >> server
> >>> is too high?
> >>>
> >>> Also, is this a builtin limit or something set in a config file?
> >>>
> >>> Thanks!
> >>> - Ian
> >>
> >>
>
>


Re: How to verify a document is indexed by all replicas

2015-03-24 Thread Erick Erickson
You can always issue a *:* query, but it'd have to be at least your
autoSoftCommit interval ago since the soft commit trigger will have
slightly different wall clock times.

But it shouldn't be necessary to wait I don't think. Since the
indexing request doesn't succeed until the docs have been written to
the tlogs, and since the tlogs will be replayed in the event of a
problem your data should be fine. Of course if you're indexing at a
very fast rate and your tlog is huge, it'll take a while

FWIW,
Erick

On Tue, Mar 24, 2015 at 4:59 AM, Shai Erera  wrote:
> Hi
>
> Is there a recommended, preferably fast, way to check that a document is
> indexed by all replicas? I currently do that by issuing a search request to
> each replica, but was wondering if there's a faster way.
>
> Even better, is there a way to verify all replicas of a shard are
> "up-to-date", e.g. by comparing their version or something? By "up-to-date"
> I mean that they've all processed the same update requests that came
> through.
>
> If there's a replica lagging behind, I'd like to wait for it to catch up,
> something like a checkpoint(), before I continue sending more updates.
>
> Shai


Re: Solr replicas going in recovering state during heavy indexing

2015-03-24 Thread Erick Erickson
What do the Solr logs show happens on those servers when they go into
recovery? What have you tried to do to diagnose the problem? You might
review: http://wiki.apache.org/solr/UsingMailingLists

The first thing I'd check, though, is whether you're seeing large GC
pauses that exceed the Zookeeper timeout, thus ZK thinks the replica
is down and puts it into recovery. YOu can get this info by tracking
the GC cycles as here:
https://lucidworks.com/blog/garbage-collection-bootcamp-1-0/, the
section "getting a view into garbage collection"

Best,
Erick

On Tue, Mar 24, 2015 at 5:57 AM, Gopal Jee  wrote:
> Hi
> We have a large solrcloud cluster. We have observed that during heavy
> indexing, large number of replicas go to recovering or down state.
> What could be the possible reason and/or fix for the issue.
>
> Gopal


Re: Auto naming replicas via ADDREPLICA

2015-03-24 Thread Shawn Heisey
On 3/23/2015 10:48 AM, Shai Erera wrote:
> The 'name' param isn't set when I send the URL request (and it's also not
> specified in the reference guide), but only when I add the replica using
> SolrJ. I then tweaked my code to do the following:
> 
>   final CollectionAdminRequest.AddReplica addReplicaRequest = new
> CollectionAdminRequest.AddReplica() {
> @Override
> public SolrParams getParams() {
>   final ModifiableSolrParams params = (ModifiableSolrParams)
> super.getParams();
>   params.remove(CoreAdminParams.NAME);
>   return params;
> }
>   };
> 
> And voila, the core is now also named mycollection_shard1_replica2, and I'm
> even able to add as many replicas as I want on this node (where before it
> failed since 'mycollection' already existed).
> 
> The 'name' parameter is added by
> CollectionSpecificAdminRequest.getParams(). So how would you suggest to fix
> it:
> 
>1. Remove it in AddReplica.getParams() -- replicas will always be
>auto-named. It makes sense as users didn't have control over it before, and
>maybe they shouldn't.
>2. Add a setCoreName to AddReplica request -- this would be nice if
>someone wanted to control the name of the added replica, but otherwise
>should not be included in the request
> 
> Or maybe we fix the bug by doing #1 and consider #2 as a new feature "allow
> naming replicas"?

Doing both sounds like a good solution to me.  I'm trying to think of
some cautionary text for the javadoc on the new method, but I'm not
really sure what it should say.  Perhaps something like "when this
method is not used, the new core will receive a name like
collection_shardN_replicaN, be aware that if you override it,
understanding the collection layout may be more difficult."

I'm hoping Mark and/or Yonik (or someone else, if they know) can comment
about why the AddReplica code had that behavior and whether this is a
good idea in the larger SolrCloud environment.

Thanks,
Shawn



Re: Unable to setup solr cloud with multiple collections.

2015-03-24 Thread Erick Erickson
Why are you doing this in the first place? SolrCloud and master/slave
are fundamentally different. When running in SolrCloud mode, there is
no need whatsoever to configure replication as per the Wiki link
you've outlined above, that's for the older style master/slave setups.

Just change it back and watch the magic would be my advice.

So if you'd tell us why you thought this was necessary, perhaps we can
suggest alternatives because from a quick glance it looks unnecessary,
and in fact harmful.

Best,
Erick

On Mon, Mar 23, 2015 at 10:08 PM, sthita  wrote:
> I have newly created a new collection and activated the replication for 4
> nodes(Including masters).
> After doing the config changes as suggested on
> http://wiki.apache.org/solr/SolrReplication
> 
> The nodes of the newly created collections are down on solr cloud. We are
> not able to add or remove any document on newly created core i.e dict_cn in
> our case. All the configuration files  look ok on solr cloud
>
> 
>
> This is my replication changes on solrconfig.xml
>
>  startup="lazy">
>  commit  name="replicateAfter">startup
> solrconfig_cn.xml,schema_cn.xml 
>
>  http://mail:8983/solr/dict_cn
> 
>
> Note: I am using solr 4.4.0, zookeeper-3.4.5
>
> Can anyone help me on this ?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Unable-to-setup-solr-cloud-with-multiple-collections-tp4194833.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Erick Erickson
Well, there's a ticket out there for "thousands of collections on a
single machine",
although this is wy out there. I often see 10-20 small cores on a
4-8 core machine
if they're reasonably small (a few million docs). I see a single
replica strain a 128G 16
core machine if it has 300M docs

Which is a way of saying "ya gotta test with your data/query mix".

Wish there was a better answer.
Erick

On Tue, Mar 24, 2015 at 6:02 AM, Ian Rose  wrote:
> Hi all -
>
> I'm sure this topic has been covered before but I was unable to find any
> clear references online or in the mailing list.
>
> Are there any rules of thumb for how many cores (aka shards, since I am
> using SolrCloud) is "too many" for one machine?  I realize there is no one
> answer (depends on size of the machine, etc.) so I'm just looking for a
> rough idea.  Something like the following would be very useful:
>
> * People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
> server without any problems.
> * I have never heard of anyone successfully running X cores/shards on a
> single machine, even if you throw a lot of hardware at it.
>
> Thanks!
> - Ian


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Jack Krupansky
Shards per collection, or across all collections on the node?

It will all depend on:

1. Your ingestion/indexing rate. High, medium or low?
2. Your query access pattern. Note that a typical query fans out to all
shards, so having more shards than CPU cores means less parallelism.
3. How many collections you will have per node.

In short, it depends on what you want to achieve, not some limit of Solr
per se.

Why are you even sharding the node anyway? Why not just run with a single
shard per node, and do sharding by having separate nodes, to maximize
parallel processing and availability?

Also be careful to be clear about using the Solr term "shard" (a slice,
across all replica nodes) as distinct from the Elasticsearch term "shard"
(a single slice of an index for a single replica, analogous to a Solr
"core".)


-- Jack Krupansky

On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose  wrote:

> Hi all -
>
> I'm sure this topic has been covered before but I was unable to find any
> clear references online or in the mailing list.
>
> Are there any rules of thumb for how many cores (aka shards, since I am
> using SolrCloud) is "too many" for one machine?  I realize there is no one
> answer (depends on size of the machine, etc.) so I'm just looking for a
> rough idea.  Something like the following would be very useful:
>
> * People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
> server without any problems.
> * I have never heard of anyone successfully running X cores/shards on a
> single machine, even if you throw a lot of hardware at it.
>
> Thanks!
> - Ian
>


Re: Issues to create new core

2015-03-24 Thread Erick Erickson
Tell us all the steps you went through to do this. Note that you
should _not_ be using the core admin in the admin UI if you're working
with SolrCloud.

For stand-alone Solr, the message above is probably caused by your not
having a conf directory set up already. The core admin UI expects that
you have a pre-existing directory with a "conf" directory that
contains solrconfig.xml, schema.xml, and all the rest of the
configuration files. You can specify this via some of the parameters
on the admin UI screen (see instanceDir and dataDir). Each core must
be in a separate directory or Bad Things Happen.

HTH,
Erick

On Tue, Mar 24, 2015 at 7:01 AM, Alejandro Jesus Mariño Molerio
 wrote:
> Dear Solr Community:
> I just began to work with Solr. I choose Solr 5.0, but when I try to create a 
> new core with GUI, show the following error: " Error CREATEing SolrCore 
> 'datos': Unable to create core [datos] Caused by: Can't find resource 
> 'solrconfig.xml' in classpath or 'C:\solr\server\solr\datos\conf'". My 
> question is simple, How can fix this problem?.
>
> Thanks in advance for your consideration.
> Alejandro.


Re: How to verify a document is indexed by all replicas

2015-03-24 Thread Shai Erera
Thanks Erick,

When a replica is down, no updates are sent to it. When it comes back up,
it discovers that it needs to catch-up with the leader. If there are many
events it falls back to index replication (slower). During this period of
time, is the replica considered ACTIVE or RECOVERING?

And, can I assume that at any given moment (aside from ZK connection
timeouts etc.) when I check the replicas' state, all the ones that report
ACTIVE are in sync with each other?

Shai

On Tue, Mar 24, 2015 at 5:04 PM, Erick Erickson 
wrote:

> You can always issue a *:* query, but it'd have to be at least your
> autoSoftCommit interval ago since the soft commit trigger will have
> slightly different wall clock times.
>
> But it shouldn't be necessary to wait I don't think. Since the
> indexing request doesn't succeed until the docs have been written to
> the tlogs, and since the tlogs will be replayed in the event of a
> problem your data should be fine. Of course if you're indexing at a
> very fast rate and your tlog is huge, it'll take a while
>
> FWIW,
> Erick
>
> On Tue, Mar 24, 2015 at 4:59 AM, Shai Erera  wrote:
> > Hi
> >
> > Is there a recommended, preferably fast, way to check that a document is
> > indexed by all replicas? I currently do that by issuing a search request
> to
> > each replica, but was wondering if there's a faster way.
> >
> > Even better, is there a way to verify all replicas of a shard are
> > "up-to-date", e.g. by comparing their version or something? By
> "up-to-date"
> > I mean that they've all processed the same update requests that came
> > through.
> >
> > If there's a replica lagging behind, I'd like to wait for it to catch up,
> > something like a checkpoint(), before I continue sending more updates.
> >
> > Shai
>


Setting up SOLR 5 from an RPM

2015-03-24 Thread Tom Evans
Hi all

We're migrating to SOLR 5 (from 4.8), and our infrastructure guys
would prefer we installed SOLR from an RPM rather than extracting the
tarball where we need it. They are creating the RPM file themselves,
and it installs an init.d script and the equivalent of the tarball to
/opt/solr.

We're having problems running SOLR from the installed files, as SOLR
wants to (I think) extract the WAR file and create various temporary
files below /opt/solr/server.

We currently have this structure:

/data/solr - root directory of our solr instance
/data/solr/{logs,run} - log/run directories
/data/solr/cores - configuration for our cores and solr.in.sh
/opt/solr - the RPM installed solr 5

The user running solr can modify anything under /data/solr, but
nothing under /opt/solr.

Is this sort of configuration supported? Am I missing some variable in
our solr.in.sh that sets where temporary files can be extracted? We
currently set:

SOLR_PID_DIR=/data/solr/run
SOLR_HOME=/data/solr/cores
SOLR_LOGS_DIR=/data/solr/logs


Cheers

Tom


Re: maxReplicasPerNode

2015-03-24 Thread Anshum Gupta
Yes, it applies to both. Solr wouldn't auto-add replicas in either of those
cases (or any other case) to meet the rf specified at create time.

On Tue, Mar 24, 2015 at 2:22 AM, Shai Erera  wrote:

> Thanks Anshum,
>
> About #3, i line with my answer to the previous question, Solr wouldn't
> > auto-add a Replica to meet the replication factor when a node goes down.
> >
>
> Just to make sure the answer applies to both these cases:
>
>1. There are two replicas on node1 and node2. Solr won't add a replica
>to node1 when node2 goes down.
>2. The collection was created with rf=2, Solr creates replicas on node1
>and node2. If node2 goes down and a node3 comes up instead, will it be
>assigned a replica, or Solr does not do that also?
>
> In short, is there any scenario where Solr would auto-add replicas (aside
> from running on HDFS) to meet the 'rf' setting, or after the collection has
> been created, ensuring RF is met is my responsibility?
>
> Shai
>
> On Tue, Mar 24, 2015 at 10:02 AM, Anshum Gupta 
> wrote:
>
> > Hi Shai,
> >
> > As of now, all replicas for a collections are created to meet the
> specified
> > replication factor at the time of collection creation. There's no way to
> > defer that until more nodes are up. Your best bet is to have the nodes
> > already up before you CREATE the collection or create the collection
> with a
> > lower replication factor and then use ADDREPLICA.
> >
> > About auto-addition of replicas, that's kind of supported when using
> shared
> > file system (HDFS) to host the index. It's doesn't truly work as per your
> > use-case i.e. it doesn't consider the intended replication factor but
> only
> > brings up a Replica in case all replicas for a node are down, so that
> > SolrCloud continues to be usable. It also doesn't auto-remove replica
> when
> > the old node comes back up. You can read more about this in the
> > "Automatically Add Replicas in SolrCloud" section here:
> > https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
> >
> > About #3, i line with my answer to the previous question, Solr wouldn't
> > auto-add a Replica to meet the replication factor when a node goes down.
> >
> >
> > On Tue, Mar 24, 2015 at 12:36 AM, Shai Erera  wrote:
> >
> > > Hi
> > >
> > > I saw that we can define maxShardsPerNode when creating a collection,
> > but I
> > > don't see that I can set something similar for replicas. My scenario is
> > the
> > > following:
> > >
> > >- I setup one Solr node
> > >- Create collection with numShards=1 and replicationFactor=2
> > >- Hopefully, one replica is created on that node
> > >- When I bring up the second Solr node, the second replica will be
> > >created
> > >
> > > What I see is that both replicas are created on the first node, and
> when
> > I
> > > bring up the second Solr node, none of the replicas are moved.
> > >
> > > I know that I can "move" one replica by calling ADDREPLICA on node2,
> then
> > > DELETEREPLICA on node1, but I was wondering if there's an automated way
> > to
> > > do that.
> > >
> > > I've also considered creating the collection with replicationFactor=1
> and
> > > when the second node comes up it will look for shards w/ one replica
> > only,
> > > and assign themselves as the replica. But it means I have to own that
> > piece
> > > of logic, where if Solr already does that, that's better.
> > >
> > > Also, from what I understand, if I create a collection w/ rf=2 and
> there
> > > are two nodes, then each node is assigned a replica. If one of the
> nodes
> > > comes down, and a 3rd node comes up, it will be assigned a replica --
> is
> > > that correct?
> > >
> > > Another related question, if there are two replicas on node1 and node2,
> > and
> > > node2 goes down -- will node1 be assigned the second replica as well?
> > >
> > > If this is explained somewhere, I'd appreciate if you can give me a
> > > pointer.
> > >
> > > Shai
> > >
> >
> >
> >
> > --
> > Anshum Gupta
> >
>



-- 
Anshum Gupta


Re: Auto naming replicas via ADDREPLICA

2015-03-24 Thread Anshum Gupta
It's certainly looks like a bug and the name shouldn't be added to the
request automatically.
Can you confirm what version of Solr are you using?

If it turns out to be a bug in 5x/trunk I'll create a JIRA and fix it to
both #1 and #2.

On Mon, Mar 23, 2015 at 9:48 AM, Shai Erera  wrote:

> Shawn, that was a great tip!
>
> When I tried the URL, the core was named as expected
> (mycollection_shard1_replica2). I then compared the URLs as reported in the
> logs, and I believe I found the bug:
>
> SolrJ: [admin] webapp=null path=/admin/collections params={shard=shard1&
> *name=mycollection*&action=ADDREPLICA&*collection=mycollection*
> &wt=javabin&version=2}
>
> The 'name' param isn't set when I send the URL request (and it's also not
> specified in the reference guide), but only when I add the replica using
> SolrJ. I then tweaked my code to do the following:
>
>   final CollectionAdminRequest.AddReplica addReplicaRequest = new
> CollectionAdminRequest.AddReplica() {
> @Override
> public SolrParams getParams() {
>   final ModifiableSolrParams params = (ModifiableSolrParams)
> super.getParams();
>   params.remove(CoreAdminParams.NAME);
>   return params;
> }
>   };
>
> And voila, the core is now also named mycollection_shard1_replica2, and I'm
> even able to add as many replicas as I want on this node (where before it
> failed since 'mycollection' already existed).
>
> The 'name' parameter is added by
> CollectionSpecificAdminRequest.getParams(). So how would you suggest to fix
> it:
>
>1. Remove it in AddReplica.getParams() -- replicas will always be
>auto-named. It makes sense as users didn't have control over it before,
> and
>maybe they shouldn't.
>2. Add a setCoreName to AddReplica request -- this would be nice if
>someone wanted to control the name of the added replica, but otherwise
>should not be included in the request
>
> Or maybe we fix the bug by doing #1 and consider #2 as a new feature "allow
> naming replicas"?
>
> Shai
>
>
> On Mon, Mar 23, 2015 at 6:14 PM, Shawn Heisey  wrote:
>
> > On 3/23/2015 9:27 AM, Shai Erera wrote:
> > > I have a Solr cluster started (all programmatically) with one Solr
> node,
> > > one collection and one shard. I set the replicationFactor to 1. The
> name
> > of
> > > the result core was set to mycollection_shard1_replica1.
> > >
> > > I then start a second Solr node and issue an ADDREPLICA command as
> > > described in the reference guide, using following code:
> > >
> > >   final CollectionAdminRequest.AddReplica addReplicaRequest = new
> > > CollectionAdminRequest.AddReplica();
> > >   addReplicaRequest.setCollectionName("mycollection");
> > >   addReplicaRequest.setShardName("shard1");
> > >   final CollectionAdminResponse response =
> > > addReplicaRequest.process(solrClient);
> > >
> > > The replica is added under a core named "mycollection" and not e.g.
> > > mycollection_shard1_replica2.
> >
> > I'd call that a bug.
> >
> > > BTW, the example in the reference guide shows that issuing the request:
> > >
> > >
> >
> http://localhost:8983/solr/admin/collections?action=ADDREPLICA&collection=test2&shard=shard2&node=192.167.1.2:8983_solr
> > >
> > > Results in
> > >
> > > 
> > >   
> > > 0
> > > 3764
> > >   
> > >   
> > > 
> > >   
> > > 0
> > > 3450
> > >   
> > > *  test2_shard2_replica4*
> >
> > Did you try out a URL like that to see whether it also results in the
> > misnamed core, or if it behaves correctly as the reference guide
> indicates?
> >
> > If the URL behaves correctly, I'd be curious what Solr logs for the URL
> > request and the SolrJ request.
> >
> > Thanks,
> > Shawn
> >
> >
>



-- 
Anshum Gupta


Re: maxReplicasPerNode

2015-03-24 Thread Shai Erera
Thanks guys, this makes sense I guess, from Solr's side.

Perhaps we can have a new Collections API like REDIRECTREPLICA or
something, that will redirect a replica to the new node.
This API can simply do ADDREPLICA on the new node, and DELETEREPLICA of the
node that doesn't exist anymore.

I guess I need to implement that for my use case now (I know that if a node
came down, it won't ever come back up again - there will be a new node
replacing it), so I'll see how it plays out and if it works well, I'll open
a JIRA issue. In my case, when the new node comes up, it can check the
cluster's status, and if it detects an orphanage replica, it will add
itself as a new replica and delete the orphanage one.

Let me know if you see a problem with how I intend to address that.

Shai

On Tue, Mar 24, 2015 at 6:01 PM, Anshum Gupta 
wrote:

> Yes, it applies to both. Solr wouldn't auto-add replicas in either of those
> cases (or any other case) to meet the rf specified at create time.
>
> On Tue, Mar 24, 2015 at 2:22 AM, Shai Erera  wrote:
>
> > Thanks Anshum,
> >
> > About #3, i line with my answer to the previous question, Solr wouldn't
> > > auto-add a Replica to meet the replication factor when a node goes
> down.
> > >
> >
> > Just to make sure the answer applies to both these cases:
> >
> >1. There are two replicas on node1 and node2. Solr won't add a replica
> >to node1 when node2 goes down.
> >2. The collection was created with rf=2, Solr creates replicas on
> node1
> >and node2. If node2 goes down and a node3 comes up instead, will it be
> >assigned a replica, or Solr does not do that also?
> >
> > In short, is there any scenario where Solr would auto-add replicas (aside
> > from running on HDFS) to meet the 'rf' setting, or after the collection
> has
> > been created, ensuring RF is met is my responsibility?
> >
> > Shai
> >
> > On Tue, Mar 24, 2015 at 10:02 AM, Anshum Gupta 
> > wrote:
> >
> > > Hi Shai,
> > >
> > > As of now, all replicas for a collections are created to meet the
> > specified
> > > replication factor at the time of collection creation. There's no way
> to
> > > defer that until more nodes are up. Your best bet is to have the nodes
> > > already up before you CREATE the collection or create the collection
> > with a
> > > lower replication factor and then use ADDREPLICA.
> > >
> > > About auto-addition of replicas, that's kind of supported when using
> > shared
> > > file system (HDFS) to host the index. It's doesn't truly work as per
> your
> > > use-case i.e. it doesn't consider the intended replication factor but
> > only
> > > brings up a Replica in case all replicas for a node are down, so that
> > > SolrCloud continues to be usable. It also doesn't auto-remove replica
> > when
> > > the old node comes back up. You can read more about this in the
> > > "Automatically Add Replicas in SolrCloud" section here:
> > > https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
> > >
> > > About #3, i line with my answer to the previous question, Solr wouldn't
> > > auto-add a Replica to meet the replication factor when a node goes
> down.
> > >
> > >
> > > On Tue, Mar 24, 2015 at 12:36 AM, Shai Erera  wrote:
> > >
> > > > Hi
> > > >
> > > > I saw that we can define maxShardsPerNode when creating a collection,
> > > but I
> > > > don't see that I can set something similar for replicas. My scenario
> is
> > > the
> > > > following:
> > > >
> > > >- I setup one Solr node
> > > >- Create collection with numShards=1 and replicationFactor=2
> > > >- Hopefully, one replica is created on that node
> > > >- When I bring up the second Solr node, the second replica will be
> > > >created
> > > >
> > > > What I see is that both replicas are created on the first node, and
> > when
> > > I
> > > > bring up the second Solr node, none of the replicas are moved.
> > > >
> > > > I know that I can "move" one replica by calling ADDREPLICA on node2,
> > then
> > > > DELETEREPLICA on node1, but I was wondering if there's an automated
> way
> > > to
> > > > do that.
> > > >
> > > > I've also considered creating the collection with replicationFactor=1
> > and
> > > > when the second node comes up it will look for shards w/ one replica
> > > only,
> > > > and assign themselves as the replica. But it means I have to own that
> > > piece
> > > > of logic, where if Solr already does that, that's better.
> > > >
> > > > Also, from what I understand, if I create a collection w/ rf=2 and
> > there
> > > > are two nodes, then each node is assigned a replica. If one of the
> > nodes
> > > > comes down, and a 3rd node comes up, it will be assigned a replica --
> > is
> > > > that correct?
> > > >
> > > > Another related question, if there are two replicas on node1 and
> node2,
> > > and
> > > > node2 goes down -- will node1 be assigned the second replica as well?
> > > >
> > > > If this is explained somewhere, I'd appreciate if you can give me a
> > > > pointer.
> > > >
> > 

Re: Auto naming replicas via ADDREPLICA

2015-03-24 Thread Shai Erera
I use vanilla 5.0. I intended to fix it myself, but if you want to go
ahead, I'd be happy to review the patch.

Shai

On Tue, Mar 24, 2015 at 6:11 PM, Anshum Gupta 
wrote:

> It's certainly looks like a bug and the name shouldn't be added to the
> request automatically.
> Can you confirm what version of Solr are you using?
>
> If it turns out to be a bug in 5x/trunk I'll create a JIRA and fix it to
> both #1 and #2.
>
> On Mon, Mar 23, 2015 at 9:48 AM, Shai Erera  wrote:
>
> > Shawn, that was a great tip!
> >
> > When I tried the URL, the core was named as expected
> > (mycollection_shard1_replica2). I then compared the URLs as reported in
> the
> > logs, and I believe I found the bug:
> >
> > SolrJ: [admin] webapp=null path=/admin/collections params={shard=shard1&
> > *name=mycollection*&action=ADDREPLICA&*collection=mycollection*
> > &wt=javabin&version=2}
> >
> > The 'name' param isn't set when I send the URL request (and it's also not
> > specified in the reference guide), but only when I add the replica using
> > SolrJ. I then tweaked my code to do the following:
> >
> >   final CollectionAdminRequest.AddReplica addReplicaRequest = new
> > CollectionAdminRequest.AddReplica() {
> > @Override
> > public SolrParams getParams() {
> >   final ModifiableSolrParams params = (ModifiableSolrParams)
> > super.getParams();
> >   params.remove(CoreAdminParams.NAME);
> >   return params;
> > }
> >   };
> >
> > And voila, the core is now also named mycollection_shard1_replica2, and
> I'm
> > even able to add as many replicas as I want on this node (where before it
> > failed since 'mycollection' already existed).
> >
> > The 'name' parameter is added by
> > CollectionSpecificAdminRequest.getParams(). So how would you suggest to
> fix
> > it:
> >
> >1. Remove it in AddReplica.getParams() -- replicas will always be
> >auto-named. It makes sense as users didn't have control over it
> before,
> > and
> >maybe they shouldn't.
> >2. Add a setCoreName to AddReplica request -- this would be nice if
> >someone wanted to control the name of the added replica, but otherwise
> >should not be included in the request
> >
> > Or maybe we fix the bug by doing #1 and consider #2 as a new feature
> "allow
> > naming replicas"?
> >
> > Shai
> >
> >
> > On Mon, Mar 23, 2015 at 6:14 PM, Shawn Heisey 
> wrote:
> >
> > > On 3/23/2015 9:27 AM, Shai Erera wrote:
> > > > I have a Solr cluster started (all programmatically) with one Solr
> > node,
> > > > one collection and one shard. I set the replicationFactor to 1. The
> > name
> > > of
> > > > the result core was set to mycollection_shard1_replica1.
> > > >
> > > > I then start a second Solr node and issue an ADDREPLICA command as
> > > > described in the reference guide, using following code:
> > > >
> > > >   final CollectionAdminRequest.AddReplica addReplicaRequest = new
> > > > CollectionAdminRequest.AddReplica();
> > > >   addReplicaRequest.setCollectionName("mycollection");
> > > >   addReplicaRequest.setShardName("shard1");
> > > >   final CollectionAdminResponse response =
> > > > addReplicaRequest.process(solrClient);
> > > >
> > > > The replica is added under a core named "mycollection" and not e.g.
> > > > mycollection_shard1_replica2.
> > >
> > > I'd call that a bug.
> > >
> > > > BTW, the example in the reference guide shows that issuing the
> request:
> > > >
> > > >
> > >
> >
> http://localhost:8983/solr/admin/collections?action=ADDREPLICA&collection=test2&shard=shard2&node=192.167.1.2:8983_solr
> > > >
> > > > Results in
> > > >
> > > > 
> > > >   
> > > > 0
> > > > 3764
> > > >   
> > > >   
> > > > 
> > > >   
> > > > 0
> > > > 3450
> > > >   
> > > > *  test2_shard2_replica4*
> > >
> > > Did you try out a URL like that to see whether it also results in the
> > > misnamed core, or if it behaves correctly as the reference guide
> > indicates?
> > >
> > > If the URL behaves correctly, I'd be curious what Solr logs for the URL
> > > request and the SolrJ request.
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
>
>
>
> --
> Anshum Gupta
>


Re: Solr 5.0 --> "IllegalStateException: unexpected docvalues type NONE" on result grouping

2015-03-24 Thread Shawn Heisey

On 3/12/2015 3:36 PM, Alexandre Rafalovitch wrote:

Manual optimize is no longer needed for modern Solr. It does great
optimization automatically. The only reason I recommended it here is
to make sure that all segments are brought up to the latest version
and the deleted documents are purged. That's something that also would
happen automatically eventually, but "eventually" was not an option
for you.

I am glad this helped. I am not 100% sure if you have to do it on each
shard in SolrCloud mode, but I suspect so.


In SolrCloud, whenever you send an optimize command to any shard replica 
in a collection, the entire collection will be optimized.  SolrCloud 
will do the optimization sequentially, not in parallel.  There is 
currently no way to optimize only one shard replica, and as far as I 
know, there is no way to ask for a parallel optimization.


Alexandre's comments about the necessity of optimization (whether it's 
SolrCloud or not) is spot on.  The only time that optimization should be 
done on a modern Solr index is when you have a lot of deleted documents 
and want to clean those up, either to reclaim disk space or remove them 
from the relevancy calculation.


Most people do see a performance boost on an optimized index compared to 
a non-optimized index, but with a modern Solr install, you might 
actually see better performance on a multi-segment index when the 
indexing rate is high, because Lucene is moving to a model where there 
are per-segment caches that are not invalidated at commit time, only at 
merge time.


Thanks,
Shawn



Re: How to verify a document is indexed by all replicas

2015-03-24 Thread Shalin Shekhar Mangar
Hi Shai,

To your original question on how to know if a document has been indexed at
all replicas -- You can add a min_rf=true parameter to your indexing
request and then Solr will add information to the response about how many
replicas gave an ack' to the leader. So if the returned number is equal to
the number of replicas, you can be sure that the doc has been indexed
everywhere.

More comments inline:

On Tue, Mar 24, 2015 at 8:18 AM, Shai Erera  wrote:

> Thanks Erick,
>
> When a replica is down, no updates are sent to it. When it comes back up,
> it discovers that it needs to catch-up with the leader. If there are many
> events it falls back to index replication (slower). During this period of
> time, is the replica considered ACTIVE or RECOVERING?
>
>
It is marked as recovering.


> And, can I assume that at any given moment (aside from ZK connection
> timeouts etc.) when I check the replicas' state, all the ones that report
> ACTIVE are in sync with each other?
>
>
Yes, 'active' replicas should be in sync but autoCommits can cause
inconsistency between replicas as to what is visible to searchers (even if
all replicas have indexed the same data). Also, checking the state of the
replica is not enough, one should always check for the state=active and
live-ness of the replica i.e. the node is marked live under /live_nodes in
ZK.


> Shai
>
> On Tue, Mar 24, 2015 at 5:04 PM, Erick Erickson 
> wrote:
>
> > You can always issue a *:* query, but it'd have to be at least your
> > autoSoftCommit interval ago since the soft commit trigger will have
> > slightly different wall clock times.
> >
> > But it shouldn't be necessary to wait I don't think. Since the
> > indexing request doesn't succeed until the docs have been written to
> > the tlogs, and since the tlogs will be replayed in the event of a
> > problem your data should be fine. Of course if you're indexing at a
> > very fast rate and your tlog is huge, it'll take a while
> >
> > FWIW,
> > Erick
> >
> > On Tue, Mar 24, 2015 at 4:59 AM, Shai Erera  wrote:
> > > Hi
> > >
> > > Is there a recommended, preferably fast, way to check that a document
> is
> > > indexed by all replicas? I currently do that by issuing a search
> request
> > to
> > > each replica, but was wondering if there's a faster way.
> > >
> > > Even better, is there a way to verify all replicas of a shard are
> > > "up-to-date", e.g. by comparing their version or something? By
> > "up-to-date"
> > > I mean that they've all processed the same update requests that came
> > > through.
> > >
> > > If there's a replica lagging behind, I'd like to wait for it to catch
> up,
> > > something like a checkpoint(), before I continue sending more updates.
> > >
> > > Shai
> >
>



-- 
Regards,
Shalin Shekhar Mangar.


One of three cores is missing userData and lastModified fields from /admin/cores

2015-03-24 Thread Aaron Daubman
Hey All,

On a Solr server running 4.10.2 with three cores, two return the expected
info from /solr/admin/cores?wt=json but the third is missing userData and
lastModified.

The first (artists) and third (tracks) cores from the linked screenshot are
the ones I care about. Unfortunately, the third (tracks) is the one missing
lastModified.

As far as I can see, that comes from:
https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_2/solr/core/src/java/org/apache/solr/handler/admin/LukeRequestHandler.java#L568

I can't trace back to see what would possible cause getUserData() to return
an empty Object, but that appears to be what is happening?

For these severs, indexes that are pre-optimized are shipped over to the
server and the server is re-started... nothing is actually ever committed
on these live servers. This should behave exactly the same for artists and
tracks, even though tracks is the one always missing lastUpdated.

Here's the output in img format, I'll paste the full JSON[1] below:
http://monosnap.com/image/XMyAfk5z3AvHgY39m0qAKAGlc3RACI.png

I'd like to be able to provide access to clients to grab lastUpdated time
for both indices so that they can see how old/stale the data they are
getting results back from is...

...alternately, is there any other way to expose easily how old (last
modified time?) the index for a core is?

Thanks,
  Aaron

1: Full JSON
---snip---
{
  "responseHeader": {
"status": 0,
"QTime": 10
  },
  "defaultCoreName": "collection1",
  "initFailures": {
  },
  "status": {
"artists": {
  "name": "artists",
  "isDefaultCore": false,
  "instanceDir": "/opt/solr/search/solr/artists/",
  "dataDir": "/opt/solr/search/solr/artists/",
  "config": "solrconfig.xml",
  "schema": "schema.xml",
  "startTime": "2015-03-24T14:12:23.667Z",
  "uptime": 7335696,
  "index": {
"numDocs": 3360380,
"maxDoc": 3360380,
"deletedDocs": 0,
"indexHeapUsageBytes": 63366952,
"version": 421,
"segmentCount": 1,
"current": true,
"hasDeletions": false,
"directory":
"org.apache.lucene.store.MMapDirectory:MMapDirectory@/opt/solr/search/solr/artists/index
lockFactory=NativeFSLockFactory@/opt/solr/search/solr/artists/index",
"userData": {
  "commitTimeMSec": "1427133705908"
},
"lastModified": "2015-03-23T18:01:45.908Z",
"sizeInBytes": 25341305528,
"size": "23.6 GB"
  }
},
"banana-int": {
  "name": "banana-int",
  "isDefaultCore": false,
  "instanceDir": "/opt/solr/search/solr/banana-int/",
  "dataDir": "/opt/solr/search/solr/banana-int/data/",
  "config": "solrconfig.xml",
  "schema": "schema.xml",
  "startTime": "2015-03-24T14:12:22.895Z",
  "uptime": 7336472,
  "index": {
"numDocs": 3,
"maxDoc": 3,
"deletedDocs": 0,
"indexHeapUsageBytes": 17448,
"version": 135,
"segmentCount": 3,
"current": true,
"hasDeletions": false,
"directory":
"org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/opt/solr/search/solr/banana-int/data/index
lockFactory=NativeFSLockFactory@/opt/solr/search/solr/banana-int/data/index;
maxCacheMB=48.0 maxMergeSizeMB=4.0)",
"userData": {
  "commitTimeMSec": "1412796723183"
},
"lastModified": "2014-10-08T19:32:03.183Z",
"sizeInBytes": 16196,
"size": "15.82 KB"
  }
},
"tracks": {
  "name": "tracks",
  "isDefaultCore": false,
  "instanceDir": "/opt/solr/search/solr/tracks/",
  "dataDir": "/opt/solr/search/solr/tracks/",
  "config": "solrconfig.xml",
  "schema": "schema.xml",
  "startTime": "2015-03-24T14:12:23.656Z",
  "uptime": 7335713,
  "index": {
"numDocs": 53268126,
"maxDoc": 53268126,
"deletedDocs": 0,
"indexHeapUsageBytes": 517650552,
"version": 100,
"segmentCount": 1,
"current": true,
"hasDeletions": false,
"directory":
"org.apache.lucene.store.MMapDirectory:MMapDirectory@/opt/solr/search/solr/tracks/index
lockFactory=NativeFSLockFactory@/opt/solr/search/solr/tracks/index",
"userData": {
},
"sizeInBytes": 122892905007,
"size": "114.45 GB"
  }
}
  }
}
---snip---


Re: Auto naming replicas via ADDREPLICA

2015-03-24 Thread Anshum Gupta
Either of them works for me. If you want to get your hands dirty, please go
ahead.
I can review/provide feedback if you need anything there. I'll just create
a JIRA to begin with.

On Tue, Mar 24, 2015 at 9:15 AM, Shai Erera  wrote:

> I use vanilla 5.0. I intended to fix it myself, but if you want to go
> ahead, I'd be happy to review the patch.
>
> Shai
>
> On Tue, Mar 24, 2015 at 6:11 PM, Anshum Gupta 
> wrote:
>
> > It's certainly looks like a bug and the name shouldn't be added to the
> > request automatically.
> > Can you confirm what version of Solr are you using?
> >
> > If it turns out to be a bug in 5x/trunk I'll create a JIRA and fix it to
> > both #1 and #2.
> >
> > On Mon, Mar 23, 2015 at 9:48 AM, Shai Erera  wrote:
> >
> > > Shawn, that was a great tip!
> > >
> > > When I tried the URL, the core was named as expected
> > > (mycollection_shard1_replica2). I then compared the URLs as reported in
> > the
> > > logs, and I believe I found the bug:
> > >
> > > SolrJ: [admin] webapp=null path=/admin/collections
> params={shard=shard1&
> > > *name=mycollection*&action=ADDREPLICA&*collection=mycollection*
> > > &wt=javabin&version=2}
> > >
> > > The 'name' param isn't set when I send the URL request (and it's also
> not
> > > specified in the reference guide), but only when I add the replica
> using
> > > SolrJ. I then tweaked my code to do the following:
> > >
> > >   final CollectionAdminRequest.AddReplica addReplicaRequest = new
> > > CollectionAdminRequest.AddReplica() {
> > > @Override
> > > public SolrParams getParams() {
> > >   final ModifiableSolrParams params = (ModifiableSolrParams)
> > > super.getParams();
> > >   params.remove(CoreAdminParams.NAME);
> > >   return params;
> > > }
> > >   };
> > >
> > > And voila, the core is now also named mycollection_shard1_replica2, and
> > I'm
> > > even able to add as many replicas as I want on this node (where before
> it
> > > failed since 'mycollection' already existed).
> > >
> > > The 'name' parameter is added by
> > > CollectionSpecificAdminRequest.getParams(). So how would you suggest to
> > fix
> > > it:
> > >
> > >1. Remove it in AddReplica.getParams() -- replicas will always be
> > >auto-named. It makes sense as users didn't have control over it
> > before,
> > > and
> > >maybe they shouldn't.
> > >2. Add a setCoreName to AddReplica request -- this would be nice if
> > >someone wanted to control the name of the added replica, but
> otherwise
> > >should not be included in the request
> > >
> > > Or maybe we fix the bug by doing #1 and consider #2 as a new feature
> > "allow
> > > naming replicas"?
> > >
> > > Shai
> > >
> > >
> > > On Mon, Mar 23, 2015 at 6:14 PM, Shawn Heisey 
> > wrote:
> > >
> > > > On 3/23/2015 9:27 AM, Shai Erera wrote:
> > > > > I have a Solr cluster started (all programmatically) with one Solr
> > > node,
> > > > > one collection and one shard. I set the replicationFactor to 1. The
> > > name
> > > > of
> > > > > the result core was set to mycollection_shard1_replica1.
> > > > >
> > > > > I then start a second Solr node and issue an ADDREPLICA command as
> > > > > described in the reference guide, using following code:
> > > > >
> > > > >   final CollectionAdminRequest.AddReplica addReplicaRequest = new
> > > > > CollectionAdminRequest.AddReplica();
> > > > >   addReplicaRequest.setCollectionName("mycollection");
> > > > >   addReplicaRequest.setShardName("shard1");
> > > > >   final CollectionAdminResponse response =
> > > > > addReplicaRequest.process(solrClient);
> > > > >
> > > > > The replica is added under a core named "mycollection" and not e.g.
> > > > > mycollection_shard1_replica2.
> > > >
> > > > I'd call that a bug.
> > > >
> > > > > BTW, the example in the reference guide shows that issuing the
> > request:
> > > > >
> > > > >
> > > >
> > >
> >
> http://localhost:8983/solr/admin/collections?action=ADDREPLICA&collection=test2&shard=shard2&node=192.167.1.2:8983_solr
> > > > >
> > > > > Results in
> > > > >
> > > > > 
> > > > >   
> > > > > 0
> > > > > 3764
> > > > >   
> > > > >   
> > > > > 
> > > > >   
> > > > > 0
> > > > > 3450
> > > > >   
> > > > > *  test2_shard2_replica4*
> > > >
> > > > Did you try out a URL like that to see whether it also results in the
> > > > misnamed core, or if it behaves correctly as the reference guide
> > > indicates?
> > > >
> > > > If the URL behaves correctly, I'd be curious what Solr logs for the
> URL
> > > > request and the SolrJ request.
> > > >
> > > > Thanks,
> > > > Shawn
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Anshum Gupta
> >
>



-- 
Anshum Gupta


Re: How to verify a document is indexed by all replicas

2015-03-24 Thread Shai Erera
>
> You can add a min_rf=true parameter to your indexing
>

Yeah I read about it, but it doesn't help me as in this case, I'm
implementing some monitoring component over a SolrCloud instance, so I have
no handle to the indexing client. I would like the monitor to check the
replicas and report something if all replicas are in sync, some are not in
sync, or e.g. replicas 2 and 3 are further ahead than replica1.

Also, checking the state of the
> replica is not enough, one should always check for the state=active and
> live-ness of the replica i.e. the node is marked live under /live_nodes in
> ZK.
>

Thanks, I've looked at code samples in tests and saw this is done, so I
copied the logic. E.g. an .isReplicaAlive(Replica replica) checks both the
replica's state, as well that the node it's one is in the cluster state's
live nodes.

Also, verifying replicas are in sync via searching is not the best solution
at all. Apart from not being that fast, it also doesn't factor in documents
that are in the tlog, or in IW's RAM buffer, or even that a document may
have been updated. So I will change my test to ensuring that all replicas
of a slice are in state active (and on a live node) and rely on that being
OK.

Shai

On Tue, Mar 24, 2015 at 6:39 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> Hi Shai,
>
> To your original question on how to know if a document has been indexed at
> all replicas -- You can add a min_rf=true parameter to your indexing
> request and then Solr will add information to the response about how many
> replicas gave an ack' to the leader. So if the returned number is equal to
> the number of replicas, you can be sure that the doc has been indexed
> everywhere.
>
> More comments inline:
>
> On Tue, Mar 24, 2015 at 8:18 AM, Shai Erera  wrote:
>
> > Thanks Erick,
> >
> > When a replica is down, no updates are sent to it. When it comes back up,
> > it discovers that it needs to catch-up with the leader. If there are many
> > events it falls back to index replication (slower). During this period of
> > time, is the replica considered ACTIVE or RECOVERING?
> >
> >
> It is marked as recovering.
>
>
> > And, can I assume that at any given moment (aside from ZK connection
> > timeouts etc.) when I check the replicas' state, all the ones that report
> > ACTIVE are in sync with each other?
> >
> >
> Yes, 'active' replicas should be in sync but autoCommits can cause
> inconsistency between replicas as to what is visible to searchers (even if
> all replicas have indexed the same data). Also, checking the state of the
> replica is not enough, one should always check for the state=active and
> live-ness of the replica i.e. the node is marked live under /live_nodes in
> ZK.
>
>
> > Shai
> >
> > On Tue, Mar 24, 2015 at 5:04 PM, Erick Erickson  >
> > wrote:
> >
> > > You can always issue a *:* query, but it'd have to be at least your
> > > autoSoftCommit interval ago since the soft commit trigger will have
> > > slightly different wall clock times.
> > >
> > > But it shouldn't be necessary to wait I don't think. Since the
> > > indexing request doesn't succeed until the docs have been written to
> > > the tlogs, and since the tlogs will be replayed in the event of a
> > > problem your data should be fine. Of course if you're indexing at a
> > > very fast rate and your tlog is huge, it'll take a while
> > >
> > > FWIW,
> > > Erick
> > >
> > > On Tue, Mar 24, 2015 at 4:59 AM, Shai Erera  wrote:
> > > > Hi
> > > >
> > > > Is there a recommended, preferably fast, way to check that a document
> > is
> > > > indexed by all replicas? I currently do that by issuing a search
> > request
> > > to
> > > > each replica, but was wondering if there's a faster way.
> > > >
> > > > Even better, is there a way to verify all replicas of a shard are
> > > > "up-to-date", e.g. by comparing their version or something? By
> > > "up-to-date"
> > > > I mean that they've all processed the same update requests that came
> > > > through.
> > > >
> > > > If there's a replica lagging behind, I'd like to wait for it to catch
> > up,
> > > > something like a checkpoint(), before I continue sending more
> updates.
> > > >
> > > > Shai
> > >
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Regarding detection of duplication

2015-03-24 Thread Iniyan
Hi,

My requirement is to detect duplication in title after removing punctuation
marks, stop words, accented characters.

I am trying to do exact match . After that I am thinking of applying
filters. 

I have tried solr. KeywordTokenizerFactory . It does exact matching. But
when I add 



Stop filter is not working.

But If I apply solr.StandardTokenizerFactory , am not getting the exact
match.


Title:

What is a apple?
What is an apple?
What is the apple?

When I type "What is a apple" I need to get all the above.

Could you please let me know that Is there any tokenizer/filter matching my
requirement.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Regarding-detection-of-duplication-tp4194975.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Ian Rose
Let me give a bit of background.  Our Solr cluster is multi-tenant, where
we use one collection for each of our customers.  In many cases, these
customers are very tiny, so their collection consists of just a single
shard on a single Solr node.  In fact, a non-trivial number of them are
totally empty (e.g. trial customers that never did anything with their
trial account).  However there are also some customers that are larger,
requiring their collection to be sharded.  Our strategy is to try to keep
the total documents in any one shard under 20 million (honestly not sure
where my coworker got that number from - I am open to alternatives but I
realize this is heavily app-specific).

So my original question is not related to indexing or query traffic, but
just the sheer number of cores.  For example, if I have 10 active cores on
a machine and everything is working fine, should I expect that everything
will still work fine if I add 10 nearly-idle cores to that machine?  What
about 100?  1000?  I figure the overhead of each core is probably fairly
low but at some point starts to matter.

Does that make sense?
- Ian


On Tue, Mar 24, 2015 at 11:12 AM, Jack Krupansky 
wrote:

> Shards per collection, or across all collections on the node?
>
> It will all depend on:
>
> 1. Your ingestion/indexing rate. High, medium or low?
> 2. Your query access pattern. Note that a typical query fans out to all
> shards, so having more shards than CPU cores means less parallelism.
> 3. How many collections you will have per node.
>
> In short, it depends on what you want to achieve, not some limit of Solr
> per se.
>
> Why are you even sharding the node anyway? Why not just run with a single
> shard per node, and do sharding by having separate nodes, to maximize
> parallel processing and availability?
>
> Also be careful to be clear about using the Solr term "shard" (a slice,
> across all replica nodes) as distinct from the Elasticsearch term "shard"
> (a single slice of an index for a single replica, analogous to a Solr
> "core".)
>
>
> -- Jack Krupansky
>
> On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose  wrote:
>
> > Hi all -
> >
> > I'm sure this topic has been covered before but I was unable to find any
> > clear references online or in the mailing list.
> >
> > Are there any rules of thumb for how many cores (aka shards, since I am
> > using SolrCloud) is "too many" for one machine?  I realize there is no
> one
> > answer (depends on size of the machine, etc.) so I'm just looking for a
> > rough idea.  Something like the following would be very useful:
> >
> > * People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
> > server without any problems.
> > * I have never heard of anyone successfully running X cores/shards on a
> > single machine, even if you throw a lot of hardware at it.
> >
> > Thanks!
> > - Ian
> >
>


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Jack Krupansky
Multi-tenancy is a bad idea for a single solr Cluster. Better to give each
tenant a separate Solr instance that you spin up and spin down based on
demand.

Think about it: If there are a small number of tenants, just giving each
their own machine will be cheaper than the effort spent managing a
multi-tenant cluster, and if there are a large number of tenants of even a
moderate number of large tenants, you can't expect them to all run
reasonably on a relatively small cluster. Think about scalability.


-- Jack Krupansky

On Tue, Mar 24, 2015 at 1:22 PM, Ian Rose  wrote:

> Let me give a bit of background.  Our Solr cluster is multi-tenant, where
> we use one collection for each of our customers.  In many cases, these
> customers are very tiny, so their collection consists of just a single
> shard on a single Solr node.  In fact, a non-trivial number of them are
> totally empty (e.g. trial customers that never did anything with their
> trial account).  However there are also some customers that are larger,
> requiring their collection to be sharded.  Our strategy is to try to keep
> the total documents in any one shard under 20 million (honestly not sure
> where my coworker got that number from - I am open to alternatives but I
> realize this is heavily app-specific).
>
> So my original question is not related to indexing or query traffic, but
> just the sheer number of cores.  For example, if I have 10 active cores on
> a machine and everything is working fine, should I expect that everything
> will still work fine if I add 10 nearly-idle cores to that machine?  What
> about 100?  1000?  I figure the overhead of each core is probably fairly
> low but at some point starts to matter.
>
> Does that make sense?
> - Ian
>
>
> On Tue, Mar 24, 2015 at 11:12 AM, Jack Krupansky  >
> wrote:
>
> > Shards per collection, or across all collections on the node?
> >
> > It will all depend on:
> >
> > 1. Your ingestion/indexing rate. High, medium or low?
> > 2. Your query access pattern. Note that a typical query fans out to all
> > shards, so having more shards than CPU cores means less parallelism.
> > 3. How many collections you will have per node.
> >
> > In short, it depends on what you want to achieve, not some limit of Solr
> > per se.
> >
> > Why are you even sharding the node anyway? Why not just run with a single
> > shard per node, and do sharding by having separate nodes, to maximize
> > parallel processing and availability?
> >
> > Also be careful to be clear about using the Solr term "shard" (a slice,
> > across all replica nodes) as distinct from the Elasticsearch term "shard"
> > (a single slice of an index for a single replica, analogous to a Solr
> > "core".)
> >
> >
> > -- Jack Krupansky
> >
> > On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose  wrote:
> >
> > > Hi all -
> > >
> > > I'm sure this topic has been covered before but I was unable to find
> any
> > > clear references online or in the mailing list.
> > >
> > > Are there any rules of thumb for how many cores (aka shards, since I am
> > > using SolrCloud) is "too many" for one machine?  I realize there is no
> > one
> > > answer (depends on size of the machine, etc.) so I'm just looking for a
> > > rough idea.  Something like the following would be very useful:
> > >
> > > * People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
> > > server without any problems.
> > > * I have never heard of anyone successfully running X cores/shards on a
> > > single machine, even if you throw a lot of hardware at it.
> > >
> > > Thanks!
> > > - Ian
> > >
> >
>


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Shalin Shekhar Mangar
Sorry Jack. That doesn't scale when you have millions of customers. And
these are good problems to have!

On Tue, Mar 24, 2015 at 10:47 AM, Jack Krupansky 
wrote:

> Multi-tenancy is a bad idea for a single solr Cluster. Better to give each
> tenant a separate Solr instance that you spin up and spin down based on
> demand.
>
> Think about it: If there are a small number of tenants, just giving each
> their own machine will be cheaper than the effort spent managing a
> multi-tenant cluster, and if there are a large number of tenants of even a
> moderate number of large tenants, you can't expect them to all run
> reasonably on a relatively small cluster. Think about scalability.
>
>
> -- Jack Krupansky
>
> On Tue, Mar 24, 2015 at 1:22 PM, Ian Rose  wrote:
>
> > Let me give a bit of background.  Our Solr cluster is multi-tenant, where
> > we use one collection for each of our customers.  In many cases, these
> > customers are very tiny, so their collection consists of just a single
> > shard on a single Solr node.  In fact, a non-trivial number of them are
> > totally empty (e.g. trial customers that never did anything with their
> > trial account).  However there are also some customers that are larger,
> > requiring their collection to be sharded.  Our strategy is to try to keep
> > the total documents in any one shard under 20 million (honestly not sure
> > where my coworker got that number from - I am open to alternatives but I
> > realize this is heavily app-specific).
> >
> > So my original question is not related to indexing or query traffic, but
> > just the sheer number of cores.  For example, if I have 10 active cores
> on
> > a machine and everything is working fine, should I expect that everything
> > will still work fine if I add 10 nearly-idle cores to that machine?  What
> > about 100?  1000?  I figure the overhead of each core is probably fairly
> > low but at some point starts to matter.
> >
> > Does that make sense?
> > - Ian
> >
> >
> > On Tue, Mar 24, 2015 at 11:12 AM, Jack Krupansky <
> jack.krupan...@gmail.com
> > >
> > wrote:
> >
> > > Shards per collection, or across all collections on the node?
> > >
> > > It will all depend on:
> > >
> > > 1. Your ingestion/indexing rate. High, medium or low?
> > > 2. Your query access pattern. Note that a typical query fans out to all
> > > shards, so having more shards than CPU cores means less parallelism.
> > > 3. How many collections you will have per node.
> > >
> > > In short, it depends on what you want to achieve, not some limit of
> Solr
> > > per se.
> > >
> > > Why are you even sharding the node anyway? Why not just run with a
> single
> > > shard per node, and do sharding by having separate nodes, to maximize
> > > parallel processing and availability?
> > >
> > > Also be careful to be clear about using the Solr term "shard" (a slice,
> > > across all replica nodes) as distinct from the Elasticsearch term
> "shard"
> > > (a single slice of an index for a single replica, analogous to a Solr
> > > "core".)
> > >
> > >
> > > -- Jack Krupansky
> > >
> > > On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose 
> wrote:
> > >
> > > > Hi all -
> > > >
> > > > I'm sure this topic has been covered before but I was unable to find
> > any
> > > > clear references online or in the mailing list.
> > > >
> > > > Are there any rules of thumb for how many cores (aka shards, since I
> am
> > > > using SolrCloud) is "too many" for one machine?  I realize there is
> no
> > > one
> > > > answer (depends on size of the machine, etc.) so I'm just looking
> for a
> > > > rough idea.  Something like the following would be very useful:
> > > >
> > > > * People commonly run up to X cores/shards on a mid-sized (4 or 8
> core)
> > > > server without any problems.
> > > > * I have never heard of anyone successfully running X cores/shards
> on a
> > > > single machine, even if you throw a lot of hardware at it.
> > > >
> > > > Thanks!
> > > > - Ian
> > > >
> > >
> >
>



-- 
Regards,
Shalin Shekhar Mangar.


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Jack Krupansky
Don't confuse customers and tenants.

-- Jack Krupansky

On Tue, Mar 24, 2015 at 2:24 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> Sorry Jack. That doesn't scale when you have millions of customers. And
> these are good problems to have!
>
> On Tue, Mar 24, 2015 at 10:47 AM, Jack Krupansky  >
> wrote:
>
> > Multi-tenancy is a bad idea for a single solr Cluster. Better to give
> each
> > tenant a separate Solr instance that you spin up and spin down based on
> > demand.
> >
> > Think about it: If there are a small number of tenants, just giving each
> > their own machine will be cheaper than the effort spent managing a
> > multi-tenant cluster, and if there are a large number of tenants of even
> a
> > moderate number of large tenants, you can't expect them to all run
> > reasonably on a relatively small cluster. Think about scalability.
> >
> >
> > -- Jack Krupansky
> >
> > On Tue, Mar 24, 2015 at 1:22 PM, Ian Rose  wrote:
> >
> > > Let me give a bit of background.  Our Solr cluster is multi-tenant,
> where
> > > we use one collection for each of our customers.  In many cases, these
> > > customers are very tiny, so their collection consists of just a single
> > > shard on a single Solr node.  In fact, a non-trivial number of them are
> > > totally empty (e.g. trial customers that never did anything with their
> > > trial account).  However there are also some customers that are larger,
> > > requiring their collection to be sharded.  Our strategy is to try to
> keep
> > > the total documents in any one shard under 20 million (honestly not
> sure
> > > where my coworker got that number from - I am open to alternatives but
> I
> > > realize this is heavily app-specific).
> > >
> > > So my original question is not related to indexing or query traffic,
> but
> > > just the sheer number of cores.  For example, if I have 10 active cores
> > on
> > > a machine and everything is working fine, should I expect that
> everything
> > > will still work fine if I add 10 nearly-idle cores to that machine?
> What
> > > about 100?  1000?  I figure the overhead of each core is probably
> fairly
> > > low but at some point starts to matter.
> > >
> > > Does that make sense?
> > > - Ian
> > >
> > >
> > > On Tue, Mar 24, 2015 at 11:12 AM, Jack Krupansky <
> > jack.krupan...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Shards per collection, or across all collections on the node?
> > > >
> > > > It will all depend on:
> > > >
> > > > 1. Your ingestion/indexing rate. High, medium or low?
> > > > 2. Your query access pattern. Note that a typical query fans out to
> all
> > > > shards, so having more shards than CPU cores means less parallelism.
> > > > 3. How many collections you will have per node.
> > > >
> > > > In short, it depends on what you want to achieve, not some limit of
> > Solr
> > > > per se.
> > > >
> > > > Why are you even sharding the node anyway? Why not just run with a
> > single
> > > > shard per node, and do sharding by having separate nodes, to maximize
> > > > parallel processing and availability?
> > > >
> > > > Also be careful to be clear about using the Solr term "shard" (a
> slice,
> > > > across all replica nodes) as distinct from the Elasticsearch term
> > "shard"
> > > > (a single slice of an index for a single replica, analogous to a Solr
> > > > "core".)
> > > >
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Tue, Mar 24, 2015 at 9:02 AM, Ian Rose 
> > wrote:
> > > >
> > > > > Hi all -
> > > > >
> > > > > I'm sure this topic has been covered before but I was unable to
> find
> > > any
> > > > > clear references online or in the mailing list.
> > > > >
> > > > > Are there any rules of thumb for how many cores (aka shards, since
> I
> > am
> > > > > using SolrCloud) is "too many" for one machine?  I realize there is
> > no
> > > > one
> > > > > answer (depends on size of the machine, etc.) so I'm just looking
> > for a
> > > > > rough idea.  Something like the following would be very useful:
> > > > >
> > > > > * People commonly run up to X cores/shards on a mid-sized (4 or 8
> > core)
> > > > > server without any problems.
> > > > > * I have never heard of anyone successfully running X cores/shards
> > on a
> > > > > single machine, even if you throw a lot of hardware at it.
> > > > >
> > > > > Thanks!
> > > > > - Ian
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Solr and HDFS configuration

2015-03-24 Thread Joseph Obernberger
Hi All - does it make sense to run a solr shard on a node within an 
Hadoop cluster that is not a data node?  In that case all the data that 
node processes would need to come over the network, but you get the 
benefit of more CPU for things like faceting.

Thank you!

-Joe


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Shawn Heisey
On 3/24/2015 11:22 AM, Ian Rose wrote:
> Let me give a bit of background.  Our Solr cluster is multi-tenant, where
> we use one collection for each of our customers.  In many cases, these
> customers are very tiny, so their collection consists of just a single
> shard on a single Solr node.  In fact, a non-trivial number of them are
> totally empty (e.g. trial customers that never did anything with their
> trial account).  However there are also some customers that are larger,
> requiring their collection to be sharded.  Our strategy is to try to keep
> the total documents in any one shard under 20 million (honestly not sure
> where my coworker got that number from - I am open to alternatives but I
> realize this is heavily app-specific).
>
> So my original question is not related to indexing or query traffic, but
> just the sheer number of cores.  For example, if I have 10 active cores on
> a machine and everything is working fine, should I expect that everything
> will still work fine if I add 10 nearly-idle cores to that machine?  What
> about 100?  1000?  I figure the overhead of each core is probably fairly
> low but at some point starts to matter.

One resource that may be exhausted faster than any other when you have a
lot of cores on a solr instance (especially when they are not idle) is
Java heap memory, so you might need to increase the java heap.  Memory
in the server is one of the most important resources you have for Solr
performance, and here I am talking about memory that is *not* used in
the Java heap (or any other program) -- the OS must be able to
effectively cache your index data or Solr performance will be terrible.

You have said "Solr cluster" and "collection" ... so that makes me think
you're running SolrCloud.  In cloud mode, you can't really use the
LotsOfCores functionality, where you mark cores transient and tell Solr
how many cores you'd like to have resident at the same time.  If you are
NOT in cloud mode, then you can use this feature:

http://wiki.apache.org/solr/LotsOfCores

In general, there are three resources other than memory which might
become exhausted with a large number of cores:

One resource is the "maximum open files" limit in the operating system,
which typically defaults to 1024.  Each core will typically have several
dozen files in its index, so it's very easy to reach 1024 open files.

The second resource is the maximum allowed threads in your servlet
container config -- each core you add requires more threads.  The
default maxThreads value in most containers is 200.  The Jetty container
included in the Solr download is preconfigured with a maxThreads value
of 1, effectively removing the limit for most setups.

The third resource is related to the second -- some operating systems
implement threads as hidden processes, and many operating systems will
limit the number of processes that a user may start.  On Linux, this
limit is typically 1024, and may need to be increased.

I really need to add this kind of info to the wiki.

Thanks,
Shawn



Re: Solr and HDFS configuration

2015-03-24 Thread Michael Della Bitta
The ultimate answer is that you need to test your configuration with your
expected workflow.

However, the thing that mitigates the remote IO factor (hopefully) is that
the Solr HDFS stuff features a blockcache that should (when tuned
correctly) cache in RAM the blocks your Solr process needs the most.

Solr on HDFS currently doesn't have any sort of rack locality like there is
with say HBase colocated on the HDFS nodes. So you can expect that even
with Solr installed on the same nodes as your datanodes for HDFS, that
there will be remote IO.



Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions  | g+:
plus.google.com/appinions

w: appinions.com 

On Tue, Mar 24, 2015 at 2:47 PM, Joseph Obernberger  wrote:

> Hi All - does it make sense to run a solr shard on a node within an Hadoop
> cluster that is not a data node?  In that case all the data that node
> processes would need to come over the network, but you get the benefit of
> more CPU for things like faceting.
> Thank you!
>
> -Joe
>


Re: Need help using DIH with FileListEntityProcessor with XPathEntityProcessor

2015-03-24 Thread Martin Wunderlich
Hi Alex, 

Thanks again for the reply. See my response below inline. 

> Am 22.03.2015 um 20:14 schrieb Alexandre Rafalovitch :
> 
> I am not entirely sure your problem is at the XSL level yet?
> 
> *) I see problems with quotes in two places (in datasource, and in
> outer entity). Did you paste definitions from MSWord by any chance?

The file was created in a text editor. I am not sure which quotes you are 
referring to. They look fine to me and the XML file valides alright. Could you 
perhaps be more specific?

> *) I see that you declare outer entity to be rootEntity=true, so you
> will not get anything from inner documents

That’s correct, I have set the value to „false" now 

> *) I don't see any XPath definitions in the inner entity, so the
> processor does not know how to actually map to the fields (that's
> different for SQLEntityProcessor which auto-maps).

As far as I know, the explicit mappings are not required when the result of the 
transformation is in the Solr default import format. The documentation says: 
useSolrAddSchema

- Set this to true if the content is in the form of the standard Solr update 
XML schema.

(https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler
 
)

But maybe my interpretation here is incorrect. I was assuming that setting this 
attribute to „true“ will allow the DIH to directly process the resulting XML 
file as if I was importing it with the command line Java tool. 

> 
> I would step back from inner DIH entity and make sure your outer
> entity actually captures something. Maybe by enabling dynamicField "*"
> with stored=true. See what you get into the schema. Then, add XPath
> against original XML, just to make sure you capture _something_. Then,
> XSLT and XPath.

OK, I will try to debug the DIH like this. Thanks again. 

Cheers, 

Martin
 
 


> 
> Regards,
>   Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
> 
> 
> On 22 March 2015 at 12:36, Martin Wunderlich  wrote:
>> Hi Alex,
>> 
>> Thanks a lot for the reply and apologies for being unclear. The 
>> XPathEntityProcessor provides an option to specify an XSLT file that should 
>> be applied to the XML input prior to the actual data import. I am including 
>> my current configuration below, with the respective attribute highlighted.
>> 
>> I have checked various forums and documentation bits, but the config XML 
>> seems ok to me. And yet, nothing gets imported.
>> 
>> Cheers,
>> 
>> Martin
>> 
>> 
>> 
>>>type=„FileDataSource />
>>>name="pickupdir"
>>processor="FileListEntityProcessor"
>>rootEntity="true"
>>fileName=".*xml"
>>baseDir=„/abs/path/to/source/dir/for/import/"
>>recursive="true"
>>newerThan="${dataimporter.last_index_time}"
>>dataSource="null">
>> 
>>>name="xml"
>>processor="XPathEntityProcessor"
>>stream="false"
>>useSolrAddSchema="true"
>>url="${pickupdir.fileAbsolutePath}"
>>xsl="/abs/path/to/xslt/file/in/myCore/conf/transform.xsl">
>>
>>
>>
>> 
>> 
>> 
>> 
>> 
>>> Am 22.03.2015 um 01:18 schrieb Alexandre Rafalovitch >> >:
>>> 
>>> What do you mean using DIH with XSLT together? DIH uses a basic XPath
>>> parser, but not full XSLT.
>>> 
>>> So, it's not very clear what the question actually means. How did you
>>> configure it all?
>>> 
>>> Regards,
>>>  Alex.
>>> 
>>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>>> http://www.solr-start.com/ 
>>> 
>>> 
>>> On 21 March 2015 at 14:14, Martin Wunderlich  wrote:
 Hi all,
 
 I am trying to create a data import handler (DIH) to import XML files. The 
 source XML should be transformed using XSLT into the standard Solr import 
 format. I have tested the XSLT and successfully imported data using the 
 Java-based simple import tool. However, when I try to import the same XML 
 files with the same XSLT pre-processing using a DIH configured in 
 solrconfig.xml, it doesn’t work. I can execute the DIH from the admin 
 interface, but no documents get imported. The logging console doesn’t give 
 any errors.
 
 Could someone who has managed to successfully set up a similar 
 configuration (XML import via DIH with XSL pre-processing), provide with 
 the basic configuration, so that I can check what might be wrong in mine?
 
 Thanks a lot.
 
 Cheers,
 
 Martin
 
 
>> 



Re: Need help using DIH with FileListEntityProcessor with XPathEntityProcessor

2015-03-24 Thread Alexandre Rafalovitch
>type=„FileDataSource />

I am getting both missing closing quote and the opening quote is a
funny one ("aligns on the bottom"). But your response email also does
that, so maybe you are using some "smart" editor. Try checking this
conversation in a web archive if you can't see the unusual quotes.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 24 March 2015 at 15:41, Martin Wunderlich  wrote:
> Hi Alex,
>
> Thanks again for the reply. See my response below inline.
>
>> Am 22.03.2015 um 20:14 schrieb Alexandre Rafalovitch :
>>
>> I am not entirely sure your problem is at the XSL level yet?
>>
>> *) I see problems with quotes in two places (in datasource, and in
>> outer entity). Did you paste definitions from MSWord by any chance?
>
> The file was created in a text editor. I am not sure which quotes you are 
> referring to. They look fine to me and the XML file valides alright. Could 
> you perhaps be more specific?
>
>> *) I see that you declare outer entity to be rootEntity=true, so you
>> will not get anything from inner documents
>
> That’s correct, I have set the value to „false" now
>
>> *) I don't see any XPath definitions in the inner entity, so the
>> processor does not know how to actually map to the fields (that's
>> different for SQLEntityProcessor which auto-maps).
>
> As far as I know, the explicit mappings are not required when the result of 
> the transformation is in the Solr default import format. The documentation 
> says:
> useSolrAddSchema
>
> - Set this to true if the content is in the form of the standard Solr update 
> XML schema.
>
> (https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler
>  
> )
>
> But maybe my interpretation here is incorrect. I was assuming that setting 
> this attribute to „true“ will allow the DIH to directly process the resulting 
> XML file as if I was importing it with the command line Java tool.
>
>>
>> I would step back from inner DIH entity and make sure your outer
>> entity actually captures something. Maybe by enabling dynamicField "*"
>> with stored=true. See what you get into the schema. Then, add XPath
>> against original XML, just to make sure you capture _something_. Then,
>> XSLT and XPath.
>
> OK, I will try to debug the DIH like this. Thanks again.
>
> Cheers,
>
> Martin
>
>
>
>
>>
>> Regards,
>>   Alex.
>> 
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 22 March 2015 at 12:36, Martin Wunderlich  wrote:
>>> Hi Alex,
>>>
>>> Thanks a lot for the reply and apologies for being unclear. The 
>>> XPathEntityProcessor provides an option to specify an XSLT file that should 
>>> be applied to the XML input prior to the actual data import. I am including 
>>> my current configuration below, with the respective attribute highlighted.
>>>
>>> I have checked various forums and documentation bits, but the config XML 
>>> seems ok to me. And yet, nothing gets imported.
>>>
>>> Cheers,
>>>
>>> Martin
>>>
>>>
>>> 
>>>>>type=„FileDataSource />
>>>>>name="pickupdir"
>>>processor="FileListEntityProcessor"
>>>rootEntity="true"
>>>fileName=".*xml"
>>>baseDir=„/abs/path/to/source/dir/for/import/"
>>>recursive="true"
>>>newerThan="${dataimporter.last_index_time}"
>>>dataSource="null">
>>>
>>>>>name="xml"
>>>processor="XPathEntityProcessor"
>>>stream="false"
>>>useSolrAddSchema="true"
>>>url="${pickupdir.fileAbsolutePath}"
>>>xsl="/abs/path/to/xslt/file/in/myCore/conf/transform.xsl">
>>>
>>>
>>>
>>> 
>>>
>>>
>>>
>>>
 Am 22.03.2015 um 01:18 schrieb Alexandre Rafalovitch >>> >:

 What do you mean using DIH with XSLT together? DIH uses a basic XPath
 parser, but not full XSLT.

 So, it's not very clear what the question actually means. How did you
 configure it all?

 Regards,
  Alex.
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/ 


 On 21 March 2015 at 14:14, Martin Wunderlich  wrote:
> Hi all,
>
> I am trying to create a data import handler (DIH) to import XML files. 
> The source XML should be transformed using XSLT into the standard Solr 
> import format. I have tested the XSLT and successfully imported data 
> using the Java-based simple import tool. However, when I try to import 
> the same XML files with the same XSLT pre-processing using a DIH 
> configured in solrconfig.xml, it doesn’t work. I can execut

RE: rough maximum cores (shards) per machine?

2015-03-24 Thread Toke Eskildsen
Jack Krupansky [jack.krupan...@gmail.com] wrote:
> Don't confuse customers and tenants.

Perhaps you could explain what you mean by multi-tenant in the context of Ian's 
setup? It is not clear to me what the distinction is in this case.

- Toke Eskildsen


Re: Need help using DIH with FileListEntityProcessor with XPathEntityProcessor

2015-03-24 Thread Shawn Heisey
On 3/24/2015 1:41 PM, Martin Wunderlich wrote:
> The file was created in a text editor. I am not sure which quotes you
> are referring to. They look fine to me and the XML file valides
> alright. Could you perhaps be more specific?

This partial screenshot is your email to the list showing your
dataconfig, as I see it in Thunderbird, with the unusual quote
characters clearly indicated:

https://www.dropbox.com/s/rbycy69xq4bn42l/solr-user-martin-wunderlich-quote-problem.png?dl=0

Thanks,
Shawn



Re: Need help using DIH with FileListEntityProcessor with XPathEntityProcessor

2015-03-24 Thread Martin Wunderlich
Very interesting. Thanks, Shawn. Here is what the config file looks like in the 
Solr admin console: 

https://www.dropbox.com/s/qtfclbvs8oze7lp/Bildschirmfoto%202015-03-24%20um%2021.11.12.png?dl=0
 


No problems with quotes here. It might have been Apple Mail that converted 
them. 

Cheers, 

Martin
 

> Am 24.03.2015 um 20:59 schrieb Shawn Heisey :
> 
> On 3/24/2015 1:41 PM, Martin Wunderlich wrote:
>> The file was created in a text editor. I am not sure which quotes you
>> are referring to. They look fine to me and the XML file valides
>> alright. Could you perhaps be more specific?
> 
> This partial screenshot is your email to the list showing your
> dataconfig, as I see it in Thunderbird, with the unusual quote
> characters clearly indicated:
> 
> https://www.dropbox.com/s/rbycy69xq4bn42l/solr-user-martin-wunderlich-quote-problem.png?dl=0
> 
> Thanks,
> Shawn
> 



Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Jack Krupansky
I'm sure that I am quite unqualified to describe his hypothetical setup. I
mean, he's the one using the term multi-tenancy, so it's for him to be
clear.

For me, it's a question of who has control over the config and schema and
collection creation. Having more than one business entity controlling the
configuration of a single (Solr) server is a recipe for disaster. Solr
works well if there is an architect for the system. Ever hear the old
saying "Too many cooks spoil the stew"?

-- Jack Krupansky

On Tue, Mar 24, 2015 at 3:54 PM, Toke Eskildsen 
wrote:

> Jack Krupansky [jack.krupan...@gmail.com] wrote:
> > Don't confuse customers and tenants.
>
> Perhaps you could explain what you mean by multi-tenant in the context of
> Ian's setup? It is not clear to me what the distinction is in this case.
>
> - Toke Eskildsen
>


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Test Test
Hi there, 
I'm trying to create my own TokenizerFactory (from tamingtext's book).After 
setting schema.xml and have adding path in solrconfig.xml, i start solr.I have 
this error message : Caused by: org.apache.solr.common.SolrException: Plugin 
init failure for [schema.xml] fieldType "text": Plugin init failure for 
[schema.xml] analyzer/tokenizer: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is 
.../conf/schema.xml at 
org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595) at 
org.apache.solr.schema.IndexSchema.(IndexSchema.java:166) at 
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55) at 
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
 at 
org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)
 at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62) 
... 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure 
for [schema.xml] fieldType "text": Plugin init failure for [schema.xml] 
analyzer/tokenizer: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactory at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
 at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486) ... 12 
moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for 
[schema.xml] analyzer/tokenizer: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactory at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
 at 
org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)
 at 
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)
 at 
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)
 at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
 ... 13 moreCaused by: java.lang.ClassCastException: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactory at 
java.lang.Class.asSubclass(Class.java:3208) at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474) 
at 
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593)
 at 
org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342)
 at 
org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335)
 at 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
 
Someone can help?
Thanks.Regards.


 Le Mardi 24 mars 2015 21h24, Jack Krupansky  a 
écrit :
   

 I'm sure that I am quite unqualified to describe his hypothetical setup. I
mean, he's the one using the term multi-tenancy, so it's for him to be
clear.

For me, it's a question of who has control over the config and schema and
collection creation. Having more than one business entity controlling the
configuration of a single (Solr) server is a recipe for disaster. Solr
works well if there is an architect for the system. Ever hear the old
saying "Too many cooks spoil the stew"?

-- Jack Krupansky

On Tue, Mar 24, 2015 at 3:54 PM, Toke Eskildsen 
wrote:

> Jack Krupansky [jack.krupan...@gmail.com] wrote:
> > Don't confuse customers and tenants.
>
> Perhaps you could explain what you mean by multi-tenant in the context of
> Ian's setup? It is not clear to me what the distinction is in this case.
>
> - Toke Eskildsen
>


  

RE: rough maximum cores (shards) per machine?

2015-03-24 Thread Toke Eskildsen
Jack Krupansky [jack.krupan...@gmail.com] wrote:
> I'm sure that I am quite unqualified to describe his hypothetical setup. I
> mean, he's the one using the term multi-tenancy, so it's for him to be
> clear.

It was my understanding that Ian used them interchangeably, but of course Ian 
it the only one that knows.

> For me, it's a question of who has control over the config and schema and
> collection creation. Having more than one business entity controlling the
> configuration of a single (Solr) server is a recipe for disaster.

Thank you. Now your post makes a lot more sense. I will not argue against that.

- Toke Eskildsen


Problem with Terms Query Parser

2015-03-24 Thread Shamik Bandopadhyay
Hi,

  I'm trying to use Terms Query Parser for one of my use cases where I use
an implicit filter on bunch of sources.

When I'm trying to run the following query,

fq={!terms f=Source}help,documentation,sfdc

I'm getting the following error.

Unknown query parser 'terms'400

What am I missing here ? I'm using Solr 5.0 version.

Any pointers will be appreciated.

Regards,
Shamik


Custom TokenFilter

2015-03-24 Thread Test Test
Hi there, 
I'm trying to create my own TokenizerFactory (from tamingtext's book).After 
setting schema.xml and have adding path in solrconfig.xml, i start solr.I have 
this error message : Caused by: org.apache.solr.common.SolrException: Plugin 
init failure for [schema.xml] fieldType "text": Plugin init failure for 
[schema.xml] analyzer/tokenizer: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is 
.../conf/schema.xmlat 
org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at 
org.apache.solr.schema.IndexSchema.(IndexSchema.java:166)at 
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at 
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at
 
org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at
 org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)... 7 
moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for 
[schema.xml] fieldType "text": Plugin init failure for [schema.xml] 
analyzer/tokenizer: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at
 org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486)... 12 
moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for 
[schema.xml] analyzer/tokenizer: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at
 
org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)at
 
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)at
 
org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)at
 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)...
 13 moreCaused by: java.lang.ClassCastException: class 
com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat 
java.lang.Class.asSubclass(Class.java:3208)at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474)at
 
org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593)at
 
org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342)at
 
org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335)at
 
org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
Someone can help?
Thanks.Regards.


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Erick Erickson
Test Test:

>From Hossman's apache page:

When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email.  Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is "hidden" in that thread and gets less
attention.   It makes following discussions in the mailing list archives
particularly difficult.

Also, please format your stack trace for readability. On a quick
glance, you probably
have mis-matched jars in your classpath.

On Tue, Mar 24, 2015 at 1:35 PM, Test Test  wrote:
> Hi there,
> I'm trying to create my own TokenizerFactory (from tamingtext's book).After 
> setting schema.xml and have adding path in solrconfig.xml, i start solr.I 
> have this error message : Caused by: org.apache.solr.common.SolrException: 
> Plugin init failure for [schema.xml] fieldType "text": Plugin init failure 
> for [schema.xml] analyzer/tokenizer: class 
> com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is 
> .../conf/schema.xml at 
> org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595) at 
> org.apache.solr.schema.IndexSchema.(IndexSchema.java:166) at 
> org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55) 
> at 
> org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
>  at 
> org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)
>  at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62) 
> ... 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init 
> failure for [schema.xml] fieldType "text": Plugin init failure for 
> [schema.xml] analyzer/tokenizer: class 
> com.tamingtext.texttamer.solr.SentenceTokenizerFactory at 
> org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
>  at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486) ... 
> 12 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure 
> for [schema.xml] analyzer/tokenizer: class 
> com.tamingtext.texttamer.solr.SentenceTokenizerFactory at 
> org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
>  at 
> org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)
>  at 
> org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)
>  at 
> org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)
>  at 
> org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
>  ... 13 moreCaused by: java.lang.ClassCastException: class 
> com.tamingtext.texttamer.solr.SentenceTokenizerFactory at 
> java.lang.Class.asSubclass(Class.java:3208) at 
> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474)
>  at 
> org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593)
>  at 
> org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342)
>  at 
> org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335)
>  at 
> org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
> Someone can help?
> Thanks.Regards.
>
>
>  Le Mardi 24 mars 2015 21h24, Jack Krupansky  a 
> écrit :
>
>
>  I'm sure that I am quite unqualified to describe his hypothetical setup. I
> mean, he's the one using the term multi-tenancy, so it's for him to be
> clear.
>
> For me, it's a question of who has control over the config and schema and
> collection creation. Having more than one business entity controlling the
> configuration of a single (Solr) server is a recipe for disaster. Solr
> works well if there is an architect for the system. Ever hear the old
> saying "Too many cooks spoil the stew"?
>
> -- Jack Krupansky
>
> On Tue, Mar 24, 2015 at 3:54 PM, Toke Eskildsen 
> wrote:
>
>> Jack Krupansky [jack.krupan...@gmail.com] wrote:
>> > Don't confuse customers and tenants.
>>
>> Perhaps you could explain what you mean by multi-tenant in the context of
>> Ian's setup? It is not clear to me what the distinction is in this case.
>>
>> - Toke Eskildsen
>>
>
>
>


Re: Custom TokenFilter

2015-03-24 Thread Erick Erickson
bq: 13 moreCaused by: java.lang.ClassCastException: class
com.tamingtext.texttamer.solr.

This usually means you have jar files from different versions of Solr
in your classpath.

Best,
Erick

On Tue, Mar 24, 2015 at 2:38 PM, Test Test  wrote:
> Hi there,
> I'm trying to create my own TokenizerFactory (from tamingtext's book).After 
> setting schema.xml and have adding path in solrconfig.xml, i start solr.I 
> have this error message : Caused by: org.apache.solr.common.SolrException: 
> Plugin init failure for [schema.xml] fieldType "text": Plugin init failure 
> for [schema.xml] analyzer/tokenizer: class 
> com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is 
> .../conf/schema.xmlat 
> org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at 
> org.apache.solr.schema.IndexSchema.(IndexSchema.java:166)at 
> org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at
>  
> org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at
>  
> org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at
>  org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)... 
> 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure 
> for [schema.xml] fieldType "text": Plugin init failure for [schema.xml] 
> analyzer/tokenizer: class 
> com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat 
> org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at
>  org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486)... 12 
> moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for 
> [schema.xml] analyzer/tokenizer: class 
> com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat 
> org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at
>  
> org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)at
>  
> org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)at
>  
> org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)at
>  
> org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)...
>  13 moreCaused by: java.lang.ClassCastException: class 
> com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat 
> java.lang.Class.asSubclass(Class.java:3208)at 
> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474)at
>  
> org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593)at
>  
> org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342)at
>  
> org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335)at
>  
> org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
> Someone can help?
> Thanks.Regards.


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Ian Rose
First off thanks everyone for the very useful replies thus far.

Shawn - thanks for the list of items to check.  #1 and #2 should be fine
for us and I'll check our ulimit for #3.

To add a bit of clarification, we are indeed using SolrCloud.  Our current
setup is to create a new collection for each customer.  For now we allow
SolrCloud to decide for itself where to locate the initial shard(s) but in
time we expect to refine this such that our system will automatically
choose the least loaded nodes according to some metric(s).

Having more than one business entity controlling the configuration of a
> single (Solr) server is a recipe for disaster. Solr works well if there is
> an architect for the system.


Jack, can you explain a bit what you mean here?  It looks like Toke caught
your meaning but I'm afraid it missed me.  What do you mean by "business
entity"?  Is your concern that with automatic creation of collections they
will be distributed willy-nilly across the cluster, leading to uneven load
across nodes?  If it is relevant, the schema and solrconfig are controlled
entirely by me and is the same for all collections.  Thus theoretically we
could actually just use one single collection for all of our customers
(adding a 'customer:' type fq to all queries) but since we never
need to query across customers it seemed more performant (as well as safer
- less chance of accidentally leaking data across customers) to use
separate collections.

Better to give each tenant a separate Solr instance that you spin up and
> spin down based on demand.


Regarding this, if by tenant you mean "customer", this is not viable for us
from a cost perspective.  As I mentioned initially, many of our customers
are very small so dedicating an entire machine to each of them would not be
economical (or efficient).  Or perhaps I am not understanding what your
definition of "tenant" is?

Cheers,
Ian



On Tue, Mar 24, 2015 at 4:51 PM, Toke Eskildsen 
wrote:

> Jack Krupansky [jack.krupan...@gmail.com] wrote:
> > I'm sure that I am quite unqualified to describe his hypothetical setup.
> I
> > mean, he's the one using the term multi-tenancy, so it's for him to be
> > clear.
>
> It was my understanding that Ian used them interchangeably, but of course
> Ian it the only one that knows.
>
> > For me, it's a question of who has control over the config and schema and
> > collection creation. Having more than one business entity controlling the
> > configuration of a single (Solr) server is a recipe for disaster.
>
> Thank you. Now your post makes a lot more sense. I will not argue against
> that.
>
> - Toke Eskildsen
>


Re: Unable to setup solr cloud with multiple collections.

2015-03-24 Thread sthita
Thanks Erick for your reply.
I am trying to create a new core i.e dict_cn , which is totally different in
terms of index data, configs etc from the existing core "abc". 
The core is created successfully in my master (i.e mail) and i can do solr
query on this newly created core .
All the config files(Schema.xml and solrconfig.xml) are in mail server and
zookeper helps it for me to share all config files to other collections.
I did the similar setup to other collection , so that newly created core
should be available to all the collections, but it is still showing down.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Unable-to-setup-solr-cloud-with-multiple-collections-tp4194833p4195078.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Damien Kamerman
>From my experience on a high-end sever (256GB memory, 40 core CPU) testing
collection numbers with one shard and two replicas, the maximum that would
work is 3,000 cores (1,500 collections). I'd recommend much less (perhaps
half of that), depending on your startup-time requirements. (Though I have
settled on 6,000 collection maximum with some patching. See SOLR-7191). You
could create multiple clouds after that, and choose the cloud least used to
create your collection.

Regarding memory usage I'd pencil in 6MB overheard (no docs) java heap per
collection.

On 25 March 2015 at 13:46, Ian Rose  wrote:

> First off thanks everyone for the very useful replies thus far.
>
> Shawn - thanks for the list of items to check.  #1 and #2 should be fine
> for us and I'll check our ulimit for #3.
>
> To add a bit of clarification, we are indeed using SolrCloud.  Our current
> setup is to create a new collection for each customer.  For now we allow
> SolrCloud to decide for itself where to locate the initial shard(s) but in
> time we expect to refine this such that our system will automatically
> choose the least loaded nodes according to some metric(s).
>
> Having more than one business entity controlling the configuration of a
> > single (Solr) server is a recipe for disaster. Solr works well if there
> is
> > an architect for the system.
>
>
> Jack, can you explain a bit what you mean here?  It looks like Toke caught
> your meaning but I'm afraid it missed me.  What do you mean by "business
> entity"?  Is your concern that with automatic creation of collections they
> will be distributed willy-nilly across the cluster, leading to uneven load
> across nodes?  If it is relevant, the schema and solrconfig are controlled
> entirely by me and is the same for all collections.  Thus theoretically we
> could actually just use one single collection for all of our customers
> (adding a 'customer:' type fq to all queries) but since we never
> need to query across customers it seemed more performant (as well as safer
> - less chance of accidentally leaking data across customers) to use
> separate collections.
>
> Better to give each tenant a separate Solr instance that you spin up and
> > spin down based on demand.
>
>
> Regarding this, if by tenant you mean "customer", this is not viable for us
> from a cost perspective.  As I mentioned initially, many of our customers
> are very small so dedicating an entire machine to each of them would not be
> economical (or efficient).  Or perhaps I am not understanding what your
> definition of "tenant" is?
>
> Cheers,
> Ian
>
>
>
> On Tue, Mar 24, 2015 at 4:51 PM, Toke Eskildsen 
> wrote:
>
> > Jack Krupansky [jack.krupan...@gmail.com] wrote:
> > > I'm sure that I am quite unqualified to describe his hypothetical
> setup.
> > I
> > > mean, he's the one using the term multi-tenancy, so it's for him to be
> > > clear.
> >
> > It was my understanding that Ian used them interchangeably, but of course
> > Ian it the only one that knows.
> >
> > > For me, it's a question of who has control over the config and schema
> and
> > > collection creation. Having more than one business entity controlling
> the
> > > configuration of a single (Solr) server is a recipe for disaster.
> >
> > Thank you. Now your post makes a lot more sense. I will not argue against
> > that.
> >
> > - Toke Eskildsen
> >
>



-- 
Damien Kamerman


Re: Using G1 with Apache Solr

2015-03-24 Thread Shawn Heisey
On 3/24/2015 3:48 PM, Kamran Khawaja wrote:
> I'm running Solr 4.7.2 with Java 7u75 with the following JVM params:
> 
> -verbose:gc 
> -XX:+PrintGCDateStamps 
> -XX:+PrintGCDetails 
> -XX:+PrintAdaptiveSizePolicy 
> -XX:+PrintReferenceGC 
> -Xmx3072m 
> -Xms3072m 
> -XX:+UseG1GC 
> -XX:+UseLargePages 
> -XX:+AggressiveOpts 
> -XX:+ParallelRefProcEnabled 
> -XX:G1HeapRegionSize=8m 
> -XX:InitiatingHeapOccupancyPercent=35 
> 
> 
> What I'm currently seeing is that many of the gc pauses are under an
> acceptable 0.25 seconds but seeing way too many full GCs with an average
> stop time of 3.2 seconds.
> 
> You can find the gc logs
> here: https://www.dropbox.com/s/v04b336v2k5l05e/g1_gc_7u75.log.gz?dl=0
> 
> I initially tested without specifying the HeapRegionSize but that
> resulted in the "humongous" message in the gc logs and a ton of full gc
> pauses.

This is similar to the settings I've been working on that I've
documented on my wiki page, with better results than you are seeing, and
a larger heap than you have configured:

https://wiki.apache.org/solr/ShawnHeisey#G1_.28Garbage_First.29_Collector

You have one additional option that I don't --
InitiatingHeapOccupancyPercent.  I would suggest running without that
option to see how it affects your GC times.

I'm curious what OS you're running under, whether the OS and Java are
64-bit, and whether you have actually enabled huge pages in your
operating system.  If it's Linux and you have enabled huge pages, have
you turned off transparent huge pages as documented by Oracle:

https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge

On my servers, I do *not* have huge pages configured in the operating
system, so the UseLargePages java option isn't doing anything.

One final thing ... Oracle developers have claimed that Java 8u40 has
some major improvements to the G1 collector, particularly for programs
that allocate very large objects.  Can you try 8u40?

Thanks,
Shawn



Re: Using G1 with Apache Solr

2015-03-24 Thread Shawn Heisey
On 3/24/2015 9:52 PM, Shawn Heisey wrote:
> On 3/24/2015 3:48 PM, Kamran Khawaja wrote:
>> I'm running Solr 4.7.2 with Java 7u75 with the following JVM params:

I really got my wires crossed.  Kamran sent his message to the
hostpot-gc-use mailing list, not the solr-user list!

Thanks,
Shawn



Re: rough maximum cores (shards) per machine?

2015-03-24 Thread Shai Erera
While it's hard to answer this question because as others have said, "it
depends", I think it will be good of we can quantify or assess the cost of
running a SolrCore.

For instance, let's say that a server can handle a load of 10M indexed
documents (I omit search load on purpose for now) in a single SolrCore.
Would the same server be able to handle the same number of documents, If we
indexed 1000 docs per SolrCore, in total of 10,000 SorClores? If the answer
is no, then it means there is some cost that comes w/ each SolrCore, and we
may at least be able to give an upper bound --- on a server with X amount
of storage, Y GB RAM and Z cores you can run up to maxSolrCores(X, Y, Z).

Another way to look at it, if I were to create empty SolrCores, would I be
able to create an infinite number of cores if storage was infinite? Or even
empty cores have their toll on CPU and RAM?

I know from the Lucene side of things that each SolrCore (carries a Lucene
index) there is a toll to an index -- the lexicon, IW's RAM buffer, Codecs
that store things in memory etc. For instance, one downside of splitting a
10M core into 10,000 cores is that the cost of the holding the total
lexicon (dictionary of indexed words) goes up drastically, since now every
word (just the byte[] of the word) is potentially represented in memory
10,000 times.

What other RAM/CPU/Storage costs does a SolrCore carry with it? There are
the caches of course, which really depend on how many documents are
indexed. Any other non-trivial or constant cost?

So yes, there isn't a single answer to this question. It's just like
someone would ask how many documents can a single Lucene index handle
efficiently. But if we can come up with basic numbers as I outlined above,
it might help people doing rough estimates. That doesn't mean people
shouldn't benchmark, as that upper bound may be wy too high for their
data set, query workload and search needs.

Shai

On Wed, Mar 25, 2015 at 5:25 AM, Damien Kamerman  wrote:

> From my experience on a high-end sever (256GB memory, 40 core CPU) testing
> collection numbers with one shard and two replicas, the maximum that would
> work is 3,000 cores (1,500 collections). I'd recommend much less (perhaps
> half of that), depending on your startup-time requirements. (Though I have
> settled on 6,000 collection maximum with some patching. See SOLR-7191). You
> could create multiple clouds after that, and choose the cloud least used to
> create your collection.
>
> Regarding memory usage I'd pencil in 6MB overheard (no docs) java heap per
> collection.
>
> On 25 March 2015 at 13:46, Ian Rose  wrote:
>
> > First off thanks everyone for the very useful replies thus far.
> >
> > Shawn - thanks for the list of items to check.  #1 and #2 should be fine
> > for us and I'll check our ulimit for #3.
> >
> > To add a bit of clarification, we are indeed using SolrCloud.  Our
> current
> > setup is to create a new collection for each customer.  For now we allow
> > SolrCloud to decide for itself where to locate the initial shard(s) but
> in
> > time we expect to refine this such that our system will automatically
> > choose the least loaded nodes according to some metric(s).
> >
> > Having more than one business entity controlling the configuration of a
> > > single (Solr) server is a recipe for disaster. Solr works well if there
> > is
> > > an architect for the system.
> >
> >
> > Jack, can you explain a bit what you mean here?  It looks like Toke
> caught
> > your meaning but I'm afraid it missed me.  What do you mean by "business
> > entity"?  Is your concern that with automatic creation of collections
> they
> > will be distributed willy-nilly across the cluster, leading to uneven
> load
> > across nodes?  If it is relevant, the schema and solrconfig are
> controlled
> > entirely by me and is the same for all collections.  Thus theoretically
> we
> > could actually just use one single collection for all of our customers
> > (adding a 'customer:' type fq to all queries) but since we
> never
> > need to query across customers it seemed more performant (as well as
> safer
> > - less chance of accidentally leaking data across customers) to use
> > separate collections.
> >
> > Better to give each tenant a separate Solr instance that you spin up and
> > > spin down based on demand.
> >
> >
> > Regarding this, if by tenant you mean "customer", this is not viable for
> us
> > from a cost perspective.  As I mentioned initially, many of our customers
> > are very small so dedicating an entire machine to each of them would not
> be
> > economical (or efficient).  Or perhaps I am not understanding what your
> > definition of "tenant" is?
> >
> > Cheers,
> > Ian
> >
> >
> >
> > On Tue, Mar 24, 2015 at 4:51 PM, Toke Eskildsen 
> > wrote:
> >
> > > Jack Krupansky [jack.krupan...@gmail.com] wrote:
> > > > I'm sure that I am quite unqualified to describe his hypothetical
> > setup.
> > > I
> > > > mean, he's the one using the term multi-tenancy,