Re: 2 VM setup for SOLRCLOUD?

2013-06-01 Thread Daniel Collins
Document updates will fail with less than the quorum of ZKs, so you won't be 
able to index anything when 1 server is down.


Its the one area that always seems counter intuitive (to me at any rate), 
after all you have your 2 instances on 1 server, so you have all the shard 
data, logically you should be able to index just using that (and if you had 
a single ZK running on that server it would indeed be fine)...  However, ZK 
needs a 3rd instance running somewhere in order to maintain its majority 
rule.


The consensus I've seen tends to be run a ZK on all your cloud servers, and 
then run some "outside" the cloud on other machines.  If you had a 3rd VM 
that just ran ZK and nothing else, you could lose any 1 of the 3 machines 
and still be ok. But if you lose 2 you are in trouble.


-Original Message- 
From: James Dulin

Sent: Friday, May 31, 2013 10:28 PM
To: solr-user@lucene.apache.org
Subject: RE: 2 VM setup for SOLRCLOUD?

Thanks. When you say updates will fail, do you mean document updates will 
fail, or, updates to the cluster, like adding a new node?  If adding new 
data will fail, I will definitely need to figure out a different way to set 
this up.


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Friday, May 31, 2013 4:33 PM
To: solr-user@lucene.apache.org
Subject: Re: 2 VM setup for SOLRCLOUD?

Be really careful here. Zookeeper requires a quorum, which is ((zk
nodes)/2) + 1. So the problem here is that if (zk nodes) is 2, both of them 
need to be up. If either of them is down, searches will still work, but 
updates will fail.


Best
Erick

On Fri, May 31, 2013 at 11:39 AM, James Dulin  wrote:


Thanks, I think that the load balancer will be simple enough to set up in 
Azure.   My only other current concern is having the zookeepers on the 
same VMs as Solr.  While not ideal, we basically just need simple 
redunancy, so my theory is that if VM1 goes down, VM 2 will have the 
shard, node, and zookeeper to keep everything going smooth.



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Friday, May 31, 2013 8:07 AM
To: solr-user@lucene.apache.org
Subject: Re: 2 VM setup for SOLRCLOUD?

Actually, you don't technically _need_ a load balancer, you could hard 
code all requests to the same node and internally, everything would "just 
work". But then you'd be _creating_ a single point of failure if that node 
went down, so a fronting LB is usually indicated.


Perhaps the thing you're missing is that Zookeeper is there explicitly for 
the purpose of knowing where all the nodes are and what their state is. 
Solr communicates with ZK and any incoming requests (update or query) are 
handled appripriately thus Jason's comment that once a request gets to any 
node in the cluster, things are handled automatically.


All that said, if you're using SolrJ and use CloudSolrServer exclusively, 
then the load balancer isn't necessary. Internally CloudSolrServer (the 
client) reads the list of accessible nodes from Zookeeper and will be 
fault tolerant and load balance internally.


Best
Erick

On Thu, May 30, 2013 at 3:51 PM, Jason Hellman 
 wrote:

Jamey,

You will need a load balancer on the front end to direct traffic into one 
of your SolrCore entry points.  It doesn't matter, technically, which one 
though you will find benefits to narrowing traffic to fewer (for purposes 
of better cache management).


Internally SolrCloud will round-robin distribute requests to other shards 
once a query begins execution.  But you do need an entry point externally 
to be defined through your load balancer.


Hope this is useful!

Jason

On May 30, 2013, at 12:48 PM, James Dulin  wrote:


Working to setup SolrCloud in Windows Azure.  I have read over the
solr Cloud wiki, but am a little confused about some of the
deployment options.  I am attaching an image for what I am thinking
we want to do.  2 VM's that will have 2 shards spanning across them.
4 Nodes total across the two machines, and a zookeeper on each VM.
I think this is feasible, but, I am a little confused about how each
node knows how to respond to requests (do I need a load balancer in
front, or can we just reference the "collection" etc.)



Thanks!

Jamey








Re: Shard Keys and Distributed Search

2013-06-01 Thread Daniel Collins
Yes it is doing a distributed search, Solr cloud will do that by default 
unless you say distrib=false.


My understanding of Solr's Load balancer is that it picks a random instance 
from the list of available instances serving each shard.

So in your example:

1. Query comes in to Server 1, server 1 de-constructs it and works out which 
shards it needs to query. It then gets a list (from ZK) of all the instances 
in that collection which can service that shard, and the LB in Solr just 
picks one (at random).

2. It has picked Server 3 in your case, so the request goes there.
3. The request is still a 2-stage process (in terms of what you see in the 
logs), 1 query to get the docIds (using your query data) and then a second 
"query" to get the stored fields, once it has the correct list of docs. 
This is necessary because in a general multi-shard query, the responses will 
have to go back to server 1 and be consolidated (not 100% sure of this area 
but I believe this is true and it makes logical sense to me), so if you had 
a query for 10 records that needed to access 4 shards, it would ask for the 
"top 10" from each shard, then combine/sort them to get the overall "top 
10", and then get the stored fields for those 10 (which might be 5 from 
shard 1, 2 from shard2 and 3 from shard3, nothing from shard4 for example).


You are right that it seems counter intuitive from the users's perspective, 
but I don't think Solr Cloud currently has any logic to favour a local 
instance over a remote one, I guess that would be a change to 
CloudSolrServer? Alternatively, you can do it in your client, send a 
non-distributed query, so append 
"distrib=false&shards=localhost:8983/solr,localhost:7574/solr".


-Original Message- 
From: Niran Fajemisin

Sent: Friday, May 31, 2013 5:00 PM
To: Solr User
Subject: Shard Keys and Distributed Search

Hi all,

I'm trying to make sure that I understand under what circumstance a 
distributed search is performed against Solr and if my general understanding 
of what constitutes a distributed search is correct.


I have a Solr collection that was created using the Collections API with the 
following parameters: numShards=5  maxShardsPerNode=5  replicationFactor=4. 
Given that we have 4 servers this will result in 5 shards being created on 
each server. All documents indexed into Solr have a shard key specified as a 
part of their document id, such that we can use the same shard key prefix as 
a part of our query by specifying: shard.keys=myshardkey!


My assumption was that when the search request is submitted, given that my 
deployment topology has all possible shards available on each server, there 
will be no need to call out to other servers in the cluster to fulfill the 
search. What I am noticing is the following:


1. Submit a search to Server 1 with the shard.keys parameter specified. 
(Note again that replicas for shard 1-5 are all available on the Server 1.)
2. The request is forwarded to a server other than Server 1, for example 
Server 3.
3. The  /select request handler of Server 3 is invoked. This proceeds to 
execute the /select request, asking for the id and score fields for each 
document that matches the submittted query. I also noticed that it passes 
the shard.url parameter but states that distrib=false.
4. Then *another* request is executed on Server 3 for the /select request 
handler *again*. This time the ids returned from the previous search are 
passed in as the ids parameters.
5. Finally the results are passed back to the caller through the original 
server, Server 1.


This appears to a be full blown distributed shard being performed. My 
expectation was that the search would be localized to the original server 
(Server 1 in the example used above), given that it *should* be able to 
deduce that the current server has a replica that can fulfill the requested 
search. As the very least localizing the search against the shards on Server 
1 instead of going against the entire Solr cluster.


My hope was that we would not have to go across the network, paying the 
network transport penalty, for a search that could have been fulfilled from 
the original Solr node, when the shard.keys param is specified.


Any insight that can be provided will be greatly appreciated.

Thanks all! 



Re: installing & configuring solr over ms sql server - tutorial needed

2013-06-01 Thread Mysurf Mail
My problem was with sql server.
This  is a
great "step by step"


On Sat, Jun 1, 2013 at 2:06 AM, bbarani  wrote:

> Why dont you follow this one tutorial to set the SOLR on tomcat..
>
> http://wiki.apache.org/solr/SolrTomcat
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/installing-configuring-solr-over-ms-sql-server-tutorial-needed-tp4067344p4067488.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: whole index in memory

2013-06-01 Thread Ramkumar R. Aiyengar
In general, just increasing the cache sizes to make everything fit in
memory might not always give you best results. Do keep in mind that the
caches are in Java memory and that incurs the penalty of garbage collection
and other housekeeping Java's memory management might have to do.

Reasonably recent Solr distributions should default to memory mapping your
collections on most platforms. What that means is that if you have
sufficient free memory available on your server for the operating system to
use, it would do the caching for you and that invariably ends up being much
better in terms of performance. From that angle, its preferable to keep the
caches as small as possible so that the OS has more to cache.

That said, as always, YMMV. The ultimate test in all this is to try it out
yourself with various configurations and see the performance differences
for yourself.
On 1 Jun 2013 01:34,  wrote:

> Hello,
>
> I have a solr index of size 5GB. I am thinking of increasing  cache size
> to 5 GB, expecting Solr will put whole index into memory.
>
> 1. Will Solr indeed put whole index into memory?
> 2. What are drawbacks of this approach?
>
> Thanks in advance.
> Alex.
>


Custom Response Handler

2013-06-01 Thread vibhoreng04
Hi All,

I have a requirement where I need to retrieve the candidates from the solr,
do some calculation on the basis of the search result and return  the
calculated values along with the solr document.
I am planning to use Custom Response Handlers for this .
Anybody can guide me what will be the best approach for this.

I am using solr 4.2.1.

Regards,
Vibhor Jaiswal



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Custom-Response-Handler-tp4067558.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Custom Response Handler

2013-06-01 Thread Erik Hatcher
If the values are per document, a DocTransformer is probably the best hook 
point for you.  If you're computing things across the results, a 
SearchComponent is perhaps the right fit.  A RequestHandler or 
QueryResponseWriter are probably not the right fits for what you're trying to 
do, it sounds like with what you've said so far.

Erik

On Jun 1, 2013, at 10:00 , vibhoreng04 wrote:

> Hi All,
> 
> I have a requirement where I need to retrieve the candidates from the solr,
> do some calculation on the basis of the search result and return  the
> calculated values along with the solr document.
> I am planning to use Custom Response Handlers for this .
> Anybody can guide me what will be the best approach for this.
> 
> I am using solr 4.2.1.
> 
> Regards,
> Vibhor Jaiswal
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Custom-Response-Handler-tp4067558.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: 2 VM setup for SOLRCLOUD?

2013-06-01 Thread Walter Underwood
Running ZK on all the cloud servers makes it very, very hard to add a new Solr 
node. You have to reconfigure every ZK server to do that.

Manage the ZK cluster and the Solr cluster separately.

I'm not sure it is worth configuring Solr Cloud if you are only going to run 
two servers. Instead, run one server as live, and use simple replication to the 
second as a hot backup.

If you need four or more Solr servers and you need NRT, run Solr Cloud. 

wunder

On Jun 1, 2013, at 1:55 AM, Daniel Collins wrote:

> Document updates will fail with less than the quorum of ZKs, so you won't be 
> able to index anything when 1 server is down.
> 
> Its the one area that always seems counter intuitive (to me at any rate), 
> after all you have your 2 instances on 1 server, so you have all the shard 
> data, logically you should be able to index just using that (and if you had a 
> single ZK running on that server it would indeed be fine)...  However, ZK 
> needs a 3rd instance running somewhere in order to maintain its majority rule.
> 
> The consensus I've seen tends to be run a ZK on all your cloud servers, and 
> then run some "outside" the cloud on other machines.  If you had a 3rd VM 
> that just ran ZK and nothing else, you could lose any 1 of the 3 machines and 
> still be ok. But if you lose 2 you are in trouble.
> 
> -Original Message- From: James Dulin
> Sent: Friday, May 31, 2013 10:28 PM
> To: solr-user@lucene.apache.org
> Subject: RE: 2 VM setup for SOLRCLOUD?
> 
> Thanks. When you say updates will fail, do you mean document updates will 
> fail, or, updates to the cluster, like adding a new node?  If adding new data 
> will fail, I will definitely need to figure out a different way to set this 
> up.
> 
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Friday, May 31, 2013 4:33 PM
> To: solr-user@lucene.apache.org
> Subject: Re: 2 VM setup for SOLRCLOUD?
> 
> Be really careful here. Zookeeper requires a quorum, which is ((zk
> nodes)/2) + 1. So the problem here is that if (zk nodes) is 2, both of them 
> need to be up. If either of them is down, searches will still work, but 
> updates will fail.
> 
> Best
> Erick
> 
> On Fri, May 31, 2013 at 11:39 AM, James Dulin  wrote:
>> 
>> Thanks, I think that the load balancer will be simple enough to set up in 
>> Azure.   My only other current concern is having the zookeepers on the same 
>> VMs as Solr.  While not ideal, we basically just need simple redunancy, so 
>> my theory is that if VM1 goes down, VM 2 will have the shard, node, and 
>> zookeeper to keep everything going smooth.
>> 
>> 
>> -Original Message-
>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>> Sent: Friday, May 31, 2013 8:07 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: 2 VM setup for SOLRCLOUD?
>> 
>> Actually, you don't technically _need_ a load balancer, you could hard code 
>> all requests to the same node and internally, everything would "just work". 
>> But then you'd be _creating_ a single point of failure if that node went 
>> down, so a fronting LB is usually indicated.
>> 
>> Perhaps the thing you're missing is that Zookeeper is there explicitly for 
>> the purpose of knowing where all the nodes are and what their state is. Solr 
>> communicates with ZK and any incoming requests (update or query) are handled 
>> appripriately thus Jason's comment that once a request gets to any node in 
>> the cluster, things are handled automatically.
>> 
>> All that said, if you're using SolrJ and use CloudSolrServer exclusively, 
>> then the load balancer isn't necessary. Internally CloudSolrServer (the 
>> client) reads the list of accessible nodes from Zookeeper and will be fault 
>> tolerant and load balance internally.
>> 
>> Best
>> Erick
>> 
>> On Thu, May 30, 2013 at 3:51 PM, Jason Hellman 
>>  wrote:
>>> Jamey,
>>> 
>>> You will need a load balancer on the front end to direct traffic into one 
>>> of your SolrCore entry points.  It doesn't matter, technically, which one 
>>> though you will find benefits to narrowing traffic to fewer (for purposes 
>>> of better cache management).
>>> 
>>> Internally SolrCloud will round-robin distribute requests to other shards 
>>> once a query begins execution.  But you do need an entry point externally 
>>> to be defined through your load balancer.
>>> 
>>> Hope this is useful!
>>> 
>>> Jason
>>> 
>>> On May 30, 2013, at 12:48 PM, James Dulin  wrote:
>>> 
 Working to setup SolrCloud in Windows Azure.  I have read over the
 solr Cloud wiki, but am a little confused about some of the
 deployment options.  I am attaching an image for what I am thinking
 we want to do.  2 VM's that will have 2 shards spanning across them.
 4 Nodes total across the two machines, and a zookeeper on each VM.
 I think this is feasible, but, I am a little confused about how each
 node knows how to respond to requests (do I need a load balancer in
 front, or can we just 

Estimating the required volume to

2013-06-01 Thread Mysurf Mail
Hi,

I am just starting to learn about solr.
I want to test it in my env working with ms sql server.

I have followed the tutorial and imported some rows to the Solr.
Now I have a few noob question regarding the benefits of implementing Solr
on a sql environment.

1. As I understand, When I send a query request over http, I receive a
result with ID from the Solr system and than I query the full object row
from the db.
Is that right?
Is there a comparison  next to ms sql full text search which retrieves the
full object in the same select?
Is there a comparison that relates to db/server cluster and multiple
machines?
2. Is there a technic that will assist me to estimate the volume size I
will need for the indexed data (obviously, based on the indexed data
properties) ?


Delta Import failing - DataImportHandler SOLR 4.2

2013-06-01 Thread PeriS
I have configured the delta query properly, but not sure why the DIH is 
throwing the following error;

SEVERE: Delta Import Failed
java.lang.RuntimeException: java.lang.NullPointerException
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:266)
at 
org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:451)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:489)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)
Caused by: java.lang.NullPointerException
at 
org.apache.solr.handler.dataimport.DocBuilder.findMatchingPkColumn(DocBuilder.java:718)
at 
org.apache.solr.handler.dataimport.DocBuilder.collectDelta(DocBuilder.java:783)
at 
org.apache.solr.handler.dataimport.DocBuilder.doDelta(DocBuilder.java:334)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:219)
... 3 more


Thanks
-Peri.S