Re: SolR performance problem

2014-01-31 Thread Furkan KAMACI
Hi;

Could you give more information about your hardware infrastructure and JVM
settings?

Thanks;
Furkan KAMACI


2014-01-30 MayurPanchal :

> Hi,
>
> I am working on solr 4.2.1 jetty and we are facing some performance issue
> and heap memory overflow issue as well. So i am searching the actual cause
> for this exceptions. then i applied load test for different solr queries.
> After few mins got below errors.
>
> WARN:oejs.Response:Committed before 500 {msg=Software caused connection
> abort: socket write
>
> Caused by: java.net.SocketException: Software caused connection abort:
> socket write error
>
> SEVERE: null:org.eclipse.jetty.io.EofException
>
>
> I also tried to set the maxIdleTime to 30 milliSeconds. But still
> getting same error.
>
> Any ideas?
> Please help, how to tackle this.
>
> Thanks,
> Mayur
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolR-performance-problem-tp4114459.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Realtimeget SolrCloud

2014-01-31 Thread StrW_dev
Hello,

I am currently experimenting to move our Solr instance into a SolrCloud
setup. 
I am getting an error trying to access the realtimeget handlers:

HTTP ERROR 404
Problem accessing /solr/collection1/get. Reason:
Not Found

They work fine in the normal Solr setup. Do I need some changes in
configurations? Am I missing something? Or is it not supported in the cloud
environment?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Realtimeget-SolrCloud-tp4114595.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Realtimeget SolrCloud

2014-01-31 Thread Rafał Kuć
Hello!

Do you have realtime get handler defined in your solrconfig.xml? This
part should be present:

  
 
   true
   json
   true
 
  

-- 
Regards,
 Rafał Kuć
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


> Hello,

> I am currently experimenting to move our Solr instance into a SolrCloud
> setup. 
> I am getting an error trying to access the realtimeget handlers:

> HTTP ERROR 404
> Problem accessing /solr/collection1/get. Reason:
> Not Found

> They work fine in the normal Solr setup. Do I need some changes in
> configurations? Am I missing something? Or is it not supported in the cloud
> environment?



> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Realtimeget-SolrCloud-tp4114595.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Realtimeget SolrCloud

2014-01-31 Thread StrW_dev

That seemed to be the issue.

I had several other request handlers as I wasn't using the simple /get, but
apparently in SolrCloud this handler must be present in order to use the
class RealTimeGetHandler at all.

Thank you!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Realtimeget-SolrCloud-tp4114595p4114598.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Regarding Solr Faceting on the query response.

2014-01-31 Thread Mikhail Khludnev
On Thu, Jan 30, 2014 at 9:35 PM, Kuchekar  wrote:

>
> "docs": [ { "id": "ABC123", "company": [ "APPLE" ] },
> { "id": "ABC1234", "company": [ "APPLE" ] },
> { "id": "ABC1235", "company": [ "APPLE" ] },
> { "id": "ABC1236", "company": [ "APPLE" ] } ] }, "facet_counts": { "
> facet_queries": { "p_company:ucsf\n": 1 }, "facet_fields": { "company": [
> "APPLE", 4, ] }, "facet_dates": {}, "facet_ranges": {} }
>

Is it your 'expected' result? If it is, please repeat an 'actual' once
again. I have one idea in mind, please answer to let me confirm my guess. I
share my idea after that.


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Re: Regarding Solr Faceting on the query response.

2014-01-31 Thread Jérôme Étévé
On 30 January 2014 17:35, Kuchekar  wrote:
> Hi Mikhail,
>
>  I would like my faceting to run only on my resultset
> returned as in only on numFound, rather than the whole index.

As far as I know, unless you define filter tagging and exclusion, this
is the default facet behaviour.

Are you sure no such things are defined?

Can you send your full solr query?

J.


> In the example, even when I specify the query 'company:Apple' .. it gives
> me faceted results for other companies. This means that it is querying
> against the whole index, rather than just the result set.
>
> Using facet.mincount=1 will give me faceted values which are greater than
> 1, but that will again to retrieve all the distinct values (Apple, Bose,
> Chevron, ..Oracle..) of facet field (company) query the whole index.
>
> What I would like to do is ... facet only on the resultset.
>
> i.e. my query (q= company:Apple AND technologies:java ) should return, only
> the facet details about 'Apple' since that is only present in the results
> set. But it provides me the list of other Company Names ... which makes me
> believe that it is querying the whole index to get the distinct value for
> the company..
>
> "docs": [ { "id": "ABC123", "company": [ "APPLE" ] },
> { "id": "ABC1234", "company": [ "APPLE" ] },
> { "id": "ABC1235", "company": [ "APPLE" ] },
> { "id": "ABC1236", "company": [ "APPLE" ] } ] }, "facet_counts": { "
> facet_queries": { "p_company:ucsf\n": 1 }, "facet_fields": { "company": [
> "APPLE", 4, ] }, "facet_dates": {}, "facet_ranges": {} }
>
>
>  Thanks.
> Kuchekar, Nilesh
>
>
> On Thu, Jan 30, 2014 at 2:13 AM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
>> Hello
>> Do you mean setting
>> http://wiki.apache.org/solr/SimpleFacetParameters#facet.mincount to 1 or
>> you want to facet only returned page (rows) instead of full resultset
>> (numFound) ?
>>
>>
>> On Thu, Jan 30, 2014 at 6:24 AM, Nilesh Kuchekar
>> wrote:
>>
>> > Yeah it's a typo... I meant company:Apple
>> >
>> > Thanks
>> > Nilesh
>> >
>> > > On Jan 29, 2014, at 8:59 PM, Alexandre Rafalovitch > >
>> > wrote:
>> > >
>> > >> On Thu, Jan 30, 2014 at 3:43 AM, Kuchekar 
>> > wrote:
>> > >> company=Apple
>> > > Did you mean company:Apple ?
>> > >
>> > > Otherwise, that could be the issue.
>> > >
>> > > Regards,
>> > >   Alex.
>> > >
>> > >
>> > > Personal website: http://www.outerthoughts.com/
>> > > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> > > - Time is the quality of nature that keeps events from happening all
>> > > at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> > > book)
>> >
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Principal Engineer,
>> Grid Dynamics
>>
>> 
>>  
>>



-- 
Jerome Eteve
+44(0)7738864546
http://www.eteve.net/


Re: Realtimeget SolrCloud

2014-01-31 Thread Rafał Kuć
Hello!

No problem. Also remember that you need the _version_ field to be
present in your schema.

-- 
Regards,
 Rafał Kuć
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/



> That seemed to be the issue.

> I had several other request handlers as I wasn't using the simple /get, but
> apparently in SolrCloud this handler must be present in order to use the
> class RealTimeGetHandler at all.

> Thank you!



> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Realtimeget-SolrCloud-tp4114595p4114598.html
> Sent from the Solr - User mailing list archive at Nabble.com.



MoreLikeThis

2014-01-31 Thread rubenboada
Hi everybody,

I'm working on DSpace 3.2 and I want to change 'Related Documents'
functionality, which is based on Solr MoreLikeThis. Now when I open an item,
below his metadata appears 'Related Documents' where would have to show
other items of actual item's author, but appears items of another authors. I
have observed that if name or surname matchs with actual author's name or
surname, DSpace shows wrong items. For example, if item's author called
Jorge Perez Diaz and DSpace have another author called Marta Fernandez
Perez, first surname of Jorge matchs with second surname of Marta, and
DSpace shows Marta Fernandez Perez item in Related Documents of Jorge Perez
Diaz item. 

I readed http://wiki.apache.org/solr/MoreLikeThis and I tried to modify
values of mindf and mintf but I didn't resolve succesfully this problem.

Anyone knows the solution?

Thanks in advance



--
View this message in context: 
http://lucene.472066.n3.nabble.com/MoreLikeThis-tp4114605.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Realtimeget SolrCloud

2014-01-31 Thread Jack Krupansky
The reason is that although you can configure handlers with any name you 
want, internal requests to other shards (other Solr servers) will assume 
that the handlers have the default handler names, like "/get". Probably that 
should be configurable, or have some way of determining what the original 
handler name was.


-- Jack Krupansky

-Original Message- 
From: StrW_dev

Sent: Friday, January 31, 2014 4:56 AM
To: solr-user@lucene.apache.org
Subject: Re: Realtimeget SolrCloud


That seemed to be the issue.

I had several other request handlers as I wasn't using the simple /get, but
apparently in SolrCloud this handler must be present in order to use the
class RealTimeGetHandler at all.

Thank you!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Realtimeget-SolrCloud-tp4114595p4114598.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: how to write an efficient query with a subquery to restrict the search space?

2014-01-31 Thread svante karlsson
It seems to be faster to first restrict the search space and then do the
scoring compared to just use the full query and let solr handle everything.

For example in my application one of the scoring fields effectivly hits
1/12 of the database (a month field) and if we have 100'' items in the
database the this matters.

/svante


2014-01-30 Jack Krupansky :

> Lucene's default scoring should give you much of what you want - ranking
> hits of low-frequency terms higher - without any special query syntax -
> just list out your terms and use "OR" as your default operator.
>
> -- Jack Krupansky
>
> -Original Message- From: svante karlsson
> Sent: Thursday, January 23, 2014 6:42 AM
> To: solr-user@lucene.apache.org
> Subject: how to write an efficient query with a subquery to restrict the
> search space?
>
>
> I have a solr db containing 1 billion records that I'm trying to use in a
> NoSQL fashion.
>
> What I want to do is find the best matches using all search terms but
> restrict the search space to the most unique terms
>
> In this example I know that val2 and val4 is rare terms and val1 and val3
> are more common. In my real scenario I'll have 20 fields that I want to
> include or exclude in the inner query depending on the uniqueness of the
> requested value.
>
>
> my first approach was:
> q=field1:val1 OR field2:val2 OR field3:val3 OR field4:val4 AND (field2:val2
> OR field4:val4)&rows=100&fl=*
>
> but what I think I get is
> .  field4:val4 AND (field2:val2 OR field4:val4)   this result is then
> OR'ed with the rest
>
> if I write
> q=(field1:val1 OR field2:val2 OR field3:val3 OR field4:val4) AND
> (field2:val2 OR field4:val4)&rows=100&fl=*
>
> then what I think I get is two sub-queries that is evaluated separately and
> then joined - performance wise this is bad.
>
> Whats the best way to write these types of queries?
>
>
> Are there any performance issues when running it on several solrcloud nodes
> vs a single instance or should it scale?
>
>
>
> /svante
>


Re: Storing ranges on documents and searching all document with specific value included

2014-01-31 Thread Jack Krupansky
What does your actual query look like? Is it two range queries and an AND? 
Also, you have spaces in your field names, so that makes it more difficult 
to write queries since they need to be escaped.


-- Jack Krupansky

-Original Message- 
From: Avner Levy

Sent: Saturday, January 18, 2014 1:01 AM
To: 'solr-user@lucene.apache.org'
Subject: Storing ranges on documents and searching all document with 
specific value included


I have millions of documents with the following fields:
name (string), start version (int), end version (int).



I need to query efficiently all records which answers the query:
Select all documents where version >= "start version" and version<="end 
version"


Running the above query took 50-100 ms while similar query by tagging each 
version took only 15 ms.
My question is how efficient can Solr handle such queries? (since it isn't 
classic FTS query)

Do I need to define something special in order to optimize performance?
Any alternate solutions will be welcomed.
The fields values / types can be changed if needed.



List and Edit Config Files at Zookeeper from a Client Application

2014-01-31 Thread Furkan KAMACI
Hi;

I am developing an application that will have an ability to list and edit
SolrCloud config files at Zookeeper. Basically operator will able to see
stopwords, synonyms (also elevator). Operator will edit it from my
dashboard and this files will be updated at Zookeper.

Currently I use cloud-scripts that is shipped with Solr. SolrZkClient,
ResourceLoader are options to use. What is the most convenient way for my
list and edit purpose?

Thanks;
Furkan KAMACI


shard1 gone missing ...

2014-01-31 Thread David Santamauro


Hi,

I have a strange situation. I created a collection with 4 ndoes 
(separate servers, numShards=4), I then proceeded to index data ... all 
has been seemingly well until this morning when I had to reboot one of 
the nodes.


After reboot, the node I rebooted went into recovery mode! This is 
completely illogical as there is 1 shard per node (no replicas).


What could have possibly happened to 1) trigger a recovery and; 2) have 
the node think it has a replica to even recover from?


Looking at the graph from the SOLR admin page it shows that shard1 
disappeared and the server that was rebooted appears in a recovering 
state under the server home to shard2.


I then looked at clusterstate.json and it confirms that shard1 is 
completely missing and shard2 now has a replica. ... I'm baffled, 
confused, dismayed.


Versions:
Solr 4.4 (4 nodes with tomcat container)
zookeeper-3.4.5 (5-node ensemble)

Oh, and I'm assuming shard1 is completely corrupt.

I'd really appreciate any insight.

David

PS I have a copy of all the shards backed up. Is there a way to possibly 
rsync shard1 back into place and "fix" clusterstate.json manually?


Re: shard1 gone missing ...

2014-01-31 Thread Mark Miller
Would probably need to see some logs to have an idea of what happened.

Would also be nice to see the after state of zk in a text dump.

You should be able to fix it, as long as you have the index on a disk, just 
make sure it is where it is expected and manually update the clusterstate.json. 
Would be good to take a look at the logs and see if it tells anything first 
though.

I’d also highly recommend you try moving to Solr 4.6.1 when you can though. We 
have fixed many, many, many bugs around SolrCloud in the 4 releases since 4.4. 
You can follow the progress in the CHANGES file we update for each release.

I wrote a little about the 4.6.1 as it relates to SolrCloud here: 
https://plus.google.com/+MarkMillerMan/posts/CigxUPN4hbA

- Mark

http://about.me/markrmiller

On Jan 31, 2014, at 10:13 AM, David Santamauro  
wrote:

> 
> Hi,
> 
> I have a strange situation. I created a collection with 4 ndoes (separate 
> servers, numShards=4), I then proceeded to index data ... all has been 
> seemingly well until this morning when I had to reboot one of the nodes.
> 
> After reboot, the node I rebooted went into recovery mode! This is completely 
> illogical as there is 1 shard per node (no replicas).
> 
> What could have possibly happened to 1) trigger a recovery and; 2) have the 
> node think it has a replica to even recover from?
> 
> Looking at the graph from the SOLR admin page it shows that shard1 
> disappeared and the server that was rebooted appears in a recovering state 
> under the server home to shard2.
> 
> I then looked at clusterstate.json and it confirms that shard1 is completely 
> missing and shard2 now has a replica. ... I'm baffled, confused, dismayed.
> 
> Versions:
> Solr 4.4 (4 nodes with tomcat container)
> zookeeper-3.4.5 (5-node ensemble)
> 
> Oh, and I'm assuming shard1 is completely corrupt.
> 
> I'd really appreciate any insight.
> 
> David
> 
> PS I have a copy of all the shards backed up. Is there a way to possibly 
> rsync shard1 back into place and "fix" clusterstate.json manually?



Re: shard1 gone missing ...

2014-01-31 Thread Mark Miller


On Jan 31, 2014, at 10:13 AM, David Santamauro  
wrote:

> Oh, and I'm assuming shard1 is completely corrupt.

Seems unlikely by the way. Sounds like what probably happened is that for some 
reason it thought when you restarted the shard that you were creating it with 
numShards=2 instead of 1.

In that case, the a new entry in zk would be created. The first entry *could* 
still look like it was active (we did not always try and publish a DOWN state 
on a clean shutdown) and the node that it is on will appear live (because it 
is).

In this case, the node would try to recover from itself.

I actually when expect that to easily corrupt the index. It’s easy enough to 
check though. Simply try starting a Solr instance against it and take a look.

- Mark

http://about.me/markrmiller

Re: shard1 gone missing ...

2014-01-31 Thread Mark Miller



On Jan 31, 2014, at 10:31 AM, Mark Miller  wrote:

> Seems unlikely by the way. Sounds like what probably happened is that for 
> some reason it thought when you restarted the shard that you were creating it 
> with numShards=2 instead of 1.

No, that’s not right. Sorry.

It must have got assigned a new core node name. numShards would still have to 
be seen as 1 for it to try and be a replica. Brain lapse.

Are you using a custom coreNodeName or taking the default? Can you post your 
solr.xml so we can see your genericCoreNodeNames and coreNodeName settings?

One possibility is that you got assigned a coreNodeName, but for some reason it 
was not persisted in solr.xml.

- Mark

http://about.me/markrmiller

Re: JVM heap constraints and garbage collection

2014-01-31 Thread Michael Della Bitta
Here at Appinions, we use mostly m2.2xlarges, but the new i2.xlarges look
pretty tasty primarily because of the SSD, and I'll probably push for a
switch to those when our reservations run out.

http://www.ec2instances.info/

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions  | g+:
plus.google.com/appinions
w: appinions.com 


On Thu, Jan 30, 2014 at 7:43 PM, Shawn Heisey  wrote:

> On 1/30/2014 3:20 PM, Joseph Hagerty wrote:
>
>> I'm using Solr 3.5 over Tomcat 6. My index has reached 30G.
>>
>
> 
>
>
>  - The box is an m1.large on AWS EC2. 2 virtual CPUs, 4 ECU, 7.5 GiB RAM
>>
>
> One detail that you did not provide was how much of your 7.5GB RAM you are
> allocating to the Java heap for Solr, but I actually don't think I need
> that information, because for your index size, you simply don't have
> enough. If you're sticking with Amazon, you'll want one of the instances
> with at least 30GB of RAM, and you might want to consider more memory than
> that.
>
> An ideal RAM size for Solr is equal to the size of on-disk data plus the
> heap space used by Solr and other programs.  This means that if your java
> heap for Solr is 4GB and there are no other significant programs running on
> the same server, you'd want a minimum of 34GB of RAM for an ideal setup
> with your index.  4GB of that would be for Solr itself, the remainder would
> be for the operating system to fully cache your index in the OS disk cache.
>
> Depending on your query patterns and how your schema is arranged, you
> *might* be able to get away as little as half of your index size just for
> the OS disk cache, but it's better to make it big enough for the whole
> index, plus room for growth.
>
> http://wiki.apache.org/solr/SolrPerformanceProblems
>
> Many people are *shocked* when they are told this information, but if you
> think about the relative speeds of getting a chunk of data from a hard disk
> vs. getting the same information from memory, it's not all that shocking.
>
> Thanks,
> Shawn
>
>


Re: JVM heap constraints and garbage collection

2014-01-31 Thread Joseph Hagerty
Thanks, Shawn. This information is actually not all that shocking to me.
It's always been in the back of my mind that I was "getting away with
something" in serving from the m1.large. Remarkably, however, it has served
me well for nearly two years; also, although the index has not always been
30GB, it has always been much larger than the RAM on the box. As you
suggested, I can only suppose that usage patterns and the index schema have
in some way facilitated minimal heap usage, up to this point.

For now, we're going to increase the heap size on the instance and see
where that gets us; if it still doesn't suffice for now, then we'll upgrade
to a more powerful instance.

Michael, thanks for weighing in. Those i2 instances look delicious indeed.
Just curious -- have you struggled with garbage collection pausing at all?



On Thu, Jan 30, 2014 at 7:43 PM, Shawn Heisey  wrote:

> On 1/30/2014 3:20 PM, Joseph Hagerty wrote:
>
>> I'm using Solr 3.5 over Tomcat 6. My index has reached 30G.
>>
>
> 
>
>
>  - The box is an m1.large on AWS EC2. 2 virtual CPUs, 4 ECU, 7.5 GiB RAM
>>
>
> One detail that you did not provide was how much of your 7.5GB RAM you are
> allocating to the Java heap for Solr, but I actually don't think I need
> that information, because for your index size, you simply don't have
> enough. If you're sticking with Amazon, you'll want one of the instances
> with at least 30GB of RAM, and you might want to consider more memory than
> that.
>
> An ideal RAM size for Solr is equal to the size of on-disk data plus the
> heap space used by Solr and other programs.  This means that if your java
> heap for Solr is 4GB and there are no other significant programs running on
> the same server, you'd want a minimum of 34GB of RAM for an ideal setup
> with your index.  4GB of that would be for Solr itself, the remainder would
> be for the operating system to fully cache your index in the OS disk cache.
>
> Depending on your query patterns and how your schema is arranged, you
> *might* be able to get away as little as half of your index size just for
> the OS disk cache, but it's better to make it big enough for the whole
> index, plus room for growth.
>
> http://wiki.apache.org/solr/SolrPerformanceProblems
>
> Many people are *shocked* when they are told this information, but if you
> think about the relative speeds of getting a chunk of data from a hard disk
> vs. getting the same information from memory, it's not all that shocking.
>
> Thanks,
> Shawn
>
>


-- 
- Joe


Re: shard1 gone missing ...

2014-01-31 Thread David Santamauro

On 01/31/2014 10:35 AM, Mark Miller wrote:




On Jan 31, 2014, at 10:31 AM, Mark Miller  wrote:


Seems unlikely by the way. Sounds like what probably happened is that for some 
reason it thought when you restarted the shard that you were creating it with 
numShards=2 instead of 1.


No, that’s not right. Sorry.

It must have got assigned a new core node name. numShards would still have to 
be seen as 1 for it to try and be a replica. Brain lapse.

Are you using a custom coreNodeName or taking the default? Can you post your 
solr.xml so we can see your genericCoreNodeNames and coreNodeName settings?

One possibility is that you got assigned a coreNodeName, but for some reason it 
was not persisted in solr.xml.

- Mark

http://about.me/markrmiller



There is nothing of note in the zookeeper logs. My solr.xml (sanitized 
for privacy) and identical on all 4 nodes.


zkHost="xx.xx.xx.xx:2181,xx.xx.xx.xx:2181,xx.xx.xx.xx:2181,xx.xx.xx.xx:2181,xx.xx.xx.xx:2181">

  

 
  


I don't specify coreNodeName nor a genericCoreNodeNames default value 
...  should I?


The tomcat log is basically just a replay of what happened.

16443 [coreLoadExecutor-4-thread-2] INFO 
org.apache.solr.core.CoreContainer  ? registering core: ...


# this is, I think what you are talking about above with new coreNodeName
16444 [coreLoadExecutor-4-thread-2] INFO 
org.apache.solr.cloud.ZkController  ? Register replica - core:c1 
address:http://xx.xx.xx.xx:8080/x collection: col1 shard:shard4


16453 [coreLoadExecutor-4-thread-2] INFO 
org.apache.solr.client.solrj.impl.HttpClientUtil  ? Creating new http 
client, 
config:maxConnections=1&maxConnectionsPerHost=20&connTimeout=3&socketTimeout=3&retry=false


16505 [coreLoadExecutor-4-thread-2] INFO 
org.apache.solr.cloud.ZkController  ? We are http://node1:8080/x and 
leader is http://node2:8080/x


Then it just starts replicating.

If there is anything specific I should be groking for in these logs, let 
me know.


Also, given that my clusterstate.json now looks like this:

assume:
  node1=xx.xx.xx.1
  node2=xx.xx.xx.2

"shard4":{
"range":"2000-3fff",
"state":"active",
"replicas":{
  "node2:8080_x_col1":{
"state":"active",
"core":"c1",
"node_name":"node2:8080_x",
"base_url":"http://node2:8080/x";,
"leader":"true"},
 this should not be a replica of shard2 but its own shard1
  "node1:8080_x_col1":{
"state":"recovering",
"core":"c1",
"node_name":"node1:8080_x",
"base_url":"http://node1:8080/x"}},

Can I just recreate shard1

"shard1":{
* NOTE: range is assumed based on ranges of other nodes
"range":"0-1fff",
"state":"active",
"replicas":{
  "node1:8080_x_col1":{
"state":"active",
"core":"c1",
"node_name":"node1:8080_x",
"base_url":"http://node1:8080/x";,
"leader":"true"}},

... and then remove the replica ..
"shard4":{
"range":"2000-3fff",
"state":"active",
"replicas":{
  "node2:8080_x_col1":{
"state":"active",
"core":"c1",
"node_name":"node2:8080_x",
"base_url":"http://node2:8080/x";,
"leader":"true"}},

That would be great...

thanks for your help

David



Re: shard1 gone missing ...

2014-01-31 Thread David Santamauro

On 01/31/2014 10:22 AM, Mark Miller wrote:


I’d also highly recommend you try moving to Solr 4.6.1 when you can though. We 
have fixed many, many, many bugs around SolrCloud in the 4 releases since 4.4. 
You can follow the progress in the CHANGES file we update for each release.


Can I do a drop-in replacement of 4.4.0 ?




Re: shard1 gone missing ...

2014-01-31 Thread Mark Miller
http://about.me/markrmiller

On Jan 31, 2014, at 11:11 AM, David Santamauro  
wrote:

> 
> There is nothing of note in the zookeeper logs. My solr.xml (sanitized for 
> privacy) and identical on all 4 nodes.
> 
>  zkHost="xx.xx.xx.xx:2181,xx.xx.xx.xx:2181,xx.xx.xx.xx:2181,xx.xx.xx.xx:2181,xx.xx.xx.xx:2181">
>   host="${host:}"
> hostPort="8080"
> hostContext="${hostContext:/x}"
> zkClientTimeout="${zkClientTimeout:15000}"
> defaultCoreName="c1"
> shareSchema="true" >
> 
>collection="col1"
>   instanceDir="/dir/x"
>   config="solrconfig.xml"
>   dataDir="/dir/x/data/y"
> />
>  
> 
> 
> I don't specify coreNodeName nor a genericCoreNodeNames default value ...  
> should I?
> 
> The tomcat log is basically just a replay of what happened.
> 
> 16443 [coreLoadExecutor-4-thread-2] INFO org.apache.solr.core.CoreContainer  
> ? registering core: ...
> 
> # this is, I think what you are talking about above with new coreNodeName
> 16444 [coreLoadExecutor-4-thread-2] INFO org.apache.solr.cloud.ZkController  
> ? Register replica - core:c1 address:http://xx.xx.xx.xx:8080/x collection: 
> col1 shard:shard4
> 
> 16453 [coreLoadExecutor-4-thread-2] INFO 
> org.apache.solr.client.solrj.impl.HttpClientUtil  ? Creating new http client, 
> config:maxConnections=1&maxConnectionsPerHost=20&connTimeout=3&socketTimeout=3&retry=false
> 
> 16505 [coreLoadExecutor-4-thread-2] INFO org.apache.solr.cloud.ZkController  
> ? We are http://node1:8080/x and leader is http://node2:8080/x
> 
> Then it just starts replicating.
> 
> If there is anything specific I should be groking for in these logs, let me 
> know.
> 
> Also, given that my clusterstate.json now looks like this:
> 
> assume:
>  node1=xx.xx.xx.1
>  node2=xx.xx.xx.2
> 
> "shard4":{
>"range":"2000-3fff",
>"state":"active",
>"replicas":{
>  "node2:8080_x_col1":{
>"state":"active",
>"core":"c1",
>"node_name":"node2:8080_x",
>"base_url":"http://node2:8080/x";,
>"leader":"true"},
>  this should not be a replica of shard2 but its own shard1
>  "node1:8080_x_col1":{
>"state":"recovering",
>"core":"c1",
>"node_name":"node1:8080_x",
>"base_url":"http://node1:8080/x"}},
> 
> Can I just recreate shard1
> 
> "shard1":{
> * NOTE: range is assumed based on ranges of other nodes
>"range":"0-1fff",
>"state":"active",
>"replicas":{
>  "node1:8080_x_col1":{
>"state":"active",
>"core":"c1",
>"node_name":"node1:8080_x",
>"base_url":"http://node1:8080/x";,
>"leader":"true"}},
> 
> ... and then remove the replica ..
> "shard4":{
>"range":"2000-3fff",
>"state":"active",
>"replicas":{
>  "node2:8080_x_col1":{
>"state":"active",
>"core":"c1",
>"node_name":"node2:8080_x",
>"base_url":"http://node2:8080/x";,
>"leader":"true"}},
> 
> That would be great...
> 
> thanks for your help
> 
> David
> 



Re: shard1 gone missing ...

2014-01-31 Thread Mark Miller


On Jan 31, 2014, at 11:15 AM, David Santamauro  
wrote:

> On 01/31/2014 10:22 AM, Mark Miller wrote:
> 
>> I’d also highly recommend you try moving to Solr 4.6.1 when you can though. 
>> We have fixed many, many, many bugs around SolrCloud in the 4 releases since 
>> 4.4. You can follow the progress in the CHANGES file we update for each 
>> release.
> 
> Can I do a drop-in replacement of 4.4.0 ?
> 
> 

It should be a drop in replacement. For some that use deep API’s in plugins, 
sometimes you might have to make a couple small changes to your code.

Alway best to do a test with a copy of your index, but for most, it should be a 
drop in replacement.

- Mark

http://about.me/markrmiller

Re: JVM heap constraints and garbage collection

2014-01-31 Thread Michael Della Bitta
Joesph:

Not so much after using some of the settings available on Shawn's Solr Wiki
page: https://wiki.apache.org/solr/ShawnHeisey

This is what we're running with right now:

-Xmx6g
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=80



Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions  | g+:
plus.google.com/appinions
w: appinions.com 


On Fri, Jan 31, 2014 at 10:58 AM, Joseph Hagerty  wrote:

> Thanks, Shawn. This information is actually not all that shocking to me.
> It's always been in the back of my mind that I was "getting away with
> something" in serving from the m1.large. Remarkably, however, it has served
> me well for nearly two years; also, although the index has not always been
> 30GB, it has always been much larger than the RAM on the box. As you
> suggested, I can only suppose that usage patterns and the index schema have
> in some way facilitated minimal heap usage, up to this point.
>
> For now, we're going to increase the heap size on the instance and see
> where that gets us; if it still doesn't suffice for now, then we'll upgrade
> to a more powerful instance.
>
> Michael, thanks for weighing in. Those i2 instances look delicious indeed.
> Just curious -- have you struggled with garbage collection pausing at all?
>
>
>
> On Thu, Jan 30, 2014 at 7:43 PM, Shawn Heisey  wrote:
>
> > On 1/30/2014 3:20 PM, Joseph Hagerty wrote:
> >
> >> I'm using Solr 3.5 over Tomcat 6. My index has reached 30G.
> >>
> >
> > 
> >
> >
> >  - The box is an m1.large on AWS EC2. 2 virtual CPUs, 4 ECU, 7.5 GiB RAM
> >>
> >
> > One detail that you did not provide was how much of your 7.5GB RAM you
> are
> > allocating to the Java heap for Solr, but I actually don't think I need
> > that information, because for your index size, you simply don't have
> > enough. If you're sticking with Amazon, you'll want one of the instances
> > with at least 30GB of RAM, and you might want to consider more memory
> than
> > that.
> >
> > An ideal RAM size for Solr is equal to the size of on-disk data plus the
> > heap space used by Solr and other programs.  This means that if your java
> > heap for Solr is 4GB and there are no other significant programs running
> on
> > the same server, you'd want a minimum of 34GB of RAM for an ideal setup
> > with your index.  4GB of that would be for Solr itself, the remainder
> would
> > be for the operating system to fully cache your index in the OS disk
> cache.
> >
> > Depending on your query patterns and how your schema is arranged, you
> > *might* be able to get away as little as half of your index size just for
> > the OS disk cache, but it's better to make it big enough for the whole
> > index, plus room for growth.
> >
> > http://wiki.apache.org/solr/SolrPerformanceProblems
> >
> > Many people are *shocked* when they are told this information, but if you
> > think about the relative speeds of getting a chunk of data from a hard
> disk
> > vs. getting the same information from memory, it's not all that shocking.
> >
> > Thanks,
> > Shawn
> >
> >
>
>
> --
> - Joe
>


SolrCloudServer questions

2014-01-31 Thread Software Dev
Can someone clarify what the following options are:

- updatesToLeaders
- shutdownLBHttpSolrServer
- parallelUpdates

Also, I remember in older version of Solr there was an efficient format
that was used between SolrJ and Solr that is more compact. Does this sill
exist in the latest version of Solr? If so, is it the default?

Thanks


Disabling Commit/Auto-Commit (SolrCloud)

2014-01-31 Thread Software Dev
Is there a way to disable commit/hard-commit at runtime? For example, we
usually have our hard commit and soft-commit set really low but when we do
bulk indexing we would like to disable this to increase performance. If
there isn't a an easy way of doing this would simply pushing a new
solrconfig to solrcloud work?


Re: Disabling Commit/Auto-Commit (SolrCloud)

2014-01-31 Thread Alexei Martchenko
Why don't you set both solrconfig commits to very high values and issue a
commit command in sparsed, small updates?

I've been doing this for ages and works perfecly for me.


alexei martchenko
Facebook  |
Linkedin|
Steam  |
4sq| Skype: alexeiramone |
Github  | (11) 9 7613.0966 |


2014-01-31 Software Dev :

> Is there a way to disable commit/hard-commit at runtime? For example, we
> usually have our hard commit and soft-commit set really low but when we do
> bulk indexing we would like to disable this to increase performance. If
> there isn't a an easy way of doing this would simply pushing a new
> solrconfig to solrcloud work?
>


Re: Regarding Solr Faceting on the query response.

2014-01-31 Thread Kuchekar
Hi Mikhail,

   The Actual result is as following

"facet_counts": {
 "facet_queries": {}, "facet_fields": { "company": [ "Apple", 215, "BOSE", 0,
"Walmart", 0, "Oracle", 25,

 ...
 ...
 ...
 ...   "Microsoft",
 34, "ATT", 45
] }, "facet_dates": {}, "facet_ranges": {} }


The Expected result would be

"facet_counts": {
 "facet_queries": {}, "facet_fields": { "company": [ "Apple", 215
] }, "facet_dates": {}, "facet_ranges": {} }

Thanks
Kuchekar, Nilesh


On Fri, Jan 31, 2014 at 5:24 AM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> On Thu, Jan 30, 2014 at 9:35 PM, Kuchekar 
> wrote:
>
> >
> > "docs": [ { "id": "ABC123", "company": [ "APPLE" ] },
> > { "id": "ABC1234", "company": [ "APPLE" ] },
> > { "id": "ABC1235", "company": [ "APPLE" ] },
> > { "id": "ABC1236", "company": [ "APPLE" ] } ] }, "facet_counts": { "
> > facet_queries": { "p_company:ucsf\n": 1 }, "facet_fields": { "company": [
> > "APPLE", 4, ] }, "facet_dates": {}, "facet_ranges": {} }
> >
>
> Is it your 'expected' result? If it is, please repeat an 'actual' once
> again. I have one idea in mind, please answer to let me confirm my guess. I
> share my idea after that.
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
>  
>


Re: Disabling Commit/Auto-Commit (SolrCloud)

2014-01-31 Thread Mark Miller
It’s not a good idea to disable hard commit because the transaction can grow 
without limit in RAM.

Also, try some performance tests. I’ve never seen it matter if it’s set to like 
a minute, both for bulk and NRT.

As far as soft commit, you could turn it off and control visibility when adding 
docs via commitWithin.

- Mark

http://about.me/markrmiller

On Jan 31, 2014, at 12:45 PM, Software Dev  wrote:

> Is there a way to disable commit/hard-commit at runtime? For example, we
> usually have our hard commit and soft-commit set really low but when we do
> bulk indexing we would like to disable this to increase performance. If
> there isn't a an easy way of doing this would simply pushing a new
> solrconfig to solrcloud work?



Special character search in Solr and boosting without altering the resultset

2014-01-31 Thread abhishek jain
Hi friends,

I am facing a strange problem, When I search a term eg .Net   , the solr
searches for Net and not includes '.'  

Is dot a special character in Solr? I tried escaping it with backslash in
the url call to solr, but no use same resultset,

 

Also , is there a way to boost some terms within a resultset.

I mean I want to boost a term within a result and I don't want to fire a
separate query. I couldn't use OR operator as it will modify the resultset.
I want to use a single query and boost. I don't want to use dismax query as
well,

 

Please advice.

 

Thanks,

Abhishek



Re: Disabling Commit/Auto-Commit (SolrCloud)

2014-01-31 Thread Alexei Martchenko
I didn't mean to disable, just to put some high value there. I have a
script that updates my solr in batches of thousands so I set my commit to
100,000 because when it runs it updates 100,000 records in short time.

The other script updates in batches of hundreds and its not so fast, so its
internal loops issue a commit after X loops and/or when it finishes
processing.


alexei martchenko
Facebook  |
Linkedin|
Steam  |
4sq| Skype: alexeiramone |
Github  | (11) 9 7613.0966 |


2014-01-31 Mark Miller :

> It's not a good idea to disable hard commit because the transaction can
> grow without limit in RAM.
>
> Also, try some performance tests. I've never seen it matter if it's set to
> like a minute, both for bulk and NRT.
>
> As far as soft commit, you could turn it off and control visibility when
> adding docs via commitWithin.
>
> - Mark
>
> http://about.me/markrmiller
>
> On Jan 31, 2014, at 12:45 PM, Software Dev 
> wrote:
>
> > Is there a way to disable commit/hard-commit at runtime? For example, we
> > usually have our hard commit and soft-commit set really low but when we
> do
> > bulk indexing we would like to disable this to increase performance. If
> > there isn't a an easy way of doing this would simply pushing a new
> > solrconfig to solrcloud work?
>
>


Re: SolrCloudServer questions

2014-01-31 Thread Greg Walters
I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore my 
response.

> -updatesToLeaders

Only send documents to shard leaders while indexing. This saves cross-talk 
between slaves and leaders which results in more efficient document routing.

> shutdownLBHttpSolrServer

CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute 
requests (that aren't updates directly to leaders). Where did you find this? I 
don't see this in the javadoc anywhere but it is a boolean in the 
CloudSolrServer class. It looks like when you create a new CloudSolrServer and 
pass it your own LBHttpSolrServer the boolean gets set to false and the 
CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut down.

> parellelUpdates

The javadoc's done have any description for this one but I checked out the code 
for CloudSolrServer and if parallelUpdates it looks like it executes update 
statements to multiple shards at the same time.

I'm no dev but I can read so please excuse any errors on my part.

Thanks,
Greg

On Jan 31, 2014, at 11:40 AM, Software Dev  wrote:

> Can someone clarify what the following options are:
> 
> - updatesToLeaders
> - shutdownLBHttpSolrServer
> - parallelUpdates
> 
> Also, I remember in older version of Solr there was an efficient format
> that was used between SolrJ and Solr that is more compact. Does this sill
> exist in the latest version of Solr? If so, is it the default?
> 
> Thanks



Re: SolrCloudServer questions

2014-01-31 Thread Mark Miller


On Jan 31, 2014, at 1:56 PM, Greg Walters  wrote:

> I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore my 
> response.
> 
>> -updatesToLeaders
> 
> Only send documents to shard leaders while indexing. This saves cross-talk 
> between slaves and leaders which results in more efficient document routing.

Right, but recently this has less of an affect because CloudSolrServer can now 
hash documents and directly send them to the right place. This option has 
become more historical. Just make sure you set the correct id field on the 
CloudSolrServer instance for this hashing to work (I think it defaults to “id”).

> 
>> shutdownLBHttpSolrServer
> 
> CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute 
> requests (that aren't updates directly to leaders). Where did you find this? 
> I don't see this in the javadoc anywhere but it is a boolean in the 
> CloudSolrServer class. It looks like when you create a new CloudSolrServer 
> and pass it your own LBHttpSolrServer the boolean gets set to false and the 
> CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut down.
> 
>> parellelUpdates
> 
> The javadoc's done have any description for this one but I checked out the 
> code for CloudSolrServer and if parallelUpdates it looks like it executes 
> update statements to multiple shards at the same time.

Right, we should def add some javadoc, but this sends updates to shards in 
parallel rather than with a single thread. Can really increase update speed. 
Still not as powerful as using CloudSolrServer from multiple threads, but a 
nice improvement non the less.


- Mark

http://about.me/markrmiller

> 
> I'm no dev but I can read so please excuse any errors on my part.
> 
> Thanks,
> Greg
> 
> On Jan 31, 2014, at 11:40 AM, Software Dev  wrote:
> 
>> Can someone clarify what the following options are:
>> 
>> - updatesToLeaders
>> - shutdownLBHttpSolrServer
>> - parallelUpdates
>> 
>> Also, I remember in older version of Solr there was an efficient format
>> that was used between SolrJ and Solr that is more compact. Does this sill
>> exist in the latest version of Solr? If so, is it the default?
>> 
>> Thanks
> 



RE: Geospatial clustering + zoom in/out help

2014-01-31 Thread Smiley, David W.
Hi Bojan.

You've got some good ideas here along the lines of some that others have tried. 
 I've through together a page on the wiki about this subject some time ago that 
I'm sure you will find interesting.  It references a relevant stack-overflow 
post, and also a presentation at DrupalCon which had a segment from a guy using 
the same approach you suggest here involving field-collapsing and/or stats 
components.  The video shows it in action.

http://wiki.apache.org/solr/SpatialClustering

It would be helpful for everyone if you share your experience with whatever you 
choose, once you give an approach a try.

~ David

From: Bojan Šmid [bos...@gmail.com]
Sent: Thursday, January 30, 2014 1:15 PM
To: solr-user@lucene.apache.org
Subject: Geospatial clustering + zoom in/out help

Hi,

I have an index with 300K docs with lat,lon. I need to cluster the docs
based on lat,lon for display in the UI. The user then needs to be able to
click on any cluster and zoom in (up to 11 levels deep).

I'm using Solr 4.6 and I'm wondering how best to implement this efficiently?

A bit more specific questions below.

I need to:

1) cluster data points at different zoom levels

2) click on a specific cluster and zoom in

3) be able to select a region (bounding box or polygon) and show clusters
in the selected area

What's the best way to implement this so that queries are fast?

What I thought I would try, but maybe there are better ways:

* divide the world in NxM large squares and then each of these squares into
4 more squares, and so on - 11 levels deep

* at index time figure out all squares (at all 11 levels) each data point
belongs to and index that info into 11 different fields: e.g.


* at search time, use field collapsing on zoomX field to get which docs
belong to which square on particular level

* calculate center point of each square (by calculating mean value of
positions for all points in that square) using StatsComponent (facet on
zoomX field, avg on lat and lon fields) - I would consider those squares as
separate clusters (one square is one cluster) and center points of those
squares as center points of clusters derived from them

I *think* the problem with this approach is that:

* there will be many unique fields for bigger zoom levels, which means
field collapsing / StatsComponent maaay not work fast enough

* clusters will not look very natural because I would have many clusters on
each zoom level and what are "real" geographical clusters would be
displayed as multiple clusters since their points would in some cases be
dispersed into multiple squares. But that may be OK

* a lot will depend on how the squares are calculated - linearly dividing
360 degrees by N to get "equal" size squares in degrees would produce
issues with "real" square sizes and counts of points in each of them


So I'm wondering if there is a better way?

Thanks,


  Bojan


Re: Special character search in Solr and boosting without altering the resultset

2014-01-31 Thread Ahmet Arslan
Hi Abhishek,

dot is not a special character. Your field type / analyzer is stripping that 
character. 
Please see similar discussions and alternative solutions.

http://search-lucene.com/m/6dbI9zMSob1
http://search-lucene.com/m/Ac71G0KlGz
http://search-lucene.com/m/RRD2D1p1mi

Ahmet



On Friday, January 31, 2014 8:23 PM, abhishek jain  
wrote:
Hi friends,

I am facing a strange problem, When I search a term eg     .Net   , the solr
searches for Net and not includes '.'  

Is dot a special character in Solr? I tried escaping it with backslash in
the url call to solr, but no use same resultset,



Also , is there a way to boost some terms within a resultset.

I mean I want to boost a term within a result and I don't want to fire a
separate query. I couldn't use OR operator as it will modify the resultset.
I want to use a single query and boost. I don't want to use dismax query as
well,



Please advice.



Thanks,

Abhishek


Solr 4.x EdgeNGramFilterFactory and highlighting

2014-01-31 Thread Dmitriy Shvadskiy
Hello,

We are using EdgeNGramFilterFactory to provide partial match on the search
phrase for type ahead/autocomplete. Field type definition


 
   

   
   
   
 
 
   
   
   
 


Field is defined as following
 

Query string
q=22&qt=%2Fedismax&qf=ac_source_id&fl=ac_source_id&hl=true&hl.mergeContiguous=true&hl.useFastVectorHighlighter=true&hl.fl=ac_source_id

This is what comes back in highlighting response

  2282372


What I expect is 2282372 as it was in Solr 3.6. Also I noticed when
I run Analysis on the field start and end positions are the same even though
EdgeNGramFilterFactory generates multiple ngrams (see image attached). 
Questions:
1. How do I accomplish highlighting on partial match in Solr 4.x?
2. Did behavior of EdgeNGramFilterFactory change between 3.6 and 4.x.
Highlighting on partial match works fine in Solr 3.6

Thank you,
Dmitriy Shvadskiy
 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-x-EdgeNGramFilterFactory-and-highlighting-tp4114748.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloudServer questions

2014-01-31 Thread Software Dev
Which of any of these settings would be beneficial when bulk uploading?


On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller  wrote:

>
>
> On Jan 31, 2014, at 1:56 PM, Greg Walters 
> wrote:
>
> > I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore
> my response.
> >
> >> -updatesToLeaders
> >
> > Only send documents to shard leaders while indexing. This saves
> cross-talk between slaves and leaders which results in more efficient
> document routing.
>
> Right, but recently this has less of an affect because CloudSolrServer can
> now hash documents and directly send them to the right place. This option
> has become more historical. Just make sure you set the correct id field on
> the CloudSolrServer instance for this hashing to work (I think it defaults
> to "id").
>
> >
> >> shutdownLBHttpSolrServer
> >
> > CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute
> requests (that aren't updates directly to leaders). Where did you find
> this? I don't see this in the javadoc anywhere but it is a boolean in the
> CloudSolrServer class. It looks like when you create a new CloudSolrServer
> and pass it your own LBHttpSolrServer the boolean gets set to false and the
> CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut down.
> >
> >> parellelUpdates
> >
> > The javadoc's done have any description for this one but I checked out
> the code for CloudSolrServer and if parallelUpdates it looks like it
> executes update statements to multiple shards at the same time.
>
> Right, we should def add some javadoc, but this sends updates to shards in
> parallel rather than with a single thread. Can really increase update
> speed. Still not as powerful as using CloudSolrServer from multiple
> threads, but a nice improvement non the less.
>
>
> - Mark
>
> http://about.me/markrmiller
>
> >
> > I'm no dev but I can read so please excuse any errors on my part.
> >
> > Thanks,
> > Greg
> >
> > On Jan 31, 2014, at 11:40 AM, Software Dev 
> wrote:
> >
> >> Can someone clarify what the following options are:
> >>
> >> - updatesToLeaders
> >> - shutdownLBHttpSolrServer
> >> - parallelUpdates
> >>
> >> Also, I remember in older version of Solr there was an efficient format
> >> that was used between SolrJ and Solr that is more compact. Does this
> sill
> >> exist in the latest version of Solr? If so, is it the default?
> >>
> >> Thanks
> >
>
>


Re: SolrCloudServer questions

2014-01-31 Thread Mark Miller
Just make sure parallel updates is set to true.

If you want to load even faster, you can use the bulk add methods, or if you 
need more fine grained responses, use the single add from multiple threads 
(though bulk add can also be done via multiple threads if you really want to 
try and push the max).

- Mark

http://about.me/markrmiller

On Jan 31, 2014, at 3:50 PM, Software Dev  wrote:

> Which of any of these settings would be beneficial when bulk uploading?
> 
> 
> On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller  wrote:
> 
>> 
>> 
>> On Jan 31, 2014, at 1:56 PM, Greg Walters 
>> wrote:
>> 
>>> I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore
>> my response.
>>> 
 -updatesToLeaders
>>> 
>>> Only send documents to shard leaders while indexing. This saves
>> cross-talk between slaves and leaders which results in more efficient
>> document routing.
>> 
>> Right, but recently this has less of an affect because CloudSolrServer can
>> now hash documents and directly send them to the right place. This option
>> has become more historical. Just make sure you set the correct id field on
>> the CloudSolrServer instance for this hashing to work (I think it defaults
>> to "id").
>> 
>>> 
 shutdownLBHttpSolrServer
>>> 
>>> CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute
>> requests (that aren't updates directly to leaders). Where did you find
>> this? I don't see this in the javadoc anywhere but it is a boolean in the
>> CloudSolrServer class. It looks like when you create a new CloudSolrServer
>> and pass it your own LBHttpSolrServer the boolean gets set to false and the
>> CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut down.
>>> 
 parellelUpdates
>>> 
>>> The javadoc's done have any description for this one but I checked out
>> the code for CloudSolrServer and if parallelUpdates it looks like it
>> executes update statements to multiple shards at the same time.
>> 
>> Right, we should def add some javadoc, but this sends updates to shards in
>> parallel rather than with a single thread. Can really increase update
>> speed. Still not as powerful as using CloudSolrServer from multiple
>> threads, but a nice improvement non the less.
>> 
>> 
>> - Mark
>> 
>> http://about.me/markrmiller
>> 
>>> 
>>> I'm no dev but I can read so please excuse any errors on my part.
>>> 
>>> Thanks,
>>> Greg
>>> 
>>> On Jan 31, 2014, at 11:40 AM, Software Dev 
>> wrote:
>>> 
 Can someone clarify what the following options are:
 
 - updatesToLeaders
 - shutdownLBHttpSolrServer
 - parallelUpdates
 
 Also, I remember in older version of Solr there was an efficient format
 that was used between SolrJ and Solr that is more compact. Does this
>> sill
 exist in the latest version of Solr? If so, is it the default?
 
 Thanks
>>> 
>> 
>> 



Removing last replica from a SolrCloud collection

2014-01-31 Thread David Smiley (@MITRE.org)
Hi,

If I issue either a core UNLOAD command, or a collection DELETEREPLICA
command,  (which both seem pretty much equivalent) it works but if there are
no other replicas for the shard, then the metadata for the shard is
completely gone in clusterstate.json!  That's pretty disconcerting because
you're basically hosed.  Of course, why would I even want to do that?  Well
I'm experimenting with ways to restore a backed-up replica to replace
existing data for the shard.

If this is unexpected behavior then I'll file a bug.

~ David



-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Removing-last-replica-from-a-SolrCloud-collection-tp4114772.html
Sent from the Solr - User mailing list archive at Nabble.com.


Clone (or Restore) Solrcloud

2014-01-31 Thread David Smiley (@MITRE.org)
Hi,

I'm attempting to come up with a SolrCloud restore / clone process for
either recover to a known good state or to clone the environment for
experimentation.  At the moment my process involves either creating a new
zookeeper environment or at least deleting the existing Collection so that I
can create a new one.  This works; I use the Core API; the first command
defines the collection parameters, and I invoke it once for each replica.  I
don't use the Collection API because I want SolrCloud to go off trying to
create all the replicas -- I know where each one is pre-positioned.

What I'm concerned about is what happens once I start wanting to use Shard
splitting, *especially* if I don't want to split all shards because shards
are uneven due to custom routing (e.g. id:"customer!myid").  In this case I
don't know how to create the collection with the hash ranges post-shard
split.  Solr doesn't have an API for me to explicitly say what the hash
ranges should be on each shard (to match up with a backup).  And I'm
concerned about undocumented pitfalls that may exist in manually
constructing a clusterstate.json, as another approach.

Any ideas?

~ David



-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Clone-or-Restore-Solrcloud-tp4114773.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: JVM heap constraints and garbage collection

2014-01-31 Thread Erick Erickson
Be a little careful when looking at on-disk index sizes.
The *.fdt and *.fdx files are pretty irrelevant for the in-memory
requirements. They are just read to assemble the response (usually
10-20 docs). That said, you can _make_ them more relevant by
specifying very large document cache sizes.

Best,
Erick

On Fri, Jan 31, 2014 at 9:49 AM, Michael Della Bitta
 wrote:
> Joesph:
>
> Not so much after using some of the settings available on Shawn's Solr Wiki
> page: https://wiki.apache.org/solr/ShawnHeisey
>
> This is what we're running with right now:
>
> -Xmx6g
> -XX:+UseConcMarkSweepGC
> -XX:CMSInitiatingOccupancyFraction=80
>
>
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> "The Science of Influence Marketing"
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions  | g+:
> plus.google.com/appinions
> w: appinions.com 
>
>
> On Fri, Jan 31, 2014 at 10:58 AM, Joseph Hagerty  wrote:
>
>> Thanks, Shawn. This information is actually not all that shocking to me.
>> It's always been in the back of my mind that I was "getting away with
>> something" in serving from the m1.large. Remarkably, however, it has served
>> me well for nearly two years; also, although the index has not always been
>> 30GB, it has always been much larger than the RAM on the box. As you
>> suggested, I can only suppose that usage patterns and the index schema have
>> in some way facilitated minimal heap usage, up to this point.
>>
>> For now, we're going to increase the heap size on the instance and see
>> where that gets us; if it still doesn't suffice for now, then we'll upgrade
>> to a more powerful instance.
>>
>> Michael, thanks for weighing in. Those i2 instances look delicious indeed.
>> Just curious -- have you struggled with garbage collection pausing at all?
>>
>>
>>
>> On Thu, Jan 30, 2014 at 7:43 PM, Shawn Heisey  wrote:
>>
>> > On 1/30/2014 3:20 PM, Joseph Hagerty wrote:
>> >
>> >> I'm using Solr 3.5 over Tomcat 6. My index has reached 30G.
>> >>
>> >
>> > 
>> >
>> >
>> >  - The box is an m1.large on AWS EC2. 2 virtual CPUs, 4 ECU, 7.5 GiB RAM
>> >>
>> >
>> > One detail that you did not provide was how much of your 7.5GB RAM you
>> are
>> > allocating to the Java heap for Solr, but I actually don't think I need
>> > that information, because for your index size, you simply don't have
>> > enough. If you're sticking with Amazon, you'll want one of the instances
>> > with at least 30GB of RAM, and you might want to consider more memory
>> than
>> > that.
>> >
>> > An ideal RAM size for Solr is equal to the size of on-disk data plus the
>> > heap space used by Solr and other programs.  This means that if your java
>> > heap for Solr is 4GB and there are no other significant programs running
>> on
>> > the same server, you'd want a minimum of 34GB of RAM for an ideal setup
>> > with your index.  4GB of that would be for Solr itself, the remainder
>> would
>> > be for the operating system to fully cache your index in the OS disk
>> cache.
>> >
>> > Depending on your query patterns and how your schema is arranged, you
>> > *might* be able to get away as little as half of your index size just for
>> > the OS disk cache, but it's better to make it big enough for the whole
>> > index, plus room for growth.
>> >
>> > http://wiki.apache.org/solr/SolrPerformanceProblems
>> >
>> > Many people are *shocked* when they are told this information, but if you
>> > think about the relative speeds of getting a chunk of data from a hard
>> disk
>> > vs. getting the same information from memory, it's not all that shocking.
>> >
>> > Thanks,
>> > Shawn
>> >
>> >
>>
>>
>> --
>> - Joe
>>


facet.prefix or separation?

2014-01-31 Thread William Bell
What should be better for performance to get those facets that begins with
A?

1.
facet=true&facet.field=conditions&facet.prefix=A

2.
When indexing create a new field conditions_A, and use it:
facet=true&facet.field=conditions_A

Thoughts?



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: facet.prefix or separation?

2014-01-31 Thread Alexandre Rafalovitch
I am quite sure that the binary flag will be faster as you will just
get a gigantic vector pre-loaded into memory. The problem starts if
you are going to have lots of those prefixes. Then, the memory
requirements may become an issue. Then, the facet becomes more
flexible as it uses the same list for any arbitrary prefix.

There are my thoughts (as requested). I haven't tested this in production.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Sat, Feb 1, 2014 at 11:20 AM, William Bell  wrote:
> What should be better for performance to get those facets that begins with
> A?
>
> 1.
> facet=true&facet.field=conditions&facet.prefix=A
>
> 2.
> When indexing create a new field conditions_A, and use it:
> facet=true&facet.field=conditions_A
>
> Thoughts?
>
>
>
> --
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076


Re: facet.prefix or separation?

2014-01-31 Thread William Bell
Just to be perfectly clear, it is not a binary field.

conditions = "A west side story"
conditions = "The edge of reason"

I look for those strings beginning with A and set that in conditions_A:

conditions_A = "A west side story"

OK?


On Fri, Jan 31, 2014 at 9:29 PM, Alexandre Rafalovitch
wrote:

> I am quite sure that the binary flag will be faster as you will just
> get a gigantic vector pre-loaded into memory. The problem starts if
> you are going to have lots of those prefixes. Then, the memory
> requirements may become an issue. Then, the facet becomes more
> flexible as it uses the same list for any arbitrary prefix.
>
> There are my thoughts (as requested). I haven't tested this in production.
>
> Regards,
>Alex.
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
>
>
> On Sat, Feb 1, 2014 at 11:20 AM, William Bell  wrote:
> > What should be better for performance to get those facets that begins
> with
> > A?
> >
> > 1.
> > facet=true&facet.field=conditions&facet.prefix=A
> >
> > 2.
> > When indexing create a new field conditions_A, and use it:
> > facet=true&facet.field=conditions_A
> >
> > Thoughts?
> >
> >
> >
> > --
> > Bill Bell
> > billnb...@gmail.com
> > cell 720-256-8076
>



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: facet.prefix or separation?

2014-01-31 Thread Alexandre Rafalovitch
Ok, so you are pre-partitioning the facet field based on initial
letter. So all the texts that start from A will go into conditions_A
and all the texts that start from C will go into conditions_C.
Interesting approach. Ignore whatever I said before.

If this does not cause other issues, than it is possible that the
partitioned approach will be slightly more efficient because it does
not need to load into memory the field cache for non_A content. But
that could be more memory size efficiency than the speed one.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Sat, Feb 1, 2014 at 11:33 AM, William Bell  wrote:
> Just to be perfectly clear, it is not a binary field.
>
> conditions = "A west side story"
> conditions = "The edge of reason"
>
> I look for those strings beginning with A and set that in conditions_A:
>
> conditions_A = "A west side story"
>
> OK?
>
>
> On Fri, Jan 31, 2014 at 9:29 PM, Alexandre Rafalovitch
> wrote:
>
>> I am quite sure that the binary flag will be faster as you will just
>> get a gigantic vector pre-loaded into memory. The problem starts if
>> you are going to have lots of those prefixes. Then, the memory
>> requirements may become an issue. Then, the facet becomes more
>> flexible as it uses the same list for any arbitrary prefix.
>>
>> There are my thoughts (as requested). I haven't tested this in production.
>>
>> Regards,
>>Alex.
>> Personal website: http://www.outerthoughts.com/
>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> - Time is the quality of nature that keeps events from happening all
>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> book)
>>
>>
>> On Sat, Feb 1, 2014 at 11:20 AM, William Bell  wrote:
>> > What should be better for performance to get those facets that begins
>> with
>> > A?
>> >
>> > 1.
>> > facet=true&facet.field=conditions&facet.prefix=A
>> >
>> > 2.
>> > When indexing create a new field conditions_A, and use it:
>> > facet=true&facet.field=conditions_A
>> >
>> > Thoughts?
>> >
>> >
>> >
>> > --
>> > Bill Bell
>> > billnb...@gmail.com
>> > cell 720-256-8076
>>
>
>
>
> --
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076


Re: facet.prefix or separation?

2014-01-31 Thread William Bell
This is the approach for "words that begin with" using an alpha-span on the
site:

A B C D E F G ...

The user clicks "A" and I would use conditions_A.



On Fri, Jan 31, 2014 at 9:42 PM, Alexandre Rafalovitch
wrote:

> Ok, so you are pre-partitioning the facet field based on initial
> letter. So all the texts that start from A will go into conditions_A
> and all the texts that start from C will go into conditions_C.
> Interesting approach. Ignore whatever I said before.
>
> If this does not cause other issues, than it is possible that the
> partitioned approach will be slightly more efficient because it does
> not need to load into memory the field cache for non_A content. But
> that could be more memory size efficiency than the speed one.
>
> Regards,
>Alex.
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
>
>
> On Sat, Feb 1, 2014 at 11:33 AM, William Bell  wrote:
> > Just to be perfectly clear, it is not a binary field.
> >
> > conditions = "A west side story"
> > conditions = "The edge of reason"
> >
> > I look for those strings beginning with A and set that in conditions_A:
> >
> > conditions_A = "A west side story"
> >
> > OK?
> >
> >
> > On Fri, Jan 31, 2014 at 9:29 PM, Alexandre Rafalovitch
> > wrote:
> >
> >> I am quite sure that the binary flag will be faster as you will just
> >> get a gigantic vector pre-loaded into memory. The problem starts if
> >> you are going to have lots of those prefixes. Then, the memory
> >> requirements may become an issue. Then, the facet becomes more
> >> flexible as it uses the same list for any arbitrary prefix.
> >>
> >> There are my thoughts (as requested). I haven't tested this in
> production.
> >>
> >> Regards,
> >>Alex.
> >> Personal website: http://www.outerthoughts.com/
> >> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> >> - Time is the quality of nature that keeps events from happening all
> >> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> >> book)
> >>
> >>
> >> On Sat, Feb 1, 2014 at 11:20 AM, William Bell 
> wrote:
> >> > What should be better for performance to get those facets that begins
> >> with
> >> > A?
> >> >
> >> > 1.
> >> > facet=true&facet.field=conditions&facet.prefix=A
> >> >
> >> > 2.
> >> > When indexing create a new field conditions_A, and use it:
> >> > facet=true&facet.field=conditions_A
> >> >
> >> > Thoughts?
> >> >
> >> >
> >> >
> >> > --
> >> > Bill Bell
> >> > billnb...@gmail.com
> >> > cell 720-256-8076
> >>
> >
> >
> >
> > --
> > Bill Bell
> > billnb...@gmail.com
> > cell 720-256-8076
>



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


solr joins

2014-01-31 Thread anand chandak
Folks, have a basic question regarding solr join, the wiki, 
  states : -


"Fields or other properties of the documents being joined "from" are not 
available for use in processing of the resulting set of "to" documents 
(ie: you can not return fields in the "from" documents as if they were a 
multivalued field on the "to" documents)".


I am finding it hard to understand the above limitation of solr 
join,does it means that unlike the traditional RDMS joins that can have 
columns from both the TO and FROM field, solr joins will only have 
fields from the TO  documents ? Is my understanding correct ?


Also, there's some difference with respect to scoring and towards that 
the wiki says :


"The Join query produces constant scores for all documents that match -- 
scores computed by the nested query for the "from" documents are not 
available to use in scoring the "to" documents" Does it mean the 
subquery's score is not available the main query? Is this behaviour true 
for the lucene join too ?


Basically, i am trying to understand where and how solr joins differ 
from lucene joins. Any pointers, much appreciated ?


I have posted same question on the stackoverflow 



Thanks,

Anand