Clarity on Stable Release

2020-01-29 Thread Jeff
TL;DR: I am having difficulty on deciding on a release that is stable to
use and would like this to be easier.

Recently it has been rather difficult to figure out what release to use
based on its stability. This is probably in part because of the rapid
release cadence and also the versioning being employed upon a release.

To demonstrate what I mean, let me walk through some of the process we've
had for determining what version to use starting at version 8.1.0:
1) 8.1.0 could not be used because of NPE (SOLR-13475) so we upgrade to
8.1.1
2) 8.1.1 could not be used because of intermittent 401s (SOLR-13510) so we
looked for a patch version 8.1.2 - which does not exist. So instead we
looked into upgrading to 8.2.0 (which includes new features and
improvements alongside bug fixes).
3) 8.2.0 is fine except for CVE-2019-12409 caused by a bad configuration.
This is still a good stable candidate if the configuration is simply
changed (or solr is properly secured through networking measures anyway).
4) 8.3.0 contains a bug that causes data loss during inter-node updates
SOLR-13963 so must use patch version 8.3.1
5) Versions 8.4.0 and 8.4.1 have since been released and they seem stable
so far.

Now, we are considering 8.2.0, 8.3.1, or 8.4.1 to use as they seem to be
stable. But it is hard to determine if we should be using the bleeding edge
or a few minor versions back since each of  these includes many bug fixes.
It is unclear to me why some fixes get back-patched and why some are
released under new minor version changes (which include some hefty
improvements and features).

To clarify, I am mostly asking for some clarity on which versions *should*
be used for a stable system and that we somehow can make it more clear in
the future. I am not trying to point the finger at specific bugs, but am
simply using them as examples as to why it is hard to determine a release
as stable.

If anybody has insight on this, please let me know.


Re: Clarity on Stable Release

2020-01-29 Thread Jeff
Thanks Shawn! Your answer is very helpful. Especially your note about
keeping up to date with the latest major version after a number of releases.

On Wed, Jan 29, 2020 at 6:35 PM Shawn Heisey  wrote:

> On 1/29/2020 11:24 AM, Jeff wrote:
> > Now, we are considering 8.2.0, 8.3.1, or 8.4.1 to use as they seem to be
> > stable. But it is hard to determine if we should be using the bleeding
> edge
> > or a few minor versions back since each of  these includes many bug
> fixes.
> > It is unclear to me why some fixes get back-patched and why some are
> > released under new minor version changes (which include some hefty
> > improvements and features).
>
> 
>
> >
> > To clarify, I am mostly asking for some clarity on which versions
> *should*
> > be used for a stable system and that we somehow can make it more clear in
> > the future. I am not trying to point the finger at specific bugs, but am
> > simply using them as examples as to why it is hard to determine a release
> > as stable.
> >
> > If anybody has insight on this, please let me know.
>
> My personal thought about any particular major version is that before
> using that version, it's a good idea to wait for a few releases, so that
> somebody braver than me can find the really big problems.
>
> If 8.x were still brand new, I'd run the latest version of 7.x.  Since
> 8.x has had a number of releases, my current thought for a new
> deployment would be to run the latest version of 8.x.  I would also plan
> on watching for new issues and being aggressive about upgrading to
> future 8.x versions.  I would maintain a test environment to qualify
> those releases.
>
> All releases are called "stable".  That is the intent with any release
> -- for it to be good enough for anyone to use in production.  Sometimes
> we find problems after release.  When a problem is noted, we almost
> always create a test that will alert us if that problem should resurface.
>
> What you refer to as "bleeding edge" is the master branch, and that
> branch is never used to create releases.
>
> Thanks,
> Shawn
>


Re: How to check when a search exceeds the threshold of timeAllowed parameter

2015-12-23 Thread Jeff Wartes
Looks like it’ll set partialResults=true on your results if you hit the 
timeout. 

https://issues.apache.org/jira/browse/SOLR-502

https://issues.apache.org/jira/browse/SOLR-5986






On 12/22/15, 5:43 PM, "Vincenzo D'Amore"  wrote:

>Well... I can write everything, but really all this just to understand
>when timeAllowed
>parameter trigger a partial answer? I mean, isn't there anything set in the
>response when is partial?
>
>On Wed, Dec 23, 2015 at 2:38 AM, Walter Underwood 
>wrote:
>
>> We need to know a LOT more about your site. Number of documents, size of
>> index, frequency of updates, length of queries approximate size of server
>> (CPUs, RAM, type of disk), version of Solr, version of Java, and features
>> you are using (faceting, highlighting, etc.).
>>
>> After that, we’ll have more questions.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>> > On Dec 22, 2015, at 4:58 PM, Vincenzo D'Amore 
>> wrote:
>> >
>> > Hi All,
>> >
>> > my website is under pressure, there is a big number of concurrent
>> searches.
>> > When the connected users are too many, the searches becomes so slow that
>> in
>> > some cases users have to wait many seconds.
>> > The queue of searches becomes so long that, in same cases, servers are
>> > blocked trying to serve all these requests.
>> > As far as I know because some searches are very expensive, and when many
>> > expensive searches clog the queue server becomes unresponsive.
>> >
>> > In order to quickly workaround this herd effect, I have added a
>> > default timeAllowed to 15 seconds, and this seems help a lot.
>> >
>> > But during stress tests but I'm unable to understand when and what
>> requests
>> > are affected by timeAllowed parameter.
>> >
>> > Just be clear, I have configure timeAllowed parameter in a SolrCloud
>> > environment, given that partial results may be returned (if there are
>> any),
>> > how can I know when this happens? When the timeAllowed parameter trigger
>> a
>> > partial answer?
>> >
>> > Best regards,
>> > Vincenzo
>> >
>> >
>> >
>> > --
>> > Vincenzo D'Amore
>> > email: v.dam...@gmail.com
>> > skype: free.dev
>> > mobile: +39 349 8513251
>>
>>
>
>
>-- 
>Vincenzo D'Amore
>email: v.dam...@gmail.com
>skype: free.dev
>mobile: +39 349 8513251


Error importing data - java.util.concurrent.RejectedExecutionException

2015-12-30 Thread Jeff Chastain
I will preface this with the fact that I am still pretty new to both Solr and 
Tomcat, so hopefully this is something obvious to somebody out there.  I have 
two 4.3.10 Solr servers set up in separate contexts, running on a Tomcat 7 
application server on Windows 2012.  When I attempt to import data from a SQL 
server into a collection on one of the Solr instances, no documents are created 
and the log files when run at full debug level show the following .

--

DEBUG - 2015-12-30 13:24:53.469; 
org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE add{,id=216885} 
{{params(optimize=true&indent=true&clean=true&commit=true&verbose=false&command=full-import&debug=false&wt=json),defaults(config=db-data-config.xml)}}
WARN  - 2015-12-30 13:24:53.469; org.apache.solr.handler.dataimport.SolrWriter; 
Error creating document : SolrInputDocument(fields: [memberId=**, 
location=**,**, longitude=**, lastName=**, status=**, 
latitude=**, id=**, firstName=**, _version_=1522019276914950145])
org.apache.solr.common.SolrException: Exception writing document id ** to 
the index; possible analysis error.
 at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:168)
 at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
 at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
 at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:926)
 at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1080)
 at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:692)
 at 
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
 at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:71)
 at 
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:265)
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:511)
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
 at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
 at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
 at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)
 at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480)
 at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:461)
Caused by: java.util.concurrent.RejectedExecutionException: Task 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@15afacfe 
rejected from 
java.util.concurrent.ScheduledThreadPoolExecutor@132e86ed[Terminated, pool size 
= 0, active threads = 0, queued tasks = 0, completed tasks = 1]
 at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(Unknown 
Source)
 at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
 at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(Unknown 
Source)
 at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(Unknown 
Source)
 at 
org.apache.solr.update.CommitTracker._scheduleCommitWithin(CommitTracker.java:150)
 at 
org.apache.solr.update.CommitTracker._scheduleCommitWithinIfNeeded(CommitTracker.java:118)
 at 
org.apache.solr.update.CommitTracker.addedDocument(CommitTracker.java:169)
 at 
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:275)
 at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
 ... 15 more

--

I am not sure where to even start looking here, but the server appears to be 
running fine with ample CPU and memory room.  I have doubled the RAM available 
to the Tomcat server (1024 on start, 4096 for the max).

On the Solr side, I have checked the data shown against the schema for the 
collection and everything appears to line up.

I am at a loss here ... can anybody offer a pointer?

Thanks,
-- Jeff


Re: SolrCloud: Setting/finding node names for deleting replicas

2016-01-08 Thread Jeff Wartes


I’m pretty sure you could change the name when you ADDREPLICA using a core.name 
property. I don’t know if you can when you initially create the collection 
though.

The CLUSTERSTATUS command will tell you the core names: 
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api18
 

That said, this tool might make things easier.
https://github.com/whitepages/solrcloud_manager


# shows cluster status, including core names:
java -jar solrcloud_manager-assembly-1.4.0.jar -z zk0.example.com:2181/myapp


# deletes a replica by node/collection/shard (figures out the core name under 
the hood)
java -jar solrcloud_manager-assembly-1.4.0.jar deletereplica -z 
zk0.example.com:2181/myapp -c collection1 --node node1.example.com --slice 
shard2


I mention this tool every now and then on this list because I like it, but I’m 
the author, so take that with a pretty big grain of salt. Feedback is very 
welcome.







On 1/8/16, 1:18 PM, "Robert Brown"  wrote:

>Hi,
>
>I'm having trouble identifying a replica to delete...
>
>I've created a 3-shard cluster, all 3 created on a single host, then 
>added a replica for shard2 onto another host, no problem so far.
>
>Now I want to delete the original shard, but got this error when trying 
>a *replica* param value I thought would work...
>
>shard2/uk available replicas are core_node1,core_node4
>
>I can't find any mention of core_node1 or core_node4 via the admin UI, 
>how would I know/find the name of each one?
>
>Is it possible to set these names explicitly myself for easier maintenance?
>
>Many thanks for any guidance,
>Rob
>


Re: SolrCloud: Setting/finding node names for deleting replicas

2016-01-08 Thread Jeff Wartes

Honestly, I have no idea which is "old". The solr source itself uses slice 
pretty consistently, so I stuck with that when I started the project last year. 
And logically, a shard being an instance of a slice makes sense to me. But one 
significant place where they word shard is exposed is the default names of the 
slices, so it’s a mixed bag.


See here:
  https://github.com/whitepages/solrcloud_manager#terminology






On 1/8/16, 2:34 PM, "Robert Brown"  wrote:

>Thanks for the pointer Jeff,
>
>For SolrCloud it turned out to be...
>
>&property.coreNodeName=xxx
>
>btw, for your app, isn't "slice" old notation?
>
>
>
>
>On 08/01/16 22:05, Jeff Wartes wrote:
>>
>> I’m pretty sure you could change the name when you ADDREPLICA using a 
>> core.name property. I don’t know if you can when you initially create the 
>> collection though.
>>
>> The CLUSTERSTATUS command will tell you the core names: 
>> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api18
>>
>> That said, this tool might make things easier.
>> https://github.com/whitepages/solrcloud_manager
>>
>>
>> # shows cluster status, including core names:
>> java -jar solrcloud_manager-assembly-1.4.0.jar -z zk0.example.com:2181/myapp
>>
>>
>> # deletes a replica by node/collection/shard (figures out the core name 
>> under the hood)
>> java -jar solrcloud_manager-assembly-1.4.0.jar deletereplica -z 
>> zk0.example.com:2181/myapp -c collection1 --node node1.example.com --slice 
>> shard2
>>
>>
>> I mention this tool every now and then on this list because I like it, but 
>> I’m the author, so take that with a pretty big grain of salt. Feedback is 
>> very welcome.
>>
>>
>>
>>
>>
>>
>>
>> On 1/8/16, 1:18 PM, "Robert Brown"  wrote:
>>
>>> Hi,
>>>
>>> I'm having trouble identifying a replica to delete...
>>>
>>> I've created a 3-shard cluster, all 3 created on a single host, then
>>> added a replica for shard2 onto another host, no problem so far.
>>>
>>> Now I want to delete the original shard, but got this error when trying
>>> a *replica* param value I thought would work...
>>>
>>> shard2/uk available replicas are core_node1,core_node4
>>>
>>> I can't find any mention of core_node1 or core_node4 via the admin UI,
>>> how would I know/find the name of each one?
>>>
>>> Is it possible to set these names explicitly myself for easier maintenance?
>>>
>>> Many thanks for any guidance,
>>> Rob
>>>
>


Re: collection configuration stored in Zoo Keeper with solrCloud

2016-01-11 Thread Jeff Courtade
Yes its stored in the directories configured in zoo.cfg

.Jeff Courtade
M: 240.507.6116
On Jan 11, 2016 1:16 PM, "Jim Shi"  wrote:

> Hi, I have question regarding collection configurations stored Zoo Keeper
> with solrCloud.
> All collection configurations are stored at Zoo Keeper. What happens if
> you want to restart all Zoo Keeper instances? Does the Zoo Keeper persists
> data on disk and can restore all configurations from disk?


Re: SolrCloud replicas out of sync

2016-01-26 Thread Jeff Wartes

My understanding is that the "version" represents the timestamp the searcher 
was opened, so it doesn’t really offer any assurances about your data.

Although you could probably bounce a node and get your document counts back in 
sync (by provoking a check), it’s interesting that you’re in this situation. It 
implies to me that at some point the leader couldn’t write a doc to one of the 
replicas, but that the replica didn’t consider itself down enough to check 
itself.

You might watch the achieved replication factor of your updates and see if it 
ever changes:
https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance
 (See Achieved Replication Factor/min_rf)

If it does, that might give you clues about how this is happening. Also, it 
might allow you to work around the issue by trying the write again.






On 1/22/16, 10:52 AM, "David Smith"  wrote:

>I have a SolrCloud v5.4 collection with 3 replicas that appear to have fallen 
>permanently out of sync.  Users started to complain that the same search, 
>executed twice, sometimes returned different result counts.  Sure enough, our 
>replicas are not identical:
>
>>> shard1_replica1:  89867 documents / version 1453479763194
>>> shard1_replica2:  89866 documents / version 1453479763194
>>> shard1_replica3:  89867 documents / version 1453479763191
>
>I do not think this discrepancy is going to resolve itself.  The Solr Admin 
>screen reports all 3 replicas as “Current”.  The last modification to this 
>collection was 2 hours before I captured this information, and our auto commit 
>time is 60 seconds.  
>
>I have a lot of concerns here, but my first question is if anyone else has had 
>problems with out of sync replicas, and if so, what they have done to correct 
>this?
>
>Kind Regards,
>
>David
>


Re: SolrCloud replicas out of sync

2016-01-26 Thread Jeff Wartes

Ah, perhaps you fell into something like this then? 
https://issues.apache.org/jira/browse/SOLR-7844

That says it’s fixed in 5.4, but that would be an example of a split-brain type 
incident, where different documents were accepted by different replicas who 
each thought they were the leader. If this is the case, and you actually have 
different data on each replica, I’m not aware of any way to fix the problem 
short of reindexing those documents. Before that, you’ll probably need to 
choose a replica and just force the others to get in sync with it. I’d choose 
the current leader, since that’s slightly easier.

Typically, a leader writes an update to it’s transaction log, then sends the 
request to all replicas, and when those all finish it acknowledges the update. 
If a replica gets restarted, and is less than N documents behind, the leader 
will only replay that transaction log. (Where N is the numRecordsToKeep 
configured in the updateLog section of solrconfig.xml)

What you want is to provoke the heavy-duty process normally invoked if a 
replica has missed more than N docs, which essentially does a checksum and file 
copy on all the raw index files. FetchIndex would probably work, but it’s a 
replication handler API originally designed for master/slave replication, so 
take care: https://wiki.apache.org/solr/SolrReplication#HTTP_API
Probably a lot easier would be to just delete the replica and re-create it. 
That will also trigger a full file copy of the index from the leader onto the 
new replica.

I think design decisions around Solr generally use CP as a goal. (I sometimes 
wish I could get more AP behavior!) See posts like this: 
http://lucidworks.com/blog/2014/12/10/call-maybe-solrcloud-jepsen-flaky-networks/
 
So the fact that you encountered this sounds like a bug to me.
That said, another general recommendation (of mine) is that you not use Solr as 
your primary data source, so you can rebuild your index from scratch if you 
really need to. 






On 1/26/16, 1:10 PM, "David Smith"  wrote:

>Thanks Jeff!  A few comments
>
>>>
>>> Although you could probably bounce a node and get your document counts back 
>>> in sync (by provoking a check)
>>>
> 
>
>If the check is a simple doc count, that will not work. We have found that 
>replica1 and replica3, although they contain the same doc count, don’t have 
>the SAME docs.  They each missed at least one update, but of different docs.  
>This also means none of our three replicas are complete.
>
>>>
>>>it’s interesting that you’re in this situation. It implies to me that at 
>>>some point the leader couldn’t write a doc to one of the replicas,
>>>
>
>That is our belief as well. We experienced a datacenter-wide network 
>disruption of a few seconds, and user complaints started the first workday 
>after that event.  
>
>The most interesting log entry during the outage is this:
>
>"1/19/2016, 5:08:07 PM ERROR null DistributedUpdateProcessorRequest says it is 
>coming from leader,​ but we are the leader: 
>update.distrib=FROMLEADER&distrib.from=http://dot.dot.dot.dot:8983/solr/blah_blah_shard1_replica3/&wt=javabin&version=2";
>
>>>
>>> You might watch the achieved replication factor of your updates and see if 
>>> it ever changes
>>>
>
>This is a good tip. I’m not sure I like the implication that any failure to 
>write all 3 of our replicas must be retried at the app layer.  Is this really 
>how SolrCloud applications must be built to survive network partitions without 
>data loss? 
>
>Regards,
>
>David
>
>
>On 1/26/16, 12:20 PM, "Jeff Wartes"  wrote:
>
>>
>>My understanding is that the "version" represents the timestamp the searcher 
>>was opened, so it doesn’t really offer any assurances about your data.
>>
>>Although you could probably bounce a node and get your document counts back 
>>in sync (by provoking a check), it’s interesting that you’re in this 
>>situation. It implies to me that at some point the leader couldn’t write a 
>>doc to one of the replicas, but that the replica didn’t consider itself down 
>>enough to check itself.
>>
>>You might watch the achieved replication factor of your updates and see if it 
>>ever changes:
>>https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance
>> (See Achieved Replication Factor/min_rf)
>>
>>If it does, that might give you clues about how this is happening. Also, it 
>>might allow you to work around the issue by trying the write again.
>>
>>
>>
>>
>>
>>
>>On 1/22/16, 10:52 AM, "David Smith"  wrote:
>>
>>>I have a SolrCloud v5.4 collection with 3 replica

Re: SolrCloud replicas out of sync

2016-01-27 Thread Jeff Wartes

On 1/27/16, 8:28 AM, "Shawn Heisey"  wrote:





>
>I don't think any documentation states this, but it seems like a good
>idea to me use an alias from day one, so that you always have the option
>of swapping the "real" collection that you are using without needing to
>change anything else.  I'll need to ask some people if they think this
>is a good documentation addition, and think of a good place to mention
>it in the reference guide.


+1 - I recommend this at every opportunity. 

I’ve even considered creating 10 aliases for a single collection and having the 
client select one of the aliases randomly per query. This would allow 
transparently shifting traffic between collections in 10% increments.




Re: SolrCloud replicas out of sync

2016-01-27 Thread Jeff Wartes

If you can identify the problem documents, you can just re-index those after 
forcing a sync. Might save a full rebuild and downtime.

You might describe your cluster setup, including ZK. it sounds like you’ve done 
your research, but improper ZK node distribution could certainly invalidate 
some of Solr’s assumptions.




On 1/27/16, 7:59 AM, "David Smith"  wrote:

>Jeff, again, very much appreciate your feedback.  
>
>It is interesting — the article you linked to by Shalin is exactly why we 
>picked SolrCloud over ES, because (eventual) consistency is critical for our 
>application and we will sacrifice availability for it.  To be clear, after the 
>outage, NONE of our three replicas are correct or complete.
>
>So we definitely don’t have CP yet — our very first network outage resulted in 
>multiple overlapped lost updates.  As a result, I can’t pick one replica and 
>make it the new “master”.  I must rebuild this collection from scratch, which 
>I can do, but that requires downtime which is a problem in our app (24/7 High 
>Availability with few maintenance windows).
>
>
>So, I definitely need to “fix” this somehow.  I wish I could outline a 
>reproducible test case, but as the root cause is likely very tight timing 
>issues and complicated interactions with Zookeeper, that is not really an 
>option.  I’m happy to share the full logs of all 3 replicas though if that 
>helps.
>
>I am curious though if the thoughts have changed since 
>https://issues.apache.org/jira/browse/SOLR-5468 of seriously considering a 
>“majority quorum” model, with rollback?  Done properly, this should be free of 
>all lost update problems, at the cost of availability.  Some SolrCloud users 
>(like us!!!) would gladly accept that tradeoff.  
>
>Regards
>
>David
>
>


Re: collection aliasing

2016-01-28 Thread Jeff Wartes
I enjoy using collection aliases in all client references, because that allows 
me to change the collection all clients use without updating the clients. I 
just move the alias. 
This is particularly useful if I’m doing a full index rebuild and want an 
atomic, zero-downtime switchover.





On 1/28/16, 6:07 AM, "Shawn Heisey"  wrote:

>On 1/28/2016 2:59 AM, vidya wrote:
>> Hi
>>
>> Then what is the difference between collection aliasing and shards parameter
>> mentioned in request handler of solrconfig.xml.
>>
>> In request handler of new collection's solrconfig.xml
>>shards =
>> http://localhost:8983/solr/collection1,http://localhost:8983/solr/collection1
>> I can query both data of collection1 and collection2 in new collection which
>> is same as collection aliasing.
>>
>> Is my understanding correct ? If so, then what is the special characteristic
>> of collection alaising. Please help me.
>
>Collection aliasing handles it completely automatically, no need to put
>a shards parameter *anywhere*.  That is the main difference.
>
>The shards parameter is the old way of doing distributed searches. 
>SolrCloud completely automates the process so that neither the admin nor
>the user has to worry about it.  Aliases are part of that automation.
>
>Thanks,
>Shawn
>


Re: Restoring backups of solrcores

2016-02-01 Thread Jeff Wartes

Aliases work when indexing too.

Create collection: collection1
Create alias: this_week -> collection1
Index to: this_week

Next week...

Create collection: collection2
Create (Move) alias: this_week -> collection2
Index to: this_week




On 2/1/16, 2:14 AM, "vidya"  wrote:

>Hi 
>
>How can that be useful, can u please explain.
>I want to have the same collection name everytime when I index data i.e.,
>current_collection.
>
>By collection aliasing, i can create a new collection and point my alias
>(say ALIAS) to new collection but cannot rename that collection to the same
>current_collection which i have created and indexed previous week.
>
>So, are you asking me to create whatever collection name i want to create
>but point out my alias with name i want and change that alias pointing to
>new collection that i create and query using my alias name.
>
>Please help me on this.
>
>Thanks in advance
>
>
>
>--
>View this message in context: 
>http://lucene.472066.n3.nabble.com/Restoring-backups-of-solrcores-tp4254080p4254366.html
>Sent from the Solr - User mailing list archive at Nabble.com.


Re: Shard allocation across nodes

2016-02-01 Thread Jeff Wartes

You could write your own snitch: 
https://cwiki.apache.org/confluence/display/solr/Rule-based+Replica+Placement

Or, it would be more annoying, but you can always add/remove replicas manually 
and juggle things yourself after you create the initial collection.




On 2/1/16, 8:42 AM, "Tom Evans"  wrote:

>Hi all
>
>We're setting up a solr cloud cluster, and unfortunately some of our
>VMs may be physically located on the same VM host. Is there a way of
>ensuring that all copies of a shard are not located on the same
>physical server?
>
>If they do end up in that state, is there a way of rebalancing them?
>
>Cheers
>
>Tom


Re: Adding nodes

2016-02-17 Thread Jeff Wartes
Solrcloud does not come with any autoscaling functionality. If you want such a 
thing, you’ll need to write it yourself.

https://github.com/whitepages/solrcloud_manager might be a useful head start 
though, particularly the “fill” and “cleancollection” commands. I don’t do 
*auto* scaling, but I do use this for all my cluster management, which 
certantly involves moving collections/shards around among nodes, adding 
capacity, and removing capacity.






On 2/14/16, 11:17 AM, "McCallick, Paul"  wrote:

>These are excellent questions and give me a good sense of why you suggest 
>using the collections api.
>
>In our case we have 8 shards of product data with a even distribution of data 
>per shard, no hot spots. We have very different load at different points in 
>the year (cyber monday), and we tend to have very little traffic at night. I'm 
>thinking of two use cases:
>
>1) we are seeing increased latency due to load and want to add 8 more replicas 
>to handle the query volume.  Once the volume subsides, we'd remove the nodes. 
>
>2) we lose a node due to some unexpected failure (ec2 tends to do this). We 
>want auto scaling to detect the failure and add a node to replace the failed 
>one. 
>
>In both cases the core api makes it easy. It adds nodes to the shards evenly. 
>Otherwise we have to write a fairly involved script that is subject to race 
>conditions to determine which shard to add nodes to. 
>
>Let me know if I'm making dangerous or uninformed assumptions, as I'm new to 
>solr. 
>
>Thanks,
>Paul
>
>> On Feb 14, 2016, at 10:35 AM, Susheel Kumar  wrote:
>> 
>> Hi Pual,
>> 
>> 
>> For Auto-scaling, it depends on how you are thinking to design and what/how
>> do you want to scale. Which scenario you think makes coreadmin API easy to
>> use for a sharded SolrCloud environment?
>> 
>> Isn't if in a sharded environment (assume 3 shards A,B & C) and shard B has
>> having higher or more load,  then you want to add Replica for shard B to
>> distribute the load or if a particular shard replica goes down then you
>> want to add another Replica back for the shard in which case ADDREPLICA
>> requires a shard name?
>> 
>> Can you describe your scenario / provide more detail?
>> 
>> Thanks,
>> Susheel
>> 
>> 
>> 
>> On Sun, Feb 14, 2016 at 11:51 AM, McCallick, Paul <
>> paul.e.mccall...@nordstrom.com> wrote:
>> 
>>> Hi all,
>>> 
>>> 
>>> This doesn’t really answer the following question:
>>> 
>>> What is the suggested way to add a new node to a collection via the
>>> apis?  I  am specifically thinking of autoscale scenarios where a node has
>>> gone down or more nodes are needed to handle load.
>>> 
>>> 
>>> The coreadmin api makes this easy.  The collections api (ADDREPLICA),
>>> makes this very difficult.
>>> 
>>> 
 On 2/14/16, 8:19 AM, "Susheel Kumar"  wrote:
 
 Hi Paul,
 
 Shawn is referring to use Collections API
 https://cwiki.apache.org/confluence/display/solr/Collections+API  than
>>> Core
 Admin API https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API
 for SolrCloud.
 
 Hope that clarifies and you mentioned about ADDREPLICA which is the
 collections API, so you are on right track.
 
 Thanks,
 Susheel
 
 
 
 On Sun, Feb 14, 2016 at 10:51 AM, McCallick, Paul <
 paul.e.mccall...@nordstrom.com> wrote:
 
> Then what is the suggested way to add a new node to a collection via the
> apis?  I  am specifically thinking of autoscale scenarios where a node
>>> has
> gone down or more nodes are needed to handle load.
> 
> Note that the ADDREPLICA endpoint requires a shard name, which puts the
> onus of how to scale out on the user. This can be challenging in an
> autoscale scenario.
> 
> Thanks,
> Paul
> 
>> On Feb 14, 2016, at 12:25 AM, Shawn Heisey 
>>> wrote:
>> 
>>> On 2/13/2016 6:01 PM, McCallick, Paul wrote:
>>> - When creating a new collection, SOLRCloud will use all available
> nodes for the collection, adding cores to each.  This assumes that you
>>> do
> not specify a replicationFactor.
>> 
>> The number of nodes that will be used is numShards multipled by
>> replicationFactor.  The default value for replicationFactor is 1.  If
>> you do not specify numShards, there is no default -- the CREATE call
>> will fail.  The value of maxShardsPerNode can also affect the overall
>> result.
>> 
>>> - When adding new nodes to the cluster AFTER the collection is
>>> created,
> one must use the core admin api to add the node to the collection.
>> 
>> Using the CoreAdmin API is strongly discouraged when running
>>> SolrCloud.
>> It works, but it is an expert API when in cloud mode, and can cause
>> serious problems if not used correctly.  Instead, use the Collections
>> API.  It can handle all normal maintenance needs.
>> 
>>> I would really like to see the second case behave more like the
>>> first.

Re: very slow frequent updates

2016-02-23 Thread Jeff Wartes

My suggestion would be to split your problem domain. Use Solr exclusively for 
search - index the id and only those fields you need to search on. Then use 
some other data store for retrieval. Get the id’s from the solr results, and 
look them up in the data store to get the rest of your fields. This allows you 
to keep your solr docs as small as possible, and you only need to update them 
when a *searchable* field changes.

Every “update" in solr is a delete/insert. Even the "atomic update” feature is 
just a shortcut for that. It requires stored fields because the data from the 
stored fields gets copied into the new insert.





On 2/22/16, 12:21 PM, "Roland Szűcs"  wrote:

>Hi folks,
>
>We use SOLR 5.2.1. We have ebooks stored in SOLR. The majority of the
>fields do not change at all like content, author, publisher Only the
>price field changes frequently.
>
>We let the customers to make full text search so we indexed the content
>filed. Due to the frequency of the price updates we use the atomic update
>feature. As a requirement of the atomic updates we have to store all the
>fields even the content field which is 1MB/document and we did not want to
>store it just index it.
>
>As we wanted to update 100 documents with atomic update it took about 3
>minutes. Taking into account that our metadata /document is 1 Kb and our
>content field / document is 1MB we use 1000 more memory to accelerate the
>update process.
>
>I am almost 100% sure that we make something wrong.
>
>What is the best practice of the frequent updates when 99% part of a given
>document is constant forever?
>
>Thank in advance
>
>-- 
> Roland Szűcs
> Connect with
>me on Linkedin 
>
>CEO Phone: +36 1 210 81 13
>Bookandwalk.hu 


Re: very slow frequent updates

2016-02-24 Thread Jeff Wartes

I suspect your problem is the intersection of “very large document” and “high 
rate of change”. Either of those alone would be fine.

You’re correct, if the thing you need to search or sort by is the thing with a 
high change rate, you probably aren’t going to be able to peel those things out 
of your index. 

Perhaps you could work something out with join queries? So you have two kinds 
of documents - book content and book price - and your high-frequency change is 
limited to documents with very little data.





On 2/24/16, 4:01 AM, "roland.sz...@booknwalk.com on behalf of Szűcs Roland" 
 wrote:

>I have checked it already in the ref. guide. It is stated that you can not
>search in external fields:
>https://cwiki.apache.org/confluence/display/solr/Working+with+External+Files+and+Processes
>
>Really I am very curios that my problem is not a usual one or the case is
>that SOLR mainly focuses on search and not a kind of end-to-end support.
>How this approach works with 1 million documents with frequently changing
>prices?
>
>Thanks your time,
>
>Roland
>
>2016-02-24 12:39 GMT+01:00 Stefan Matheis :
>
>> Depending of what features you do actually need, might be worth a look
>> on "External File Fields" Roland?
>>
>> -Stefan
>>
>> On Wed, Feb 24, 2016 at 12:24 PM, Szűcs Roland
>>  wrote:
>> > Thanks Jeff your help,
>> >
>> > Can it work in production environment? Imagine when my customer initiate
>> a
>> > query having 1 000 docs in the result set. I can not use the pagination
>> of
>> > SOLR as the field which is the basis of the sort is not included in the
>> > schema for example the price. The customer wants the list in descending
>> > order of the price.
>> >
>> > So I have to get all the 1000 docids from solr and find the metadata of
>> > them in a sql database or in cache in best case. This is the way you
>> > suggested? Is it not too slow?
>> >
>> > Regards,
>> > Roland
>> >
>> > 2016-02-23 19:29 GMT+01:00 Jeff Wartes :
>> >
>> >>
>> >> My suggestion would be to split your problem domain. Use Solr
>> exclusively
>> >> for search - index the id and only those fields you need to search on.
>> Then
>> >> use some other data store for retrieval. Get the id’s from the solr
>> >> results, and look them up in the data store to get the rest of your
>> fields.
>> >> This allows you to keep your solr docs as small as possible, and you
>> only
>> >> need to update them when a *searchable* field changes.
>> >>
>> >> Every “update" in solr is a delete/insert. Even the "atomic update”
>> >> feature is just a shortcut for that. It requires stored fields because
>> the
>> >> data from the stored fields gets copied into the new insert.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On 2/22/16, 12:21 PM, "Roland Szűcs" 
>> wrote:
>> >>
>> >> >Hi folks,
>> >> >
>> >> >We use SOLR 5.2.1. We have ebooks stored in SOLR. The majority of the
>> >> >fields do not change at all like content, author, publisher Only
>> the
>> >> >price field changes frequently.
>> >> >
>> >> >We let the customers to make full text search so we indexed the content
>> >> >filed. Due to the frequency of the price updates we use the atomic
>> update
>> >> >feature. As a requirement of the atomic updates we have to store all
>> the
>> >> >fields even the content field which is 1MB/document and we did not
>> want to
>> >> >store it just index it.
>> >> >
>> >> >As we wanted to update 100 documents with atomic update it took about 3
>> >> >minutes. Taking into account that our metadata /document is 1 Kb and
>> our
>> >> >content field / document is 1MB we use 1000 more memory to accelerate
>> the
>> >> >update process.
>> >> >
>> >> >I am almost 100% sure that we make something wrong.
>> >> >
>> >> >What is the best practice of the frequent updates when 99% part of a
>> given
>> >> >document is constant forever?
>> >> >
>> >> >Thank in advance
>> >> >
>> >> >--
>> >> ><https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu> Roland
>> >> Szűcs
>> >> ><https:/

Re: Shard State vs Replica State

2016-02-26 Thread Jeff Wartes

I believe the shard state is a reflection of whether that shard is still in use 
by the collection, and has nothing to do with the state of the replicas. I 
think doing a split-shard operation would create two new shards, and mark the 
old one as inactive, for example.




On 2/26/16, 8:50 AM, "Dennis Gove"  wrote:

>In clusterstate.json (or just state.json in new versions) I'm seeing the
>following
>
>"shard1":{
>"range":"8000-d554",
>"state":"active",
>"replicas":{
>  "core_node7":{
>"core":"people_shard1_replica3",
>"base_url":"http://192.168.2.32:8983/solr";,
>"node_name":"192.168.2.32:8983_solr",
>"state":"down"},
>  "core_node9":{
>"core":"people_shard1_replica2",
>"base_url":"http://192.168.2.32:8983/solr";,
>"node_name":"192.168.2.32:8983_solr",
>"state":"down"},
>  "core_node2":{
>"core":"people_shard1_replica1",
>"base_url":"http://192.168.2.32:8983/solr";,
>"node_name":"192.168.2.32:8983_solr",
>"state":"down"}
>}
>},
>
>All replicas are down (I hosed the index for one of the replicas on purpose
>to simulate this) and each replica is showing its state accurately as
>"down". But the shard state is still showing "active". I would expect the
>shard state to reflect the availability of that shard (ie, the best state
>across all the replicas). For example, if one replica is active then the
>shard state is active, if two replicas are recovering and one is down then
>the shard state shows recovering, etc...
>
>What I'm seeing, however, doesn't match my expectation so I'm wondering
>what is shard state showing?
>
>Thanks,
>Dennis


Re: Prevent the SSL Keystore and Truststore password from showing up in the Solr Admin and Linux processes (Solr 5.2.1)

2016-02-29 Thread Jeff Wu
Hi Katherine, we had exact the same issue, we need to protect our password.
Anyone who can access to solr server can do "ps -elf|grep java" to grep the
solr commandline, and it has all the password in plain text.

The /bin/solr shell will set 10 related system property:
 SOLR_SSL_OPTS=" -Dsolr.jetty.keystore=$SOLR_SSL_KEY_STORE \
-Dsolr.jetty.keystore.password=$SOLR_SSL_KEY_STORE_PASSWORD \
-Dsolr.jetty.truststore=$SOLR_SSL_TRUST_STORE \
-Dsolr.jetty.truststore.password=$SOLR_SSL_TRUST_STORE_PASSWORD \
-Dsolr.jetty.ssl.needClientAuth=$SOLR_SSL_NEED_CLIENT_AUTH \
-Dsolr.jetty.ssl.wantClientAuth=$SOLR_SSL_WANT_CLIENT_AUTH"
  SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_KEY_STORE \
  -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_KEY_STORE_PASSWORD \
  -Djavax.net.ssl.trustStore=$SOLR_SSL_TRUST_STORE \
  -Djavax.net.ssl.trustStorePassword=$SOLR_SSL_TRUST_STORE_PASSWORD"
and also
   SOLR_JETTY_CONFIG+=("--module=https")

The questions we have:
1. We doubt "OBF:XYZ"  does not work when you set to solr.in.sh.  the
javax.net.ssl can't work with jetty OBF. What we saw is Incorrect password

Caused by: java.io.IOException: Keystore was tampered with, or password was
incorrect
at sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:780)
at sun.security.provider.JavaKeyStore$JKS.engineLoad(JavaKeyStore.java:56)
at
sun.security.provider.KeyStoreDelegator.engineLoad(KeyStoreDelegator.java:225)
at
sun.security.provider.JavaKeyStore$DualFormatJKS.engineLoad(JavaKeyStore.java:70)
at java.security.KeyStore.load(KeyStore.java:1445)
at
sun.security.ssl.SSLContextImpl$DefaultSSLContext.getDefaultKeyManager(SSLContextImpl.java:852)
at
sun.security.ssl.SSLContextImpl$DefaultSSLContext.(SSLContextImpl.java:732)
at sun.reflect.GeneratedConstructorAccessor280.newInstance(Unknown Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at java.security.Provider$Service.newInstance(Provider.java:1595)

2. Is there any good sample we can referent to configure in jetty-https.xml
and jetty-ssl.xml to leverage Jetty OBF?
Katherine, can you share your jetty-ssl.xml and jetty-https.xml?
>From this link:
http://www.eclipse.org/jetty/documentation/9.2.6.v20141205/configuring-ssl.html,
did you put keystore files under jetty home?
and followed this sample?

  /etc/keystore
  OBF:1vny1zlo1x8e1vnw1vn61x8g1zlu1vn4
  OBF:1u2u1wml1z7s1z7a1wnl1u2g
  /etc/keystore
  OBF:1vny1zlo1x8e1vnw1vn61x8g1zlu1vn4





2016-02-15 13:23 GMT-05:00 Katherine Mora :

> Hello All,
>
> I've configured Solr 5.2.1 to enable SSL by following the instructions
> listed in the Wiki in Enabling SSL<
> https://cwiki.apache.org/confluence/display/solr/Enabling+SSL>. This is
> working fine. However, if I go to the Solr Admin (Dashboard -> JVM -> Args)
> or if I list the processes running in the computer, I can see the password
> that I set in the solr.in.sh script for SOLR_SSL_KEY_STORE_PASSWORD and
> SOLR_SSL_TRUST_STORE_PASSWORD:
>
> -Dsolr.jetty.truststore.password=XYZ
> -Dsolr.jetty.keystore.password=XYZ
> -Djavax.net.ssl.trustStorePassword=XYZ
> -Djavax.net.ssl.keyStorePassword=XYZ
>
>
> I have tried securing the passwords using Jetty's Password utility:
>
> java -cp jetty-util-9.2.10.v20150310.jar
> org.eclipse.jetty.util.security.Password XYZ
>
> And using the "OBF:XYZ" password in solr.in.sh instead but I get an
> exception java.security.NoSuchAlgorithmException -> java.io.IOException:
> Keystore was tampered with, or password was incorrect (I'm listing the
> complete exception below as well)
>
> Additionally, I have tried to remove the lines in the "bin/solr" script
> that set the passwords in SOLR_SSL_OPTS and eventually in SOLR_OPTS
> instead, setting the passwords directly in the jetty configuration files
> located under "server/etc". However, when I do this, I get an exception
> saying the password cannot be null. It seems like there is a setting that
> is not listed in the jetty files. I found that "keyManagerPassword" is not
> listed in the jetty-ssl.xml file and I added it, but I keep getting the
> same error.
>
> Does anyone know how to prevent the SSL keystore and trust store password
> from showing up in the Solr Admin by doing the configuration in the jetty
> files or by securing the passwords?
>
> Thanks in advance for any help you can provide.
>
>
> Caused by: java.net.SocketException:
> java.security.NoSuchAlgorithmException: Error constructing implementation
> (algorithm: Default, provider: SunJSSE, class:
> sun.security.ssl.SSLContextImpl$DefaultSSLContext)
> at
> javax.net.ssl.DefaultSSLSocketFactory.throwException(SSLSocketFactory.java:198)
> at
> javax.net.ssl.DefaultSSLSocketFactory.createSocket(SSLSocketFactory.java:205)
> at
> org.apache.http.conn.ssl.SSLSocketFactory.createSocket(SSLSocketFactory.java:513)
> at
> org.apache.http.conn.ssl.SSLSock

Re: SolrCloud - Strategy for recovering cluster states

2016-03-01 Thread Jeff Wartes

I’ve been running SolrCloud clusters in various versions for a few years here, 
and I can only think of two or three cases that the ZK-stored cluster state was 
broken in a way that I had to manually intervene by hand-editing the contents 
of ZK. I think I’ve seen Solr fixes go by for those cases, too. I’ve never 
completely wiped ZK. (Although granted, my ZK cluster has been pretty stable, 
and my collection count is smaller than yours)

My philosophy is that ZK is the source of cluster configuration, not the 
collection of core.properties files on the nodes. 
Currently, cluster state is shared between ZK and core directories. I’d prefer, 
and I think Solr development is going this way, (SOLR-7269) that all cluster 
state exist and be managed via ZK, and all state be removed from the local disk 
of the cluster nodes. The fact that a node uses local disk based configuration 
to figure out what collections/replicas it has is something that should be 
fixed, in my opinion.

If you’re frequently getting into bad states due to ZK issues, I’d suggest you 
file bugs against Solr for the fact that you got into the state, and then fix 
your ZK cluster.

Failing that, can you just periodically back up your ZK data and restore it if 
something breaks? I wrote a little tool to watch clusterstate.json and write 
every version to a local git repo a few years ago. I was mostly interested 
because I wanted to see changes that happened pretty fast, but it could also 
serve as a backup approach. Here’s a link, although I clearly haven’t touched 
it lately. Feel free to ask if you have issues: 
https://github.com/randomstatistic/git_zk_monitor




On 3/1/16, 12:09 PM, "danny teichthal"  wrote:

>Hi,
>Just summarizing my questions if the long mail is a little intimidating:
>1. Is there a best practice/automated tool for overcoming problems in
>cluster state coming from zookeeper disconnections?
>2. Creating a collection via core admin is discouraged, is it true also for
>core.properties discovery?
>
>I would like to be able to specify collection.configName in the
>core.properties and when starting server, the collection will be created
>and linked to the config name specified.
>
>
>
>On Mon, Feb 29, 2016 at 4:01 PM, danny teichthal 
>wrote:
>
>> Hi,
>>
>>
>> I would like to describe a process we use for overcoming problems in
>> cluster state when we have networking issues. Would appreciate if anyone
>> can answer about what are the flaws on this solution and what is the best
>> practice for recovery in case of network problems involving zookeeper.
>> I'm working with Solr Cloud with version 5.2.1
>> ~100 collections in a cluster of 6 machines.
>>
>> This is the short procedure:
>> 1. Bring all the cluster down.
>> 2. Clear all data from zookeeper.
>> 3. Upload configuration.
>> 4. Restart the cluster.
>>
>> We rely on the fact that a collection is created on core discovery
>> process, if it does not exist. It gives us much flexibility.
>> When the cluster comes up, it reads from core.properties and creates the
>> collections if needed.
>> Since we have only one configuration, the collections are automatically
>> linked to it and the cores inherit it from the collection.
>> This is a very robust procedure, that helped us overcome many problems
>> until we stabilized our cluster which is now pretty stable.
>> I know that the leader might change in such case and may lose updates, but
>> it is ok.
>>
>>
>> The problem is that today I want to add a new config set.
>> When I add it and clear zookeeper, the cores cannot be created because
>> there are 2 configurations. This breaks my recovery procedure.
>>
>> I thought about a few options:
>> 1. Put the config Name in core.properties - this doesn't work. (It is
>> supported in CoreAdminHandler, but  is discouraged according to
>> documentation)
>> 2. Change recovery procedure to not delete all data from zookeeper, but
>> only relevant parts.
>> 3. Change recovery procedure to delete all, but recreate and link
>> configurations for all collections before startup.
>>
>> Option #1 is my favorite, because it is very simple, it is currently not
>> supported, but from looking on code it looked like it is not complex to
>> implement.
>>
>>
>>
>> My questions are:
>> 1. Is there something wrong in the recovery procedure that I described ?
>> 2. What is the best way to fix problems in cluster state, except from
>> editing clusterstate.json manually? Is there an automated tool for that? We
>> have about 100 collections in a cluster, so editing is not really a
>> solution.
>> 3.Is creating a collection via core.properties is also discouraged?
>>
>>
>>
>> Would very appreciate any answers/ thoughts on that.
>>
>>
>> Thanks,
>>
>>
>>
>>
>>
>>


Re: SolrCloud - Strategy for recovering cluster states

2016-03-02 Thread Jeff Wartes
Well, with the understanding that someone who isn’t involved in the process is 
describing something that isn’t built yet...

I could imagine changes like:
 - Core discovery ignores cores that aren’t present in the ZK cluster state
 - New cores are automatically created to bring a node in line with ZK cluster 
state (addreplica, essentially) 
 
So if the clusterstate said “node XYZ has a replica of shard3 of collection1 
and that’s all”, and you downed node XYZ and deleted the data directory, it’d 
get restored when you started the node again. And if you copied the core 
directory for shard1 of collection2 in there and restarted the node, it’d get 
ignored because the clusterstate says node XYZ doesn’t have that.

More importantly, if you completely destroyed a node and rebuilt it from an 
image, (AWS?) that image wouldn't need any special core directories specific to 
that node. As long as the node name was the same, Solr would handle bringing 
that node back to where it was in the cluster.

Back to opinions, I think mixing the cluster definition between local disk on 
the nodes and ZK clusterstate is just confusing. It should really be one or the 
other. Specifically, I think it should be local disk for non-SolrCloud, and ZK 
for SolrCloud.





On 3/2/16, 12:13 AM, "danny teichthal"  wrote:

>Thanks Jeff,
>I understand your philosophy and it sounds correct.
>Since we had many problems with zookeeper when switching to Solr Cloud. we
>couldn't make it as a source of knowledge and had to relay on a more stable
>source.
>The issues is that when we get such an event of zookeeper, it brought our
>system down, and in this case, clearing the core.properties were a life
>saver.
>We've managed to make it pretty stable not, but we will always need a
>"dooms day" weapon.
>
>I looked into the related JIRA and it confused me a little, and raised a
>few other questions:
>1. What exactly defines zookeeper as a truth?
>2. What is the role of core.properties if the state is only in zookeeper?
>
>
>
>Your tool is very interesting, I just thought about writing such a tool
>myself.
>From the sources I understand that you represent each node as a path in the
>git repository.
>So, I guess that for restore purposes I will have to do
>the opposite direction and create a node for every path entry.
>
>
>
>
>On Tue, Mar 1, 2016 at 11:36 PM, Jeff Wartes  wrote:
>
>>
>> I’ve been running SolrCloud clusters in various versions for a few years
>> here, and I can only think of two or three cases that the ZK-stored cluster
>> state was broken in a way that I had to manually intervene by hand-editing
>> the contents of ZK. I think I’ve seen Solr fixes go by for those cases,
>> too. I’ve never completely wiped ZK. (Although granted, my ZK cluster has
>> been pretty stable, and my collection count is smaller than yours)
>>
>> My philosophy is that ZK is the source of cluster configuration, not the
>> collection of core.properties files on the nodes.
>> Currently, cluster state is shared between ZK and core directories. I’d
>> prefer, and I think Solr development is going this way, (SOLR-7269) that
>> all cluster state exist and be managed via ZK, and all state be removed
>> from the local disk of the cluster nodes. The fact that a node uses local
>> disk based configuration to figure out what collections/replicas it has is
>> something that should be fixed, in my opinion.
>>
>> If you’re frequently getting into bad states due to ZK issues, I’d suggest
>> you file bugs against Solr for the fact that you got into the state, and
>> then fix your ZK cluster.
>>
>> Failing that, can you just periodically back up your ZK data and restore
>> it if something breaks? I wrote a little tool to watch clusterstate.json
>> and write every version to a local git repo a few years ago. I was mostly
>> interested because I wanted to see changes that happened pretty fast, but
>> it could also serve as a backup approach. Here’s a link, although I clearly
>> haven’t touched it lately. Feel free to ask if you have issues:
>> https://github.com/randomstatistic/git_zk_monitor
>>
>>
>>
>>
>> On 3/1/16, 12:09 PM, "danny teichthal"  wrote:
>>
>> >Hi,
>> >Just summarizing my questions if the long mail is a little intimidating:
>> >1. Is there a best practice/automated tool for overcoming problems in
>> >cluster state coming from zookeeper disconnections?
>> >2. Creating a collection via core admin is discouraged, is it true also
>> for
>> >core.properties discovery?
>> >
>> >I would like to be able to specify collection.configName in the
>> >co

Re: XX:ParGCCardsPerStrideChunk

2016-03-03 Thread Jeff Wartes

I've experimented with that a bit, and Shawn added my comments in IRC to his 
Solr/GC page here: https://wiki.apache.org/solr/ShawnHeisey

The relevant bit:
"With values of 4096 and 32768, the IRC user was able to achieve 15% and 19% 
reductions in average pause time, respectively, with the maximum pause cut 
nearly in half. That option was added to the options you can find in the Solr 
5.x start script, on a 12GB heap."


This was with CMS, obviously. I haven’t repeated the experiment recently, but 
I’m using the option in production at this point.






On 3/2/16, 9:39 PM, "William Bell"  wrote:

>Has anyone tried -XX:ParGCCardsPerStrideChunk with Solr?
>
>There has been reports of improved GC times.
>
>-- 
>Bill Bell
>billnb...@gmail.com
>cell 720-256-8076


Re: Separating cores from Solr home

2016-03-03 Thread Jeff Wartes
It’s a bit backwards feeling, but I’ve had luck setting the install dir and 
solr home, instead of the data dir.

Something like:
-Dsolr.solr.home=/data/solr 
-Dsolr.install.dir=/opt/solr


So all of the Solr files are in in /opt/solr and all of the index/core related 
files end up in /data/solr.




On 3/3/16, 3:58 AM, "Tom Evans"  wrote:

>Hmm, I've worked around this by setting the directory where the
>indexes should live to be the actual solr home, and symlink the files
>from the current release in to that directory, but it feels icky.
>
>Any better ideas?
>
>Cheers
>
>Tom
>
>On Thu, Mar 3, 2016 at 11:12 AM, Tom Evans  wrote:
>> Hi all
>>
>> I'm struggling to configure solr cloud to put the index files and
>> core.properties in the correct places in SolrCloud 5.5. Let me explain
>> what I am trying to achieve:
>>
>> * solr is installed in /opt/solr
>> * the user who runs solr only has read only access to that tree
>> * the solr home files - custom libraries, log4j.properties, solr.in.sh
>> and solr.xml - live in /data/project/solr/releases/, which
>> is then the target of a symlink /data/project/solr/releases/current
>> * releasing a new version of the solr home (eg adding/changing
>> libraries, changing logging options) is done by checking out a fresh
>> copy of the solr home, switching the symlink and restarting solr
>> * the solr core.properties and any data live in /data/project/indexes,
>> so they are preserved when new solr home is released
>>
>> Setting core specific dataDir with absolute paths in solrconfig.xml
>> only gets me part of the way, as the core.properties for each shard is
>> created inside the solr home.
>>
>> This is obviously no good, as when releasing a new version of the solr
>> home, they will no longer be in the current solr home.
>>
>> Cheers
>>
>> Tom


Reset JMX counters for monitoring without restarting

2016-04-02 Thread Jeff Courtade
Hi,

I am putting together some montitors for various things.

The counters seem to be  ... from the beginning of the process.

This makes many of them not so useful for long term monitoring and alerting.

Is there a way to rest the counters without restarting solr or reloading a
core?

For instance: these seem to be from the time the process started.

java -jar /opt/scripts/pkg/cmdline-jmxclient-0.10.3.jar - localhost:9010
solr/collection1:id=org.apache.solr.handler.component.SearchHandler,type=/select
$NAME

avgTimePerRequest.value  363.66010984870064
medianRequestTime.value  1240.416114498
75thPcRequestTime.value  1614.2324915
95thPcRequestTime.value  3048.37888109
99thPcRequestTime.value  5930.183086690001



--
Thanks,

Jeff Courtade
M: 240.507.6116


Re: Reset JMX counters for monitoring without restarting

2016-04-02 Thread Jeff Courtade
Thanks,

I was hoping there was a way without a core reload.

Do you know what is different with cloud? I need to do this in both.

Jeff Courtade
M: 240.507.6116
On Apr 2, 2016 1:37 PM, "Shawn Heisey"  wrote:

> On 4/2/2016 11:06 AM, Jeff Courtade wrote:
> > I am putting together some montitors for various things.
> >
> > The counters seem to be  ... from the beginning of the process.
> >
> > This makes many of them not so useful for long term monitoring and
> alerting.
> >
> > Is there a way to rest the counters without restarting solr or reloading
> a
> > core?
> >
> > For instance: these seem to be from the time the process started.
> >
> > java -jar /opt/scripts/pkg/cmdline-jmxclient-0.10.3.jar - localhost:9010
> >
> solr/collection1:id=org.apache.solr.handler.component.SearchHandler,type=/select
> > $NAME
> >
> > avgTimePerRequest.value  363.66010984870064
> > medianRequestTime.value  1240.416114498
> > 75thPcRequestTime.value  1614.2324915
> > 95thPcRequestTime.value  3048.37888109
> > 99thPcRequestTime.value  5930.183086690001
>
> Some of the information you find in statistics might indeed go back to
> the last Solr start.
>
> The specific stats that you have indicated are most useful over a long
> period of time, ideally for the entire time Solr has been running, but
> if you do want to reset them, you can do so with a core reload.  If
> you're running cloud (doesn't look like you are), then you would reload
> the whole collection, probably not the core.
>
> Thanks,
> Shawn
>
>


Re: Reset JMX counters for monitoring without restarting

2016-04-02 Thread Jeff Courtade
Thanks very much.

Jeff Courtade
M: 240.507.6116
On Apr 2, 2016 3:03 PM, "Otis Gospodnetić" 
wrote:

> Hi Jeff,
>
> With info that Solr provides in JMX you have to keep track of things
> yourself, do subtractions and counting yourself.
> If you don't feel like reinventing that, see
> https://sematext.com/spm/integrations/solr-monitoring/
>
> Otis
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
> On Sat, Apr 2, 2016 at 1:06 PM, Jeff Courtade 
> wrote:
>
> > Hi,
> >
> > I am putting together some montitors for various things.
> >
> > The counters seem to be  ... from the beginning of the process.
> >
> > This makes many of them not so useful for long term monitoring and
> > alerting.
> >
> > Is there a way to rest the counters without restarting solr or reloading
> a
> > core?
> >
> > For instance: these seem to be from the time the process started.
> >
> > java -jar /opt/scripts/pkg/cmdline-jmxclient-0.10.3.jar - localhost:9010
> >
> >
> solr/collection1:id=org.apache.solr.handler.component.SearchHandler,type=/select
> > $NAME
> >
> > avgTimePerRequest.value  363.66010984870064
> > medianRequestTime.value  1240.416114498
> > 75thPcRequestTime.value  1614.2324915
> > 95thPcRequestTime.value  3048.37888109
> > 99thPcRequestTime.value  5930.183086690001
> >
> >
> >
> > --
> > Thanks,
> >
> > Jeff Courtade
> > M: 240.507.6116
> >
>


Re: SolrCloud no leader for collection

2016-04-05 Thread Jeff Wartes
I recall I had some luck fixing a leader-less shard (after a ZK quorum failure) 
by forcably removing the records for the down-state replicas from the leader 
election list, and then forcing an election. 
The ZK path looks like collections//leader_elect/shardX/election. 
Usually you’ll find the down-state one that keeps getting elected is the first 
one. Delete that, then try the force-election collections api command again.



On 4/5/16, 3:15 AM, "Tom Evans"  wrote:

>Hi all, I have an 8 node SolrCloud 5.5 cluster with 11 collections,
>most of them in a 1 shard x 8 replicas configuration. We have 5 ZK
>nodes.
>
>During the night, we attempted to reindex one of the larger
>collections. We reindex by pushing json docs to the update handler
>from a number of processes. It seemed this overwhelmed the servers,
>and caused all of the collections to fail and end up in either a down
>or a recovering state, often with no leader.
>
>Restarting and rebooting the servers brought a lot of the collections
>back online, but we are left with a few collections for which all the
>nodes hosting those replicas are up, but the replica reports as either
>"active" or "down", and with no leader.
>
>Trying to force a leader election has no effect, it keeps choosing a
>leader that is in "down" state. Removing all the nodes that are in
>"down" state and forcing a leader election also has no effect.
>
>
>Any ideas? The only viable option I see is to create a new collection,
>index it and then remove the old collection and alias it in.
>
>Cheers
>
>Tom


Re: SolrCloud backup/restore

2016-04-05 Thread Jeff Wartes

There is some automation around this process in the backup commands here:
https://github.com/whitepages/solrcloud_manager


It’s been tested with 5.4, and will restore arbitrary replication factors. 
Ever assuming the shared filesystem for backups, of course.



On 4/5/16, 3:18 AM, "Reth RM"  wrote:

>Yes. It should be backing up each shard leader of collection. For each
>collection, for each shard, find the leader and request a backup command on
>that. Further, restore this on new collection, in its respective shard and
>then go on adding new replica which will duly pull it from the newly added
>shards.
>
>
>On Mon, Apr 4, 2016 at 10:32 PM, Zisis Tachtsidis 
>wrote:
>
>> I've tested backup/restore successfully in a SolrCloud installation with a
>> single node (no replicas). This has been achieved in
>> https://issues.apache.org/jira/browse/SOLR-6637
>> Can you do something similar when more replicas are involved? What I'm
>> looking for is a restore command that will restore index in all replicas of
>> a collection.
>> Judging from the code in /ReplicationHandler.java/ and
>> https://issues.apache.org/jira/browse/SOLR-5750 I assume that more work
>> needs to be done to achieve this.
>>
>> Is my understanding correct? If the situation is like this I guess an
>> alternative would be to just create a new collection, restore index and
>> then
>> add replicas. (I'm using Solr 5.5.0)
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/SolrCloud-backup-restore-tp4267954.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>


DIH with Nested Documents - Configuration Issue

2016-04-14 Thread Jeff Chastain
I am working on a project where the specification requires a parent - child 
relationship within the Solr data collection ... i.e. a user and the collection 
of languages they speak (each of which is made up of multiple data fields).  My 
production system is a 4.10 Solr implementation but I have a 5.5 implementation 
as my disposal as well.  Thus far, I am not getting this to work on either one 
and I have yet to find a complete documentation source on how to implement this.

The goal is to get a resulting document from Solr that looks like this:

   {
   "id": 123,
   "firstName": "John",
   "lastName": "Doe",
   "languagesSpoken": [
  {
 "id": 243,
 "abbreviation": "en",
 "name": "English"
  },
  {
 "id": 442,
 "abbreviation": "fr",
 "name": "French"
  }
   ]
}

In my schema.xml, I have flatted out all of the fields as follows:

   
   
   
   
   
   
   

The latest rendition of my db-data-config.xml looks like this:


   
   
  

 
 
 

 



 
  
   
   ...

On the 4.10 server, when the data comes out of Solr, I get one flat document 
record with the fields for one language inline with the firstName and lastname 
like this:

   {
   "id": 123,
   "firstName": "John",
   "lastName": "Doe",
   "languagesSpoken_id": 243,
   "languagesSpoken_abbreviation ": "en",
   "languagesSpoken_name": "English"
}

On the 5.5 server, when the data comes out, I get separate documents for the 
root client document and the child language documents with no relationship 
between them like this:

   {
   "id": 123,
   "firstName": "John",
   "lastName": "Doe"
},
{
   "languagesSpoken_id": 243,
   "languagesSpoken_abbreviation": "en",
   "languagesSpoken_name": "English"
},
{
   "languagesSpoken_id": 442,
   "languagesSpoken_abbreviation": "fr",
   "languagesSpoken_name": "French"
}

I have spent several days now trying to figure out what is going on here to no 
avail.  Can anybody provide me with a pointer as to what I am missing here?

Thanks,
-- Jeff



Re: HTTP Client Only

2016-04-14 Thread Jeff Wartes


If you’re already using java, just use the CloudSolrClient. 
If you’re using the default router, (CompositeId) it’ll figure out the leaders 
and send documents to the right place for you.

If you’re not using java, then I’d still look there for hints on how to 
duplicate the functionality.



On 4/14/16, 1:27 PM, "Robert Brown"  wrote:

>Hi,
>
>I have a collection with 2 shards, 1 replica each.
>
>When I send updates, I currently /admin/ping each of the nodes, and then 
>pick one at random.
>
>I'm guessing it makes more sense to only send updates to one of the 
>leaders, so I'm contemplating getting the collection status instead, and 
>filter out the leaders.
>
>Is there anything else I should be aware of, apart from using a Java 
>client, etc.
>
>I guess the ping becomes redundant?
>
>Thanks,
>Rob
>
>
>


Re: Adding replica on solr - 5.50

2016-04-14 Thread Jeff Wartes
I’m all for finding another way to make something work, but I feel like this is 
the wrong advice. 

There are two options:
1) You are doing something wrong. In which case, you should probably invest in 
figuring out what.
2) Solr is doing something wrong. In which case, you should probably invest in 
figuring out what, and then file a bug so it doesn’t happen to anyone else.

Adding a replica is a pretty basic operation, so whichever option is the case, 
I feel like you’ll just encounter other problems down the road if you don’t 
figure out what’s going on.

I’d probably start by creating the single-replica collection, and then 
inspecting the live_nodes list in Zookeeper to confirm that the (live) node 
list is actually what you think it is.





On 4/14/16, 4:04 PM, "John Bickerstaff"  wrote:

>5.4
>
>This problem drove me insane for about a month...
>
>I'll send you the doc.
>
>On Thu, Apr 14, 2016 at 5:02 PM, Jay Potharaju 
>wrote:
>
>> Thanks John, which version of solr are you using?
>>
>> On Thu, Apr 14, 2016 at 3:59 PM, John Bickerstaff <
>> j...@johnbickerstaff.com>
>> wrote:
>>
>> > su - solr -c "/opt/solr/bin/solr create -c statdx -d /home/john/conf
>> > -shards 1 -replicationFactor 2"
>> >
>> > However, this won't work by itself.  There is some preparation
>> > necessary...  I'll send you the doc.
>> >
>> > On Thu, Apr 14, 2016 at 4:55 PM, Jay Potharaju 
>> > wrote:
>> >
>> > > Curious what command did you use?
>> > >
>> > > On Thu, Apr 14, 2016 at 3:48 PM, John Bickerstaff <
>> > > j...@johnbickerstaff.com>
>> > > wrote:
>> > >
>> > > > I had a hard time getting replicas made via the API, once I had
>> created
>> > > the
>> > > > collection for the first time although that may have been
>> ignorance
>> > > on
>> > > > my part.
>> > > >
>> > > > I was able to get it done fairly easily on the Linux command line.
>> If
>> > > > that's an option and you're interested, let me know - I have a rough
>> > but
>> > > > accurate document. But perhaps others on the list will have the
>> > specific
>> > > > answer you're looking for.
>> > > >
>> > > > On Thu, Apr 14, 2016 at 4:19 PM, Jay Potharaju <
>> jspothar...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Hi,
>> > > > > I am using solr 5.5 and testing adding a new replica when a solr
>> > > instance
>> > > > > comes up. When I run the following command I get an error. I have 1
>> > > > replica
>> > > > > and trying to add another replica.
>> > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> http://x.x.x.x:8984/solr/admin/collections?action=ADDREPLICA&collection=test2&shard=shard1&node=x.x.x.x:9001_solr
>> > > > >
>> > > > > Error:
>> > > > > > org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
>> > > > > > At least one of the node(s) specified are not currently active,
>> no
>> > > > action
>> > > > > > taken.
>> > > > > > 
>> > > > > > At least one of the node(s) specified are not
>> > > currently
>> > > > > > active, no action taken.
>> > > > > > 400
>> > > > > > 
>> > > > > > 
>> > > > > > 
>> > > > > > > name="error-class">org.apache.solr.common.SolrException
>> > > > > > > > > name="root-error-class">org.apache.solr.common.SolrException
>> > > > > > 
>> > > > > > At least one of the node(s) specified are not
>> > > currently
>> > > > > > active, no action taken.
>> > > > > > 400
>> > > > > > 
>> > > > > > 
>> > > > >
>> > > > >
>> > > > > But when i create a new collection with 2 replicas it works fine.
>> > > > > As a side note my clusterstate.json is not updating correctly. Not
>> > sure
>> > > > if
>> > > > > that is causing an issue.
>> > > > >
>> > > > >  Any suggestions why the Addreplica command is not working. And is
>> it
>> > > > > related to the clusterstate.json? If yes, how can i fix it?
>> > > > >
>> > > > > --
>> > > > > Thanks
>> > > > > Jay
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Thanks
>> > > Jay Potharaju
>> > >
>> >
>>
>>
>>
>> --
>> Thanks
>> Jay Potharaju
>>


Re: Indexing 700 docs per second

2016-04-19 Thread Jeff Wartes

I have no numbers to back this up, but I’d expect Atomic Updates to be slightly 
slower than a full update, since the atomic approach has to retrieve the fields 
you didn't specify before it can write the new (updated) document.




On 4/19/16, 11:54 AM, "Tim Robertson"  wrote:

>Hi Mark,
>
>We were putting in and updating docs of around 20-25 indexed fields (mainly
>INTs, but some Strings and multivalue fields) at >1000/sec on far lesser
>hardware and a total of 600 million docs (batch updates of course) while
>also serving live queries for a website which had about 30 concurrent users
>steady state (not all hitting SOLR though).
>
>It seems realistic with that kind of hardware in my experience, but you
>didn't mention what else was going on that might affect it (e.g. reads).
>
>HTH,
>Tim
>
>
>On Tue, Apr 19, 2016 at 7:12 PM, Erick Erickson 
>wrote:
>
>> Make very sure you batch updates though.
>> Here's a benchmark I ran:
>> https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/
>>
>> NOTE: it's not entirely clear that you want to
>> put 122M docs on a single shard. Depending on the queries
>> you'll run you may want 2 or more shards, but that depends
>> on the query pattern and your SLAs. Here's the long version
>> of "you really have to load test this":
>>
>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>
>> Best,
>> Erick
>>
>> On Tue, Apr 19, 2016 at 6:48 AM, Susheel Kumar 
>> wrote:
>> >  It sounds achievable with your machine configuration and i would suggest
>> > to try out atomic update.  Use SolrJ with multi-threaded indexing for
>> > higher indexing rate.
>> >
>> > Thanks,
>> > Susheel
>> >
>> >
>> >
>> > On Tue, Apr 19, 2016 at 9:27 AM, Tom Evans 
>> wrote:
>> >
>> >> On Tue, Apr 19, 2016 at 10:25 AM, Mark Robinson <
>> mark123lea...@gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > I have a requirement to index (mainly updation) 700 docs per second.
>> >> > Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around
>> 260
>> >> > byes (6 fields out of which only 2 will undergo updation at the above
>> >> > rate). This collection has around 122Million docs and that count is
>> >> pretty
>> >> > much a constant.
>> >> >
>> >> > 1. Can I manage this updation rate with a non-sharded ie single Solr
>> >> > instance set up?
>> >> > 2. Also is atomic update or a full update (the whole doc) of the
>> changed
>> >> > records the better approach in this case.
>> >> >
>> >> > Could some one please share their views/ experience?
>> >>
>> >> Try it and see - everyone's data/schemas are different and can affect
>> >> indexing speed. It certainly sounds achievable enough - presumably you
>> >> can at least produce the documents at that rate?
>> >>
>> >> Cheers
>> >>
>> >> Tom
>> >>
>>


Re: Replicas for same shard not in sync

2016-04-26 Thread Jeff Wartes

At the risk of thread hijacking, this is an area where I don’t know I fully 
understand, so I want to make sure.

I understand the case where a node is marked “down” in the clusterstate, but 
what if it’s down for less than the ZK heartbeat? That’s not unreasonable, I’ve 
seen some recommendations for really high ZK timeouts. Let’s assume there’s 
some big GC pause, or some other ephemeral service interruption that recovers 
very quickly.

So,
1. leader gets an update request
2. leader makes update requests to all live nodes
3. leader gets success responses from all but one replica
4. leader gets failure response from one replica

At this point we have different replicas with different data sets. Does 
anything signal that the failure-response node has now diverged? Does the 
leader attempt to roll back the other replicas? I’ve seen references to 
leader-initiated-recovery, is this that?

And regardless, is the update request considered a success (and reported as 
such to the client) by the leader?



On 4/25/16, 12:14 PM, "Erick Erickson"  wrote:

>Ted:
>Yes, deleting and re-adding the replica will be fine.
>
>Having commits happen from the client when you _also_ have
>autocommits that frequently (10 seconds and 1 second are pretty
>aggressive BTW) is usually not recommended or necessary.
>
>David:
>
>bq: if one or more replicas are down, updates presented to the leader
>still succeed, right?  If so, tedsolr is correct that the Solr client
>app needs to re-issue update
>
>Absolutely not the case. When the replicas are down, they're marked as
>down by Zookeeper. When then come back up they find the leader through
>Zookeeper magic and ask, essentially "Did I miss any updates"? If the
>replica did miss any updates it gets them from the leader either
>through the leader replaying the updates from its transaction log to
>the replica or by replicating the entire index from the leader. Which
>path is followed is a function of how far behind the replica is.
>
>In this latter case, any updates that come in to the leader while the
>replication is happening are buffered and replayed on top of the index
>when the full replication finishes.
>
>The net-net here is that you should not have to track whether updates
>got to all the replicas or not. One of the major advantages of
>SolrCloud is to remove that worry from the indexing client...
>
>Best,
>Erick
>
>On Mon, Apr 25, 2016 at 11:39 AM, David Smith
> wrote:
>> Erick,
>>
>> So that my understanding is correct, let me ask, if one or more replicas are 
>> down, updates presented to the leader still succeed, right?  If so, tedsolr 
>> is correct that the Solr client app needs to re-issue updates, if it wants 
>> stronger guarantees on replica consistency than what Solr provides.
>>
>> The “Write Fault Tolerance” section of the Solr Wiki makes what I believe is 
>> the same point:
>>
>> "On the client side, if the achieved replication factor is less than the 
>> acceptable level, then the client application can take additional measures 
>> to handle the degraded state. For instance, a client application may want to 
>> keep a log of which update requests were sent while the state of the 
>> collection was degraded and then resend the updates once the problem has 
>> been resolved."
>>
>>
>> https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance
>>
>>
>> Kind Regards,
>>
>> David
>>
>>
>>
>>
>> On 4/25/16, 11:57 AM, "Erick Erickson"  wrote:
>>
>>>bq: I also read that it's up to the
>>>client to keep track of updates in case commits don't happen on all the
>>>replicas.
>>>
>>>This is not true. Or if it is it's a bug.
>>>
>>>The update cycle is this:
>>>1> updates get to the leader
>>>2> updates are sent to all followers and indexed on the leader as well
>>>3> each replica writes the updates to the local transaction log
>>>4> all the replicas ack back to the leader
>>>5> the leader responds to the client.
>>>
>>>At this point, all the replicas for the shard have the docs locally
>>>and can take over as leader.
>>>
>>>You may be confusing indexing in batches and having errors with
>>>updates getting to replicas. When you send a batch of docs to Solr,
>>>if one of them fails indexing some of the rest of the docs may not
>>>be indexed. See SOLR-445 for some work on this front.
>>>
>>>That said, bouncing servers willy-nilly during heavy indexing, especially
>>>if the indexer doesn't know enough to retry if an indexing attempt fails may
>>>be the root cause here. Have you verified that your indexing program
>>>retries in the event of failure?
>>>
>>>Best,
>>>Erick
>>>
>>>On Mon, Apr 25, 2016 at 6:13 AM, tedsolr  wrote:
 I've done a bit of reading - found some other posts with similar questions.
 So I gather "Optimizing" a collection is rarely a good idea. It does not
 need to be condensed to a single segment. I also read that it's up to the
 client to keep track of updates in case commits don't happen on all the
 replicas. S

Re: Replicas for same shard not in sync

2016-04-27 Thread Jeff Wartes
I didn’t leave it out, I was asking what it was. I’ve been reading around some 
more this morning though, and here’s what I’ve come up with, feel free to 
correct.

Continuing my scenario:

If you did NOT specify min_rf
5. leader sets leader_initiated_recovery in ZK for the replica with the 
failure. Hopefully that replica notices and re-syncs at some point, because it 
can’t become a leader until it does. (SOLR-5495, SOLR-8034)
6. leader returns success to the client (http://bit.ly/1UhB2cF)

If you specified a min_rf and it WAS achieved:

5. leader sets leader_initiated_recovery in ZK for the replica with the failure.

6. leader returns success (and the achieved rf) to the client (SOLR-5468, 
SOLR-8062)


If you specified a min_rf and it WASN'T achieved:
5. leader does NOT set leader_initiated_recovery (SOLR-8034)
6. leader returns success (and the achieved rf) to the client (SOLR-5468, 
SOLR-8062)

I couldn’t seem to find anyplace that’d cause an error return to the client, 
aside from race conditions around who the leader should be, or if the update 
couldn’t be applied to the leader itself.






On 4/26/16, 8:22 PM, "Erick Erickson"  wrote:

>You left out step 5... leader responds with fail for the update to the
>client. At this point, the client is in charge of retrying the docs.
>Retrying will update all the docs that were successfully indexed in
>the failed packet, but that's not unusual.
>
>There's no real rollback semantics that I know of. This is analogous
>to not hitting minRF, see:
>https://support.lucidworks.com/hc/en-us/articles/212834227-How-does-indexing-work-in-SolrCloud.
>In particular the bit about "it is the client's responsibility to
>re-send it"...
>
>There's some retry logic in the code that distributes the updates from
>the leader as well.
>
>Best,
>Erick
>
>On Tue, Apr 26, 2016 at 12:51 PM, Jeff Wartes  wrote:
>>
>> At the risk of thread hijacking, this is an area where I don’t know I fully 
>> understand, so I want to make sure.
>>
>> I understand the case where a node is marked “down” in the clusterstate, but 
>> what if it’s down for less than the ZK heartbeat? That’s not unreasonable, 
>> I’ve seen some recommendations for really high ZK timeouts. Let’s assume 
>> there’s some big GC pause, or some other ephemeral service interruption that 
>> recovers very quickly.
>>
>> So,
>> 1. leader gets an update request
>> 2. leader makes update requests to all live nodes
>> 3. leader gets success responses from all but one replica
>> 4. leader gets failure response from one replica
>>
>> At this point we have different replicas with different data sets. Does 
>> anything signal that the failure-response node has now diverged? Does the 
>> leader attempt to roll back the other replicas? I’ve seen references to 
>> leader-initiated-recovery, is this that?
>>
>> And regardless, is the update request considered a success (and reported as 
>> such to the client) by the leader?
>>
>>
>>
>> On 4/25/16, 12:14 PM, "Erick Erickson"  wrote:
>>
>>>Ted:
>>>Yes, deleting and re-adding the replica will be fine.
>>>
>>>Having commits happen from the client when you _also_ have
>>>autocommits that frequently (10 seconds and 1 second are pretty
>>>aggressive BTW) is usually not recommended or necessary.
>>>
>>>David:
>>>
>>>bq: if one or more replicas are down, updates presented to the leader
>>>still succeed, right?  If so, tedsolr is correct that the Solr client
>>>app needs to re-issue update
>>>
>>>Absolutely not the case. When the replicas are down, they're marked as
>>>down by Zookeeper. When then come back up they find the leader through
>>>Zookeeper magic and ask, essentially "Did I miss any updates"? If the
>>>replica did miss any updates it gets them from the leader either
>>>through the leader replaying the updates from its transaction log to
>>>the replica or by replicating the entire index from the leader. Which
>>>path is followed is a function of how far behind the replica is.
>>>
>>>In this latter case, any updates that come in to the leader while the
>>>replication is happening are buffered and replayed on top of the index
>>>when the full replication finishes.
>>>
>>>The net-net here is that you should not have to track whether updates
>>>got to all the replicas or not. One of the major advantages of
>>>SolrCloud is to remove that worry from the indexing client...
>>>
>>>Best,
>>>Erick
>>>
>>>O

Re: Solr 5.2.1 on Java 8 GC

2016-04-28 Thread Jeff Wartes

Shawn Heisey’s page is the usual reference guide for GC settings: 
https://wiki.apache.org/solr/ShawnHeisey
Most of the learnings from that are in the Solr 5.x startup scripts already, 
but your heap is bigger, so your mileage may vary.

Some tools I’ve used while doing GC tuning:

* VisualVM - Comes with the jdk. It has a Visual GC plug-in that’s pretty nice 
for visualizing what’s going on in realtime, but you need to connect it via 
jstatd for that to work.
* GCViewer - Visualizes a GC log. The UI leaves a lot to be desired, but it’s 
the best tool I’ve found for this purpose. Use this fork for jdk 6+ - 
https://github.com/chewiebug/GCViewer
* Swiss Java Knife has a bunch of useful features - 
https://github.com/aragozin/jvm-tools
* YourKit - I’ve been using this lately to analyze where garbage comes from. 
It’s not free though. 
* Eclipse Memory Analyzer - I used this to analyze heap dumps before I got a 
YourKit license: http://www.eclipse.org/mat/

Good luck!






On 4/28/16, 9:27 AM, "Yonik Seeley"  wrote:

>On Thu, Apr 28, 2016 at 12:21 PM, Nick Vasilyev
> wrote:
>> Hi Yonik,
>>
>> There are a lot of logistics involved with re-indexing and naturally
>> upgrading Solr. I was hoping that there is an easier alternative since this
>> is only a single back end script that is having problems.
>>
>> Is there any room for improvement with tweaking GC params?
>
>There always is ;-)  But I'm not a GC tuning expert.  I prefer to
>attack memory problems more head-on (i.e. with code to use less
>memory).
>
>-Yonik


Re: Passing Ids in query takes more time

2016-05-05 Thread Jeff Wartes

An ID lookup is a very simple and fast query, for one ID. Or’ing a lookup for 
80k ids though is basically 80k searches as far as Solr is concerned, so it’s 
not altogether surprising that it takes a while. Your complaint seems to be 
that the query planner doesn’t know in advance that  should be 
run first, and then the id selection applied to the reduced set. 

So, I can think of a few things for you to look at, in no particular order:

1. TermsQueryParser is designed for lists of terms, you might get better 
results from that: 
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermsQueryParser

2. If your  is the real discriminating factor in your search, 
you could just search for  and then apply your ID list as a 
PostFilter: http://yonik.com/advanced-filter-caching-in-solr/  
I guess that’d look something like &fq={!terms f= v="= 100 
should qualify it as a post filter, which only operates on an already-found 
result set instead of the full index. (Note: I haven’t confirmed that the Terms 
query parser supports post filtering.)

3. I’m not really aware of any storage engine that’ll love doing a filter on 
80k ids at once, but a key-value store like Cassandra might work out better for 
that.

4. There is a thing called a JoinQParserPlugin 
(https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-JoinQueryParser)
 that can join to another collection 
(https://issues.apache.org/jira/browse/SOLR-4905). But I’ve never used it, and 
there are some significant restrictions.




On 5/5/16, 2:46 AM, "Bhaumik Joshi"  wrote:

>Hi,
>
>
>I am retrieving ids from collection1 based on some query and passing those ids 
>as a query to collection2 so the query to collection2 which contains ids in it 
>takes much more time compare to normal query.
>
>
>Que. 1 - While passing ids to query why it takes more time compare to normal 
>query however we are narrowing the criteria by passing ids?
>
>e.g.  query-1: doc_id:(111 222 333 444 ...) AND  slower 
>(passing 80k ids takes 7-9 sec) than query-2: only  (700-800 
>ms). Both returns 250 records with same set of fields.
>
>
>Que. 2 - Any idea on how i can achieve above (get ids from one collection and 
>pass those ids to other one) in efficient manner or any other way to get data 
>from one collection based on response of other collection?
>
>
>Thanks & Regards,
>
>Bhaumik Joshi


SOLR cloud node has a 4k index directory

2015-08-17 Thread Jeff Courtade
Hi,

I have SOLR cloud running on SOLR 4.7.2
2 shards one replica each.

The size of the index directories is odd on ps03


ps01 shard1 leader

41G
/opt/solr/solr-4.7.2/example/solr/collection1/data/index.20150815024352580


ps03 shard 2 replica

59G
/opt/solr/solr-4.7.2/example/solr/collection1/data/index.20140906125148419
18G
/opt/solr/solr-4.7.2/example/solr/collection1/data/index.20150129181248396
24G
/opt/solr/solr-4.7.2/example/solr/collection1/data/index.20150511233039042
24G
/opt/solr/solr-4.7.2/example/solr/collection1/data/index.20150527121503268
41G
/opt/solr/solr-4.7.2/example/solr/collection1/data/index.20150806034052366
4.0K
 /opt/solr/solr-4.7.2/example/solr/collection1/data/index.20150814152030017


ps02 shard 2 leader

31G
/opt/solr/solr-4.7.2/example/solr/collection1/data/index.20150527161148429
39G
/opt/solr/solr-4.7.2/example/solr/collection1/data/index.20150815151640598


ps04 shard 2 replica

61G
/opt/solr/solr-4.7.2/example/solr/collection1/data/index.20140820212651780
39G
/opt/solr/solr-4.7.2/example/solr/collection1/data/index.20150815170546642


what can i do to remedy this?

--
Thanks,

Jeff Courtade
M: 240.507.6116


Re: SOLR cloud node has a 4k index directory

2015-08-17 Thread Jeff Courtade
Hi,

So it turns out that the index directory has nothign to do with what index
is actually in use

I found that we had mismatched version numbers on our shards so this is
what we had to do to fix that.


In production today we discovered that our shard replicas were on different
version numbers.
this means that the shards had some differences in data between leader and
replica.

we have two shards

shard1 ps01 ps03
shard2 ps02 ps04

checking this url showed different version numbers on a given shard. Both
leader and replica is a shard should have the same version number.


http://ps0X:8983/solr/#/~cores/collection1

shard1 ps01 7752045 ps03 7752095

shard2 ps02 7792045 ps04 7790323

So to fix this we did the following.

Stop ingestion/aspire no updates should be being made while you are doing
this.


For each shard stop the server with the lowest version number
In this case it is shard1 ps01 shard2 ps04

so stop solr on ps01
ps03 will become leader if it is not already in the cloud console

http://ps0X:8983/solr/#/~cloud

then move or remove everything in this directory. It should be empty.


/opt/solr/solr-4.7.2/example/solr/collection1/data/

restart solr on ps01

watch that data directory it should get a few files and an
index.201508X directory where the index is downloaded form the leader

du -sh should show that growing.

in the cloud console ps01 will show as recovering while this is going on
until it is complete. Once it is done it will go green in teh cloud console.

once it is green check the version number on ps01 and ps03 they should be
the same now.

Repeat this for shard2 and you are done.


--
Thanks,

Jeff Courtade
M: 240.507.6116

On Mon, Aug 17, 2015 at 10:57 AM, Jeff Courtade 
wrote:

> Hi,
>
> I have SOLR cloud running on SOLR 4.7.2
> 2 shards one replica each.
>
> The size of the index directories is odd on ps03
>
>
> ps01 shard1 leader
>
> 41G
> /opt/solr/solr-4.7.2/example/solr/collection1/data/index.20150815024352580
>
>
> ps03 shard 2 replica
>
> 59G
> /opt/solr/solr-4.7.2/example/solr/collection1/data/index.20140906125148419
> 18G
> /opt/solr/solr-4.7.2/example/solr/collection1/data/index.20150129181248396
> 24G
> /opt/solr/solr-4.7.2/example/solr/collection1/data/index.20150511233039042
> 24G
> /opt/solr/solr-4.7.2/example/solr/collection1/data/index.20150527121503268
> 41G
> /opt/solr/solr-4.7.2/example/solr/collection1/data/index.20150806034052366
> 4.0K
>  /opt/solr/solr-4.7.2/example/solr/collection1/data/index.20150814152030017
>
>
> ps02 shard 2 leader
>
> 31G
> /opt/solr/solr-4.7.2/example/solr/collection1/data/index.20150527161148429
> 39G
> /opt/solr/solr-4.7.2/example/solr/collection1/data/index.20150815151640598
>
>
> ps04 shard 2 replica
>
> 61G
> /opt/solr/solr-4.7.2/example/solr/collection1/data/index.20140820212651780
> 39G
> /opt/solr/solr-4.7.2/example/solr/collection1/data/index.20150815170546642
>
>
> what can i do to remedy this?
>
> --
> Thanks,
>
> Jeff Courtade
> M: 240.507.6116
>


Solr leader and replica version mismatch 4.7.2

2015-08-19 Thread Jeff Courtade
We are running SOLR 4.7.2
SolrCloud with 2 shards
one Leader and one replica per shard.

the "Version" of the replica and leader differ displayed here as...

curl http://ps01:8983/solr/admin/cores?action=STATUS |sed 's/>\n7753045


However the commitTimeMSec lastModified and sizeInBytes matches on Leader
and replica

curl http://ps01:8983/solr/admin/cores?action=STATUS |sed 's/>\n1439974815928

2015-08-19T09:00:15.928Z
43691759309
40.69 GB



if that number date and size match on the leader and the replicas I believe
we are in sync.

Can anyone verify this?



--
Thanks,

Jeff Courtade
M: 240.507.6116


Re: Solr leader and replica version mismatch 4.7.2

2015-08-19 Thread Jeff Courtade
What I am trying to determine is a way to validate for instance if a leader
dies. As in completely unrecoverable that the data on the replica is an
exact match to what the leader had.

I need to be able to monitor it and have confidence that it is working as
expected.

i had assumed the version number is what I was interested in.

Should the version number be different in SOLR Cloud then as it is
deprecated?


--
Thanks,

Jeff Courtade
M: 240.507.6116

On Wed, Aug 19, 2015 at 10:08 AM, Shawn Heisey  wrote:

> On 8/19/2015 7:52 AM, Jeff Courtade wrote:
> > We are running SOLR 4.7.2
> > SolrCloud with 2 shards
> > one Leader and one replica per shard.
> >
> > the "Version" of the replica and leader differ displayed here as...
> >
> > curl http://ps01:8983/solr/admin/cores?action=STATUS |sed 's/>\n >
> > 7753045
> >
> >
> > However the commitTimeMSec lastModified and sizeInBytes matches on Leader
> > and replica
>
> SolrCloud works very differently than the old master-slave replication.
>  The index is NOT copied from the leader to the other replicas, except
> in extreme recovery circumstances.
>
> Each replica builds its own copy of the index independently from the
> others.  Due to slight timing differences in the indexing operations,
> and possible actions related to transaction log replay on node restart,
> each replica may end up with a different index layout.  There also could
> be differences in the number of deleted documents.  Unless something
> goes really wrong, all replicas should contain the same live documents.
>
> Thanks,
> Shawn
>
>


splitting shards on 4.7.2 with custom plugins

2015-08-25 Thread Jeff Courtade
I am getting failures when trying too split shards on solr 4.2.7 with
custom plugins.

It fails regularily it cannot find the jar files for  plugins when creating
the new cores/shards.

Ideas?

--
Thanks,

Jeff Courtade
M: 240.507.6116


Re: splitting shards on 4.7.2 with custom plugins

2015-08-26 Thread Jeff Courtade
Hi,


So i got the shards too split. But they are very unbalanced.


 7204922 total docs on the original collection

shard1_0 numdocs 3661699

shard1_1 numdocs 3543132

shard2_0 numdocs 0

shard2_1 numdcs 0

Any ideas?

This is what i had to do to get this to split with the custom libs

I got shard1 to split successfully and it created replicas on the other
servers in the cloud for the new shard/shards.


This is the jist of it.


When you split a shard solr creates a 2 new cores.

When creating a core it uses the solr/solr.xml settings for classpath
etc

This is why searches etc work fine and can find the opa plugins but when we
called shardsplit it could not.


I had to move the custom jars outside of the collection directory and add
this to solr/solr.xml on the 4 nodes.


info here  https://wiki.apache.org/solr/Solr.xml%204.4%20and%20beyond






${sharedLib:../lib}


when you restart you can see it in the log loading the jars form the new
location.



INFO  - 2015-08-25 23:40:52.297; org.apache.solr.core.CoreContainer;
loading shared library: /opt/solr/solr-4.7.2/solr01/solr/../lib

INFO  - 2015-08-25 23:40:52.298; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/commons-pool-1.6.jar' to
classloader

INFO  - 2015-08-25 23:40:52.298; org.apache.solr.core.SolrResourceLoader;
Adding
'file:/opt/solr/solr-4.7.2/solr01/lib/query-processing-language-0.2-SNAPSHOT.jar'
to classloader

INFO  - 2015-08-25 23:40:52.299; org.apache.solr.core.SolrResourceLoader;
Adding
'file:/opt/solr/solr-4.7.2/solr01/lib/jetty-continuation-8.1.10.v20130312.jar'
to classloader

INFO  - 2015-08-25 23:40:52.301; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/groovy-all-2.0.4.jar' to
classloader

INFO  - 2015-08-25 23:40:52.302; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/qpl-solr472-0.2-SNAPSHOT.jar'
to classloader

INFO  - 2015-08-25 23:40:52.302; org.apache.solr.core.SolrResourceLoader;
Adding
'file:/opt/solr/solr-4.7.2/solr01/lib/jetty-jmx-8.1.10.v20130312.jar' to
classloader

INFO  - 2015-08-25 23:40:52.303; org.apache.solr.core.SolrResourceLoader;
Adding
'file:/opt/solr/solr-4.7.2/solr01/lib/jetty-deploy-8.1.10.v20130312.jar' to
classloader

INFO  - 2015-08-25 23:40:52.303; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/ext/' to classloader

INFO  - 2015-08-25 23:40:52.303; org.apache.solr.core.SolrResourceLoader;
Adding
'file:/opt/solr/solr-4.7.2/solr01/lib/jetty-xml-8.1.10.v20130312.jar' to
classloader

so I then ran the split and checked on it in the morning

http://dj01.aws.narasearch.us:8981/solr/admin/collections?action=SPLITSHARD&collection=collection1&shard=shard1


it succeeded and created replicas.

ls /opt/solr/solr-4.7.2/solr0*/solr/

/opt/solr/solr-4.7.2/solr01/solr/:
bin  collection1_shard1_0_replica1  README.txt  zoo.cfg
collection1  collection1_shard1_1_replica1  solr.xml

/opt/solr/solr-4.7.2/solr02/solr/:
bin  collection1  README.txt  solr.xml  zoo.cfg

/opt/solr/solr-4.7.2/solr03/solr/:
bin  collection1  collection1_shard1_0_replica2  README.txt  solr.xml
 zoo.cfg

/opt/solr/solr-4.7.2/solr04/solr/:
bin  collection1  collection1_shard1_1_replica2  README.txt  solr.xml
 zoo.cfg


and it actually distributed it

[root@dj01 solr]# du -sh *
4.0Kbin
41G collection1
18G collection1_shard1_0_replica1
16G collection1_shard1_1_replica1
4.0KREADME.txt
4.0Ksolr.xml
4.0Kzoo.cfg
[root@dj01 solr]# du -sh
/opt/solr/solr-4.7.2/solr04/solr/collection1_shard1_1_replica2
16G /opt/solr/solr-4.7.2/solr04/solr/collection1_shard1_1_replica2
[root@dj01 solr]# du -sh
/opt/solr/solr-4.7.2/solr03/solr/collection1_shard1_0_replica2
18G /opt/solr/solr-4.7.2/solr03/solr/collection1_shard1_0_replica2


Jeff Courtade
M: 240.507.6116
On Aug 25, 2015 11:09 PM, "Anshum Gupta"  wrote:

> Can you elaborate a bit more on the setup, what do the custom plugins do,
> what error do you get ? It seems like a classloader/classpath issue to me
> which doesn't really relate to Shard splitting.
>
>
> On Tue, Aug 25, 2015 at 7:59 PM, Jeff Courtade 
> wrote:
>
> > I am getting failures when trying too split shards on solr 4.2.7 with
> > custom plugins.
> >
> > It fails regularily it cannot find the jar files for  plugins when
> creating
> > the new cores/shards.
> >
> > Ideas?
> >
> > --
> > Thanks,
> >
> > Jeff Courtade
> > M: 240.507.6116
> >
>
>
>
> --
> Anshum Gupta
>


Re: splitting shards on 4.7.2 with custom plugins

2015-08-26 Thread Jeff Courtade
im looking at the clusterstate.json t see why it is doing this I really
dont understand it though...

{"collection1":{
"shards":{
  "shard1":{
"range":"8000-",
"state":"active",
"replicas":{
  "core_node1":{
"state":"active",
"base_url":"http://10.135.2.153:8981/solr";,
"core":"collection1",
"node_name":"10.135.2.153:8981_solr",
"leader":"true"},
  "core_node10":{
"state":"active",
"base_url":"http://10.135.2.153:8982/solr";,
"core":"collection1",
"node_name":"10.135.2.153:8982_solr"}}},
  "shard2":{
"range":"0-7fff",
"state":"inactive",
"replicas":{
  "core_node9":{
"state":"active",
"base_url":"http://10.135.2.153:8984/solr";,
"core":"collection1",
"node_name":"10.135.2.153:8984_solr",
"leader":"true"},
  "core_node11":{
"state":"active",
"base_url":"http://10.135.2.153:8983/solr";,
"core":"collection1",
"node_name":"10.135.2.153:8983_solr"}}},
  "shard1_1":{
"range":null,
"state":"active",
"parent":null,
"replicas":{
  "core_node6":{
"state":"active",
"base_url":"http://10.135.2.153:8981/solr";,
"core":"collection1_shard1_1_replica1",
"node_name":"10.135.2.153:8981_solr",
"leader":"true"},
  "core_node8":{
"state":"active",
"base_url":"http://10.135.2.153:8984/solr";,
"core":"collection1_shard1_1_replica2",
"node_name":"10.135.2.153:8984_solr"}}},
  "shard1_0":{
"range":null,
"state":"active",
"parent":null,
"replicas":{
  "core_node5":{
"state":"active",
"base_url":"http://10.135.2.153:8981/solr";,
"core":"collection1_shard1_0_replica1",
"node_name":"10.135.2.153:8981_solr",
"leader":"true"},
  "core_node7":{
"state":"active",
"base_url":"http://10.135.2.153:8983/solr";,
"core":"collection1_shard1_0_replica2",
"node_name":"10.135.2.153:8983_solr"}}},
  "shard2_0":{
"range":"0-3fff",
"state":"active",
"replicas":{
  "core_node13":{
    "state":"active",
"base_url":"http://10.135.2.153:8984/solr";,
"core":"collection1_shard2_0_replica1",
"node_name":"10.135.2.153:8984_solr",
"leader":"true"},
  "core_node14":{
"state":"active",
"base_url":"http://10.135.2.153:8982/solr";,
"core":"collection1_shard2_0_replica2",
"node_name":"10.135.2.153:8982_solr"}}},
  "shard2_1":{
"range":"4000-7fff",
"state":"active",
"replicas":{
  "core_node12":{
"state":"active",
"base_url":"http://10.135.2.153:8984/solr";,
"core":"collection1_shard2_1_replica1",
"node_name":"10.135.2.153:8984_solr",
"leader":"true"},
  "core_node15":{
"state":"active",
"base_url":"http://10.135.2.153:8981/solr";,
"core":"collection1_shard2_1_replica2",
"node_name":

Cached fq decreases performance

2015-09-03 Thread Jeff Wartes

I have a query like:

q=&fq=enabled:true

For purposes of this conversation, "fq=enabled:true" is set for every query, I 
never open a new searcher, and this is the only fq I ever use, so the filter 
cache size is 1, and the hit ratio is 1.
The fq=enabled:true clause matches about 15% of my documents. I have some 20M 
documents per shard, in a 5.3 solrcloud cluster.

Under these circumstances, this alternate version of the query averages about 
1/3 faster, consumes less CPU, and generates less garbage:

q= +enabled:true

So it appears I have a case where using the cached fq result is more expensive 
than just putting the same restriction in the query.
Does someone have a clear mental model of how “q” and “fq” interact?
Naively, I’d expect that either the “q” operates within the set matched by the 
fq (in which case it’s doing "complicated stuff" on only a subset and should be 
faster) or that Solr takes the intersection of the q & fq sets (in which case 
putting the restriction in the “q” means that set needs to be generated instead 
of retrieved from cache, and should be slower).
This has me wondering, if you want fq cache speed boosts, but also want ranking 
involved, can you do that? Would something like q= AND 
&fq= help, or just be more work?

Thanks.


Re: Cached fq decreases performance

2015-09-03 Thread Jeff Wartes

I’m measuring performance in the aggregate, over several minutes and tens
of thousands of distinct queries that all use this specific fq.
The cache hit count reported is roughly identical to the number of queries
I’ve sent, so no, this isn’t a first-query cache-miss situation.

The fq result will be large, 15% of my documents qualify, so if solr is
intelligent enough to ignore that restriction in the main query until it’s
found a much smaller set to scan for that criteria, I could see how simply
processing the intersection of the full fq cache value could be time
consuming. Is that the kind of thing you’re talking about with
intersection hopping?


On 9/3/15, 2:00 PM, "Alexandre Rafalovitch"  wrote:

>FQ has to calculate the result bit set for every document to be able
>to cache it. Q will only calculate it for the documents it matches on
>and there is some intersection hopping going on.
>
>Are you seeing this performance hit on first query only or or every
>one? I would expect on first query only unless your filter cache size
>assumptions are somehow wrong.
>
>Regards,
>   Alex.
>
>
>Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>http://www.solr-start.com/
>
>
>On 3 September 2015 at 16:45, Jeff Wartes  wrote:
>>
>> I have a query like:
>>
>> q=&fq=enabled:true
>>
>> For purposes of this conversation, "fq=enabled:true" is set for every
>>query, I never open a new searcher, and this is the only fq I ever use,
>>so the filter cache size is 1, and the hit ratio is 1.
>> The fq=enabled:true clause matches about 15% of my documents. I have
>>some 20M documents per shard, in a 5.3 solrcloud cluster.
>>
>> Under these circumstances, this alternate version of the query averages
>>about 1/3 faster, consumes less CPU, and generates less garbage:
>>
>> q= +enabled:true
>>
>> So it appears I have a case where using the cached fq result is more
>>expensive than just putting the same restriction in the query.
>> Does someone have a clear mental model of how “q” and “fq” interact?
>> Naively, I’d expect that either the “q” operates within the set matched
>>by the fq (in which case it’s doing "complicated stuff" on only a subset
>>and should be faster) or that Solr takes the intersection of the q & fq
>>sets (in which case putting the restriction in the “q” means that set
>>needs to be generated instead of retrieved from cache, and should be
>>slower).
>> This has me wondering, if you want fq cache speed boosts, but also want
>>ranking involved, can you do that? Would something like q=>stuff> AND &fq= help, or just be
>>more work?
>>
>> Thanks.



Re: Cached fq decreases performance

2015-09-04 Thread Jeff Wartes


On 9/4/15, 7:06 AM, "Yonik Seeley"  wrote:
>
>Lucene seems to always be changing it's execution model, so it can be
>difficult to keep up.  What version of Solr are you using?
>Lucene also changed how filters work,  so now, a filter is
>incorporated with the query like so:
>
>query = new BooleanQuery.Builder()
>.add(query, Occur.MUST)
>.add(pf.filter, Occur.FILTER)
>.build();
>
>It may be that term queries are no longer worth caching... if this is
>the case, we could automatically not cache them.
>
>It also may be the structure of the query that is making the
>difference.  Solr is creating
>
>(complicated stuff) +(filter(enabled:true))
>
>If you added +enabled:true directly to an existing boolean query, that
>may be more efficient for lucene to process (flatter structure).
>
>If you haven't already, could you try putting parens around your
>(complicated stuff) to see if it makes any difference?
>
>-Yonik


I’ll reply at this point in the thread, since it’s addressed to me, but I
strongly agree with some of the later comments in the thread about knowing
what’s going on. The whole point of this post is that this situation
violated my mental heuristics about how to craft a query.

In answer to the question, this is a Solrcloud 5.3 cluster. I can provide
a little more detail on (complicated stuff) too if that’s helpful. I have
not tried putting everything else in parens, but it’s a couple of distinct
paren clauses anyway:

q=+(_query_:"{!dismax (several fields)}") +(_query_:"{!dismax (several
fields)}") +(_query_:"{!dismax (several fields)}") +enabled:true

So to be clear, that query template outperforms this one:
q=+(_query_:"{!dismax (several fields)}") +(_query_:"{!dismax (several
fields)}") +(_query_:"{!dismax (several fields)}")&fq=enabled:true


Your comments remind me that I migrated from 5.2.1 to 5.3 while I’ve been
doing my performance testing, and I thought I noticed a performance
degradation in that transition, but I never followed though to confirm
that. I hadn’t tested moving that FQ clause into the Q on 5.2.1, only 5.3.






solr4.7: leader core does not elected to other active core after sorl OS shutdown, known issue?

2015-09-21 Thread Jeff Wu
Our environment still run with Solr4.7. Recently we noticed in a test. When
we stopped 1 solr server(solr02, which did OS shutdown), all the cores of
solr02 are shown as "down", but remains a few cores still as leaders. After
that, we quickly seeing all other servers are still sending requests to
that down solr server, and therefore we saw a lot of TCP waiting threads in
thread pool of other solr servers since solr02 already down.

"shard53":{
"range":"2666-2998",
"state":"active",
"replicas":{
  "core_node102":{
"state":"down",
"base_url":"https://solr02.myhost/solr";,
"core":"collection2_shard53_replica1",
"node_name":"https://solr02.myhost_solr";,
"leader":"true"},
  "core_node104":{
"state":"active",
"base_url":"https://solr04.myhost/solr";,
"core":"collection2_shard53_replica2",
"node_name":"https://solr04.myhost/solr_solr"}}},

Is this something known bug in 4.7 and late on fixed? Any reference JIRA we
can study about?  If the solr service is stopped gracefully, we can see
leader core election happens and switched to other active core. But if we
just directly shutdown a Solr OS, we can reproduce in our environment that
some "Down" cores remains "leader" at ZK clusterstate.json


Solr4.7: tlog replay has a major delay before start recovering transaction replay

2015-09-21 Thread Jeff Wu
Our environment ran in Solr4.7. Recently hit a core recovery failure and
then it retries to recover from tlog.

We noticed after  20:05:22 said Recovery failed, Solr server waited a long
time before it started tlog replay. During that time, we have about 32
cores doing such tlog relay. The service took over 40 minutes to make whole
service back.

Some questions we want to know:
1. Is tlog replay a single thread activity? Can we configure to have
multiple threads since in our deployment we have 64 cores for each solr
server.

2. What might cause the tlog replay thread to wait for over 15 minutes
before actual tlog replay?  The actual replay seems very quick.

3. The last message "Log replay finished" does not tell which core it is
finished. Given 32 cores to recover, we can not know which core the log is
reporting.

4. We know 4.7 is pretty old, we'd like to know is this known issue and
fixed in late release, any related JIRA?

Line 4120: ERROR - 2015-09-16 20:05:22.396;
org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again...
(0) core=collection3_shard11_replica2
WARN  - 2015-09-16 20:22:50.343;
org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay
tlog{file=/mnt/solrdata1/solr/home/collection3_shard11_replica2/data/tlog/tlog.0120498
refcount=2} active=true starting pos=25981
WARN  - 2015-09-16 20:22:53.301;
org.apache.solr.update.UpdateLog$LogReplayer; Log replay finished.
recoveryInfo=RecoveryInfo{adds=914 deletes=215 deleteByQuery=0 errors=0
positionOfStart=25981}

Thank you all~


Re: solr4.7: leader core does not elected to other active core after sorl OS shutdown, known issue?

2015-09-21 Thread Jeff Wu
Hi Shalin,  thank you for the response.

We waited longer enough than the ZK session timeout time, and it still did
not kick off any leader election for these "remained down-leader" cores.
That's the question I'm actually asking.

Our test scenario:

Each solr server has 64 cores, and they are all active, and all leader
cores.
Shutdown the linux OS.
Monitor clusterstate.json over ZK, after enough ZK session timeout value.
We noticed some cores has leader election happened. But still saw some down
cores remains leader.

2015-09-21 9:15 GMT-04:00 Shalin Shekhar Mangar :

> Hi Jeff,
>
> The leader election relies on ephemeral nodes in Zookeeper to detect
> when leader or other nodes have gone down (abruptly). These ephemeral
> nodes are automatically deleted by ZooKeeper after the ZK session
> timeout which is by default 30 seconds. So if you kill a node then it
> can take up to 30 seconds for the cluster to detect it and start a new
> leader election. This won't be necessary during a graceful shutdown
> because on shutdown the node will give up leader position so that a
> new one can be elected. You could tune the zk session timeout to a
> lower value but then it makes the cluster more sensitive to GC pauses
> which can also trigger new leader elections.
>
> On Mon, Sep 21, 2015 at 5:55 PM, Jeff Wu  wrote:
> > Our environment still run with Solr4.7. Recently we noticed in a test.
> When
> > we stopped 1 solr server(solr02, which did OS shutdown), all the cores of
> > solr02 are shown as "down", but remains a few cores still as leaders.
> After
> > that, we quickly seeing all other servers are still sending requests to
> > that down solr server, and therefore we saw a lot of TCP waiting threads
> in
> > thread pool of other solr servers since solr02 already down.
> >
> > "shard53":{
> > "range":"2666-2998",
> > "state":"active",
> > "replicas":{
> >   "core_node102":{
> > "state":"down",
> > "base_url":"https://solr02.myhost/solr";,
> > "core":"collection2_shard53_replica1",
> > "node_name":"https://solr02.myhost_solr";,
> > "leader":"true"},
> >   "core_node104":{
> > "state":"active",
> > "base_url":"https://solr04.myhost/solr";,
> > "core":"collection2_shard53_replica2",
> > "node_name":"https://solr04.myhost/solr_solr"}}},
> >
> > Is this something known bug in 4.7 and late on fixed? Any reference JIRA
> we
> > can study about?  If the solr service is stopped gracefully, we can see
> > leader core election happens and switched to other active core. But if we
> > just directly shutdown a Solr OS, we can reproduce in our environment
> that
> > some "Down" cores remains "leader" at ZK clusterstate.json
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: Solr4.7: tlog replay has a major delay before start recovering transaction replay

2015-09-21 Thread Jeff Wu
>
> Before tlog replay, the replica will replicate any missing index files
> from the leader. I think that is what is causing the time between the
> two log messages. You have INFO logging turned off so there are no
> messages from the replication handler about it.


I did not monitor major network throughput during that timeframe, and I
thought the first log already showed the peersync failed. So I try to
understand the time spent there.

Also, in our solr.log, I did not see log reporting Recovery- retry(1),
Recovery - retry(2), Recovery give up, etc in this log file before it tells
us "tlog replay"

2015-09-21 9:07 GMT-04:00 Shalin Shekhar Mangar :

> Hi Jeff,
>
> Comments inline:
>
> On Mon, Sep 21, 2015 at 6:06 PM, Jeff Wu  wrote:
> > Our environment ran in Solr4.7. Recently hit a core recovery failure and
> > then it retries to recover from tlog.
> >
> > We noticed after  20:05:22 said Recovery failed, Solr server waited a
> long
> > time before it started tlog replay. During that time, we have about 32
> > cores doing such tlog relay. The service took over 40 minutes to make
> whole
> > service back.
> >
> > Some questions we want to know:
> > 1. Is tlog replay a single thread activity? Can we configure to have
> > multiple threads since in our deployment we have 64 cores for each solr
> > server.
>
> Each core gets a separate recovery thread but each individual log
> replay is single-threaded
>
> >
> > 2. What might cause the tlog replay thread to wait for over 15 minutes
> > before actual tlog replay?  The actual replay seems very quick.
>
> Before tlog replay, the replica will replicate any missing index files
> from the leader. I think that is what is causing the time between the
> two log messages. You have INFO logging turned off so there are no
> messages from the replication handler about it.
>
> >
> > 3. The last message "Log replay finished" does not tell which core it is
> > finished. Given 32 cores to recover, we can not know which core the log
> is
> > reporting.
>
> Yeah, many such issues were fixed in recent 5.x releases where we use
> MDC to log collection, shard, core etc for each message. Furthermore,
> tlog replay progress/status is also logged since 5.0
>
> >
> > 4. We know 4.7 is pretty old, we'd like to know is this known issue and
> > fixed in late release, any related JIRA?
> >
> > Line 4120: ERROR - 2015-09-16 20:05:22.396;
> > org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again...
> > (0) core=collection3_shard11_replica2
> > WARN  - 2015-09-16 20:22:50.343;
> > org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay
> >
> tlog{file=/mnt/solrdata1/solr/home/collection3_shard11_replica2/data/tlog/tlog.0120498
> > refcount=2} active=true starting pos=25981
> > WARN  - 2015-09-16 20:22:53.301;
> > org.apache.solr.update.UpdateLog$LogReplayer; Log replay finished.
> > recoveryInfo=RecoveryInfo{adds=914 deletes=215 deleteByQuery=0 errors=0
> > positionOfStart=25981}
> >
> > Thank you all~
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
Jeff Wu
---
CSDL Beijing, China


Re: solr4.7: leader core does not elected to other active core after sorl OS shutdown, known issue?

2015-09-21 Thread Jeff Wu
Hi Shai, still the same question: other peer cores which they are active
did not claim to be leader after a long time.  However, some of the peer
cores claimed to be leaders at earlier time when server stopping. That's
inconsistent results

2015-09-21 10:52 GMT-04:00 Shai Erera :

> I don't think the process Shalin describes applies to clusterstate.json.
> That JSON object reflects the status Solr "knows" about, or "last known
> status". When Solr is properly shutdown, I believe those attributes are
> cleared from clusterstate.json, as well the leaders give up their lease.
>
> However, when Solr is killed, it takes ZK the 30 seconds or so timeout to
> kill the ephemeral node and release the leader lease. ZK is unaware of
> Solr's clusterstate.json and cannot update the 'leader' property to false.
> It simply releases the lease, so that other cores may claim it.
>
> Perhaps that explains the confusion?
>
> Shai
>
> On Mon, Sep 21, 2015 at 4:36 PM, Jeff Wu  wrote:
>
> > Hi Shalin,  thank you for the response.
> >
> > We waited longer enough than the ZK session timeout time, and it still
> did
> > not kick off any leader election for these "remained down-leader" cores.
> > That's the question I'm actually asking.
> >
> > Our test scenario:
> >
> > Each solr server has 64 cores, and they are all active, and all leader
> > cores.
> > Shutdown the linux OS.
> > Monitor clusterstate.json over ZK, after enough ZK session timeout value.
> > We noticed some cores has leader election happened. But still saw some
> down
> > cores remains leader.
> >
> > 2015-09-21 9:15 GMT-04:00 Shalin Shekhar Mangar  >:
> >
> > > Hi Jeff,
> > >
> > > The leader election relies on ephemeral nodes in Zookeeper to detect
> > > when leader or other nodes have gone down (abruptly). These ephemeral
> > > nodes are automatically deleted by ZooKeeper after the ZK session
> > > timeout which is by default 30 seconds. So if you kill a node then it
> > > can take up to 30 seconds for the cluster to detect it and start a new
> > > leader election. This won't be necessary during a graceful shutdown
> > > because on shutdown the node will give up leader position so that a
> > > new one can be elected. You could tune the zk session timeout to a
> > > lower value but then it makes the cluster more sensitive to GC pauses
> > > which can also trigger new leader elections.
> > >
> > > On Mon, Sep 21, 2015 at 5:55 PM, Jeff Wu  wrote:
> > > > Our environment still run with Solr4.7. Recently we noticed in a
> test.
> > > When
> > > > we stopped 1 solr server(solr02, which did OS shutdown), all the
> cores
> > of
> > > > solr02 are shown as "down", but remains a few cores still as leaders.
> > > After
> > > > that, we quickly seeing all other servers are still sending requests
> to
> > > > that down solr server, and therefore we saw a lot of TCP waiting
> > threads
> > > in
> > > > thread pool of other solr servers since solr02 already down.
> > > >
> > > > "shard53":{
> > > > "range":"2666-2998",
> > > > "state":"active",
> > > > "replicas":{
> > > >   "core_node102":{
> > > > "state":"down",
> > > > "base_url":"https://solr02.myhost/solr";,
> > > > "core":"collection2_shard53_replica1",
> > > > "node_name":"https://solr02.myhost_solr";,
> > > > "leader":"true"},
> > > >   "core_node104":{
> > > > "state":"active",
> > > > "base_url":"https://solr04.myhost/solr";,
> > > > "core":"collection2_shard53_replica2",
> > > > "node_name":"https://solr04.myhost/solr_solr"}}},
> > > >
> > > > Is this something known bug in 4.7 and late on fixed? Any reference
> > JIRA
> > > we
> > > > can study about?  If the solr service is stopped gracefully, we can
> see
> > > > leader core election happens and switched to other active core. But
> if
> > we
> > > > just directly shutdown a Solr OS, we can reproduce in our environment
> > > that
> > > > some "Down" cores remains "leader" at ZK clusterstate.json
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Shalin Shekhar Mangar.
> > >
> >
>



-- 
Jeff Wu
---
CSDL Beijing, China


Autowarm and filtercache invalidation

2015-09-24 Thread Jeff Wartes

If I configure my filterCache like this:


and I have <= 10 distinct filter queries I ever use, does that mean I’ve
effectively disabled cache invalidation? So my cached filter query results
will never change? (short of JVM restart)

I’m unclear on whether autowarm simply copies the value into the new
searcher’s cache or whether it tries to rebuild the results of the cached
filter query based on the new searcher’s view of the data.



Re: Autowarm and filtercache invalidation

2015-09-24 Thread Jeff Wartes
Answering my own question: Looks like the default filterCache regenerator
uses the old cache to re-executes queries in the context of the new
searcher and does nothing with the old cache value.

So, the new searcher’s cache contents will be consistent with that
searcher’s view, regardless of whether it was populated via autowarm.


On 9/24/15, 11:28 AM, "Jeff Wartes"  wrote:

>
>If I configure my filterCache like this:
>autowarmCount="10"/>
>
>and I have <= 10 distinct filter queries I ever use, does that mean I’ve
>effectively disabled cache invalidation? So my cached filter query results
>will never change? (short of JVM restart)
>
>I’m unclear on whether autowarm simply copies the value into the new
>searcher’s cache or whether it tries to rebuild the results of the cached
>filter query based on the new searcher’s view of the data.
>



Re: How to know index file in OS Cache

2015-09-25 Thread Jeff Wartes


I’ve been relying on this:
https://code.google.com/archive/p/linux-ftools/


fincore will tell you what percentage of a given file is in cache, and
fadvise can suggest to the OS that a file be cached.

All of the solr start scripts at my company first call fadvise
(FADV_WILLNEED) on all the files in the index directories. It works great
if you’re on a linux system.



On 9/25/15, 8:41 AM, "Gili Nachum"  wrote:

>Gonna try Mikhail suggestion, but just for fun you can also empirically
>"test" for how much of a file is in the oshr...@matrix.co.il cache with:
>time cat  > /dev/null
>
>The faster it completes the more blocks are cached you can take a baseline
>after manually purging of cache - don't recall the command. Note that
>running the command by itself encourages to cache the file.
>On Sep 25, 2015 12:39, "Aman Tandon"  wrote:
>
>> Awesome thank you Mikhail. This is what I was looking for.
>>
>> This was just a random question poped up in my mind. So I just asked
>>this
>> on the group.
>>
>> With Regards
>> Aman Tandon
>>
>> On Fri, Sep 25, 2015 at 2:49 PM, Mikhail Khludnev <
>> mkhlud...@griddynamics.com> wrote:
>>
>> > What about Linux:
>> > $less /proc//maps
>> > $pmap 
>> >
>> > On Fri, Sep 25, 2015 at 10:57 AM, Markus Jelsma <
>> > markus.jel...@openindex.io>
>> > wrote:
>> >
>> > > Hello - as far as i remember, you don't. A file itself is not the
>>unit
>> to
>> > > cache, but blocks are.
>> > > Markus
>> > >
>> > >
>> > > -Original message-
>> > > > From:Aman Tandon 
>> > > > Sent: Friday 25th September 2015 5:56
>> > > > To: solr-user@lucene.apache.org
>> > > > Subject: How to know index file in OS Cache
>> > > >
>> > > > Hi,
>> > > >
>> > > > Is there any way to know that the index file/s is present in the
>>OS
>> > cache
>> > > > or RAM. I want to check if the index is present in the RAM or in
>>OS
>> > cache
>> > > > and which files are not in either of them.
>> > > >
>> > > > With Regards
>> > > > Aman Tandon
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Sincerely yours
>> > Mikhail Khludnev
>> > Principal Engineer,
>> > Grid Dynamics
>> >
>> > 
>> > 
>> >
>>



Re: Cost of having multiple search handlers?

2015-09-28 Thread Jeff Wartes

One would hope that https://issues.apache.org/jira/browse/SOLR-4735 will
be done by then. 


On 9/28/15, 11:39 AM, "Walter Underwood"  wrote:

>We did the same thing, but reporting performance metrics to Graphite.
>
>But we won’t be able to add servlet filters in 6.x, because it won’t be a
>webapp.
>
>wunder
>Walter Underwood
>wun...@wunderwood.org
>http://observer.wunderwood.org/  (my blog)
>
>
>> On Sep 28, 2015, at 11:32 AM, Gili Nachum  wrote:
>> 
>> A different solution to the same need: I'm measuring response times of
>> different collections measuring  online/batch queries apart using New
>> Relic. I've added a servlet filter that analyses the request and makes
>>this
>> info available to new relic over a request argument.
>> 
>> The built in new relic solr plug in doesn't provide much.
>> On Sep 28, 2015 17:16, "Shawn Heisey"  wrote:
>> 
>>> On 9/28/2015 6:30 AM, Oliver Schrenk wrote:
 I want to register multiple but identical search handler to have
>>> multiple buckets to measure performance for our different apis and
>>> consumers (and to find out who is actually using Solr).
 
 What are there some costs associated with having multiple search
>>> handlers? Are they neglible?
>>> 
>>> Unless you are creating hundreds or thousands of them, I doubt you'll
>>> notice any significant increase in resource usage from additional
>>> handlers.  Each handler definition creates an additional URL endpoint
>>> within the servlet container, additional object creation within Solr,
>>> and perhaps an additional thread pool and threads to go with it, so
>>>it's
>>> not free, but I doubt that it's significant.  The resources required
>>>for
>>> actually handling a request is likely to dwarf what's required for more
>>> handlers.
>>> 
>>> Disclaimer: I have not delved into the code to figure out exactly what
>>> gets created with a search handler config, so I don't know exactly what
>>> happens.  I'm basing this on general knowledge about how Java programs
>>> are constructed by expert developers, not specifics about Solr.
>>> 
>>> There are others on the list who have a much better idea than I do, so
>>> if I'm wrong, I'm sure one of them will let me know.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>>> 
>



Re: Cost of having multiple search handlers?

2015-09-29 Thread Jeff Wartes

At the risk of going increasingly off-thread, yes, please do.
I’ve been using this:
https://dropwizard.github.io/metrics/3.1.0/manual/jetty/, which is
convenient, but doesn’t even have request-handler-level resolution.

Something I’ve started doing for issues that don’t seem likely to get
pulled in but also don’t really need changes in solr/lucene source code is
to publish a free-standing project (with tests) that builds the necessary
jar. For example, https://github.com/whitepages/SOLR-4449.
Seems like a decent middle ground where people can easily use or
contribute changes, and then if it gets popular enough, that’s a strong
signal it should be in the solr distribution itself.



On 9/28/15, 6:05 PM, "Walter Underwood"  wrote:

>We built our own because there was no movement on that. Don’t hold your
>breath.
>
>Glad to contribute it. We’ve been running it in production for a year,
>but the config is pretty manual.
>
>wunder
>Walter Underwood
>wun...@wunderwood.org
>http://observer.wunderwood.org/  (my blog)
>
>
>> On Sep 28, 2015, at 4:41 PM, Jeff Wartes  wrote:
>> 
>> 
>> One would hope that https://issues.apache.org/jira/browse/SOLR-4735 will
>> be done by then.
>> 
>> 
>> On 9/28/15, 11:39 AM, "Walter Underwood"  wrote:
>> 
>>> We did the same thing, but reporting performance metrics to Graphite.
>>> 
>>> But we won’t be able to add servlet filters in 6.x, because it won’t
>>>be a
>>> webapp.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
>>>> On Sep 28, 2015, at 11:32 AM, Gili Nachum 
>>>>wrote:
>>>> 
>>>> A different solution to the same need: I'm measuring response times of
>>>> different collections measuring  online/batch queries apart using New
>>>> Relic. I've added a servlet filter that analyses the request and makes
>>>> this
>>>> info available to new relic over a request argument.
>>>> 
>>>> The built in new relic solr plug in doesn't provide much.
>>>> On Sep 28, 2015 17:16, "Shawn Heisey"  wrote:
>>>> 
>>>>> On 9/28/2015 6:30 AM, Oliver Schrenk wrote:
>>>>>> I want to register multiple but identical search handler to have
>>>>> multiple buckets to measure performance for our different apis and
>>>>> consumers (and to find out who is actually using Solr).
>>>>>> 
>>>>>> What are there some costs associated with having multiple search
>>>>> handlers? Are they neglible?
>>>>> 
>>>>> Unless you are creating hundreds or thousands of them, I doubt you'll
>>>>> notice any significant increase in resource usage from additional
>>>>> handlers.  Each handler definition creates an additional URL endpoint
>>>>> within the servlet container, additional object creation within Solr,
>>>>> and perhaps an additional thread pool and threads to go with it, so
>>>>> it's
>>>>> not free, but I doubt that it's significant.  The resources required
>>>>> for
>>>>> actually handling a request is likely to dwarf what's required for
>>>>>more
>>>>> handlers.
>>>>> 
>>>>> Disclaimer: I have not delved into the code to figure out exactly
>>>>>what
>>>>> gets created with a search handler config, so I don't know exactly
>>>>>what
>>>>> happens.  I'm basing this on general knowledge about how Java
>>>>>programs
>>>>> are constructed by expert developers, not specifics about Solr.
>>>>> 
>>>>> There are others on the list who have a much better idea than I do,
>>>>>so
>>>>> if I'm wrong, I'm sure one of them will let me know.
>>>>> 
>>>>> Thanks,
>>>>> Shawn
>>>>> 
>>>>> 
>>> 
>> 
>



Facet queries blow out the filterCache

2015-10-01 Thread Jeff Wartes

I’m doing some fairly simple facet queries in a two-shard 5.3 SolrCloud
index on fields like this:



Re: Facet queries blow out the filterCache

2015-10-01 Thread Jeff Wartes

No change, still shows an insert per-request. As does a simplified request
with only the facet params
"&facet.field=city&facet=true"

It’s definitely facet related though, facet=false eliminates the insert.



On 10/1/15, 1:50 PM, "Mikhail Khludnev"  wrote:

>what if you set f.city.facet.limit=-1 ?
>
>On Thu, Oct 1, 2015 at 7:43 PM, Jeff Wartes 
>wrote:
>
>>
>> I’m doing some fairly simple facet queries in a two-shard 5.3 SolrCloud
>> index on fields like this:
>>
>> > docValues="true”/>
>>
>> that look something like this:
>> 
>>q=...&fl=id,score&facet.field=city&facet=true&f.city.facet.mincount=1&f.c
>>it
>> y.facet.limit=50&rows=0&start=0&facet.method=fc
>>
>> (no, NOT facet.method=enum - the usage of the filterCache there is
>>pretty
>> well documented)
>>
>> Watching the filterCache stats, it appears that every one of these
>>queries
>> causes the "inserts" counter to be incremented by one. Distinct "q="
>> queries also increase the "size", and eviction happens as normal. If I
>> repeat the same query a few times, "lookups" is not incremented, so
>>these
>> entries generally appear to be completely wasted. (Although when
>>running a
>> lot of these queries, it appears as though a very small set also
>>increment
>> the "lookups" counter, but only a small set, and I haven’t figured out
>>why
>> some are special.)
>>
>> So the question is, why does this facet query have anything to do with
>>the
>> filterCache? This causes a huge amount of filterCache churn with no
>> apparent benefit.
>>
>>
>
>
>-- 
>Sincerely yours
>Mikhail Khludnev
>Principal Engineer,
>Grid Dynamics
>
><http://www.griddynamics.com>
>



Re: Facet queries blow out the filterCache

2015-10-01 Thread Jeff Wartes
It still inserts if I address the core directly and use distrib=false.

I’ve got a few collections sharing the same config, so it’s surprisingly
annoying to
change solrconfig.xml right now, but it seemed pretty clear the query is
the thing being cached, since
the cache size only changes when the query does.



On 10/1/15, 3:01 PM, "Mikhail Khludnev"  wrote:

>hm..
>This option was useful for introspecting cache content
>https://wiki.apache.org/solr/SolrCaching#showItems It might help you to
>find-out a cause.
>I'm still blaming distributed requests, it expained here
>https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-Over-Re
>questParameters
>eg does it happen if you run with distrib=false?
>
>On Fri, Oct 2, 2015 at 12:27 AM, Jeff Wartes 
>wrote:
>
>>
>> No change, still shows an insert per-request. As does a simplified
>>request
>> with only the facet params
>> "&facet.field=city&facet=true"
>>
>by default it's 100
>https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-Theface
>t.limitParameter
>and can cause filtering by values, it can be seen in logs, btw.
>
>>
>> It’s definitely facet related though, facet=false eliminates the insert.
>>
>>
>>
>> On 10/1/15, 1:50 PM, "Mikhail Khludnev" 
>> wrote:
>>
>> >what if you set f.city.facet.limit=-1 ?
>> >
>> >On Thu, Oct 1, 2015 at 7:43 PM, Jeff Wartes 
>> >wrote:
>> >
>> >>
>> >> I’m doing some fairly simple facet queries in a two-shard 5.3
>>SolrCloud
>> >> index on fields like this:
>> >>
>> >> > >> docValues="true”/>
>> >>
>> >> that look something like this:
>> >>
>> 
>>>>q=...&fl=id,score&facet.field=city&facet=true&f.city.facet.mincount=1&f
>>>>.c
>> >>it
>> >> y.facet.limit=50&rows=0&start=0&facet.method=fc
>> >>
>> >> (no, NOT facet.method=enum - the usage of the filterCache there is
>> >>pretty
>> >> well documented)
>> >>
>> >> Watching the filterCache stats, it appears that every one of these
>> >>queries
>> >> causes the "inserts" counter to be incremented by one. Distinct "q="
>> >> queries also increase the "size", and eviction happens as normal. If
>>I
>> >> repeat the same query a few times, "lookups" is not incremented, so
>> >>these
>> >> entries generally appear to be completely wasted. (Although when
>> >>running a
>> >> lot of these queries, it appears as though a very small set also
>> >>increment
>> >> the "lookups" counter, but only a small set, and I haven’t figured
>>out
>> >>why
>> >> some are special.)
>> >>
>> >> So the question is, why does this facet query have anything to do
>>with
>> >>the
>> >> filterCache? This causes a huge amount of filterCache churn with no
>> >> apparent benefit.
>> >>
>> >>
>> >
>> >
>> >--
>> >Sincerely yours
>> >Mikhail Khludnev
>> >Principal Engineer,
>> >Grid Dynamics
>> >
>> ><http://www.griddynamics.com>
>> >
>>
>>
>
>
>-- 
>Sincerely yours
>Mikhail Khludnev
>Principal Engineer,
>Grid Dynamics
>
><http://www.griddynamics.com>
>



Re: Facet queries blow out the filterCache

2015-10-02 Thread Jeff Wartes

I backed up a bit. I took the stock solr download and did this:

solr-5.3.1>$ bin/solr -e techproducts

So, no SolrCloud, default example config, about as basic as you get. I
didn’t even bother indexing any docs. Then I issued this query:

http://localhost:8983/solr/techproducts/select?q=name:foo&rows=1&facet=true
&facet.field=popularity&facet.mincount=0&facet.limit=-1


This still causes an insert into the filterCache.

The only real difference I’m noticing vs my solrcloud collection is that
repeating the query increments cache lookups and hits. It’s still odd
though, because issuing new distinct queries causes a reported insert, but
not a lookup, so the cache hit ratio is always exactly 1.



On 10/2/15, 4:18 AM, "Toke Eskildsen"  wrote:

>On Thu, 2015-10-01 at 22:31 +, Jeff Wartes wrote:
>> It still inserts if I address the core directly and use distrib=false.
>
>It is quite strange that is is triggered with the direct access. If that
>can be reproduced in test, it looks like a performance optimization to
>be done.
>
>Anyway, operating under the assumption that the single-core facet
>request for some reason acts as a distributed call, the key to avoid the
>fine-counting is to ensure that _all_ possibly relevant term counts has
>been returned in the first facet phase.
>
>Try setting both facet.mincount=0 and facet.limit=-1.
>
>- Toke Eskildsen, State and University Library, Denmark
>
>



Re: Facet queries blow out the filterCache

2015-10-06 Thread Jeff Wartes

I dug far enough yesterday to find the GET_DOCSET, but not far enough to
find why. Thanks, a little context is really helpful sometimes.


So, starting with an empty filterCache...

http://localhost:8983/solr/techproducts/select?q=name:foo&rows=1&facet=true
&facet.field=popularity

New values: lookups: 0, hits: 0, inserts: 1, size: 1

So for the reasons you explained, "inserts" is incremented for this new
search

http://localhost:8983/solr/techproducts/select?q=name:boo&rows=1&facet=true
&facet.field=popularity

New values: inserts:lookups: 0, hits: 0, inserts 2, size: 2


Another new search, another new insert. No "lookups" though, so how does
it know name:boo wasn’t cached?

http://localhost:8983/solr/techproducts/select?q=name:boo&rows=1&facet=true
&facet.field=popularity
New values: inserts:lookups: 1, hits: 1, inserts: 2, size: 2


But it clearly does know - when I repeat the search, I get both a lookup
and a hit. (and no insert) So is this just
a bug in the stats reporting, perhaps?


When I first started looking at this, it was in a solrcloud cluster, and
one interesting thing about that cluster is that it was configured with
the queryResultCache turned off, so let’s repeat the above experiment
without the queryResultCache. (I’m just commenting it out in the
techproducts config for this run.)


Starting with an empty filterCache...

http://localhost:8983/solr/techproducts/select?q=name:foo&rows=1&facet=true
&facet.field=popularity
New values: lookups: 0, hits: 0, inserts: 1, size: 1

Same as before...

http://localhost:8983/solr/techproducts/select?q=name:boo&rows=1&facet=true
&facet.field=popularity
New values: inserts:lookups: 0, hits: 0, inserts 2, size: 2

Same as before...

http://localhost:8983/solr/techproducts/select?q=name:boo&rows=1&facet=true
&facet.field=popularity
New values: inserts: lookups: 0, hits: 0, inserts 3, size: 2

No cache hit! We get an insert instead, but it’s already in there, so the
size doesn’t change. So disabling the queryResultCache apparently causes
facet queries to be unable to use the filterCache?




I’m increasingly thinking that different use cases need different
filterCaches, rather than try to bundle every explicit or unexpected
use-case under one cache with one size and one regenerator.






On 10/6/15, 2:45 PM, "Chris Hostetter"  wrote:

>: So, no SolrCloud, default example config, about as basic as you get. I
>: didn’t even bother indexing any docs. Then I issued this query:
>: 
>: 
>http://localhost:8983/solr/techproducts/select?q=name:foo&rows=1&facet=tru
>e
>: &facet.field=popularity&facet.mincount=0&facet.limit=-1
>
>: This still causes an insert into the filterCache.
>
>the faceting component is a type of operation that indicates in the
>QueryCommand that it needs to GET_DOCSET for the set of all documents
>matching the query (independent of pagination) -- the point of this
>DocSet 
>is so the faceting logic can then compute the intersection of the set of
>all matching documents with the set of documents matching each facet
>constraint.  the cached DocSet will be re-used both within the context
>of the current request, and in future facet requests over the
>same query+filters.
>
>: The only real difference I’m noticing vs my solrcloud collection is that
>: repeating the query increments cache lookups and hits. It’s still odd
>: though, because issuing new distinct queries causes a reported insert,
>but
>: not a lookup, so the cache hit ratio is always exactly 1.
>
>i'm not following what you are saying at all ... can you give some
>concrete examples (ie: "starting with an empty cache i do this request,
>then i see these cache stats, then i do this identical/different query
>and 
>then the cache stats look like this...")
>
>
>
>-Hoss
>http://www.lucidworks.com/



Re: are there any SolrCloud supervisors?

2015-10-14 Thread Jeff Wartes

I’m aware of two public administration tools:
This was announced to the list just recently:
https://github.com/bloomreach/solrcloud-haft
And I’ve been working in this:
https://github.com/whitepages/solrcloud_manager

Both of these hook the Solrcloud client’s ZK access to inspect the cluster
state and execute more complex cluster-aware operations. I was also a bit
amused, because it looks like we both independently arrived at the same
replication-handler-based copy-collection operation. (Which suggests to me
that the functionality should be pushed into the collections API.)

Neither of these is a supervisor though, they merely provide a way to
execute cluster aware commands. Another monitor-oriented mechanism would
be needed to detect when to perform those commands, and I’ve not seen
anything existing along those lines.



On 10/13/15, 5:35 AM, "Susheel Kumar"  wrote:

>Sounds interesting...
>
>On Tue, Oct 13, 2015 at 12:58 AM, Trey Grainger 
>wrote:
>
>> I'd be very interested in taking a look if you post the code.
>>
>> Trey Grainger
>> Co-Author, Solr in Action
>> Director of Engineering, Search & Recommendations @ CareerBuilder
>>
>> On Fri, Oct 2, 2015 at 3:09 PM, r b  wrote:
>>
>> > I've been working on something that just monitors ZooKeeper to add and
>> > remove nodes from collections. the use case being I put SolrCloud in
>> > an autoscaling group on EC2 and as instances go up and down, I need
>> > them added to the collection. It's something I've built for work and
>> > could clean up to share on GitHub if there is much interest.
>> >
>> > I asked in the IRC about a SolrCloud supervisor utility but wanted to
>> > extend that question to this list. are there any more "full featured"
>> > supervisors out there?
>> >
>> >
>> > -renning
>> >
>>



solr4.7: truncated log output in grouping.CommandHandler?

2015-10-19 Thread Jeff Wu
We had solr server 4.7 recently reported such WARN message, and come with a
long GC pause after that. Sometime it will force Solr server disconnect
with ZK server.

Solr 4.7.0, got this warning message:
WARN  - 2015-10-19 02:23:24.503;
org.apache.solr.search.grouping.CommandHandler; Query: +(+owner:testUser)
+(+directoryUUID:x +softFlag:`^H^@^@^@^@); Elapsed time:
20Exceeded allowed search time: 1 ms.

Questions is
The log string has sometring like this: `^H^@^@^@^@), this looks like a
truncated string. Is this expected output? Or indicating something wrong in
this query?


Anyone users IBM J9 JVM with 32G max heap ? Tuning recommendations?

2015-10-19 Thread Jeff Wu
Hi all,

we are using solr4.7 on top of IBM JVM J9 Java7, max heap to 32G, system
RAM 64G.

JVM parameters: -Xgcpolicy:balanced -verbose:gc -Xms12228m -Xmx32768m
-XX:PermSize=128m -XX:MaxPermSize=512m

We faced one issue here: we set zkClient timeout value to 30 seconds. By
using the balanced GC policy, we sometimes occurred a global GC pause
>30seconds, therefore the solr server disconnected with ZK, and /update
requests on this solr was disabled after zk disconnect. We have to restart
this solr server to recover.

By staying with IBM JVM, anyone has recommendations on this ? The general
average heap usage in our solr server is around 26G so we'd like to stay
with 32G max heap, but want to better tune the JVM to have less global gc
pause.


Re: DevOps question : auto deployment/setup of Solr & Zookeeper on medium-large clusters

2015-10-20 Thread Jeff Wartes

If you’re using AWS, there’s this:
https://github.com/LucidWorks/solr-scale-tk
If you’re using chef, there’s this:
https://github.com/vkhatri/chef-solrcloud

(There are several other chef cookbooks for Solr out there, but this is
the only one I’m aware of that supports Solr 5.3.)

For ZK, I’m less familiar, but if you’re using chef there’s this:
https://github.com/SimpleFinance/chef-zookeeper
And this might be handy to know about too:
https://github.com/Netflix/exhibitor/wiki


On 10/20/15, 6:37 AM, "Davis, Daniel (NIH/NLM) [C]" 
wrote:

>Waste of money in my opinion.   I would point you towards other tools -
>bash scripts and free configuration managers such as puppet, chef, salt,
>or ansible.Depending on what development you are doing, you may want
>a continuous integration environment.   For a small company starting out,
>using a free CI, maybe SaaS, is a good choice.   A professional version
>such as Bamboo, TeamCity, Jenkins are almost essential in a large
>enterprise if you are doing diverse builds.
>
>When you create a VM, you can generally specify a script to run after the
>VM is mostly created.   There is a protocol (PXE Boot) that enables this
>- a PXE server listens and hears that a new server with such-and-such
>Ethernet Address is starting.   The PXE server makes it boot like a
>CD-ROM/DVD install, booting from installation media on the network and
>installing.Once that install is down, a custom script may be invoked.
>  This script is typically a bash script, because you may not be able to
>count on too much else being installed.   However, python/perl are also
>reasonable choices - just be careful that the modules/libraries you are
>using for the script are present.The same PXE protocol is used in
>large on-premises installations (vCenter) and in the cloud (AWS/Digital
>Ocean).  We don't care about the PXE server - the point is that you can
>generally run a bash script after your install.
>
>The bash script can bootstrap other services such as puppet, chef, or
>salt, and/or setup keys so that push configuration management tools such
>as ansible can reach the server.   The bash script may even be smart
>enough to do all of the setup you need, depending on what other servers
>you need to configure.   Smart bash scripts are good for a small company,
>but for large setups, I'd use puppet, chef, salt, and/or ansible.
>
>What I tend to do is to deploy things in such a way that puppet (because
>it is what we use here) can setup things so that a "solradm" account can
>setup everything else, and solr and zookeeper are running as a "solrapp"
>user using puppet.Then, my continuous integration server, which is
>Atlassian Bamboo (you can also use tools such as Jenkins, TeamCity,
>BuildBot), installs solr as "solradm" and sets it up to run as "solrapp".
>
>I am not a systems administrator, and I'm not really in "DevOps", my job
>is to be above all of that and do "systems architecture" which I am lucky
>still involves coding both in system administration and applications
>development.   So, that's my 2 cents.
>
>Dan Davis, Systems/Applications Architect (Contractor),
>Office of Computer and Communications Systems,
>National Library of Medicine, NIH
>
>-Original Message-
>From: Susheel Kumar [mailto:susheel2...@gmail.com]
>Sent: Tuesday, October 20, 2015 9:19 AM
>To: solr-user@lucene.apache.org
>Subject: DevOps question : auto deployment/setup of Solr & Zookeeper on
>medium-large clusters
>
>Hello,
>
>Resending to see opinion from Dev-Ops perspective on the tools for
>installing/deployment of Solr & ZK on large no of machines and
>maintaining them. I have heard Bladelogic or HP OO (commercial tools)
>etc. being used.
>Please share your experience or pros / cons of such tools.
>
>Thanks,
>Susheel
>
>On Mon, Oct 19, 2015 at 3:32 PM, Susheel Kumar 
>wrote:
>
>> Hi,
>>
>> I am trying to find the best practises for setting up Solr on new 20+
>> machines  & ZK (5+) and repeating same on other environments.  What's
>> the best way to download, extract, setup Solr & ZK in an automated way
>> along with other dependencies like java etc.  Among shell scripts or
>> puppet or docker or imaged vm's what is being used & suggested from
>> Dev-Ops perspective.
>>
>> Thanks,
>> Susheel
>>



Re: copy data between collection

2015-10-26 Thread Jeff Wartes

The “copy” command in this tool automatically does what Upayavira
describes, including bringing the replicas up to date. (if any)
https://github.com/whitepages/solrcloud_manager


I’ve been using it as a mechanism for copying a collection into a new
cluster (different ZK), but it should work within
a cluster too. The same caveats apply - see the entry in the README.

I’ve also been doing some collection backup/restore stuff that could be
used to copy a collection within a cluster, (back up your collection, then
restore into a new collection with a different name) but I only just
pushed that, and haven’t bundled it into a release yet.

In all cases, you’re responsible for managing the actual collection
definitions yourself.

An alternative tool I’m aware of is this one:
https://github.com/bloomreach/solrcloud-haft

This says it’s only tested with Solr 4.6, but I’d think it should work.
The Solr APIs for replication haven’t changed much. I haven’t used it, but
it looks like it has some stuff around saving ZK data that could be
useful, and that’s one thing I haven’t focused on myself yet.



On 10/26/15, 4:46 AM, "Upayavira"  wrote:

>Hi Shani,
>
>There isn't a SolrCloud way to do it. A proper 'clone this collection'
>feature would be a very useful thing.
>
>However, I have managed to do it, in a way that involves some caveats:
> * you should only do this on a collection that has no replicas. Add
> replicas *after* cloning the index
> * if you must do it on a sharded index, then you will need to do it
> once for each shard. No guarantees though
>
>All SolrCloud nodes are all already enabled as 'replication masters' so
>that new replicas can pull a full index from the current leader. We're
>gonna use this feature to pull our index (assuming single shard):
>
>http://:8983/solr/_shard1_replica1/replicat
>ion?command=fetchindex&masterUrl=http://:8983/solr/lection>_shard1_replica1/replication
>
>This basically says to the core behind your new collection: "Go to the
>core behind the old collection, and pull its entire index".
>
>This worked for me. I added a replica afterwards, and the index cloned
>correctly. However, when I did it against a collection that had a
>replica already, the replica *didn't* notice, meaning the leader/replica
>were now out of sync, i.e: Really make sure you do this replication
>before you add replicas to your new collection.
>
>Hope this helps.
>
>Upayavira
>
>On Mon, Oct 26, 2015, at 11:21 AM, Chaushu, Shani wrote:
>> Hi,
>> Is there an API to copy all the documents from one collection to another
>> collection in the same solr server simply?
>> I'm using solr cloud 4.10
>> Thanks,
>> Shani
>> 
>> -
>> Intel Electronics Ltd.
>> 
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.



Re: replica recovery

2015-10-27 Thread Jeff Wartes

On the face of it, your scenario seems plausible. I can offer two pieces
of info that may or may not help you:

1. A write request to Solr will not be acknowledged until an attempt has
been made to write to all relevant replicas. So, B won’t ever be missing
updates that were applied to A, unless communication with B was disrupted
somehow at the time of the update request. You can add a min_rf param to
your write request, in which case the response will tell you how many
replicas received the update, but it’s still up to your indexer client to
decide what to do if that’s less than your replication factor.

See 
https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+
Tolerance for more info.

2. There are two forms of replication. The usual thing is for the leader
for each shard to write an update to all replicas before acknowledging the
write itself, as above. If a replica is less than N docs behind the
leader, the leader can replay those docs to the replica from its
transaction log. If a replica is more than N docs behind though, it falls
back to the replication handler recovery mode you mention, and attempts to
re-sync the whole shard from the leader.
The default N for this is 100, which is pretty low for a high-update-rate
index. It can be changed by increasing the size of the transaction log,
(via numRecordsToKeep) but be aware that a large transaction log size can
delay node restart.

See 
https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConf
ig#UpdateHandlersinSolrConfig-TransactionLog for more info.


Hope some of that helps, I don’t know a way to say
delete-first-on-recovery.



On 10/27/15, 5:21 PM, "Brian Scholl"  wrote:

>Whoops, in the description of my setup that should say 2 replicas per
>shard.  Every server has a replica.
>
>
>> On Oct 27, 2015, at 20:16, Brian Scholl  wrote:
>> 
>> Hello,
>> 
>> I am experiencing a failure mode where a replica is unable to recover
>>and it will try to do so forever.  In writing this email I want to make
>>sure that I haven't missed anything obvious or missed a configurable
>>option that could help.  If something about this looks funny, I would
>>really like to hear from you.
>> 
>> Relevant details:
>> - solr 5.3.1
>> - java 1.8
>> - ubuntu linux 14.04 lts
>> - the cluster is composed of 1 SolrCloud collection with 100 shards
>>backed by a 3 node zookeeper ensemble
>> - there are 200 solr servers in the cluster, 1 replica per shard
>> - a shard replica is larger than 50% of the available disk
>> - ~40M docs added per day, total indexing time is 8-10 hours spread
>>over the day
>> - autoCommit is set to 15s
>> - softCommit is not defined
>>  
>> I think I have traced the failure to the following set of events but
>>would appreciate feedback:
>> 
>> 1. new documents are being indexed
>> 2. the leader of a shard, server A, fails for any reason (java crashes,
>>times out with zookeeper, etc)
>> 3. zookeeper promotes the other replica of the shard, server B, to the
>>leader position and indexing resumes
>> 4. server A comes back online (typically 10s of seconds later) and
>>reports to zookeeper
>> 5. zookeeper tells server A that it is no longer the leader and to sync
>>with server B
>> 6. server A checks with server B but finds that server B's index
>>version is different from its own
>> 7. server A begins replicating a new copy of the index from server B
>>using the (legacy?) replication handler
>> 8. the original index on server A was not deleted so it runs out of
>>disk space mid-replication
>> 9. server A throws an error, deletes the partially replicated index,
>>and then tries to replicate again
>> 
>> At this point I think steps 6  => 9 will loop forever
>> 
>> If the actual errors from solr.log are useful let me know, not doing
>>that now for brevity since this email is already pretty long.  In a
>>nutshell and in order, on server A I can find the error that took it
>>down, the post-recovery instruction from ZK to unregister itself as a
>>leader, the corrupt index error message, and then the (start - whoops,
>>out of disk- stop) loop of the replication messages.
>> 
>> I first want to ask if what I described is possible or did I get lost
>>somewhere along the way reading the docs?  Is there any reason to think
>>that solr should not do this?
>> 
>> If my version of events is feasible I have a few other questions:
>> 
>> 1. What happens to the docs that were indexed on server A but never
>>replicated to server B before the failure?  Assuming that the replica on
>>server A were to complete the recovery process would those docs appear
>>in the index or are they gone for good?
>> 
>> 2. I am guessing that the corrupt replica on server A is not deleted
>>because it is still viable, if server B had a catastrophic failure you
>>could pick up the pieces from server A.  If so is this a configurable
>>option somewhere?  I'd rather take my chances on server B going down
>>before replication finishes than be stuck in th

Re: Facet queries blow out the filterCache

2015-10-28 Thread Jeff Wartes

FWIW, since it seemed like there was at least one bug here (and possibly
more), I filed
https://issues.apache.org/jira/browse/SOLR-8171



On 10/6/15, 3:58 PM, "Jeff Wartes"  wrote:

>
>I dug far enough yesterday to find the GET_DOCSET, but not far enough to
>find why. Thanks, a little context is really helpful sometimes.
>
>
>So, starting with an empty filterCache...
>
>http://localhost:8983/solr/techproducts/select?q=name:foo&rows=1&facet=tru
>e
>&facet.field=popularity
>
>New values:lookups: 0, hits: 0, inserts: 1, size: 1
>
>So for the reasons you explained, "inserts" is incremented for this new
>search
>
>http://localhost:8983/solr/techproducts/select?q=name:boo&rows=1&facet=tru
>e
>&facet.field=popularity
>
>New values: inserts:   lookups: 0, hits: 0, inserts 2, size: 2
>
>
>Another new search, another new insert. No "lookups" though, so how does
>it know name:boo wasn’t cached?
>
>http://localhost:8983/solr/techproducts/select?q=name:boo&rows=1&facet=tru
>e
>&facet.field=popularity
>New values: inserts:   lookups: 1, hits: 1, inserts: 2, size: 2
>
>
>But it clearly does know - when I repeat the search, I get both a lookup
>and a hit. (and no insert) So is this just
>a bug in the stats reporting, perhaps?
>
>
>When I first started looking at this, it was in a solrcloud cluster, and
>one interesting thing about that cluster is that it was configured with
>the queryResultCache turned off, so let’s repeat the above experiment
>without the queryResultCache. (I’m just commenting it out in the
>techproducts config for this run.)
>
>
>Starting with an empty filterCache...
>
>http://localhost:8983/solr/techproducts/select?q=name:foo&rows=1&facet=tru
>e
>&facet.field=popularity
>New values:lookups: 0, hits: 0, inserts: 1, size: 1
>
>Same as before...
>
>http://localhost:8983/solr/techproducts/select?q=name:boo&rows=1&facet=tru
>e
>&facet.field=popularity
>New values: inserts:   lookups: 0, hits: 0, inserts 2, size: 2
>
>Same as before...
>
>http://localhost:8983/solr/techproducts/select?q=name:boo&rows=1&facet=tru
>e
>&facet.field=popularity
>New values: inserts: lookups: 0, hits: 0, inserts 3, size: 2
>
>No cache hit! We get an insert instead, but it’s already in there, so the
>size doesn’t change. So disabling the queryResultCache apparently causes
>facet queries to be unable to use the filterCache?
>
>
>
>
>I’m increasingly thinking that different use cases need different
>filterCaches, rather than try to bundle every explicit or unexpected
>use-case under one cache with one size and one regenerator.
>
>
>
>
>
>
>On 10/6/15, 2:45 PM, "Chris Hostetter"  wrote:
>
>>: So, no SolrCloud, default example config, about as basic as you get. I
>>: didn’t even bother indexing any docs. Then I issued this query:
>>: 
>>: 
>>http://localhost:8983/solr/techproducts/select?q=name:foo&rows=1&facet=tr
>>u
>>e
>>: &facet.field=popularity&facet.mincount=0&facet.limit=-1
>>
>>: This still causes an insert into the filterCache.
>>
>>the faceting component is a type of operation that indicates in the
>>QueryCommand that it needs to GET_DOCSET for the set of all documents
>>matching the query (independent of pagination) -- the point of this
>>DocSet 
>>is so the faceting logic can then compute the intersection of the set of
>>all matching documents with the set of documents matching each facet
>>constraint.  the cached DocSet will be re-used both within the context
>>of the current request, and in future facet requests over the
>>same query+filters.
>>
>>: The only real difference I’m noticing vs my solrcloud collection is
>>that
>>: repeating the query increments cache lookups and hits. It’s still odd
>>: though, because issuing new distinct queries causes a reported insert,
>>but
>>: not a lookup, so the cache hit ratio is always exactly 1.
>>
>>i'm not following what you are saying at all ... can you give some
>>concrete examples (ie: "starting with an empty cache i do this request,
>>then i see these cache stats, then i do this identical/different query
>>and 
>>then the cache stats look like this...")
>>
>>
>>
>>-Hoss
>>http://www.lucidworks.com/
>



Re: Data Import Handler / Backup indexes

2015-11-17 Thread Jeff Wartes

https://github.com/whitepages/solrcloud_manager supports 5.x, and I added
some backup/restore functionality similar to SOLR-5750 in the last
release. 
Like SOLR-5750, this backup strategy requires a shared filesystem, but
note that unlike SOLR-5750, I haven’t yet added any backup functionality
for the contents of ZK. I’m currently working on some parts of that.


Making a copy of a collection is supported too, with some caveats.


On 11/17/15, 10:20 AM, "Brian Narsi"  wrote:

>Sorry I forgot to mention that we are using SolrCloud 5.1.0.
>
>
>
>On Tue, Nov 17, 2015 at 12:09 PM, KNitin  wrote:
>
>> afaik Data import handler does not offer backups. You can try using the
>> replication handler to backup data as you wish to any custom end point.
>>
>> You can also try out : https://github.com/bloomreach/solrcloud-haft.
>>This
>> helps backup solr indices across clusters.
>>
>> On Tue, Nov 17, 2015 at 7:08 AM, Brian Narsi  wrote:
>>
>> > I am using Data Import Handler to retrieve data from a database with
>> >
>> > full-import, clean = true, commit = true and optimize = true
>> >
>> > This has always worked correctly without any errors.
>> >
>> > But just to be on the safe side, I am thinking that we should do a
>>backup
>> > before initiating Data Import Handler. And just in case something
>>happens
>> > restore the backup.
>> >
>> > Can backup be done automatically (before initiating Data Import
>>Handler)?
>> >
>> > Thanks
>> >
>>



Re: replica recovery

2015-11-19 Thread Jeff Wartes

I completely agree with the other comments on this thread with regard to
needing more disk space asap, but I thought I’d add a few comments
regarding the specific questions here.

If your goal is to prevent full recovery requests, you only need to cover
the duration you expect a replica to be unavailable.

If your common issue is GC due to bad queries, you probably don’t need to
cover more than the number of docs you write in your typical full GC
pause. I suspect this is less than 10M.
If your common issue is the length of time it takes you to notice a server
crashed and restart it, you may need to cover something like 10 minutes
worth of docs. I suspect this is still less than 10M.

You certainly don’t need to keep an entire day’s transaction logs. If your
servers routinely go down for a whole day, solve that by fixing your
servers. :)

With respect to ulimit, if Solr is the only thing of significance on the
box, there’s no reason not to bump that up. I usually just set something
like 32k and stop thinking about it. I get hit by a low ulimit every now
and then, but I can’t recall ever having had an issue with it being too
high.



On 11/19/15, 6:21 AM, "Brian Scholl"  wrote:

>I have opted to modify the number and size of transaction logs that I
>keep to resolve the original issue I described.  In so doing I think I
>have created a new problem, feedback is appreciated.
>
>Here are the new updateLog settings:
>
>
>  ${solr.ulog.dir:}
>  name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}
>  1000
>  5760
>
>
>First I want to make sure I understand what these settings do:
>   numRecordsToKeep: per transaction log file keep this number of documents
>   maxNumLogsToKeep: retain this number of transaction log files total
>
>During my testing I thought I observed that a new tlog is created every
>time auto-commit is triggered (every 15 seconds in my case) so I set
>maxNumLogsToKeep high enough to contain an entire days worth of updates.
> Knowing that I could potentially need to bulk load some data I set
>numRecordsToKeep higher than my max throughput per replica for 15 seconds.
>
>The problem that I think this has created is I am now running out of file
>descriptors on the servers.  After indexing new documents for a couple
>hours a some servers (not all) will start logging this error rapidly:
>
>73021439 WARN  
>(qtp1476011703-18-acceptor-0@6d5514d9-ServerConnector@6392e703{HTTP/1.1}{0
>.0.0.0:8983}) [   ] o.e.j.s.ServerConnector
>java.io.IOException: Too many open files
>   at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>   at 
>sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422
>)
>   at 
>sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250
>)
>   at 
>org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:377)
>   at 
>org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.
>java:500)
>   at 
>org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.jav
>a:635)
>   at 
>org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java
>:555)
>   at java.lang.Thread.run(Thread.java:745)
>
>The output of ulimit -n for the user running the solr process is 1024.  I
>am pretty sure I can prevent this error from occurring  by increasing the
>limit on each server but it isn't clear to me how high it should be or if
>raising the limit will cause new problems.
>
>Any advice you could provide in this situation would be awesome!
>
>Cheers,
>Brian
>
>
>
>> On Oct 27, 2015, at 20:50, Jeff Wartes  wrote:
>> 
>> 
>> On the face of it, your scenario seems plausible. I can offer two pieces
>> of info that may or may not help you:
>> 
>> 1. A write request to Solr will not be acknowledged until an attempt has
>> been made to write to all relevant replicas. So, B won’t ever be missing
>> updates that were applied to A, unless communication with B was
>>disrupted
>> somehow at the time of the update request. You can add a min_rf param to
>> your write request, in which case the response will tell you how many
>> replicas received the update, but it’s still up to your indexer client
>>to
>> decide what to do if that’s less than your replication factor.
>> 
>> See 
>> 
>>https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Faul
>>t+
>> Tolerance for more info.
>> 
>> 2. There are two forms of replication. The usual thing is for the leader
>> for each shard to write an update to all replicas before acknowledging
>>the
>> write itself, as above. If a rep

Re: Data Import Handler / Backup indexes

2015-11-23 Thread Jeff Wartes

The backup/restore approach in SOLR-5750 and in solrcloud_manager is
really just that - copying the index files.
On backup, it saves your index directories, and on restore, it puts them
in the data dir, moves a pointer for the current index dir, and opens a
new searcher. Both are mostly just wrappers on the proper Solr
replication-handler commands, since Solr already has some lower level APIs
for these operations.

There is a shared filesystem requirement for backup/restore though, which
is to account for the fact that when you make the backup you don’t know
which nodes will need to restore a given shard.

The commands would look something like:

java -jar solrcloud_manager-assembly-1.4.0.jar backupindex -z
zk0.example.com:2181/myapp -c collection1 --dir 
java -jar solrcloud_manager-assembly-1.4.0.jar restoreindex -z
zk0.example.com:2181/myapp -c collection1 --dir 

Or you could restore into a new collection:
java -jar solrcloud_manager-assembly-1.4.0.jar backupindex -z
zk0.example.com:2181/myapp -c collection1 --dir 
java -jar solrcloud_manager-assembly-1.4.0.jar clonecollection -z
zk0.example.com:2181/myapp -c newcollection --fromCollection collection1
java -jar solrcloud_manager-assembly-1.4.0.jar restoreindex -z
zk0.example.com:2181/myapp -c newcollection --dir 
--restoreFrom collection1

If you don’t have a shared filesystem, you can still do the copy
collection route:
java -jar solrcloud_manager-assembly-1.4.0.jar clonecollection -z
zk0.example.com:2181/myapp -c newcollection --fromCollection collection1

java -jar solrcloud_manager-assembly-1.4.0.jar copycollection -z
zk0.example.com:2181/myapp -c newcollection --fromCollection collection1

This creates a new collection with the same settings, (clonecollection)
and triggers a one-shot “replication” into it. (copycollection) Again,
this is just framework for the proper (largely undocumented) Solr API
commands, to work around the lack of a convenient collections-level API
command.

One nice thing about using copy collection is that it can be used to keep
a backup collection up to date, only copying if necessary. Honestly
though, I don’t have as much experience with this use case as KNitin does
in solrcloud-haft, which is why I suggest using an empty collection in the
README right now. If you try that use case with solrcloud_manager, I’d be
interested in your experience. It should work, but you’ll need to disable
the verification with --skipCheck and check manually.


Having said all that though, yes, with your simple use case and small
collection, you can do everything you want with just cp. The easiest way
would be to make a backup copy of your index dir. If you need to restore,
shut down solr, nuke your index dir, and copy the backup in there. You’d
probably need to do this on all nodes at once though, to prevent a
non-leader from coming up and re-syncing with a piece of the index you
hadn’t restored yet.




On 11/21/15, 10:12 PM, "Brian Narsi"  wrote:

>What are the caveats regarding the copy of a collection?
>
>At this time DIH takes only about 10 minutes. So in case of accidental
>delete we can just re-run the DIH. The reason I am thinking about backup
>is
>just in case records are deleted accidentally and the DIH cannot be run
>because the database is unavailable.
>
>Our collection is simple: 2 nodes - 1 collection - 2 shards with 2
>replicas
>each
>
>So a simple copy (cp command) for both the nodes/shards might work for us?
>How do I restore the data back?
>
>
>
>On Tue, Nov 17, 2015 at 4:56 PM, Jeff Wartes 
>wrote:
>
>>
>> https://github.com/whitepages/solrcloud_manager supports 5.x, and I
>>added
>> some backup/restore functionality similar to SOLR-5750 in the last
>> release.
>> Like SOLR-5750, this backup strategy requires a shared filesystem, but
>> note that unlike SOLR-5750, I haven’t yet added any backup functionality
>> for the contents of ZK. I’m currently working on some parts of that.
>>
>>
>> Making a copy of a collection is supported too, with some caveats.
>>
>>
>> On 11/17/15, 10:20 AM, "Brian Narsi"  wrote:
>>
>> >Sorry I forgot to mention that we are using SolrCloud 5.1.0.
>> >
>> >
>> >
>> >On Tue, Nov 17, 2015 at 12:09 PM, KNitin  wrote:
>> >
>> >> afaik Data import handler does not offer backups. You can try using
>>the
>> >> replication handler to backup data as you wish to any custom end
>>point.
>> >>
>> >> You can also try out : https://github.com/bloomreach/solrcloud-haft.
>> >>This
>> >> helps backup solr indices across clusters.
>> >>
>> >> On Tue, Nov 17, 2015 at 7:08 AM, Brian Narsi 
>> wrote:
>> >>
>> >> > I am using

Method to fix issue when you get KeeperErrorCode = NoAuth when Zookeeper ACL enabled

2015-12-02 Thread Jeff Wu
We have being following this wiki to enable ZooKeeper ACL control
https://cwiki.apache.org/confluence/display/solr/ZooKeeper+Access+Control#ZooKeeperAccessControl-AboutZooKeeperACLs

It works fine for Solr service itself, but when you try to
use scripts/cloud-scripts/zkcli.sh to put a zNode, it throws such exception:

   Exception in thread "main"
org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode =
NoAuth for /security.json
at org.apache.zookeeper.KeeperException.create(KeeperException.java:113)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1270)
at
org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:362)
at
org.apache.solr.common.cloud.SolrZkClient$8.execute(SolrZkClient.java:359)
at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:359)

To fix the problem, the wiki needs to be updated(however I can not put a
comment in that wiki):

SOLR_ZK_CREDS_AND_ACLS="*-DzkACLProvider=org.apache.solr.common.cloud.VMParamsAllAndReadonlyDigestZkACLProvider
-DzkCredentialsProvider=org.apache.solr.common.cloud.VMParamsSingleSetCredentialsDigestZkCredentialsProvider*
-DzkDigestUsername=admin-user -DzkDigestPassword=admin-password
-DzkDigestReadonlyUsername=readonly-user
-DzkDigestReadonlyPassword=readonly-password"


I think the reason is zkcli.sh is a standard JVM process and does not read
settings from solr.xml, so we must explicitly provide these parameters in
order to make ZK ACL work.

Could someone help to notify the wiki editor to update this? Right now this
wiki will lead people to a dead end with zkcli.sh to put content to ZK
ensemble with ACL enabled


Re: Solr 5: Schema.xml vs. Managed Schema - which is advisable?

2015-12-03 Thread Jeff Wartes
I’ve never used the managed schema, so I’m probably biased, but I’ve never
seen much of a point to the Schema API.

I need to make changes sometimes to solrconfig.xml, in addition to
schema.xml and other config files, and there’s no API for those, so my
process has been like:

1. Put the entire config directory used by a collection in source control
somewhere. solrconfig.xml, schema.xml, synonyms.txt, everything.
2. Make changes, test, commit
3. “Release” by uploading the whole config dir at a specific commit to ZK
(overwriting any existing files) and issuing a collections API “reload”.


This has the downside that I can upload a broken config and take down my
collection, but with the whole config dir in source control,
I can also easily roll back to any point by uploading an old commit.
You still have to be aware of how the changes you’re making will effect
your current index, but that’s unavoidable.


On 12/3/15, 7:09 AM, "Kelly, Frank"  wrote:

>Just wondering if folks have any suggestions on using Schema.xml vs.
>Managed Schema going forward.
>
>Our deployment will be
>> 3 Zk, 3 Shards, 3 replicas
>> Copies of each collection in 5 AWS regions (EBS-backed EC2 instances)
>> Planning at least 1 Billion objects indexed (currently < 100 million)
>
>I'm sure our schema.xml will have changes and fixes and just wondering
>which approach (schema.xml vs. managed)
>will be easier to deploy / maintain?
>
>Cheers!
>
>-Frank
>
>
>Frank Kelly
>Principal Software Engineer
>Predictive Analytics Team (SCBE/HAC/CDA)
>
>
>
>
>
>
>
>



Re: How to list all collections in solr-4.7.2

2015-12-03 Thread Jeff Wartes
Looks like LIST was added in 4.8, so I guess you’re stuck looking at ZK,
or finding some tool that looks in ZK for you.

The zkCli.sh that ships with zookeeper would probably suffice for a
one-off manual inspection:
https://zookeeper.apache.org/doc/trunk/zookeeperStarted.html#sc_ConnectingT
oZooKeeper



On 12/3/15, 12:05 PM, "Pushkar Raste"  wrote:

>Will 'wget http://host;port//solr/admin/collections?action=LIST' help?
>
>On 3 December 2015 at 12:12, rashi gandhi  wrote:
>
>> Hi all,
>>
>> I have setup two solr-4.7.2 server instances on two diff machines with 3
>> zookeeper severs in solrcloud mode.
>>
>> Now, I want to retrieve list of all the collections that I have created
>>in
>> solrcloud mode.
>>
>> I tried LIST command of collections api, but its not working with
>> solr-4.7.2.
>> Error: unknown command LIST
>>
>> Please suggest me the command, that I can use.
>>
>> Thanks.
>>



Re: Solrcloud: 1 server, 1 configset, multiple collections, multiple schemas

2015-12-04 Thread Jeff Wartes

If you want two different collections to have two different schemas, those
collections need to reference two different configsets.
So you need another copy of your config available using a different name,
and to reference that other name when you create the second collection.


On 12/4/15, 6:26 AM, "bengates"  wrote:

>Hello,
>
>I'm having usage issues with *Solrcloud*.
>
>What I want to do:
>- Manage a solr server *only with the API* (create / reload / delete
>collections, create / replace / delete fields, etc).
>- A new collection should* start with pre-defined default fields,
>fieldTypes
>and copyFields* (let's say, field1 and field2 for fields).
>- Each collection must *have its own schema*.
>
>What I've setup yet:
>- Installed a *Solr 5.3.1* in //opt/solr/ on an Ubuntu 14.04 server
>- Installed *Zookeeper 3.4.6* in //opt/zookeeper/ as described in the solr
>wiki
>- Added line "server.1=127.0.0.1:2888:3888" in
>//opt/zookeeper/conf/zoo.cfg/
>- Added line "127.0.0.1:2181" in
>//var/solr/data/solr.xml/
>- Told solr or zookeeper somewhere (don't remember where I setup this) to
>use //home/me/configSet/managed-schema.xml/ and
>//home/me/configSet/solrconfig.xml/ for configSet
>- Run solr on port 8983
>
>My //home/me/configSet/managed-schema.xml/ contains *field1* and *field2*.
>
>Now let's create a collection:
>http://my.remote.addr:8983/solr/admin/collections?action=CREATE&name=colle
>ction1&numShards=1
>- *collection1 *is created, with *field1 *and *field2*. Perfect.
>
>Let's create another collection:
>http://my.remote.addr:8983/solr/admin/collections?action=CREATE&name=colle
>ction2&numShards=1
>- *collection2 *is created, with *field1 *and *field2*. Perfect.
>
>No, if I *add some fields* on *collection1 *by POSTing to :
>/http://my.remote.addr:8983/solr/collection1/schema/ the following:
>
>
>- *field3 *and *field4 *are successfully added to *collection1*
>- ... but they are *also added* to *collection2* (verified by GETting
>/http://my.remote.addr:8983/solr/collection2/schema/fields/)
>
>How to prevent this behavior, since my collections have *different kind of
>datas*, and may have the same field names but not the same types?
>
>Thanks,
>Ben
>
>
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/Solrcloud-1-server-1-configset-multiple
>-collections-multiple-schemas-tp4243584.html
>Sent from the Solr - User mailing list archive at Nabble.com.



Re: Fully automated replica creation in AWS

2015-12-09 Thread Jeff Wartes

It’s a pretty common misperception that since solr scales, you can just
spin up new nodes and be done. Amazon ElasticSearch and older solrcloud
getting-started docs encourage this misperception, as does the HDFS-only
autoAddReplicas flag.

I agree that auto-scaling should be approached carefully, and
per-collection, but the question also comes up a lot, so the lack of
available/blessed solr tools hurts, in my opinion.
https://cwiki.apache.org/confluence/display/solr/Rule-based+Replica+Placeme
nt helps, but you still need to say when and how many.

Anyway, the last time this came up
(https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201510.mbox/%3C
cae+cwktalxicdfc2zlfxvxregvnt-56yyqct3u-onhqchxa...@mail.gmail.com%3E) I
suggested this as a place to get started - it’s a tool that knows where to
look: https://github.com/whitepages/solrcloud_manager

Using this, scaling up a collection is pretty easy. Add some nodes, then:
// Add replicas as "cluster space” allows. (assumes you’re not using
built-in rule-based placement)
// Respects current maxShardsPerNode for the collection, prevents
adding replicas that already exist on a node,
// adds replicas for shards with a lower replication factor first.
java -jar solrcloud_manager-assembly-1.4.0.jar fill -z
zk0.example.com:2181/myapp -c collection1

Scaling down is also a one-liner. Turn off some nodes and run:
// removes any replicas in the collection that are not marked “active”
from the cluster state
java -jar solrcloud_manager-assembly-1.4.0.jar cleancollection -z
zk0.example.com:2181/myapp -c collection1
BUT, you still need to take care not to take down all of the nodes with a
given shard at once. This can be tricky to figure out if your collection
has shards spread across many nodes.


Another downscaling option would be to proactively delete the replicas off
of a node before turning it off:
// deletes all replicas for the collection on the node(s),
// but refuses to delete a replica if that’d bring the
replicationFactor for that shard below 2
java -jar solrcloud_manager-assembly-1.4.0.jar clean -z
zk0.example.com:2181/myapp -c collection1 --nodes abc.example.com
--safetyFactor 2



On 12/9/15, 12:09 PM, "Erick Erickson"  wrote:

>bq: As a side note, we do this for our
>customers as that's baked into our cloud provisioning software,
>
>Exactly, nothing OOB is there, but all the data is available, you
>"just" have to write a tool that knows where to look ;) That said,
>this would certainly be something that would have to be optional
>and, IMO, not on by default. It'd need some design effort, as I'd
>guess it'd be on a per-collection basis thus turned on in solrconfig.xml.
>Hmmm, or perhaps this would be some kind of list of nodes
>in Zookeeper. Or..
>
>So when a new Solr node came up for the first time, it'd need
>to find out which collections were configured with this option
>either through enumerating all the collections and checking
>their states or looking in their solrconfig files or enumerating
>a list of children in Zookeeper or Plus you'd need some
>kind of safeguards in place to handle bringing up, say, 10 new
>Solr instances one at a time so all the new replicas didn't
>end up on the first node you added. And so on
>
>The more I think about all that the less I like it; it seems that
>custom utilities on a per-collection basis make more sense.
>
>And yeah, autoAddReplicas is exactly for that on HDFS. Using
>autoAddReplicas for non shared filesystems doesn't really
>make sense IMO. The supposition is that the Solr node has
>gone away. How would some _other_ Solr instance on some _other_
>node know where to look for the index?
>
>Best,
>Erick
>
>On Wed, Dec 9, 2015 at 11:37 AM, Sameer Maggon
> wrote:
>> Erick,
>>
>> Typically, while creating collections, a replicationFactor is specified.
>> Thus, the meta data about the collection does have information about
>>what
>> the "desired" replicationFactor is for the collection. If that's the
>>case,
>> when a Solr node joins the cluster, there could be a pro-active
>>add-replica
>> operation that can be initiated if the Solr detects that the current
>> replicas are less than the desired replicationFactor and pull the
>> collection data from the leader.
>>
>> Isn't that what the attribute "autoAddReplicas" does for HDFS - can
>>this be
>> done for non-shared filesystem? As a side note, we do this for our
>> customers as that's baked into our cloud provisioning software, but it
>> would be nice if Solr supports that OOTB. Are there any underlying
>>flaws of
>> doing that?
>>
>> Thanks,
>> --
>>
>> *Sameer Maggon*
>> www.measuredsearch.com
>> 
>>>?url=http%3A%2F%2Fmeasuredsearch.com%2F&signature=6dbc74f0abef4882>
>> |
>> Deploy, Scale & Manage Solr in the cloud of your choice.
>>
>>
>> On Wed, Dec 9, 2015 at 11:19 AM, Erick Erickson
>>
>> wrote:
>>
>>> Not that I know of. The two systems

Re: Moving to SolrCloud, specifying dataDir correctly

2015-12-14 Thread Jeff Wartes

Don’t set solr.data.dir. Instead, set the install dir. Something like:
-Dsolr.solr.home=/data/solr
-Dsolr.install.dir=/opt/solr

I have many solrcloud collections, and separate data/install dirs, and
I’ve never had to do anything with manual per-collection or per-replica
data dirs.

That said, it’s been a while since I set this up, and I may not remember
all the pieces. 
You might need something like this too, for example:

-Djetty.home=/opt/solr/server


On 12/14/15, 3:11 PM, "Erick Erickson"  wrote:

>Currently, it'll be a little tedious but here's what you can do (going
>partly from memory)...
>
>When you create the collection, specify the special value EMPTY for
>createNodeSet (Solr 5.3+).
>Use ADDREPLICA to add each individual replica. When you do this, you
>can add a dataDir for
>each individual replica and thus keep them separate, i.e. for a
>particular box the first
>replica would get /data/solr/collection1_shard1_replica1, the second
>/data/solr/collection1_shard2_replica1 and so forth.
>
>If you don't have Solr 5.3+, you can still to the same thing, except
>you create your collection letting
>the replicas fall where they will. Then do the ADDREPLICA as above.
>When that's all done,
>DELETEREPLICA for the original replicas.
>
>Best,
>Erick
>
>On Mon, Dec 14, 2015 at 2:21 PM, Tom Evans 
>wrote:
>> On Mon, Dec 14, 2015 at 1:22 PM, Shawn Heisey 
>>wrote:
>>> On 12/14/2015 10:49 AM, Tom Evans wrote:
 When I tried this in SolrCloud mode, specifying
 "-Dsolr.data.dir=/mnt/solr/" when starting each node, it worked fine
 for the first collection, but then the second collection tried to use
 the same directory to store its index, which obviously failed. I fixed
 this by changing solrconfig.xml in each collection to specify a
 specific directory, like so:

   ${solr.data.dir:}products

 Looking back after the weekend, I'm not a big fan of this. Is there a
 way to add a core.properties to ZK, or a way to specify
 core.baseDatadir on the command line, or just a better way of handling
 this that I'm not aware of?
>>>
>>> Since you're running SolrCloud, just let Solr handle the dataDir, don't
>>> try to override it.  It will default to "data" relative to the
>>> instanceDir.  Each instanceDir is likely to be in the solr home.
>>>
>>> With SolrCloud, your cores will not contain a "conf" directory (unless
>>> you create it manually), therefore the on-disk locations will be *only*
>>> data, there's not really any need to have separate locations for
>>> instanceDir and dataDir.  All active configuration information for
>>> SolrCloud is in zookeeper.
>>>
>>
>> That makes sense, but I guess I was asking the wrong question :)
>>
>> We have our SSDs mounted on /data/solr, which is where our indexes
>> should go, but our solr install is on /opt/solr, with the default solr
>> home in /opt/solr/server/solr. How do we change where the indexes get
>> put so they end up on the fast storage?
>>
>> Cheers
>>
>> Tom



state.json being downloaded every 10 seconds

2016-05-16 Thread Jeff Wartes

I have a solr 5.4 cluster with three collections, A, B, C.
Nodes either host replicas for collection A, or B and C. Collections B and C 
are not currently used - no inserts or queries. Collection A is getting 
significant query traffic, but no insert traffic, and queries are only directed 
to nodes hosting replicas for collection A. ZK timeout is set to 15 seconds.

I’ve noticed via tcpdump that, every 10 seconds exactly, several of the nodes 
(but not all) hosting collection A re-download the state.json for collections B 
and C. This behavior survives JVM restart.

This isn’t a huge deal, the extra traffic isn’t very meaningful, but it’s odd 
and smells like a bug somewhere. Anyone seen something like this?




Re: state.json being downloaded every 10 seconds

2016-05-16 Thread Jeff Wartes

Ah, I tracked this down to an haproxy that was set up on a load server during 
development and still running. It was configured with a health check every 10 
seconds, so that’s pretty clearly the cause. Thanks for the pointer.

One thing that still feels a bit odd though is that the health check query was 
referencing a collection that no longer existed in the cluster. So it seems 
like it was downloading the state for ALL non-hosted collections, not a 
requested one.

This touches a bit on a sore point with me. I dislike that those 
collection-not-here proxy requests aren’t logged on the server doing the proxy, 
because you end up with traffic visible at the http interface but not the solr 
level. Honestly, I dislike that transparent proxy approach in general, because 
it means I lose the ability to dedicate entire nodes to the fan-out and 
shard-aggregation process like I could pre-solrcloud.




On 5/16/16, 8:56 PM, "Erick Erickson"  wrote:

>With the per-collection state.json, if "something" goes to a node that doesn't
>host a replica for a node, it downloads the state for the "other"
>collection then
>throws it away.
>
>In this case, "something" is apparently asking the nodes hosting collectionA to
>do "something" with collections B and/or C. Some support for this would
>be if further investigation shows that the nodes that _do_ re-download the
>info did _not_ have replicas B and C.
>
>What the "something" is that sends requests I'm not quite sure, but
>that's a place
>to start.
>
>Best,
>Erick
>
>On Mon, May 16, 2016 at 11:08 AM, Jeff Wartes  wrote:
>>
>> I have a solr 5.4 cluster with three collections, A, B, C.
>> Nodes either host replicas for collection A, or B and C. Collections B and C 
>> are not currently used - no inserts or queries. Collection A is getting 
>> significant query traffic, but no insert traffic, and queries are only 
>> directed to nodes hosting replicas for collection A. ZK timeout is set to 15 
>> seconds.
>>
>> I’ve noticed via tcpdump that, every 10 seconds exactly, several of the 
>> nodes (but not all) hosting collection A re-download the state.json for 
>> collections B and C. This behavior survives JVM restart.
>>
>> This isn’t a huge deal, the extra traffic isn’t very meaningful, but it’s 
>> odd and smells like a bug somewhere. Anyone seen something like this?
>>
>>


Re: SolrCloud replicas consistently out of sync

2016-05-19 Thread Jeff Wartes
That case related to consistency after a ZK outage or network connectivity 
issue. Your case is standard operation, so I’m not sure that’s really the same 
thing. I’m aware of a few issues that cam happen if ZK connectivity goes wonky, 
that I hope are fixed in SOLR-8697.

This one might be a closer match to your problem though: 
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201604.mbox/%3CCAOWq+=iePCJjnQiSqxgDVEPv42Pi7RUtw0X0=9f67mpcm99...@mail.gmail.com%3E




On 5/19/16, 9:10 AM, "Aleksey Mezhva"  wrote:

>Bump.
>
>this thread is with someone having a similar issue:
>
>https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201601.mbox/%3c09fdab82-7600-49e0-b639-9cb9db937...@yahoo.com%3E
>
>It seems like this is not really fixed in 5.4/6.0?
>
>
>Aleksey
>
>From: Steve Weiss 
>Date: Tuesday, May 17, 2016 at 7:25 PM
>To: "solr-user@lucene.apache.org" 
>Cc: Aleksey Mezhva , Hans Zhou 
>Subject: Re: SolrCloud replicas consistently out of sync
>
>Gotcha - well that's nice.  Still, we seem to be permanently out of sync.
>
>I see this thread with someone having a similar issue:
>
>https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201601.mbox/%3c09fdab82-7600-49e0-b639-9cb9db937...@yahoo.com%3E
>
>It seems like this is not really fixed in 5.4/6.0?  Is there any version of 
>SolrCloud where this wasn't yet a problem that we could downgrade to?
>
>--
>Steve
>
>On Tue, May 17, 2016 at 6:23 PM, Markus Jelsma 
>mailto:markus.jel...@openindex.io>> wrote:
>Hi, thats a known issue and unrelated:
>https://issues.apache.org/jira/browse/SOLR-9120
>
>M.
>
>
>-Original message-
>> From:Stephen Weiss mailto:steve.we...@wgsn.com>>
>> Sent: Tuesday 17th May 2016 23:10
>> To: solr-user@lucene.apache.org; Aleksey 
>> Mezhva mailto:aleksey.mez...@wgsn.com>>; Hans Zhou 
>> mailto:hans.z...@wgsn.com>>
>> Subject: Re: SolrCloud replicas consistently out of sync
>>
>> I should add - looking back through the logs, we're seeing frequent errors 
>> like this now:
>>
>> 78819692 WARN  (qtp110456297-1145) [   ] o.a.s.h.a.LukeRequestHandler Error 
>> getting file length for [segments_4o]
>> java.nio.file.NoSuchFileException: 
>> /var/solr/data/instock_shard5_replica1/data/index.20160516230059221/segments_4o
>>
>> --
>> Steve
>>
>>
>> On Tue, May 17, 2016 at 5:07 PM, Stephen Weiss 
>> mailto:steve.we...@wgsn.com>>>
>>  wrote:
>> OK, so we did as you suggest, read through that article, and we reconfigured 
>> the autocommit to:
>>
>> 
>> ${solr.autoCommit.maxTime:3}
>> false
>> 
>>
>> 
>> ${solr.autoSoftCommit.maxTime:60}
>> 
>>
>> However, we see no change, aside from the fact that it's clearly committing 
>> more frequently.  I will say on our end, we clearly misunderstood the 
>> difference between soft and hard commit, but even now having it configured 
>> this way, we are still totally out of sync, long after all indexing has 
>> completed (it's been about 30 minutes now).  We manually pushed through a 
>> commit on the whole collection as suggested, however, all we get back for 
>> that is o.a.s.u.DirectUpdateHandler2 No uncommitted changes. Skipping 
>> IW.commit., which makes sense, because it was all committed already anyway.
>>
>> We still currently have all shards mismatched:
>>
>> instock_shard1   replica 1: 30788491 replica 2: 30778865
>> instock_shard10   replica 1: 30973059 replica 2: 30971874
>> instock_shard11   replica 2: 31036815 replica 1: 31034715
>> instock_shard12   replica 2: 30177084 replica 1: 30170511
>> instock_shard13   replica 2: 30608225 replica 1: 30603923
>> instock_shard14   replica 2: 30755739 replica 1: 30753191
>> instock_shard15   replica 2: 30891713 replica 1: 30891528
>> instock_shard16   replica 1: 30818567 replica 2: 30817152
>> instock_shard17   replica 1: 30423877 replica 2: 30422742
>> instock_shard18   replica 2: 30874979 replica 1: 30872223
>> instock_shard19   replica 2: 30917208 replica 1: 3090
>> instock_shard2   replica 1: 31062339 replica 2: 31060575
>> instock_shard20   replica 1: 30192046 replica 2: 30190893
>> instock_shard21   replica 2: 30793817 replica 1: 30791135
>> instock_shard22   replica 2: 30821521 replica 1: 30818836
>> instock_shard23   replica 2: 30553773 replica 1: 30547336
>> instock_shard24   replica 1: 30975564 replica 2: 30971170
>> instock_shard25   replica 1: 30734696 replica 2: 30731682
>> instock_shard26   replica 1: 31465696 replica 2: 31464738
>> instock_shard27   replica 1: 30844884 replica 2: 30842445
>> instock_shard28   replica 2: 30549826 replica 1: 30547405
>> instock_shard29   replica 2: 3063 replica 1: 30634091
>> instock_shard3   replica 1: 30930723 replica 2: 30926483
>> instock_shard30   replica 2: 30904528 replica 1: 30902649
>> instock_shard31   replica 2: 31175813 replica 1: 31174921
>> instock_shard32   replica 2: 30932837 replica 1: 30926456
>> instock_shard4   replica 2: 30758100 replica 1: 30754129
>> instoc

Re: How to stop searches to solr while full data import is going in SOLR

2016-05-23 Thread Jeff Wartes
The PingRequestHandler contains support for a file check, which allows you to 
control whether the ping request succeeds based on the presence/absence of a 
file on disk on the node.

http://lucene.apache.org/solr/6_0_0/solr-core/org/apache/solr/handler/PingRequestHandler.html

I suppose you could try using this to configure a load balancer.


On 5/20/16, 3:21 PM, "Erick Erickson"  wrote:

>There really isn't any good way to do this built in that I know of.
>
>If your "clusters" are separate Solr collections (SolrCloud), you
>can use collection aliasing to point queries at one or the other
>atomically.
>
>This presumes you have some control over when DIH runs
>however. The idea is that you have an alias that you use for
>searching. You switch this alias to the "cold" cluster (the one
>that's not changing) and trigger a DIH run to to the "hot" cluster.
>Once that's done, change the alias to point to it or, perhaps,
>both.
>
>Best,
>Erick
>
>On Wed, May 18, 2016 at 11:27 PM, preeti kumari  wrote:
>> Hi,
>>
>> I am using solr 5.2.1. I have two clusters Primary A and Primary B.
>> I was pinging servers to check whether they are up or not to route the
>> searches to working cluster A or B.
>>
>> But while I am running Full data import in Primary cluster A , There is not
>> all the data and pinging servers will not help as my solr servers would be
>> responding.
>>
>> But I want my searches to go to Cluster B instead of A.
>>
>> Please help me with a way from solr which can say solr not ready to support
>> searches as full data import is running there.
>>
>> Thanks
>> Preeti



Re: SolrCloud increase replication factor

2016-05-23 Thread Jeff Wartes

https://github.com/whitepages/solrcloud_manager was designed to provide some 
easier operations for common kinds of cluster operation. 
It hasn’t been tested with 6.0 though, so if you try it, please let me know 
your experience.


On 5/23/16, 6:28 AM, "Tom Evans"  wrote:

>On Mon, May 23, 2016 at 10:37 AM, Hendrik Haddorp
> wrote:
>> Hi,
>>
>> I have a SolrCloud 6.0 setup and created my collection with a
>> replication factor of 1. Now I want to increase the replication factor
>> but would like the replicas for the same shard to be on different nodes,
>> so that my collection does not fail when one node fails. I tried two
>> approaches so far:
>>
>> 1) When I use the collections API with the MODIFYCOLLECTION action [1] I
>> can set the replication factor but that did not result in the creation
>> of additional replicas. The Solr Admin UI showed that my replication
>> factor changed but otherwise nothing happened. A reload of the
>> collection did also result in no change.
>>
>> 2) Using the ADDREPLICA action [2] from the collections API I have to
>> add the replicas to the shard individually, which is a bit more
>> complicated but otherwise worked. During testing this did however at
>> least once result in the replica being created on the same node. My
>> collection was split in 4 shards and for 2 of them all replicas ended up
>> on the same node.
>>
>> So is the only option to create the replicas manually and also pick the
>> nodes manually or is the perceived behavior wrong?
>>
>> regards,
>> Hendrik
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-modifycoll
>> [2]
>> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api_addreplica
>
>
>With ADDREPLICA, you can specify the node to create the replica on. If
>you are using a script to increase/remove replicas, you can simply
>incorporate the logic you desire in to your script - you can also use
>CLUSTERSTATUS to get a list of nodes/collections/shards etc in order
>to inform the logic in the script. This is the approach we took, we
>have a fabric script to add/remove extra nodes to/from the cluster, it
>works well.
>
>The alternative is to put the logic in to Solr itself, using what Solr
>calls a "snitch" to define the rules on where replicas are created.
>The snitch is specified at collection creation time, or you can use
>MODIFYCOLLECTION to set it after the fact. See this wiki patch for
>details:
>
>https://cwiki.apache.org/confluence/display/solr/Rule-based+Replica+Placement
>
>Cheers
>
>Tom



Re: Solr cloud with Grouping query gives inconsistent results

2016-05-23 Thread Jeff Wartes
My first thought is that you haven’t indexed such that all values of the field 
you’re grouping on are found in the same cores.

See the end of the article here: (Distributed Result Grouping Caveats)
https://cwiki.apache.org/confluence/display/solr/Result+Grouping

And the “Document Routing” section here:
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud

If I’m right, you haven’t used the “amid” field as part of your doc routing 
policy.



On 5/23/16, 3:57 AM, "preeti kumari"  wrote:

>Hi All,
>
>I am using grouping query with solr cloud version 5.2.1 .
>Parameters added in my query is
>&q=SIM*group=true&group.field=amid&group.limit=1&group.main=true. But each
>time I hit the query i get different results i.e top 10 results are
>different each time.
>
>Why is it so ? Please help me with this.
>Is there any way by which I can get consistent results from grouping query
>in solr cloud.
>
>Thanks
>Preeti



Re: What if adding 3rd node exceeds replication Factor? [scottchu]

2016-05-25 Thread Jeff Wartes
SolrCloud never creates replicas automatically, unless perhaps you’re using the 
HDFS-only autoAddReplicas option. Start the new node using the same ZK, and 
then use the Collections API 
(https://cwiki.apache.org/confluence/display/solr/Collections+API) to 
ADDREPLICA.

The replicationFactor you specified at collection creation is pretty much 
ignored after creation.


On 5/25/16, 8:53 AM, "Scott Chu"  wrote:

>I start 2 nodes and a zk ensemble to manager these nodes. Then I create a 
>collection with numShards=1 and replicationFactor=2 on 1st node. It spread 
>onto 2 nodes (meaning 1 leader, 1 replica). Now I want to add 3rd node but 
>don't do splitsharding. Before I try I want to ask some questions: Do I just 
>start node3 and join in same zk ensemble, Solrcloud will automatically create 
>replica on 3rd node? Or do I have to manually call some API to add replica to 
>3rd node? Either way, doesn't this exceed the replicationFactor?
>
>Scott Chu,scott@udngroup.com
>2016/5/25 (週三)



Re: Solr Cloud and Multi-word Synonyms :: synonym_edismax parser

2016-05-26 Thread Jeff Wartes
Oh, interesting. I’ve certainty encountered issues with multi-word synonyms, 
but I hadn’t come across this. If you end up using it with a recent solr 
verison, I’d be glad to hear your experience.

I haven’t used it, but I am aware of one other project in this vein that you 
might be interested in looking at: 
https://github.com/LucidWorks/auto-phrase-tokenfilter


On 5/26/16, 9:29 AM, "John Bickerstaff"  wrote:

>Ahh - for question #3 I may have spoken too soon.  This line from the
>github repository readme suggests a way.
>
>Update: We have tested to run with the jar in $SOLR_HOME/lib as well, and
>it works (Jetty).
>
>I'll try that and only respond back if that doesn't work.
>
>Questions 1 and 2 still stand of course...  If anyone on the list has
>experience in this area...
>
>Thanks.
>
>On Thu, May 26, 2016 at 10:25 AM, John Bickerstaff > wrote:
>
>> Hi all,
>>
>> I'm creating a Solr Cloud that will index and search medical text.
>> Multi-word synonyms are a pretty important factor.
>>
>> I find that there are some challenges around multi-word synonyms and I
>> also found on the wiki that there is a recommended 3rd-party parser
>> (synonym_edismax parser) created by Nolan Lawson and found here:
>> https://github.com/healthonnet/hon-lucene-synonyms
>>
>> Here's the thing - the instructions on the github site involve bringing
>> the jar file into the war file - which is not applicable any more... at
>> least I think it's not...
>>
>> I have three questions:
>>
>> 1. Is this still a good solution for multi-word synonyms (I.e. Solr Cloud
>> doesn't break it in some way)
>> 2. Is there a tool or plug-in out there that the contributors would
>> recommend above this one?
>> 3. Assuming 1 = yes and 2 = no, can anyone tell me an updated procedure
>> for bringing it in to Solr Cloud (I'm running 5.4.x)
>>
>> Thanks
>>



Re: Solr Cloud and Multi-word Synonyms :: synonym_edismax parser

2016-05-31 Thread Jeff Wartes
yms
>> can
>> > > > never
>> > > > >> match and add alternatives.  See <
>> > > > >> https://issues.apache.org/jira/browse/LUCENE-2605>, where I’ve
>> > > posted a
>> > > > >> patch to directly address that problem - note that it’s still a
>> work
>> > > in
>> > > > >> progress.
>> > > > >>
>> > > > >> Once LUCENE-2605 has been fixed, there is still work to do getting
>> > > > >> (e)dismax to work with the modified Lucene QueryParser, and
>> > addressing
>> > > > >> problems with how queries are constructed from Lucene’s
>> “sausagized”
>> > > > token
>> > > > >> stream.
>> > > > >>
>> > > > >> --
>> > > > >> Steve
>> > > > >> www.lucidworks.com
>> > > > >>
>> > > > >> > On May 26, 2016, at 2:21 PM, John Bickerstaff <
>> > > > j...@johnbickerstaff.com>
>> > > > >> wrote:
>> > > > >> >
>> > > > >> > Thanks Chris --
>> > > > >> >
>> > > > >> > The two projects I'm aware of are:
>> > > > >> >
>> > > > >> > https://github.com/healthonnet/hon-lucene-synonyms
>> > > > >> >
>> > > > >> > and the one referenced from the Lucidworks page here:
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> https://lucidworks.com/blog/2014/07/12/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
>> > > > >> >
>> > > > >> > ... which is here :
>> > > > >> https://github.com/LucidWorks/auto-phrase-tokenfilter
>> > > > >> >
>> > > > >> > Is there anything else out there that you would recommend I look
>> > at?
>> > > > >> >
>> > > > >> > On Thu, May 26, 2016 at 12:01 PM, Chris Morley <
>> > ch...@depahelix.com
>> > > >
>> > > > >> wrote:
>> > > > >> >
>> > > > >> >> Chris Morley here, from Wayfair.  (Depahelix = my domain)
>> > > > >> >>
>> > > > >> >> Suyash Sonawane and I have worked on multiple word synonyms at
>> > > > Wayfair.
>> > > > >> >> We worked mostly off of Ted Sullivan's work and also off of
>> some
>> > > > >> >> suggestions from Koorosh Vakhshoori.  We have gotten to a point
>> > > where
>> > > > >> we
>> > > > >> >> have a more sophisticated internal implementation, however,
>> we've
>> > > > found
>> > > > >> >> that it is very difficult to make it do what you want it to do,
>> > and
>> > > > >> also be
>> > > > >> >> sufficiently performant.  Watch out for exceptional situations
>> > with
>> > > > mm
>> > > > >> >> (minimum should match).
>> > > > >> >>
>> > > > >> >> Trey Grainger (now at Lucidworks) and Simon Hughes of Dice.com
>> > have
>> > > > >> also
>> > > > >> >> done work in this area.
>> > > > >> >>
>> > > > >> >> It should be very possible to get this kind of thing working on
>> > > > >> >> SolrCloud.  I haven't tried it yet but I think theoretically,
>> it
>> > > > should
>> > > > >> >> just work.  The synonyms stuff is mostly about doing things at
>> > > index
>> > > > >> time
>> > > > >> >> and query time.  The index time stuff should translate to
>> > SolrCloud
>> > > > >> >> directly, while the query time stuff might pose some issues,
>> but
>> > > > >> probably
>> > > > >> >> not too bad, if there are any issues at all.
>> > > > >> >>
>> > > > >> >> I've had decent luck porting our various plugins from 4.10.x to
>> > > 5.5.0
>> > > > >> >> because a lot of stuff is just Java, and it still works within
>> > the
>> > > > >> Jetty
>> > > > >> >> context.
>> > > > >> >>
>> > > > >> >> -Chris.
>> > > > >> >>
>> > > > >> >>
>> > > > >> >>
>> > > > >> >>
>> > > > >> >> 
>> > > > >> >> From: "John Bickerstaff" 
>> > > > >> >> Sent: Thursday, May 26, 2016 1:51 PM
>> > > > >> >> To: solr-user@lucene.apache.org
>> > > > >> >> Subject: Re: Solr Cloud and Multi-word Synonyms ::
>> > synonym_edismax
>> > > > >> parser
>> > > > >> >> Hey Jeff (or anyone interested in multi-word synonyms) here are
>> > > some
>> > > > >> >> potentially interesting links...
>> > > > >> >>
>> > > > >> >> http://wiki.apache.org/solr/QueryParser (search the page for
>> > > > >> >> synonum_edismax)
>> > > > >> >>
>> > > > >> >>
>> > > https://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/
>> > > > >> (blog
>> > > > >> >> post about what became the synonym_edissmax Query Parser)
>> > > > >> >>
>> > > > >> >>
>> > > > >> >>
>> > > > >>
>> > > >
>> > >
>> >
>> https://lucidworks.com/blog/2014/07/12/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
>> > > > >> >>
>> > > > >> >> This last was useful for lots of reasons and contains links to
>> > > other
>> > > > >> >> interesting, related web pages...
>> > > > >> >>
>> > > > >> >> On Thu, May 26, 2016 at 11:45 AM, Jeff Wartes <
>> > > > jwar...@whitepages.com>
>> > > > >> >> wrote:
>> > > > >> >>
>> > > > >> >>> Oh, interesting. I've certainty encountered issues with
>> > multi-word
>> > > > >> >>> synonyms, but I hadn't come across this. If you end up using
>> it
>> > > > with a
>> > > > >> >>> recent solr verison, I'd be glad to hear your experience.
>> > > > >> >>>
>> > > > >> >>> I haven't used it, but I am aware of one other project in this
>> > > vein
>> > > > >> that
>> > > > >> >>> you might be interested in looking at:
>> > > > >> >>> https://github.com/LucidWorks/auto-phrase-tokenfilter
>> > > > >> >>>
>> > > > >> >>>
>> > > > >> >>> On 5/26/16, 9:29 AM, "John Bickerstaff" <
>> > j...@johnbickerstaff.com
>> > > >
>> > > > >> >> wrote:
>> > > > >> >>>
>> > > > >> >>>> Ahh - for question #3 I may have spoken too soon. This line
>> > from
>> > > > the
>> > > > >> >>>> github repository readme suggests a way.
>> > > > >> >>>>
>> > > > >> >>>> Update: We have tested to run with the jar in $SOLR_HOME/lib
>> as
>> > > > well,
>> > > > >> >> and
>> > > > >> >>>> it works (Jetty).
>> > > > >> >>>>
>> > > > >> >>>> I'll try that and only respond back if that doesn't work.
>> > > > >> >>>>
>> > > > >> >>>> Questions 1 and 2 still stand of course... If anyone on the
>> > list
>> > > > has
>> > > > >> >>>> experience in this area...
>> > > > >> >>>>
>> > > > >> >>>> Thanks.
>> > > > >> >>>>
>> > > > >> >>>> On Thu, May 26, 2016 at 10:25 AM, John Bickerstaff <
>> > > > >> >>> j...@johnbickerstaff.com
>> > > > >> >>>>> wrote:
>> > > > >> >>>>
>> > > > >> >>>>> Hi all,
>> > > > >> >>>>>
>> > > > >> >>>>> I'm creating a Solr Cloud that will index and search medical
>> > > text.
>> > > > >> >>>>> Multi-word synonyms are a pretty important factor.
>> > > > >> >>>>>
>> > > > >> >>>>> I find that there are some challenges around multi-word
>> > synonyms
>> > > > >> and I
>> > > > >> >>>>> also found on the wiki that there is a recommended 3rd-party
>> > > > parser
>> > > > >> >>>>> (synonym_edismax parser) created by Nolan Lawson and found
>> > here:
>> > > > >> >>>>> https://github.com/healthonnet/hon-lucene-synonyms
>> > > > >> >>>>>
>> > > > >> >>>>> Here's the thing - the instructions on the github site
>> involve
>> > > > >> >> bringing
>> > > > >> >>>>> the jar file into the war file - which is not applicable any
>> > > > more...
>> > > > >> >> at
>> > > > >> >>>>> least I think it's not...
>> > > > >> >>>>>
>> > > > >> >>>>> I have three questions:
>> > > > >> >>>>>
>> > > > >> >>>>> 1. Is this still a good solution for multi-word synonyms
>> (I.e.
>> > > > Solr
>> > > > >> >>> Cloud
>> > > > >> >>>>> doesn't break it in some way)
>> > > > >> >>>>> 2. Is there a tool or plug-in out there that the
>> contributors
>> > > > would
>> > > > >> >>>>> recommend above this one?
>> > > > >> >>>>> 3. Assuming 1 = yes and 2 = no, can anyone tell me an
>> updated
>> > > > >> >> procedure
>> > > > >> >>>>> for bringing it in to Solr Cloud (I'm running 5.4.x)
>> > > > >> >>>>>
>> > > > >> >>>>> Thanks
>> > > > >> >>>>>
>> > > > >> >>>
>> > > > >> >>>
>> > > > >> >>
>> > > > >> >>
>> > > > >> >>
>> > > > >>
>> > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>



Re: Solr Cloud and Multi-word Synonyms :: synonym_edismax parser

2016-06-01 Thread Jeff Wartes
In the interests of the specific questions to me:

I’m using 5.4, solrcloud. 
I’ve never used the blob store thing, didn’t even know it existed before this 
thread.

I’m uncertain how not finding the class could be specific to hon, it really 
feels like a general solr config issue, but you could try some other foreign 
jar and see if that works. 
Here’s one I use: https://github.com/whitepages/SOLR-4449 (although this one is 
also why I use WEB-INF/lib, because it overrides a protected method, so it 
might not be the greatest example)


On 5/31/16, 4:02 PM, "John Bickerstaff"  wrote:

>Thanks Jeff,
>
>I believe I tried that, and it still refused to load..  But I'd sure love
>it to work since the other process is a bit convoluted - although I see
>it's value in a large Solr installation.
>
>When I "locate" the jar on the linux command line I get:
>
>/opt/solr-5.4.0/server/solr-webapp/webapp/WEB-INF/lib/hon-lucene-synonyms-2.0.0.jar
>
>But the log file is still carrying class not found exceptions when I
>restart...
>
>Are you in "Cloud" mode?  What version of Solr are you using?
>
>On Tue, May 31, 2016 at 4:08 PM, Jeff Wartes  wrote:
>
>> I’ve generally been dropping foreign plugin jars in this dir:
>> server/solr-webapp/webapp/WEB-INF/lib/
>> This is because it then gets loaded by the same classloader as Solr
>> itself, which can be useful if you’re, say, overriding some
>> solr-protected-space method.
>>
>> If you don’t care about the classloader, I believe you can use whatever
>> dir you want, with the appropriate bit of solrconfig.xml to load it.
>> Something like:
>> 
>>
>>
>> On 5/31/16, 2:13 PM, "John Bickerstaff"  wrote:
>>
>> >All --
>> >
>> >I'm now attempting to use the hon_lucene_synonyms project from github.
>> >
>> >I found the documents that were infered by the dead links on the readme in
>> >the repository -- however, given that I'm using Solr 5.4.x, I no longer
>> >have the need to integrate into a war file (as far as I can see).
>> >
>> >The suggestion on the readme is that I can drop the hon_lucene_synonyms
>> jar
>> >file into the $SOLR_HOME directory, but this does not seem to be working -
>> >I'm getting class not found exceptions.
>> >
>> >Does anyone on this list have direct experience with getting this plugin
>> to
>> >work in Solr 5.x?
>> >
>> >Thanks in advance...
>> >
>> >On Mon, May 30, 2016 at 6:57 PM, MaryJo Sminkey 
>> wrote:
>> >
>> >> It's been awhile since I installed it so I really can't say. I'm more
>> of a
>> >> code monkey than a server gal (particularly Linux... I'm amazed I got
>> Solr
>> >> installed in the first place, LOL!) So I had asked our network guy to
>> look
>> >> it over recently and see if it looked like I did it okay. He said since
>> it
>> >> shows up in the list of jars in the Solr admin that it's installed
>> if
>> >> that's not necessarily true, I probably need to point him in the right
>> >> direction for what else to do since he really doesn't know Solr well
>> >> either.
>> >>
>> >> Mary Jo
>> >>
>> >>
>> >>
>> >>
>> >> On Mon, May 30, 2016 at 7:49 PM, John Bickerstaff <
>> >> j...@johnbickerstaff.com>
>> >> wrote:
>> >>
>> >> > Thanks for the comment Mary Jo...
>> >> >
>> >> > The error loading the class rings a bell - did you find and follow
>> >> > instructions for adding that to the WAR file?  I vaguely remember
>> seeing
>> >> > something about that.
>> >> >
>> >> > I'm going to try my own tests on the auto phrasing one..  If I'm
>> >> > successful, I'll post back.
>> >> >
>> >> > On Mon, May 30, 2016 at 3:45 PM, MaryJo Sminkey 
>> >> > wrote:
>> >> >
>> >> > > This is a very timely discussion for me as well as we're trying to
>> >> tackle
>> >> > > the multi term synonym issue as well and have not been able to
>> >> hon-lucene
>> >> > > plugin to work, the jar shows up as installed but when we set up the
>> >> > sample
>> >> > > request handler it throws this error:
>> >> > >
>> >> > >
>> >> >
>> >>
&

Re: Solr off-heap FieldCache & HelioSearch

2016-06-03 Thread Jeff Wartes

For what it’s worth, I’d suggest you go into a conversation with Azul with a 
more explicit “I’m looking to buy” approach. I reached out to them with a more 
“I’m exploring my options” attitude, and never even got a trial. I get the 
impression their business model involves a fairly expensive (to them) trial 
process, so they’re looking for more urgency on the part of the client than I 
was expressing.

Instead, I spent a few weeks analyzing how my specific index allocated memory. 
This turned out to be quite worthwhile. Armed with that information, I was able 
to file a few patches (coming in 6.1, perhaps?) that reduced allocations by a 
pretty decent amount on large indexes. (SOLR-8922, particularly) It also 
straight-up ruled out certain things Solr supports, because the allocations 
were just too heavy. (SOLR-9125)

I suppose the next thing I’m considering is using multiple JVMs per host, 
essentially one per shard. This wouldn’t change the allocation rate, but does 
serve to reduce the worst-case GC pause, since each JVM can have a smaller 
heap. I’d be trading a little p50 latency for some p90 latency reduction, I’d 
expect. Of course, that adds a bunch of headache to managing replica locations 
too.


On 6/2/16, 6:30 PM, "Phillip Peleshok"  wrote:

>Fantastic! I'm sorry I couldn't find that JIRA before and for getting you
>to track it down.
>
>Yup, I noticed that for the docvalues with the ordinal map and I'm
>definitely leveraging all that but I'm hitting the terms limit now and that
>ends up pushing me over.  I'll see about giving Zing/Azul a try.  From all
>my readings using theUnsafe seemed a little sketchy (
>http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/) so
>I'm glad that seemed to be the point of contention bringing it in and not
>anything else.
>
>Thank you very much for the info,
>Phil
>
>On Thu, Jun 2, 2016 at 6:14 PM, Erick Erickson 
>wrote:
>
>> Basically it never reached consensus, see the discussion at:
>> https://issues.apache.org/jira/browse/SOLR-6638
>>
>> If you can afford it I've seen people with very good results
>> using Zing/Azul, but that can be expensive.
>>
>> DocValues can help for fields you facet and sort on,
>> those essentially move memory into the OS
>> cache.
>>
>> But memory is an ongoing struggle I'm afraid.
>>
>> Best,
>> Erick
>>
>> On Wed, Jun 1, 2016 at 12:34 PM, Phillip Peleshok 
>> wrote:
>> > Hey everyone,
>> >
>> > I've been using Solr for some time now and running into GC issues as most
>> > others have.  Now I've exhausted all the traditional GC settings
>> > recommended by various individuals (ie Shawn Heisey, etc) but neither
>> > proved sufficient.  The one solution that I've seen that proved useful is
>> > Heliosearch and the off-heap implementation.
>> >
>> > My question is this, why wasn't the off-heap FieldCache implementation (
>> > http://yonik.com/hs-solr-off-heap-fieldcache-performance/) ever rolled
>> into
>> > Solr when the other HelioSearch improvement were merged? Was there a
>> > fundamental design problem or just a matter of time/testing that would be
>> > incurred by the move?
>> >
>> > Thanks,
>> > Phil
>>



Re: Multiple calls across the distributed nodes for a query

2016-06-15 Thread Jeff Wartes
Any distributed query falls into the two-phase process. Actually, I think some 
components may require a third phase. (faceting?)

However, there are also cases where only a single pass is required. A 
fl=id,score will only be a single pass, for example, since it doesn’t need to 
get the field values.

https://issues.apache.org/jira/browse/SOLR-5768 would be a good place to read 
about some of this, and provides a way to help force a one-pass even if you 
need other fields.


On 6/15/16, 7:31 AM, "Raveendra Yerraguntla"  
wrote:

>I need help in understanding a query in solr cloud.
>When user issues a query , there are are two phases of query - one with the
>purpose(from debug info) of GET_TOP_FIELDS and another with GET_FIELDS.
>
>This is having an effect on end to end performance of the application.
>
>- what triggers (any components like facet, highlight, spellchecker ??? )
>the two calls
>- How can I make a query to be executed only with GET_FIELDS only .



Re: Long STW GCs with Solr Cloud

2016-06-16 Thread Jeff Wartes
Check your gc log for CMS “concurrent mode failure” messages. 

If a concurrent CMS collection fails, it does a stop-the-world pause while it 
cleans up using a *single thread*. This means the stop-the-world CMS collection 
in the failure case is typically several times slower than a concurrent CMS 
collection. The single-thread business means it will also be several times 
slower than the Parallel collector, which is probably what you’re seeing. I 
understand that it needs to stop the world in this case, but I really wish the 
CMS failure would fall back to a Parallel collector run instead.
The Parallel collector is always going to be the fastest at getting rid of 
garbage, but only because it stops all the application threads while it runs, 
so it’s got less complexity to deal with. That said, it’s probably not going to 
be orders of magnitude faster than a (successfully) concurrent CMS collection.

Regardless, the bigger the heap, the bigger the pause.

If your application is generating a lot of garbage, or can generate a lot of 
garbage very suddenly, CMS concurrent mode failures are more likely. You can 
turn down the  -XX:CMSInitiatingOccupancyFraction value in order to give the 
CMS collection more of a head start at the cost of more frequent collections. 
If that doesn’t work, you can try using a bigger heap, but you may eventually 
find yourself trying to figure out what about your query load generates so much 
garbage (or causes garbage spikes) and trying to address that. Even G1 won’t 
protect you from highly unpredictable garbage generation rates.

In my case, for example, I found that a very small subset of my queries were 
using the CollapseQParserPlugin, which requires quite a lot of memory 
allocations, especially on a large index. Although generally this was fine, if 
I got several of these rare queries in a very short window, it would always 
spike enough garbage to cause CMS concurrent mode failures. The single-threaded 
concurrent-mode failure would then take long enough that the ZK heartbeat would 
fail, and things would just go downhill from there.



On 6/15/16, 3:57 PM, "Cas Rusnov"  wrote:

>Hey Shawn! Thanks for replying.
>
>Yes I meant HugePages not HugeTable, brain fart. I will give the
>transparent off option a go.
>
>I have attempted to use your CMS configs as is and also the default
>settings and the cluster dies under our load (basically a node will get a
>35-60s GC STW and then the others in the shard will take the load, and they
>will in turn get long STWs until the shard dies), which is why basically in
>a fit of desperation I tried out ParallelGC and found it to be half-way
>acceptable. I will run a test using your configs (and the defaults) again
>just to be sure (since I'm certain the machine config has changed since we
>used your unaltered settings).
>
>Thanks!
>Cas
>
>
>On Wed, Jun 15, 2016 at 3:41 PM, Shawn Heisey  wrote:
>
>> On 6/15/2016 3:05 PM, Cas Rusnov wrote:
>> > After trying many of the off the shelf configurations (including CMS
>> > configurations but excluding G1GC, which we're still taking the
>> > warnings about seriously), numerous tweaks, rumors, various instance
>> > sizes, and all the rest, most of which regardless of heap size and
>> > newspace size resulted in frequent 30+ second STW GCs, we settled on
>> > the following configuration which leads to occasional high GCs but
>> > mostly stays between 10-20 second STWs every few minutes (which is
>> > almost acceptable): -XX:+AggressiveOpts -XX:+UnlockDiagnosticVMOptions
>> > -XX:+UseAdaptiveSizePolicy -XX:+UseLargePages -XX:+UseParallelGC
>> > -XX:+UseParallelOldGC -XX:MaxGCPauseMillis=15000 -XX:MaxNewSize=12000m
>> > -XX:ParGCCardsPerStrideChunk=4096 -XX:ParallelGCThreads=16 -Xms31000m
>> > -Xmx31000m
>>
>> You mentioned something called "HugeTable" ... I assume you're talking
>> about huge pages.  If that's what you're talking about, have you also
>> turned off transparent huge pages?  If you haven't, you might want to
>> completely disable huge pages in your OS.  There's evidence that the
>> transparent option can affect performance.
>>
>> I assume you've probably looked at my GC info at the following URL:
>>
>> http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning_for_Solr
>>
>> The parallel collector is most definitely not a good choice.  It does
>> not optimize for latency.  It's my understanding that it actually
>> prefers full GCs, because it is optimized for throughput.  Solr thrives
>> on good latency, throughput doesn't matter very much.
>>
>> If you want to continue avoiding G1, you should definitely be using
>> CMS.  My recommendation right now would be to try the G1 settings on my
>> wiki page under the heading "Current experiments" or the CMS settings
>> just below that.
>>
>> The out-of-the-box GC tuning included with Solr 6 is probably a better
>> option than the parallel collector you've got configured now.
>>
>> Thanks,
>> Shawn
>>
>>
>
>
>-- 
>
>Cas Rusnov,
>
>Engineer
>[image: Manz

Re: Long STW GCs with Solr Cloud

2016-06-17 Thread Jeff Wartes
For what it’s worth, I looked into reducing the allocation footprint of 
CollapsingQParserPlugin a bit, but without success. See 
https://issues.apache.org/jira/browse/SOLR-9125

As it happened, I was collapsing on a field with such high cardinality that the 
chances of a query even doing much collapsing of interest was pretty low. That 
allowed me to use a vastly stripped-down version of CollapsingQParserPlugin 
with a *much* lower memory footprint, in exchange for collapsed document heads 
essentially being picked at random. (That is, when collapsing two documents, 
the one that gets returned is random.)

If that’s of interest, I could probably throw the code someplace public.


On 6/16/16, 3:39 PM, "Cas Rusnov"  wrote:

>Hey thanks for your reply.
>
>Looks like running the suggested CMS config from Shawn, we're getting some
>nodes with 30+sec pauses, I gather due to large heap, interestingly enough
>while the scenario Jeff talked about is remarkably similar (we use field
>collapsing), including the performance aspects of it, we are getting
>concurrent mode failures both due to new space allocation failures and due
>to promotion failures. I suspect there's a lot of garbage building up.
>We're going to run tests with field collapsing disabled and see if that
>makes a difference.
>
>Cas
>
>
>On Thu, Jun 16, 2016 at 1:08 PM, Jeff Wartes  wrote:
>
>> Check your gc log for CMS “concurrent mode failure” messages.
>>
>> If a concurrent CMS collection fails, it does a stop-the-world pause while
>> it cleans up using a *single thread*. This means the stop-the-world CMS
>> collection in the failure case is typically several times slower than a
>> concurrent CMS collection. The single-thread business means it will also be
>> several times slower than the Parallel collector, which is probably what
>> you’re seeing. I understand that it needs to stop the world in this case,
>> but I really wish the CMS failure would fall back to a Parallel collector
>> run instead.
>> The Parallel collector is always going to be the fastest at getting rid of
>> garbage, but only because it stops all the application threads while it
>> runs, so it’s got less complexity to deal with. That said, it’s probably
>> not going to be orders of magnitude faster than a (successfully) concurrent
>> CMS collection.
>>
>> Regardless, the bigger the heap, the bigger the pause.
>>
>> If your application is generating a lot of garbage, or can generate a lot
>> of garbage very suddenly, CMS concurrent mode failures are more likely. You
>> can turn down the  -XX:CMSInitiatingOccupancyFraction value in order to
>> give the CMS collection more of a head start at the cost of more frequent
>> collections. If that doesn’t work, you can try using a bigger heap, but you
>> may eventually find yourself trying to figure out what about your query
>> load generates so much garbage (or causes garbage spikes) and trying to
>> address that. Even G1 won’t protect you from highly unpredictable garbage
>> generation rates.
>>
>> In my case, for example, I found that a very small subset of my queries
>> were using the CollapseQParserPlugin, which requires quite a lot of memory
>> allocations, especially on a large index. Although generally this was fine,
>> if I got several of these rare queries in a very short window, it would
>> always spike enough garbage to cause CMS concurrent mode failures. The
>> single-threaded concurrent-mode failure would then take long enough that
>> the ZK heartbeat would fail, and things would just go downhill from there.
>>
>>
>>
>> On 6/15/16, 3:57 PM, "Cas Rusnov"  wrote:
>>
>> >Hey Shawn! Thanks for replying.
>> >
>> >Yes I meant HugePages not HugeTable, brain fart. I will give the
>> >transparent off option a go.
>> >
>> >I have attempted to use your CMS configs as is and also the default
>> >settings and the cluster dies under our load (basically a node will get a
>> >35-60s GC STW and then the others in the shard will take the load, and
>> they
>> >will in turn get long STWs until the shard dies), which is why basically
>> in
>> >a fit of desperation I tried out ParallelGC and found it to be half-way
>> >acceptable. I will run a test using your configs (and the defaults) again
>> >just to be sure (since I'm certain the machine config has changed since we
>> >used your unaltered settings).
>> >
>> >Thanks!
>> >Cas
>> >
>> >
>> >On Wed, Jun 15, 2016 at 3:41 PM, Shawn Heisey 
>> wrote:
>> >
>> >> On 6/15/201

Re: SolrCloud: Adding a very large collection to a pre-existing cluster

2016-06-21 Thread Jeff Wartes

There’s no official way of doing #1, but there are some less official ways:
1. The Backup/Restore API provides some hooks into loading pre-existing data 
dirs into an existing collection. Lots of caveats.
2. If you don’t have many shards, there’s always rsync/reload.
3. There are some third-party tools that help with this kind of thing:
a. https://github.com/whitepages/solrcloud_manager (primarily a command line 
tool)
b. https://github.com/bloomreach/solrcloud-haft (primarily a library)

For #2, absolutely. Spin up some new nodes in your cluster, and then use the 
“createNodeSet” parameter when creating the new collection to restrict to those 
new nodes:
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api1




On 6/21/16, 12:33 PM, "Kelly, Frank"  wrote:

>We have about 200 million documents (~70 GB) we need to keep indexed across 3 
>collections.
>
>Currently 2 of the 3 collections are already indexed (roughly 90m docs).
>
>We'd like to create the remaining collection (about 100 m documents) but 
>minimizing the performance impact on the existing collections on Solr servers 
>during that Time.
>
>Is there some way to do this either by
>
>  1.  Creating the collection in another environment and shipping the 
> (underlying Lucene) index files
>  2.  Creating the collection on (dedicated) new machines that we add to the 
> SolrCloud cluster?
>
>Thoughts, comments or suggestions appreciated,
>
>Best
>
>-Frank Kelly
>



Re: Help with recovering shard range after zookeeper disaster

2016-06-28 Thread Jeff Wartes
This might come a little late to be helpful, but I had a similar situation with 
Solr 5.4 once.

We ended up finding a ZK snapshot we could restore, but we did also get the 
cluster back up for most of the interim by taking the now-empty ZK cluster, 
re-uploading the configs that the collections used, and then restarting the 
nodes one at a time. I did find that the cluster state re-wrote itself, 
including my shard ranges. (Although I didn’t have anything special in there, 
no shard-splitting history, etc).

There were a few other side effects, aliases needed to be recreated, there were 
issues around leader election, and there was an odd increase in latency for the 
duration until we got the backup restored. 

Good luck.


On 6/27/16, 11:24 PM, "pramodEbay"  wrote:

>We recently experienced a case where zookeeper snapshot became corrupt and
>would not restart. 
>zkCli.sh (of zookeeper) would fail with an error unable to connect to /
>
>We have a solr cloud with two shards (Keys are autosharded) (Solr version
>4.10.1)
>
>Unfortunately, we did not have a good snapshot to recover from. We are
>planning on creating a brand new zookeeper ensemble and have the solr nodes
>reconnect. We do not have a good clusterstate.json to upload to zookeeper.
>
>Our current state is - all solr nodes are operating on read-only mode. No
>updates are possible. 
>
>This is what we are planning on doing now:
>1. Delete snapshot and logs from zookeepers
>2. Create brand new data folder
>3. Upload solr configurations into zookeepers
>4. With solr nodes running, have them reconnect to zookeeper.
>
>What I am not clear is, will each solr node as it attempts to reconnect -
>identify itself as which shard it originally belonged to. Will the
>clusterstate.json get created? I don't know the hash ranges since there is
>no clusterstate.json. Or do I need to manually create a clusterstate.json
>and upload it to the zookeeper.
>
>What is our best recourse now. Any help with disaster recovery is much
>appreciated.
>
>Thanks,
>Pramod
>
> 
>
>
>
>
>
>
>--
>View this message in context: 
>http://lucene.472066.n3.nabble.com/Help-with-recovering-shard-range-after-zookeeper-disaster-tp4284645.html
>Sent from the Solr - User mailing list archive at Nabble.com.



Re: Full re-index without downtime

2016-07-06 Thread Jeff Wartes

A variation on #1 here - Use the same cluster, create a new collection, but use 
the createNodeSet option to logically partition your cluster so no node has 
both the old and new collection.

If your clients all reference a collection alias, instead of a collection name, 
then all you need to do when the replacement index is ready is move the alias, 
(instant and atomic) and then clean up by dropping the old collection. Repeat 
as necessary.

You say you’re using the CoreAdmin API though, which implies you’re not using 
SolrCloud, which is a requirement of the above.


On 7/6/16, 10:42 AM, "Steven Bower"  wrote:

>There are two options as I see it..
>
>1. Do something like you describe and create a secondary index, index into
>it, then switch... I personally would create a completely separate solr
>cloud alongside my existing one vs new core in the same cloud as you might
>see some negative impacts on GC caused by the indexing load.
>
>2. Tag each record with a field (eg "generation") that identifies which
>generation of data a record is from.. when querying filter on only the
>generation of data that is complete.. new records get a new generation..
>the only problem with this is changing field types doesn't really work with
>the same field names.. but if you used dynamic fields instead of static the
>field name would change anyway which isn't a problem then.
>
>We use both of these patterns in different applications..
>
>steve
>
>On Wed, Jul 6, 2016 at 1:27 PM Steven White  wrote:
>
>> Hi everyone,
>>
>> In my environment, I have use cases where I need to fully re-index my
>> data.  This happens because Solr's schema requires changes based on changes
>> made to my data source, the DB.  For example, my DB schema may change so
>> that it now has a whole new set of field added or removed (on records), or
>> the data type changed (on fields).  When that happens, the only solution I
>> have right now is to drop the current Solr index, update Solr's schema.xml,
>> re-index my data (I use Solr's core admin to dynamical do all this).
>>
>> The issue with my current solution is during the re-indexing, which right
>> now takes 10 hours (expect it to take over 30 hours as my data keeps on
>> growing) search via Solr is not available.  Sure, I can enable search while
>> the data is being re-indexed, but then I get partial results.
>>
>> My question is this: how can I avoid this so there is minimal downtime,
>> under 1 min.?  I was thinking of creating a second core (again dynamically)
>> and re-index into it (after setting up the new schema) and once the
>> re-index is fully done, switch over to the new core and drop the index from
>> the old core and then delete the old core, and rename the new core to the
>> old core (original core).
>>
>> Would the above work or is there a better way to do this?  How do you guys
>> solve this problem?
>>
>> Again, my goal is to minimize downtime during re-indexing when Solr's
>> schema is drastically changed (requiring re-indexing).
>>
>> Thanks in advanced.
>>
>> Steve
>>



Re: solrcloud consumes more time than solr when write index

2016-07-12 Thread Jeff Wartes
Well, two thoughts:


1. If you’re not using solrcloud, presumably you don’t have any replicas. If 
you are, presumably you do. This makes for a biased comparison, because 
SolrCloud won’t acknowledge a write until it’s been safely written to all 
replicas. In short, solrcloud write time is max(per-replica write time). The 
more replicas you add, the bigger the chance some replica randomly takes longer 
(gc pause, perhaps?), and the longer your overall write time, assuming a fixed 
number of indexing threads.
2. The parallelism of the optimize operation across replicas has gone back and 
forth a bit, and I’m not sure what it was doing in 4.9. However, at one point 
the optimize happened per-replica, serially. So it’d do shard1_replica1, then 
when that was done, do shard1_replica2, then shard2_replica1, etc. Other 
versions of Solr would do those at the same time. Again, I don’t know if you’re 
comparing to a non-replicated solr index, but that could explain some of the 
difference.

There’s a sort of an obligatory comment at this point that optimize doesn’t 
necessarily save you a lot. There are certainly cases where it does, but if you 
haven’t already, you’ll want to validate that you have one of them and that 
you’re not just doing unnecessary work.


On 7/12/16, 7:41 AM, "Kent Mu"  wrote:

>hello, does anybody also come across the issue? can anybody help me?
>
>2016-07-11 23:17 GMT+08:00 Kent Mu :
>
>> Hi friends!
>>
>> solr version: 4.9.0.
>>
>> we use solr and solrcloud in our project, that means we use sorl and
>> solrcloud at the same time.
>> but we find a phenomenon that sorlcoud consumes more time than solr when
>> write index. it takes nearly 5 or more times longer. I wonder that is why?
>>
>> in our project, we have a scheduler job to add index, and then execute the
>> the method of "optimize(false, true, 2)" to optimize the added index.
>> I wonder if it is caused by solrcloud internal that when writing index,
>> solrcloud needs to just which shard it should be stored? and when
>> optimizing the replicate needs to take some time to synchronize the data
>> from leader?
>>
>> and I wonder what about query?  will solrcloud also take more time than
>> solr when query data?
>>



Re: solrcloud consumes more time than solr when write index

2016-07-13 Thread Jeff Wartes
There’s another thread on this list going on right now touching on the need to 
optimize, might be worth reading.
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201607.mbox/%3c61f3d01f-c3ef-2d71-7112-6a88b0145...@elyograg.org%3E


On 7/12/16, 6:25 PM, "Kent Mu"  wrote:

>Dear Mr. Wartes,
>Thanks for your reply. well, I see. for solr we do have replicas, and for
>solrcloud, we have 5 shards and each shards with one leader and one
>replica. and the data number is nearly 100 million, you mean we do not need
>to optimize the index data?
>
>Thanks!
>Kent
>
>2016-07-12 23:02 GMT+08:00 Jeff Wartes :
>
>> Well, two thoughts:
>>
>>
>> 1. If you’re not using solrcloud, presumably you don’t have any replicas.
>> If you are, presumably you do. This makes for a biased comparison, because
>> SolrCloud won’t acknowledge a write until it’s been safely written to all
>> replicas. In short, solrcloud write time is max(per-replica write time).
>> The more replicas you add, the bigger the chance some replica randomly
>> takes longer (gc pause, perhaps?), and the longer your overall write time,
>> assuming a fixed number of indexing threads.
>> 2. The parallelism of the optimize operation across replicas has gone back
>> and forth a bit, and I’m not sure what it was doing in 4.9. However, at one
>> point the optimize happened per-replica, serially. So it’d do
>> shard1_replica1, then when that was done, do shard1_replica2, then
>> shard2_replica1, etc. Other versions of Solr would do those at the same
>> time. Again, I don’t know if you’re comparing to a non-replicated solr
>> index, but that could explain some of the difference.
>>
>> There’s a sort of an obligatory comment at this point that optimize
>> doesn’t necessarily save you a lot. There are certainly cases where it
>> does, but if you haven’t already, you’ll want to validate that you have one
>> of them and that you’re not just doing unnecessary work.
>>
>>
>> On 7/12/16, 7:41 AM, "Kent Mu"  wrote:
>>
>> >hello, does anybody also come across the issue? can anybody help me?
>> >
>> >2016-07-11 23:17 GMT+08:00 Kent Mu :
>> >
>> >> Hi friends!
>> >>
>> >> solr version: 4.9.0.
>> >>
>> >> we use solr and solrcloud in our project, that means we use sorl and
>> >> solrcloud at the same time.
>> >> but we find a phenomenon that sorlcoud consumes more time than solr when
>> >> write index. it takes nearly 5 or more times longer. I wonder that is
>> why?
>> >>
>> >> in our project, we have a scheduler job to add index, and then execute
>> the
>> >> the method of "optimize(false, true, 2)" to optimize the added index.
>> >> I wonder if it is caused by solrcloud internal that when writing index,
>> >> solrcloud needs to just which shard it should be stored? and when
>> >> optimizing the replicate needs to take some time to synchronize the data
>> >> from leader?
>> >>
>> >> and I wonder what about query?  will solrcloud also take more time than
>> >> solr when query data?
>> >>
>>
>>



Re: Node not recovering, leader elections not occuring

2016-07-19 Thread Jeff Wartes
It sounds like the node-local version of the ZK clusterstate has diverged from 
the ZK cluster state. You should check the contents of zookeeper and verify the 
state there looks sane. I’ve had issues (v5.4) on a few occasions where leader 
election got screwed up to the point where I had to delete data from ZK 
manually. (Usually after a ZK issue.)

Particularly, take a look in 
collections//leader_elect/shard/election

Non-sane states would be the same core_nodeX listed more than once, or fewer 
entries than (up) replicas. If you’re having trouble getting a sane election, 
you can try deleting the lowest-numbered entries (as well as any lower-numbered 
duplicates) and trying forceelection again. Possibly followed by restarting the 
node with that lowest-numbered entry.

Also make sure that this exists and has the expected replica
collections//leader/shard

collections//leader_initiated_recovery 
can be informative too, this represents replicas that the *leader* thinks are 
out of sync, usually due to a failed update request.


On 7/19/16, 9:20 AM, "Tom Evans"  wrote:

On the nodes that have the replica in a recovering state we now see:

19-07-2016 16:18:28 ERROR RecoveryStrategy:159 - Error while trying to
recover. core=lookups_shard1_replica8:org.apache.solr.common.SolrException:
No registered leader was found after waiting for 4000ms , collection:
lookups slice: shard1
at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:607)
at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:593)
at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:308)
at 
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

19-07-2016 16:18:28 INFO  RecoveryStrategy:444 - Replay not started,
or was not successful... still buffering updates.
19-07-2016 16:18:28 ERROR RecoveryStrategy:481 - Recovery failed -
trying again... (164)
19-07-2016 16:18:28 INFO  RecoveryStrategy:503 - Wait [12.0] seconds
before trying to recover again (attempt=165)


This is with the "leader that is not the leader" shut down.

Issuing a FORCELEADER via collections API doesn't in fact force a
leader election to occur.

Is there any other way to prompt Solr to have an election?

Cheers

Tom

On Tue, Jul 19, 2016 at 5:10 PM, Tom Evans  wrote:
> There are 11 collections, each only has one shard, and each node has
> 10 replicas (9 collections are on every node, 2 are just on one node).
> We're not seeing any OOM errors on restart.
>
> I think we're being patient waiting for the leader election to occur.
> We stopped the troublesome "leader that is not the leader" server
> about 15-20 minutes ago, but we still have not had a leader election.
>
> Cheers
>
> Tom
>
> On Tue, Jul 19, 2016 at 4:30 PM, Erick Erickson  
wrote:
>> How many replicas per Solr JVM? And do you
>> see any OOM errors when you bounce a server?
>> And how patient are you being, because it can
>> take 3 minutes for a leaderless shard to decide
>> it needs to elect a leader.
>>
>> See SOLR-7280 and SOLR-7191 for the case
>> where lots of replicas are in the same JVM,
>> the tell-tale symptom is errors in the log as you
>> bring Solr up saying something like
>> "OutOfMemory error unable to create native thread"
>>
>> SOLR-7280 has patches for 6x and 7x, with a 5x one
>> being added momentarily.
>>
>> Best,
>> Erick
>>
>> On Tue, Jul 19, 2016 at 7:41 AM, Tom Evans  
wrote:
>>> Hi all - problem with a SolrCloud 5.5.0, we have a node that has most
>>> of the collections on it marked as "Recovering" or "Recovery Failed".
>>> It attempts to recover from the leader, but the leader responds with:
>>>
>>> Error while trying to recover.
>>> core=iris_shard1_replica1:java.util.concurrent.ExecutionException:
>>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>>> Error from server at http://172.31.1.171:3/solr: We are not the
>>> leader
>>> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>>> at 
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.ja

Effects of insert order on query performance

2016-08-11 Thread Jeff Wartes

This isn’t really a question, although some validation would be nice. It’s more 
of a warning.

Tldr is that the insert order of documents in my collection appears to have had 
a huge effect on my query speed.


I have a very large (sharded) SolrCloud 5.4 index. One aspect of this index is 
a multi-valued field (“permissions”) that for 90% of docs contains one 
particular value, (“A”) and for 10% of docs contains another distinct value. 
(“B”) It’s intended to represent something like permissions, so more values are 
possible in the future, but not present currently. In fact, the addition of 
docs with value B to this index was very recent, previously all docs had value 
“A”. All queries, in addition to various other Boolean-query type restrictions, 
have a terms query on this field, like {!terms f=permissions v=A} or {!terms 
f=permissions v=A,B}

Last week, I tried to re-index the whole collection from scratch, using source 
data. Query performance on the resulting re-index proved to be abysmal, I could 
get barely 10% of my previous query throughput, and even that was at latencies 
that were orders of magnitude higher than what I had in production.

I hooked up some CPU profiling to a server that had shards from both the old 
and new version of the collection, and eventually it looked like the 
significant difference in processing the two collections was coming from 
ConstantWeight.scorer()
Specifically, this line
https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/solr/core/src/java/org/apache/solr/search/SolrConstantScoreQuery.java#L102
was far more expensive in my re-indexed collection. From there, the call chain 
goes through an LRUQueryCache, down to a BulkScorer, and ends up with the extra 
work happening here:
https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/lucene/core/src/java/org/apache/lucene/search/Weight.java#L169

I don’t pretend to understand all that code, but the difference in my re-index 
appears to have something to do either with that cache, or the aggregate 
docIdSets that need weights generated is simply much bigger in my re-index.


But the queries didn’t change, and the data is basically the same, what else 
could have changed?

The documents with the “B” distinct value were added recently to the 
high-performance collection, but the A’s and the B’s were all mixed up in the 
source data dump I used to re-index. On a hunch, I manually ordered the docs 
such that the A’s were all first and re-indexed again, and performance is great!

Here’s my theory: Using TieredMergePolicy, the vast quantity of the documents 
in an index are contained in the largest segments. I’m guessing there’s an 
optimization somewhere that says something like “This segment only has A’s”. By 
indexing all the A’s first, those biggest segments only contain A’s, and only 
the smallest, newest segments are unable to make use of that optimization.

Here’s the scary part: Although my re-index is now performing well, if this 
theory is right, some random insert (or a deliberate optimize) at some random 
point in the future could cascade a segment merge such that the largest 
segment(s) now contain both A’s and B’s, and performance suddenly goes over a 
cliff. I have no way to prevent this possibility except to stop doing inserts.

My current thinking is that I need to pull the terms-query part out of the 
query and do a filter query for it instead. Probably as a post-filter, since 
I’ve had bad luck with very large filter queries and the filter cache. I’d 
tested this originally (when I only had A’s), but found the performance was a 
bit worse than just leaving it in the query. I’ll take a bit worse and 
predictability over a bit better and a time bomb though, if those are my 
choices.


If anyone has any comments refuting or supporting this theory, I’d certainly 
like to hear it. This is the first time I’ve encountered anything about insert 
order mattering from a performance perspective, and it becomes a general-form 
question around how to handle low-cardinality fields.



Re: Effects of insert order on query performance

2016-08-12 Thread Jeff Wartes
Thanks Emir. I’m unfortunately already using a routing key that needs to be at 
the top level, since I’m collapsing on that field. 

Adding a sub-key won’t help much if my theory is correct, as even a single 
shard (distrib=false) showed serious performance degradation, and query latency 
is the max(shard latency). I’d need a routing scheme that assured that a given 
shard has *only* A’s, or *only* B’s.

Even if I could use “permissions” as the top-level routing key though, this is 
a very low cardinality field, so I’d expect to end up with very large 
differences between the sizes of the shards in that case. That’s fine from a 
SolrCloud query perspective of course, but it makes for more difficult resource 
provisioning.


On 8/12/16, 1:39 AM, "Emir Arnautovic"  wrote:

    Hi Jeff,

I will not comment on your theory (will let that to guys more familiar 
with Lucene code) but will point to one alternative solution: routing. 
You can use routing to split documents with different permission to 
different shards and use composite hash routing to split "A" (and maybe 
"B" as well) documents to multiple shards. That will make sure all doc 
with the same permission are on the same shard and on query time only 
those will be queried (less shards to query) and there is no need to 
include term query or filter query at all.

Here is blog explaining benefits of composite hash routing: 
https://sematext.com/blog/2015/09/29/solrcloud-large-tenants-and-routing/

Regards,
Emir

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

On 11.08.2016 19:39, Jeff Wartes wrote:
> This isn’t really a question, although some validation would be nice. 
It’s more of a warning.
>
> Tldr is that the insert order of documents in my collection appears to 
have had a huge effect on my query speed.
>
>
> I have a very large (sharded) SolrCloud 5.4 index. One aspect of this 
index is a multi-valued field (“permissions”) that for 90% of docs contains one 
particular value, (“A”) and for 10% of docs contains another distinct value. 
(“B”) It’s intended to represent something like permissions, so more values are 
possible in the future, but not present currently. In fact, the addition of 
docs with value B to this index was very recent, previously all docs had value 
“A”. All queries, in addition to various other Boolean-query type restrictions, 
have a terms query on this field, like {!terms f=permissions v=A} or {!terms 
f=permissions v=A,B}
>
> Last week, I tried to re-index the whole collection from scratch, using 
source data. Query performance on the resulting re-index proved to be abysmal, 
I could get barely 10% of my previous query throughput, and even that was at 
latencies that were orders of magnitude higher than what I had in production.
>
> I hooked up some CPU profiling to a server that had shards from both the 
old and new version of the collection, and eventually it looked like the 
significant difference in processing the two collections was coming from 
ConstantWeight.scorer()
> Specifically, this line
> 
https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/solr/core/src/java/org/apache/solr/search/SolrConstantScoreQuery.java#L102
> was far more expensive in my re-indexed collection. From there, the call 
chain goes through an LRUQueryCache, down to a BulkScorer, and ends up with the 
extra work happening here:
> 
https://github.com/apache/lucene-solr/blob/0a1dd10d5262153f4188dfa14a08ba28ec4ccb60/lucene/core/src/java/org/apache/lucene/search/Weight.java#L169
>
> I don’t pretend to understand all that code, but the difference in my 
re-index appears to have something to do either with that cache, or the 
aggregate docIdSets that need weights generated is simply much bigger in my 
re-index.
>
>
> But the queries didn’t change, and the data is basically the same, what 
else could have changed?
>
> The documents with the “B” distinct value were added recently to the 
high-performance collection, but the A’s and the B’s were all mixed up in the 
source data dump I used to re-index. On a hunch, I manually ordered the docs 
such that the A’s were all first and re-indexed again, and performance is great!
>
> Here’s my theory: Using TieredMergePolicy, the vast quantity of the 
documents in an index are contained in the largest segments. I’m guessing 
there’s an optimization somewhere that says something like “This segment only 
has A’s”. By indexing all the A’s first, those biggest segments only contain 
A’s, and only the smallest, newest segments are unable to make use of that 
optimization.
>
> Here’s the scary part: Although my re-

Re: Result Grouping vs. Collapsing Query Parser -- Can one be deprecated?

2016-10-20 Thread Jeff Wartes
I’ll also mention the choice to improve processing speed by allocating more 
memory, which increases the importance of GC tuning. This bit me when I tried 
using it on a larger index. 
https://issues.apache.org/jira/browse/SOLR-9125

I don’t know if the result grouping feature shares the same issue. Probably.
I actually never bothered trying it, since the comments I’d read made it seem 
like a non-starter.


On 10/19/16, 4:34 PM, "Joel Bernstein"  wrote:

Also as you consider using collapse you'll want to keep in mind the feature
compromises that were made to achieve the higher performance:

1) Collapse does not directly support faceting. It simply collapses the
results and the faceting components compute facets on the collapsed result
set. Grouping has direct support for faceting which, can be slow, but it
has options other then just computing facets on the collapsed result set.

2) Originally collapse only supported selecting group heads with min/max
value of a numeric field. It did not support using the sort parameter for
selecting the group head. Recently the sort parameter was added to
collapse, but this likely is not nearly as fast as using the min/max for
selecting group heads.



Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Oct 19, 2016 at 7:20 PM, Joel Bernstein  wrote:

> Originally collapsing was designed with a very small feature set and one
> goal in mind: High performance collapsing on high cardinality fields. To
> avoid having to compromise on that goal, it was developed as a separate
> feature.
>
> The trick in combining grouping and collapsing into one feature, is to do
> it in a way that does not hurt the original performance goal of collapse.
> Otherwise we'll be back to just have slow grouping.
>
> Perhaps the new API's that are being worked could have a facade over
> grouping and collapsing so they would share the same API.
>
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Oct 19, 2016 at 6:51 PM, Mike Lissner  com> wrote:
>
>> Hi all,
>>
>> I've had a rotten day today because of Solr. I want to share my 
experience
>> and perhaps see if we can do something to fix this particular situation 
in
>> the future.
>>
>> Solr currently has two ways to get grouped results (so far!). You can
>> either use Result Grouping or you can use the Collapsing Query Parser.
>> Result grouping seems like the obvious way to go. It's well documented,
>> the
>> parameters are clear, it doesn't use a bunch of weird syntax (ie,
>> {!collapse blah=foo}), and it uses the feature name from SQL (so it comes
>> up in Google).
>>
>> OTOH, if you use faceting with result grouping, which I imagine many
>> people
>> do, you get terrible performance. In our case it went from subsecond to
>> 10-120 seconds for big queries. Insanely bad.
>>
>> Collapsing Query Parser looks like a good way forward for us, and we'll 
be
>> investigating that, but it uses the Expand component that our library
>> doesn't support, to say nothing of the truly bizarre syntax. So this will
>> be a fair amount of effort to switch.
>>
>> I'm curious if there is anything we can do to clean up this situation.
>> What
>> I'd really like to do is:
>>
>> 1. Put a HUGE warning on the Result Grouping docs directing people away
>> from the feature if they plan to use faceting (or perhaps directing them
>> away no matter what?)
>>
>> 2. Work towards eliminating one or the other of these features. They're
>> nearly completely compatible, except for their syntax and performance. 
The
>> collapsing query parser apparently was only written because the result
>> grouping had such bad performance -- In other words, it doesn't exist to
>> provide unique features, it exists to be faster than the old way. Maybe 
we
>> can get rid of one or the other of these, taking the best parts from each
>> (syntax from Result Grouping, and performance from Collapse Query 
Parser)?
>>
>> Thanks,
>>
>> Mike
>>
>> PS -- For some extra context, I want to share some other reasons this is
>> frustrating:
>>
>> 1. I just spent a week upgrading a third-party library so it would 
support
>> grouped results, and another week implementing the feature in our code
>> with
>> tests and everything. That was a waste.
>> 2. It's hard to notice performance issues until after you deploy to a big
>> data environment. This creates a bad situation for users until you detect
>> it and revert the new features.
>> 3. The documentation *could* say something about the fact that a new
>> feature was developed to provide better performance for grouping. It 
could
>> say that us

  1   2   3   4   5   >