RE: Solr java.lang.OutOfMemoryError: Java heap space

2015-09-28 Thread will martin
http://opensourceconnections.com/blog/2014/07/13/reindexing-collections-with-solrs-cursor-support/



-Original Message-
From: Ajinkya Kale [mailto:kaleajin...@gmail.com] 
Sent: Monday, September 28, 2015 2:46 PM
To: solr-user@lucene.apache.org; java-u...@lucene.apache.org
Subject: Solr java.lang.OutOfMemoryError: Java heap space

Hi,

I am trying to retrieve all the documents from a solr index in a batched manner.
I have 100M documents. I am retrieving them using the method proposed here 
https://nowontap.wordpress.com/2014/04/04/solr-exporting-an-index-to-an-external-file/
I am dumping 10M document splits in each file. I get "OutOfMemoryError" if 
start is at 50M. I get the same error even if rows=10 for start=50M.
Curl on start=0 rows=50M in one go works fine too. But things go bad when start 
is at 50M.
My Solr version is 4.4.0.

Caused by: java.lang.OutOfMemoryError: Java heap space at
org.apache.lucene.search.TopDocsCollector.topDocs(TopDocsCollector.java:146)
at
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1502)
at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1363)
at
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:474)
at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:434)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904)

--aj



Re: Keeping faster and slower solr slaves alined with the same index version

2016-11-11 Thread Will Martin
Csongor:

If session locking is new to you, here is a comprehensive explanation of 
the "Active - Active multi-region" scenario you're encountering and how 
NetFlix resolves the matter. Although I remain confused by a 15 minute 
network transfer of non-optimized segments; or even if you are 
replicating after optimize rather than commit and all files are being 
shipped.

http://techblog.netflix.com/2013/12/active-active-for-multi-regional.html

regards,

will



On 11/7/2016 11:13 AM, Erick Erickson wrote:
> Not that I know of. Can you session lock users to a particular region?
>
> Best,
> Erick
>
> On Sun, Nov 6, 2016 at 7:49 PM, Csongor Gyuricza
>  wrote:
>> We have the following high-level solr setup:
>>
>> region a) 1 solr master + 3 slaves
>> region b) 1 solr repeater (pointing to master in region a) + 3 slaves
>>
>> In region (a) Replication takes about 2 min from the master to the 3
>> slaves. Due to our network topology, replication from the master to the
>> repeater takes about 15 min after which, it takes another 2 min for the
>> replication to occur between the repeater and the slaves in region (b), so
>> the slaves in region (b) are always 15 min behind the slaves in region (a)
>> which is a problem because all slaves are behind a latency-based route53
>> record. Clients are noticing the difference because they are getting
>> inconsistent data during those 15 min.
>>
>> I would like to solve this inconsistency. Is there a way to make the faster
>> slaves in region (a) wait for all slaves in region (b) to complete
>> replication and then have all 6 slaves switch to the new index
>> simultaneously? if not, what is the alternative solution to this problem?
>>
>> - Csongor
>>
>> Note: We are on solr 3.5 (old, yes I know...)



Re: Is it possible to do pivot grouping in SOLR?

2016-11-17 Thread Will Martin
well, a quickly formulated query against some strange kind of 
endpoint...

collapse and expand; with expand.sort

look it up; its in the ref guide.

On 11/17/2016 1:42 PM, bbarani wrote:
> Is there a way to do pivot grouping (group within a group) in SOLR?
>
> We initially group the results by category and inturn we are trying to group
> the data under one category based on another field. Is there a way to do
> that?
>
> Categories (group by)
> |--Shop
>|---Color (group by)
> |--Support
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Is-it-possible-to-do-pivot-grouping-in-SOLR-tp4306352.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: field set up help

2016-11-17 Thread Will Martin
don't give up yet kris.

q={!prefix f=metatag.date}2016-10&debugQuery

g'luck

will

On 11/17/2016 5:56 PM, Kris Musshorn wrote:

This q={!prefix f=metatag.date}2016-10 returns zero records

-Original Message-
From: KRIS MUSSHORN [mailto:mussho...@comcast.net]
Sent: Thursday, November 17, 2016 3:00 PM
To: solr-user@lucene.apache.org
Subject: Re: field set up help

so if the field was named metatag.date q={!prefix f=metatag.date}2016-10

- Original Message -

From: "Erik Hatcher" 
To: solr-user@lucene.apache.org
Sent: Thursday, November 17, 2016 2:46:32 PM
Subject: Re: field set up help

Given what you’ve said, my hunch is you could make the query like this:

q={!prefix f=field_name}2016-10

tada!  ?!

there’s nothing wrong with indexing dates as text like that, as long as your 
queries are performantly possible.   And in the case of the query type you 
mentioned, the text/string’ish indexing you’ve done is suited quite well to 
prefix queries to grab dates by year, year-month, and year-month-day.   But you 
could, if needed to get more sophisticated with date queries (DateRangeField is 
my new favorite) you can leverage ParseDateFieldUpdateProcessorFactory without 
having to change the incoming format.

Erik






On Nov 17, 2016, at 1:55 PM, KRIS MUSSHORN 
 wrote:


I have a field in solr 5.4.1 that has values like:
2016-10-15
2016-09-10
2015-10-12
2010-09-02

Yes it is a date being stored as text.

I am getting the data onto solr via nutch and the metatag plug in.

The data is coming directly from the website I am crawling and I am not able to 
change the data at the source to something more palpable.

The field is set in solr to be of type TextField that is indexed, tokenized, 
stored, multivalued and norms are omitted.

Both the index and query analysis chains contain just the whitespace tokenizer 
factory and the lowercase filter factory.

I need to be able to query for 2016-10 and only match 2016-10-15.

Any ideas on how to set this up?

TIA

Kris










Re: Solr Capacity Planning

2017-06-17 Thread Will Martin
MODERATOR REQUESTED: 

> On Jun 17, 2017, at 3:56 AM, Greenhorn Techie  
> wrote:
> 
> Hi,
> 
> We are planning to setup a Solr cloud for building a search application on
> huge volumes of data points (~hundreds of billions of solr documents) I
> would like to understand if there is any recommendation on how to size the
> infrastructure and hardware requirements for Solr clusters. Also, what are
> the best practices to consider during this setup.
> 
> Thanks

Seriously.
Will Martin



Re: Should zookeeper be run on the worker machines?

2016-11-23 Thread Will Martin
This is laughable; the so use case wording and the request here. imo of 
course.


On 11/23/2016 11:00 AM, Tech Id wrote:
> Hi,
>
> Can someone please respond to this zookeeper-for-Solr Stack-Overflow
> question: http://stackoverflow.com/questions/40755137/should-
> zookeeper-be-run-on-the-worker-machines
>
> Thanks
> TI
>



Re: SOLR vs mongdb

2016-11-23 Thread Will Martin
mongdb governance criticisms. and a recognition that OSS community is 
full of people trying new combinations, or old negated scenarios. so 
governance models have to be coded; Solr has such.  This Lucene listserv 
serves that purpose, among other resources, in coaching work in a secure 
and best practice manner; where that exists for anything other than lucene


On 11/24/2016 12:13 AM, Walter Underwood wrote:
> Sure. Someone sends an HTTP request that deletes all the content. I’m glad to 
> share the curl request.
>
> Or you can put content in with fields that are indexed but not stored. Then 
> the content is “gone” as soon
> as you send it to Solr.
>
> Or you change the schema and need to reindex, but don’t have copies of the 
> original content.
>
> Or there there is some disk problem and some docs are not in the backup 
> because the backups aren’t
> transactional.
>
> I’m sure there are other situations.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Nov 23, 2016, at 9:00 PM, Kris Musshorn  wrote:
>>
>> Will someone please give me a detailed scenario where solr content could 
>> "disappear"?
>>
>> Disappear means what exactly?
>>
>> TIA,
>> Kris
>>
>>
>> -Original Message-
>> From: Walter Underwood [mailto:wun...@wunderwood.org]
>> Sent: Wednesday, November 23, 2016 7:47 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: SOLR vs mongdb
>>
>> Well, I didn’t actually recommend MongoDB as a repository. :-)
>>
>> If you want transactions and search, buy MarkLogic. I worked there for two 
>> years, and that is serious non-muggle technology.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>>> On Nov 23, 2016, at 4:43 PM, Alexandre Rafalovitch  
>>> wrote:
>>>
>>> Actually, you need to be ok that your content will disappear when you
>>> use MongoDB as well :-(
>>>
>>> But I understand what you were trying to say.
>>> 
>>> http://www.solr-start.com/ - Resources for Solr users, new and
>>> experienced
>>>
>>>
>>> On 24 November 2016 at 11:34, Walter Underwood  
>>> wrote:
 The choice is simple. Are you OK if all your content disappears and you 
 need to reload?
 If so, use Solr. If not, you need some kind of repository. It can be files 
 in Amazon S3.
 But Solr is not designed to preserve your data.

 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)


> On Nov 23, 2016, at 4:12 PM, Alexandre Rafalovitch  
> wrote:
>
> Solr supports automatic detection of content types for new fields.
> That was - unfortunately - named as schemaless mode. It still is
> typed under the covers and has limitations. Such as needing all
> automatically created fields to be multivalued (by the default
> schemaless definition).
>
> MongoDB is better about actually storing content, especially nested
> content. Solr can store content, but that's not what it is about.
> You can totally turn off all the stored flags in Solr and return
> just document ids, while storing the content in MongoDB.
>
> You can search in Mongo and you can store content in Solr, so for
> simple use cases you can use either one to serve both cause. But you
> can also pound nails with a brick and make holes with a hammer.
>
> Oh, and do not read this as me endorsing MongoDB. I would probably
> look at Postgress with JSON columns instead, as it is more reliable
> and feature rich.
>
> Regards,
> Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and
> experienced
>
>
> On 24 November 2016 at 07:34, Prateek Jain J
>  wrote:
>> SOLR also supports, schemaless behaviour. and my question is same that, 
>> why and where should we prefer mongodb. Web search didn’t helped me on 
>> this.
>>
>>
>> Regards,
>> Prateek Jain
>>
>> -Original Message-
>> From: Rohit Kanchan [mailto:rohitkan2...@gmail.com]
>> Sent: 23 November 2016 07:07 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: SOLR vs mongdb
>>
>> Hi Prateek,
>>
>> I think you are talking about two different animals. Solr(actually
>> embedded
>> lucene) is actually a search engine where you can use different features 
>> like faceting, highlighting etc but it is a document store where for 
>> each text it does create an Inverted index and map that to documents.  
>> Mongodb is also document store but I think it adds basic search 
>> capability.  This is my understanding. We are using mongo for temporary 
>> storage and I think it is good for that where you want to store a key 
>> value document in a collection without any static schema. In Solr you 
>> need to define your schema. In solr you can define dynamic fields too. 
>> Th

Re: XFS or EXT4 on Amazon AWS AMIs

2016-12-22 Thread Will Martin
I'd like to see the MongoDB report(?). ext4fs design specifications includes 
support for large files via allocation placement. MongoDB, the last time I 
checked, does pre-allocation which gives it the performance benefit of ext4fs 
multiple design factors (Block and Inode Allocation Policy), but the 
disadvantage of having to rebuild when file lengths are being exceeded; at 
which time the disk fragmentation may prevent ext4fs from getting the 
allocation pattern it was designed for.

That design feature is going to be unavailable with Solr where ext4fs dynamic 
allocation features are less deterministic. Other performance factors on 
ext4fs, and mutexes (even with guard mutexes) are pretty standard patterns. The 
threaded calls sound like the advantages of the allocation pattern.

Still those statements, *based on a dated reading of mine*, may be out of date 
with the MongoDB report factors.

"ext4 recognizes (better than ext3, anyway) that data locality is generally a 
desirably quality of a filesystem"

https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Block_and_Inode_Allocation_Policy

For AWS AMI, is there an r4 instance type? The c3 and m3 are superseded with *4 
types that have notable improvements in IOPs and don't cost more.

http://howto.unixdev.net/Test_LVM_Trim_Ext4.html   -- not an extended 
performance benchmark, but useful to validate discard/TRIM.

On 12/22/2016 1:32 AM, William Bell wrote:

So what are people recommending for SOLR on AWS on Amazon AMI - ext4 or xfs?

I saw an article about MongoDB - saying performance on Amazon was better
due to a mutex issue on ext4 files and threaded calls.

I have been using ext4 for a long time, but I am moving to r3.* instances
and TRIM / DISCARD support just appears more supported on XFS.








Re: How to train the model using user clicks when use ltr(learning to rank) module?

2017-01-05 Thread Will Martin
http://www.dcc.fc.up.pt/~pribeiro/aulas/na1516/slides/na1516-slides-ir.pdf

  see the relevant sections for good info


On 1/5/2017 3:02 AM, Jeffery Yuan wrote:
> Thanks very much for integrating machine learning to Solr.
> https://github.com/apache/lucene-solr/blob/f62874e47a0c790b9e396f58ef6f14ea04e2280b/solr/contrib/ltr/README.md
>
> In the Assemble training data part: the third column indicates the relative
> importance or relevance of that doc
> Could you please give more info about how to give a score based on what user
> clicks?
>
> I have read
> https://static.aminer.org/pdf/PDF/000/472/865/optimizing_search_engines_using_clickthrough_data.pdf
> http://www.cs.cornell.edu/people/tj/publications/joachims_etal_05a.pdf
> http://alexbenedetti.blogspot.com/2016/07/solr-is-learning-to-rank-better-part-1.html
>
> But still have no clue how to translate the partial pairwise feedback to the
> importance or relevance of that doc.
>
>  From a user's perspective, the steps such as setup the feature and model in
> Solr is simple, but collecting the feedback data and train/update the model
> is much more complex.
>
> It would be great Solr can provide some detailed instruction or sample code
> about how to translate the partial pairwise feedback and use it to train and
> update model.
>
> Thanks again for your help.
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-train-the-model-using-user-clicks-when-use-ltr-learning-to-rank-module-tp4312462.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to train the model using user clicks when use ltr(learning to rank) module?

2017-01-05 Thread Will Martin
In the Assemble training data part: the third column indicates the relative
importance or relevance of that doc
Could you please give more info about how to give a score based on what user
clicks?

Hi Jeffery,

Give your questions more detail and there may be more feedback; just a 
suggestion.
About above,

some examples of assigning "relative" weighting to training data
user click info gathered (all assumed but similar to omniture monitoring)
- position in the result list
- above/below the fold
- result page number
As a information engineer, you might see 2 attributes here: a) user 
perseverance b) effort to find the result

From there, the attributes have a correlation relationship that is not 
linear and directly proportional I think:
easy to find outweighs user perseverance every time because it 
reduces the need for such
 extensive perseverance, page #3 for example, doesn't mitigate 
effort, it drives effort  towards lower user perseverance need value pairs.
Ok. That is damn confusing. But its what I would want to do, use the pair 
in a manner that reranks a document as if the perseverance and effort were 
balanced and positioned ... "relative" to the other training data. What that 
equation is, will take some more effort

i'm not sure this response is helpful at all, but i'm going to go with it 
because I recognize all of it from AOL, Microsoft and Comcast work. Before the 
days of ML in Search.

On 1/5/2017 3:33 PM, Jeffery Yuan wrote:

Thanks , Will Martin.

I checked the pdf it's great. but seems not very useful for my question: How
to train the model using user clicks when use ltr(learning to rank) module.

I know the concept after reading these papers. But still not sure how to
code them.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-train-the-model-using-user-clicks-when-use-ltr-learning-to-rank-module-tp4312462p4312592.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: How to train the model using user clicks when use ltr(learning to rank) module?

2017-01-06 Thread Will Martin
ah. very nice Diego. Thanks.

On 1/6/2017 1:52 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:

Hi Jeffery,
I submitted a patch to the README of the learning to rank example folder, 
trying to explain better how to produce a training set given a log with 
interaction data.

Patch is available here: https://issues.apache.org/jira/browse/SOLR-9929
And you can see the new version of the README here:  
https://github.com/bloomberg/lucene-solr/blob/master-ltr/solr/contrib/ltr/example/README.md

Please let me know if you have comments or more questions.
Cheers
Diego


From: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> At: 
01/06/17 03:57:29
To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Subject: Re: How to train the model using user clicks when use ltr(learning to 
rank) module?

In the Assemble training data part: the third column indicates the relative
importance or relevance of that doc
Could you please give more info about how to give a score based on what user
clicks?

Hi Jeffery,

Give your questions more detail and there may be more feedback; just a 
suggestion.
About above,

some examples of assigning "relative" weighting to training data
user click info gathered (all assumed but similar to omniture monitoring)
- position in the result list
- above/below the fold
- result page number
As a information engineer, you might see 2 attributes here: a) user 
perseverance b) effort to find the result

From there, the attributes have a correlation relationship that is not 
linear and directly proportional I think:
easy to find outweighs user perseverance every time because it 
reduces the need for such
 extensive perseverance, page #3 for example, doesn't mitigate 
effort, it drives effort  towards lower user perseverance need value pairs.
Ok. That is damn confusing. But its what I would want to do, use the pair 
in a manner that reranks a document as if the perseverance and effort were 
balanced and positioned ... "relative" to the other training data. What that 
equation is, will take some more effort

i'm not sure this response is helpful at all, but i'm going to go with it 
because I recognize all of it from AOL, Microsoft and Comcast work. Before the 
days of ML in Search.

On 1/5/2017 3:33 PM, Jeffery Yuan wrote:

Thanks , Will Martin.

I checked the pdf it's great. but seems not very useful for my question: How
to train the model using user clicks when use ltr(learning to rank) module.

I know the concept after reading these papers. But still not sure how to
code them.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-train-the-model-using-user-clicks-when-use-ltr-learning-to-rank-module-tp4312462p4312592.html
Sent from the Solr - User mailing list archive at Nabble.com.






Re: CloudSolrStream can't set the setZkClientTimeout and setZkConnectTimeout properties

2017-01-19 Thread Will Martin
Default behavior. Client - Server negotiate 2/3 of min (max server, max client).


This allows a client time to search for a new leader before all of its time 
consumed.

 zookeeper user @ apache org 

-will

On 1/19/2017 12:59 PM, Yago Riveiro wrote:

I can see some reconnects in my logs, the process of consuming the stream
doesn't broke and continue as normal.

The timeout is 10s but I can see in logs that after 6s the reconnect is
triggered, I don't know if it's the default behaviour or the zk timeout it's
not honoured.



-
Best regards

/Yago
--
View this message in context: 
http://lucene.472066.n3.nabble.com/CloudSolrStream-can-t-set-the-setZkClientTimeout-and-setZkConnectTimeout-properties-tp4313127p4314899.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Solrj similarity setting

2017-01-22 Thread Will Martin
Farida:

Can you define a dsl? (domain specific model).

If you stand a RESTful api on the DSL, you can write a BM25F similarity, with 
conditionals maybe controlled by a func query accessing field payloads?

:-)


On 1/21/2017 6:13 AM, Markus Jelsma wrote:

No, this is not possible. A similarity is an index time setting because it can 
have index time properties. There is no way to control this query time.

M



-Original message-


From:Farida Sabry 
Sent: Saturday 21st January 2017 9:58
To: solr-user@lucene.apache.org
Subject: Solrj similarity setting

Is there a way to set the similarity in Solrj query search like we do for
lucene IndexSearcher e.g. searcher.setSimilarity(sim);
I need to define a custom similarity and at the same time get the fields
provided by SolrInputDocument in the returned results by
SolrDocumentList results =  QueryResponse response.getResults()

Any clues how to do that?





Re: Solr performance on EC2 linux

2017-05-01 Thread Will Martin
Ubuntu 16.04 LTS - Xenial (HVM)

Is this your Xenial version?




On 5/1/2017 6:37 PM, Jeff Wartes wrote:
> I tried a few variations of various things before we found and tried that 
> linux/EC2 tuning page, including:
>- EC2 instance type: r4, c4, and i3
>- Ubuntu version: Xenial and Trusty
>- EBS vs local storage
>- Stock openjdk vs Zulu openjdk (Recent java8 in both cases - I’m aware of 
> the issues with early java8 versions and I’m not using G1)
>
> Most of those attempts were to help reduce differences between the data 
> center and the EC2 cluster. In all cases I re-indexed from scratch. I got the 
> same very high system-time symptom in all cases. With the linux changes in 
> place, we settled on r4/Xenial/EBS/Stock.
>
> Again, this was a slightly modified Solr 5.4, (I added backup requests, and 
> two memory allocation rate tweaks that have long since been merged into 
> mainline - released in 6.2 I think. I can dig up the jira numbers if anyone’s 
> interested) I’ve never used Solr 6.x in production though.
> The only reason I mentioned 6.x at all is because I’m aware that ES 5.x is 
> based on Lucene 6.2. I don’t believe my coworker spent any time on tuning his 
> ES setup, although I think he did try G1.
>
> I definitely do want to binary-search those settings until I understand 
> better what exactly did the trick.
> It’s a long cycle time per test is the problem, but hopefully in the next 
> couple of weeks.
>
>
>
> On 5/1/17, 7:26 AM, "John Bickerstaff"  wrote:
>
>  It's also very important to consider the type of EC2 instance you are
>  using...
>  
>  We settled on the R4.2XL...  The R series is labeled "High-Memory"
>  
>  Which instance type did you end up using?
>  
>  On Mon, May 1, 2017 at 8:22 AM, Shawn Heisey  wrote:
>  
>  > On 4/28/2017 10:09 AM, Jeff Wartes wrote:
>  > > tldr: Recently, I tried moving an existing solrcloud configuration 
> from
>  > a local datacenter to EC2. Performance was roughly 1/10th what I’d
>  > expected, until I applied a bunch of linux tweaks.
>  >
>  > How very strange.  I knew virtualization would have overheard, possibly
>  > even measurable overhead, but that's insane.  Running on bare metal is
>  > always better if you can do it.  I would be curious what would happen 
> on
>  > your original install if you applied similar tuning to that.  Would you
>  > see a speedup there?
>  >
>  > > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a
>  > much more recent release) alternate implementation of the same index 
> was
>  > not seeing this high-system-time behavior on EC2, and was getting
>  > throughput consistent with our general expectations.
>  >
>  > That's even weirder.  ES 5.x will likely be using Points field types 
> for
>  > numeric fields, and although those are faster than what Solr currently
>  > uses, I doubt it could explain that difference.  The implication here 
> is
>  > that the ES systems are running with stock EC2 settings, not the tuned
>  > settings ... but I'd like you to confirm that.  Same Java version as
>  > with Solr?  IMHO, Java itself is more likely to cause issues like you
>  > saw than Solr.
>  >
>  > > I’m writing this for a few reasons:
>  > >
>  > > 1.   The performance difference was so crazy I really feel like 
> this
>  > should really be broader knowledge.
>  >
>  > Definitely agree!  I would be very interested in learning which of the
>  > tunables you changed were major contributors to the improvement.  If it
>  > turns out that Solr's code is sub-optimal in some way, maybe we can 
> fix it.
>  >
>  > > 2.   If anyone is aware of anything that changed in Lucene 
> between
>  > 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from
>  > this? If it’s the clocksource that’s the issue, there’s an implication 
> that
>  > Solr was using tons more system calls like gettimeofday that the EC2 
> (xen)
>  > hypervisor doesn’t allow in userspace.
>  >
>  > I had not considered the performance regression in 6.4.0 and 6.4.1 that
>  > Erick mentioned.  Were you still running Solr 5.4, or was it a 6.x 
> version?
>  >
>  > =
>  >
>  > Specific thoughts on the tuning:
>  >
>  > The noatime option is very good to use.  I also use nodiratime on my
>  > systems.  Turning these off can have *massive* impacts on disk
>  > performance.  If these are the source of the speedup, then the machine
>  > doesn't have enough spare memory.
>  >
>  > I'd be wary of the "nobarrier" mount option.  If the underlying storage
>  > has battery-backed write caches, or is SSD without write caching, it
>  > wouldn't be a problem.  Here's info about the "discard" mount option, I
>  > don't know whether it applies to your amazon st

RE: about Solr log file

2014-10-22 Thread Will Martin
Hi Lee:

I'm returning to the Solr/Lucene community and haven't made it to Solr
Clouds yet, but w reference to discrete servers

If you put a logback configuration in place, you can with configuration
alone and (my choice) gelf4j send each logging flow to a graylog2 server. In
graylog2 you could create a stream that routes on the "source" field (which
is the hostname) and then be able to review your logs in aggregate or by
host.
http://logback.qos.ch/
https://github.com/realityforge/gelf4j
http://www.graylog2.org/



regards,
Will



-Original Message-
From: Lee Chunki [mailto:lck7...@coupang.com] 
Sent: Wednesday, October 22, 2014 10:50 PM
To: solr-user
Subject: about Solr log file

Hi,

I have two questions about Solr log file.

First,
Is it possible to set log setting to use one log file for each core?
Because of I run many cores on one Solr and log file is getting bigger and
bigger and it makes me to hard to debug when system error.

Second,
Is there any setting to gather Solr Cloud logs at any one server?
I have plan to migrate to Solr Cloud but it seems that each sold node makes
log at their local disk.

Thanks,
Chunki.



RE: How to properly use Levenstein distance with ~ in Java

2014-10-23 Thread Will Martin
In terms of recent work with edit-distance (specifically Levenshtein) and your 
expressed interest might find this paper provocative.

"We measure the keyword similarity between two strings
by lemmatizing them, removing stopwords, and computing
the cosine similarity. We then include the keyword similar-
ity between the query and the input question, the keyword
similarity between the query and the returned evidence, and
an indicator feature for whether the query involves a join.
The evidence features compute KB-specific properties... We compute the join-key 
string similarity mea-
sured using the Levenshtein distance.
"

http://dx.doi.org/10.1145/2623330.2623677

re
will


-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Thursday, October 23, 2014 12:05 PM
To: solr-user
Subject: Re: How to properly use Levenstein distance with ~ in Java

The last real update on that is 2.5 years old. Is there more recent update? I 
am interested in this topic as well.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and 
newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers 
community: https://www.linkedin.com/groups?gid=6713853


On 23 October 2014 10:10, Walter Underwood  wrote:
> We’re reimplementing fuzzy support in edismax on Solr 4.x right now. 
> See: https://issues.apache.org/jira/browse/SOLR-629
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/
>
> On Oct 22, 2014, at 11:08 PM, karsten-s...@gmx.de wrote:
>
>> Hi Aleksander,
>>
>> The Fuzzy Searche '~' is not supported in dismax (defType=dismax) 
>> https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Par
>> ser
>>
>> You are using SearchComponent "spellchecker". This does not change the query 
>> results.
>>
>>
>> btw: It looks like you are using path "/select" with qt=dismax. This normaly 
>> would throw an exception.
>> Is there a tag
>>   > inside your solrconfig.xml ?
>>
>> Best regards
>>
>>   Karsten
>>
>> P.S. in Context: 
>> http://lucene.472066.n3.nabble.com/How-to-properly-use-Levenstein-dis
>> tance-with-in-Java-td4164793.html
>>
>>
>>> On 20 October 2014 11:13, Aleksander Sadecki wrote:
>>>
>>> Ok, thank you for your response. But why I cannot use '~'?
>



RE: OpenExchangeRates.Org rates in solr

2014-10-26 Thread Will Martin
Hi Olivier:

Can you clarify this message? Are you using Solr at the business? Or are you 
giving free access to solr installations?

Thanks,
Will


-Original Message-
From: Olivier Austina [mailto:olivier.aust...@gmail.com] 
Sent: Sunday, October 26, 2014 10:57 AM
To: solr-user@lucene.apache.org
Subject: OpenExchangeRates.Org rates in solr

Hi,

There is a way to see the OpenExchangeRates.Org 
 rates used in Solr somewhere. I have changed 
the configuration to use these rates. Thank you.
Regards
Olivier



RE: OpenExchangeRates.Org rates in solr

2014-10-26 Thread Will Martin
Cool. After I wrote it occurred to me that the exchange rates API might make 
for a very useful contrib/  component.

Good luck.

-Original Message-
From: Olivier Austina [mailto:olivier.aust...@gmail.com] 
Sent: Sunday, October 26, 2014 3:11 PM
To: solr-user@lucene.apache.org
Subject: Re: OpenExchangeRates.Org rates in solr

Hi Will,

I am learning Solr now. I can use it  later for business or for free access. 
Thank you.

Regards
Olivier


2014-10-26 17:32 GMT+01:00 Will Martin :

> Hi Olivier:
>
> Can you clarify this message? Are you using Solr at the business? Or 
> are you giving free access to solr installations?
>
> Thanks,
> Will
>
>
> -Original Message-
> From: Olivier Austina [mailto:olivier.aust...@gmail.com]
> Sent: Sunday, October 26, 2014 10:57 AM
> To: solr-user@lucene.apache.org
> Subject: OpenExchangeRates.Org rates in solr
>
> Hi,
>
> There is a way to see the OpenExchangeRates.Org < 
> http://www.OpenExchangeRates.Org> rates used in Solr somewhere. I have 
> changed the configuration to use these rates. Thank you.
> Regards
> Olivier
>
>



RE: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread Will Martin
2 naïve comments, of course.

 

-  Queuing theory

-  Zookeeper logs.

 

From: S.L [mailto:simpleliving...@gmail.com] 
Sent: Monday, October 27, 2014 1:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of 
synch.

 

Please find the clusterstate.json attached.

Also in this case atleast the Shard1 replicas are out of sync , as can be seen 
below.

Shard 1 replica 1 *does not* return a result with distrib=false.

Query :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* 

 
&fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false&debug=track&shards.info=true

 

Result :

01*:*truefalsetrackxml(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)

 

Shard1 replica 2 *does* return the result with distrib=false.

Query: http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:* 

 
&fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false&debug=track&shards.info=true

Result:

01*:*truefalsetrackxml(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)http://www.xyz.com9f4748c0-fe16-4632-b74e-4fee6b80cbf51483135330558148608

 

On Mon, Oct 27, 2014 at 12:19 PM, Shalin Shekhar Mangar 
 wrote:

On Mon, Oct 27, 2014 at 9:40 PM, S.L  wrote:

> One is not smaller than the other, because the numDocs is same for both
> "replicas" and essentially they seem to be disjoint sets.
>

That is strange. Can we see your clusterstate.json? With that, please also
specify the two replicas which are out of sync.

>
> Also manually purging the replicas is not option , because this is
> "frequently" indexed index and we need everything to be automated.
>
> What other options do I have now.
>
> 1. Turn of the replication completely in SolrCloud
> 2. Use traditional Master Slave replication model.
> 3. Introduce a "replica" aware field in the index , to figure out which
> "replica" the request should go to from the client.
> 4. Try a distribution like Helios to see if it has any different behavior.
>
> Just think out loud here ..
>
> On Mon, Oct 27, 2014 at 11:56 AM, Markus Jelsma <
> markus.jel...@openindex.io>
> wrote:
>
> > Hi - if there is a very large discrepancy, you could consider to purge
> the
> > smallest replica, it will then resync from the leader.
> >
> >
> > -Original message-
> > > From:S.L 
> > > Sent: Monday 27th October 2014 16:41
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
> replicas
> > out of synch.
> > >
> > > Markus,
> > >
> > > I would like to ignore it too, but whats happening is that the there
> is a
> > > lot of discrepancy between the replicas , queries like
> > > q=*:*&fq=(id:220a8dce-3b31-4d46-8386-da8405595c47) fail depending on
> > which
> > > replica the request goes to, because of huge amount of discrepancy
> > between
> > > the replicas.
> > >
> > > Thank you for confirming that it is a know issue , I was thinking I was
> > the
> > > only one facing this due to my set up.
> > >
> > > On Mon, Oct 27, 2014 at 11:31 AM, Markus Jelsma <
> > markus.jel...@openindex.io>
> > > wrote:
> > >
> > > > It is an ancient issue. One of the major contributors to the issue
> was
> > > > resolved some versions ago but we are still seeing it sometimes too,
> > there
> > > > is nothing to see in the logs. We ignore it and just reindex.
> > > >
> > > > -Original message-
> > > > > From:S.L 
> > > > > Sent: Monday 27th October 2014 16:25
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1
> > replicas
> > > > out of synch.
> > > > >
> > > > > Thank Otis,
> > > > >
> > > > > I have checked the logs , in my case the default catalina.out and I
> > dont
> > > > > see any OOMs or , any other exceptions.
> > > > >
> > > > > What others metrics do you suggest ?
> > > > >
> > > > > On Mon, Oct 27, 2014 at 9:26 AM, Otis Gospodnetic <
> > > > > otis.gospodne...@gmail.com> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > You may simply be overwhelming your cluster-nodes. Have you
> checked
> > > > > > various metrics to see if that is the case?
> > > > > >
> > > > > > Otis
> > > > > > --
> > > > > > Monitoring * Alerting * Anomaly Detection * Centralized Log
> > Management
> > > > > > Solr & Elasticsearch Support * http://sematext.com/
> > > > > >
> > > > > >
> > > > > >
> > > > > > > On Oct 26, 2014, at 9:59 PM, S.L 
> > wrote:
> > > > > > >
> > > > > > > Folks,
> > > > > > >
> > > > > > > I have posted previously about this , I am using SolrCloud
> > 4.10.1 and
> > > > > > have
> > > > > > > a sharded collection with  6 nodes , 3 shards and a r

RE: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread Will Martin
Erick Erickson has a comment on a thread out there that says there's a lot of 
pinging between SolrCloud and ZK. AND if a timeout occurs (which could be 
fallback behavior on that exception) ZK will mark the node down AND SolrCloud 
won't use it until ZK gets back inline/online.
Fwiw.


-Original Message-
From: S.L [mailto:simpleliving...@gmail.com] 
Sent: Monday, October 27, 2014 9:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of 
synch.

Good point about ZK logs , I do see the following exceptions intermittently in 
the ZK log.

2014-10-27 06:54:14,621 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client 
/xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029
2014-10-27 07:00:06,697 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection 
from /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,725 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new 
session at /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,746 [myid:1] - INFO
[CommitProcessor:1:ZooKeeperServer@617] - Established session
0x14949db9da40037 with negotiated timeout 1 for client
/xxx.xxx.xxx.xxx:37336
2014-10-27 07:01:06,520 [myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 
0x14949db9da40037, likely client has closed socket
at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:744)

For queuing theory , I dont know of any way to see how fasts the requests are 
being served by SolrCloud , and if a queue is being maintained if the service 
rate is slower than the rate of requests from the incoming multiple threads.

On Mon, Oct 27, 2014 at 7:09 PM, Will Martin  wrote:

> 2 naïve comments, of course.
>
>
>
> -  Queuing theory
>
> -  Zookeeper logs.
>
>
>
> From: S.L [mailto:simpleliving...@gmail.com]
> Sent: Monday, October 27, 2014 1:42 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 
> replicas out of synch.
>
>
>
> Please find the clusterstate.json attached.
>
> Also in this case atleast the Shard1 replicas are out of sync , as can 
> be seen below.
>
> Shard 1 replica 1 *does not* return a result with distrib=false.
>
> Query 
> :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* < 
> http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*&fq=%
> 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false&debu
> g=track&shards.info=true> 
> &fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false
> &debug=track&
> shards.info=true
>
>
>
> Result :
>
> 0 name="QTime">1*:*truefalse name="debug">trackxml name="fq">(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)<
> result name="response" numFound="0" start="0"/> name="debug"/>
>
>
>
> Shard1 replica 2 *does* return the result with distrib=false.
>
> Query: 
> http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:* < 
> http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*&fq=%
> 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false&debu
> g=track&shards.info=true> 
> &fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false
> &debug=track&
> shards.info=true
>
> Result:
>
> 0 name="QTime">1*:*truefalse name="debug">trackxml name="fq">(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)<
> result name="response" numFound="1" start="0"> name="thingURL"> http://www.xyz.com name="id">9f4748c0-fe16-4632-b74e-4fee6b80cbf5 name="_version_">1483135330558148608 name="debug"/>
>
>
>
> On Mon, Oct 27, 2014 at 12:19 PM, Shalin Shekhar Mangar < 
> shalinman...@gmail.com> wrote:
>
> On Mon, Oct 27, 2014 at 9:40 PM, S.L  wrote:
>
> > One is not smaller than the other, because the numDocs is same for 
> > both "replicas" and essentially they seem to be disjoint sets.
> >
>
> That is strange. Can we see your clusterstate.json? With that, please 
> also specify the two replicas which are out of sync.
>
> >
> > Also manually purging the replic

RE: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread Will Martin
The easiest, and coarsest measure of response time [not service time in a 
distributed system] can be picked up in your localhost_access.log file.
You're using tomcat write?  Lookup AccessLogValve in the docs and server.xml. 
You can add configuration to report the payload and time to service the request 
without touching any code.

Queueing theory is what Otis was talking about when he said you've saturated 
your environment. In AWS people just auto-scale up and don't worry about where 
the load comes from; its dumb if it happens more than 2 times. Capacity 
planning is tough, let's hope it doesn't disappear altogether.

G'luck


-Original Message-
From: S.L [mailto:simpleliving...@gmail.com] 
Sent: Monday, October 27, 2014 9:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of 
synch.

Good point about ZK logs , I do see the following exceptions intermittently in 
the ZK log.

2014-10-27 06:54:14,621 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client 
/xxx.xxx.xxx.xxx:56877 which had sessionid 0x34949dbad580029
2014-10-27 07:00:06,697 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection 
from /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,725 [myid:1] - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new 
session at /xxx.xxx.xxx.xxx:37336
2014-10-27 07:00:06,746 [myid:1] - INFO
[CommitProcessor:1:ZooKeeperServer@617] - Established session
0x14949db9da40037 with negotiated timeout 1 for client
/xxx.xxx.xxx.xxx:37336
2014-10-27 07:01:06,520 [myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 
0x14949db9da40037, likely client has closed socket
at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:744)

For queuing theory , I dont know of any way to see how fasts the requests are 
being served by SolrCloud , and if a queue is being maintained if the service 
rate is slower than the rate of requests from the incoming multiple threads.

On Mon, Oct 27, 2014 at 7:09 PM, Will Martin  wrote:

> 2 naïve comments, of course.
>
>
>
> -  Queuing theory
>
> -  Zookeeper logs.
>
>
>
> From: S.L [mailto:simpleliving...@gmail.com]
> Sent: Monday, October 27, 2014 1:42 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 
> replicas out of synch.
>
>
>
> Please find the clusterstate.json attached.
>
> Also in this case atleast the Shard1 replicas are out of sync , as can 
> be seen below.
>
> Shard 1 replica 1 *does not* return a result with distrib=false.
>
> Query 
> :http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:* < 
> http://server3.mydomain.com:8082/solr/dyCollection1/select/?q=*:*&fq=%
> 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false&debu
> g=track&shards.info=true> 
> &fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false
> &debug=track&
> shards.info=true
>
>
>
> Result :
>
> 0 name="QTime">1*:*truefalse name="debug">trackxml name="fq">(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)<
> result name="response" numFound="0" start="0"/> name="debug"/>
>
>
>
> Shard1 replica 2 *does* return the result with distrib=false.
>
> Query: 
> http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:* < 
> http://server2.mydomain.com:8082/solr/dyCollection1/select/?q=*:*&fq=%
> 28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false&debu
> g=track&shards.info=true> 
> &fq=%28id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5%29&wt=xml&distrib=false
> &debug=track&
> shards.info=true
>
> Result:
>
> 0 name="QTime">1*:*truefalse name="debug">trackxml name="fq">(id:9f4748c0-fe16-4632-b74e-4fee6b80cbf5)<
> result name="response" numFound="1" start="0"> name="thingURL"> http://www.xyz.com name="id">9f4748c0-fe16-4632-b74e-4fee6b80cbf5 name="_version_">1483135330558148608 name="debug"/>
>
>
>
> On Mon, Oct 27, 2014 at 12:19 PM, Shalin Shekhar Mangar < 
> shalinman...@gmail.com> wrote:
>
> On Mon, Oct 27, 2014 at 9:40 PM, S.L  

RE: unable to build solr 4.10.1

2014-10-27 Thread Will Martin
There is no Javadoc jar at that location. Does that help?

-Original Message-
From: Karunakar Reddy [mailto:karunaka...@gmail.com] 
Sent: Tuesday, October 28, 2014 2:41 AM
To: solr-user@lucene.apache.org
Subject: unable to build solr 4.10.1

Hi ,

I am getting below error while doing "ant dist" .

:: problems summary ::
[ivy:retrieve]  WARNINGS
[ivy:retrieve] [FAILED ]
javax.activation#activation;1.1.1!activation.jar(javadoc):  (0ms) 
[ivy:retrieve]  shared: tried [ivy:retrieve]  
/home/.ivy2/shared/javax.activation/activation/1.1.1/javadocs/activation.jar
[ivy:retrieve]  public: tried
[ivy:retrieve]
http://repo1.maven.org/maven2/javax/activation/activation/1.1.1/activation-1.1.1-javadoc.jar
[ivy:retrieve] ::
[ivy:retrieve] ::  FAILED DOWNLOADS::
[ivy:retrieve] :: ^ see resolution messages for details  ^ ::
[ivy:retrieve] ::
[ivy:retrieve] :: javax.activation#activation;1.1.1!activation.jar(javadoc)
[ivy:retrieve] ::
[ivy:retrieve]
[ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

BUILD FAILED
/home/solr_trunk/solr-4.10.1/solr/build.xml:339: The following error occurred 
while executing this line:
/home/solr_trunk/solr-4.10.1/solr/common-build.xml:438: The following error 
occurred while executing this line:
/home/solr_trunk/solr-4.10.1/solr/contrib/contrib-build.xml:52: impossible to 
resolve dependencies:
resolve failed - see output for details


Please tell me if i am doing something wrong.


Thanks,
Karunakar.



RE: Log message "zkClient has disconnected".

2014-10-27 Thread Will Martin
Modassar:

Can you share your hw setup?  And what size are your batches? Can you make them 
smaller; it doesn't mean your throughput will necessarily suffer. 
Re
Will


-Original Message-
From: Modassar Ather [mailto:modather1...@gmail.com] 
Sent: Tuesday, October 28, 2014 2:12 AM
To: solr-user@lucene.apache.org
Subject: Log message "zkClient has disconnected".

Hi,

I am getting following INFO log messages many a times during my indexing.
The indexing process read records from database and using multiple threads 
sends them for indexing in batches.
There are four shards and one embedded Zookeeper on one of the shards.

org.apache.zookeeper.ClientCnxn$SendThread run
INFO: Client session timed out, have not heard from server in 9276ms for 
sessionid , closing socket connection and attempting reconnect 
org.apache.solr.common.cloud.ConnectionManager process
INFO: Watcher org.apache.solr.common.cloud.ConnectionManager@3debc153
name:ZooKeeperConnection Watcher:: got event WatchedEvent 
state:Disconnected type:None path:null path:null type:None 
org.apache.solr.common.cloud.ConnectionManager process
INFO: zkClient has disconnected

Kindly help me understand the possible cause of Zookeeper state disconnection.

Thanks,
Modassar



RE: Sharding configuration

2014-10-28 Thread Will Martin
Informational only. FYI

 

Machine parallelism has been empirically proven to be application dependent.

See DaCapo benchmarks (lucene indexing and lucene searching) use in  
http://dx.doi.org/10.1145/2479871.2479901

 

" Parallelism profiling and wall-time prediction for multi-threaded 
applications" 2013.

 

FYI:

 



 

 



 

-Original Message-
From: Ramkumar R. Aiyengar [mailto:andyetitmo...@gmail.com] 
Sent: Tuesday, October 28, 2014 3:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Sharding configuration

 

As far as the second option goes, unless you are using a large amount of memory 
and you reach a point where a JVM can't sensibly deal with a GC load, having 
multiple JVMs wouldn't buy you much. With a 26GB index, you probably haven't 
reached that point. There are also other shared resources at an instance level 
like connection pools and ZK connections, but those are tunable and you 
probably aren't pushing them as well (I would imagine you are just trying to 
have only a handful of shards given that you aren't sharded at all currently).

 

That leaves single vs multiple machines. Assuming the network isn't a 
bottleneck, and given the same amount of resources overall (number of cores, 
amount of memory, IO bandwidth times number of machines), it shouldn't matter 
between the two. If you are procuring new hardware, I would say buy more, 
smaller machines, but if you already have the hardware, you could serve as much 
as possible off a machine before moving to a second. There's nothing which 
limits the number of shards as long as the underlying machine has the 
sufficient amount of parallelism.

 

Again, this advice is for a small number of shards, if you had a lot more

(hundreds) of shards and significant volume of requests, things start to become 
a bit more fuzzy with other limits kicking in.

On 28 Oct 2014 09:26, "Anca Kopetz" <  
anca.kop...@kelkoo.com> wrote:

 

> Hi,

> 

> We have a SolrCloud configuration of 10 servers, no sharding, 20 

> millions of documents, the index has 26 GB.

> As the number of documents has increased recently, the performance of 

> the cluster decreased.

> 

> We thought of sharding the index, in order to measure the latency. 

> What is the best approach ?

> - to use shard splitting and have several sub-shards on the same 

> server and in the same tomcat instance

> - having several shards on the same server but on different tomcat 

> instances

> - having one shard on each server (for example 2 shards / 5 replicas 

> on

> 10 servers)

> 

> What's the impact of these 3 configuration on performance ?

> 

> Thanks,

> Anca

> 

> --

> 

> Kelkoo SAS

> Société par Actions Simplifiée

> Au capital de € 4.168.964,30

> Siège social : 8, rue du Sentier 75002 Paris

> 425 093 069 RCS Paris

> 

> Ce message et les pièces jointes sont confidentiels et établis à 

> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le 

> destinataire de ce message, merci de le détruire et d'en avertir 

> l'expéditeur.

> 



RE: Solr Memory Usage

2014-10-29 Thread Will Martin
This command only touches OS level caches that hold pages destined for (or
not) the swap cache. Its use means that disk will be hit on future requests,
but in many instances the pages were headed for ejection anyway.

It does not have anything whatsoever to do with Solr caches.  It also is not
fragmentation related; it is a result of the kernel managing virtual pages
in an "as designed manner". The proper command is

#sync; echo 3 >/proc/sys/vm/drop_caches. 

http://linux.die.net/man/5/proc

I have encountered resistance on the use of this on long-running processes
for years ... from people who don't even research the matter.



-Original Message-
From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] 
Sent: Wednesday, October 29, 2014 3:06 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr Memory Usage

Vijay Kokatnur [kokatnur.vi...@gmail.com] wrote:
> For the Solr Cloud setup, we are running a cron job with following 
> command to clear out the inactive memory.  It  is working as expected.  
> Even though the index size of Cloud is 146GB, the used memory is always
below 55GB.
> Our response times are better and no errors/exceptions are thrown. 
> (This command causes issue in 2 Shard setup)

> echo 3 > /proc/sys/vm/drop_caches

As Shawn points out, this is under normal circumstances a very bad idea,
but...

> Has anyone faced this issue before?

We did have some problems on a 256GB machine churning terabytes of data
through 40 concurrent Tika processes and into Solr. After some days,
performance got really bad. When we did a top, we noticed that most of the
time was used in the kernel (the 'sy' on the '%Cpu(s):'-line). The
drop_caches trick worked for us too. Our systems guys explained that it was
because of virtual memory space fragmentation, so the OS had to spend a lot
of resources just bookkeeping memory.

Try keeping an eye on the fraction of processing power spend on the kernel
from you clear the cache until it performance gets bad again. If it rises
drastically, you might have the same problem.

- Toke Eskildsen



RE: Solr Memory Usage

2014-10-29 Thread Will Martin
Oops. My wording was poor. My reference to those who don't research the
matter was pointing at a large number of engineers I have worked with; not
this list.

-Original Message-
From: Will Martin [mailto:wmartin...@gmail.com] 
Sent: Wednesday, October 29, 2014 6:38 PM
To: 'solr-user@lucene.apache.org'
Subject: RE: Solr Memory Usage

This command only touches OS level caches that hold pages destined for (or
not) the swap cache. Its use means that disk will be hit on future requests,
but in many instances the pages were headed for ejection anyway.

It does not have anything whatsoever to do with Solr caches.  It also is not
fragmentation related; it is a result of the kernel managing virtual pages
in an "as designed manner". The proper command is

#sync; echo 3 >/proc/sys/vm/drop_caches. 

http://linux.die.net/man/5/proc

I have encountered resistance on the use of this on long-running processes
for years ... from people who don't even research the matter.



-Original Message-
From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
Sent: Wednesday, October 29, 2014 3:06 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr Memory Usage

Vijay Kokatnur [kokatnur.vi...@gmail.com] wrote:
> For the Solr Cloud setup, we are running a cron job with following 
> command to clear out the inactive memory.  It  is working as expected.
> Even though the index size of Cloud is 146GB, the used memory is always
below 55GB.
> Our response times are better and no errors/exceptions are thrown. 
> (This command causes issue in 2 Shard setup)

> echo 3 > /proc/sys/vm/drop_caches

As Shawn points out, this is under normal circumstances a very bad idea,
but...

> Has anyone faced this issue before?

We did have some problems on a 256GB machine churning terabytes of data
through 40 concurrent Tika processes and into Solr. After some days,
performance got really bad. When we did a top, we noticed that most of the
time was used in the kernel (the 'sy' on the '%Cpu(s):'-line). The
drop_caches trick worked for us too. Our systems guys explained that it was
because of virtual memory space fragmentation, so the OS had to spend a lot
of resources just bookkeeping memory.

Try keeping an eye on the fraction of processing power spend on the kernel
from you clear the cache until it performance gets bad again. If it rises
drastically, you might have the same problem.

- Toke Eskildsen



Re: exporting to CSV with solrj

2014-10-31 Thread will martin
"Why do you want to use CSV in SolrJ?"  Alexandre are you looking for a
design gig. This kind of question really begs nothing but disdain.
Commodity search exists, not matter what Paul Nelson writes and part of
that problem is due to advanced users always rewriting the reqs and specs
of less experienced users. 

On Fri, Oct 31, 2014 at 1:05 PM, Alexandre Rafalovitch 
wrote:

> Why do you want to use CSV in SolrJ? You would just have to parse it again.
>
> You could just trigger that as a URL call from outside with cURL or as
> just an HTTP (not SolrJ) call from Java client.
>
> Regards,
>Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 31 October 2014 12:34, tedsolr  wrote:
> > Sure thing, but how do I get the results output in CSV format?
> > response.getResults() is a list of SolrDocuments.
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/exporting-to-CSV-with-solrj-tp4166845p4166861.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>


RE: How to update SOLR schema from continuous integration environment

2014-11-01 Thread Will Martin
http://www.thoughtworks.com/insights/blog/enabling-continuous-delivery-enterprises-testing


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Saturday, November 01, 2014 9:46 AM
To: solr-user@lucene.apache.org
Subject: Re: How to update SOLR schema from continuous integration environment

In all honesty, incrementally updating resources of a production server is a 
rather frightening proposition. Parallel testing is always a better way to go - 
bring up any changes in a parallel system for testing and then do an atomic 
"swap" - redirection of requests from the old server to the new server and then 
retire the old server only after the new server has had enough time to burn in 
and get past any infant mortality problems.

That's production. Testing and dev? Who needs the hassle; just tear the old 
server down and bring up the new server from scratch with all resources updated 
from the get-go.

Oh, and the starting point would be keeping your full set of config and 
resource files under source control so that you can carefully review changes 
before they are "pushed", can compare different revisions, and can easily back 
out a revision with confidence rather than "winging it."

That said, a lot of production systems these days are not designed for parallel 
operation and swapping out parallel systems, especially for cloud and cluster 
systems. In these cases the reality is more of a "rolling update", where one 
node at a time is taken down, updated, brought up, tested, brought back into 
production, tested some more, and only after enough burn in time do you move to 
the next node.

This rolling update may also force you to sequence or stage your changes so 
that old and new nodes are at least relatively compatible. So, the first stage 
would update all nodes, one at a time, to the intermediate compatible change, 
and only when that rolling update of all nodes is complete would you move up to 
the next stage of the update to replace the intermediate update with the final 
update. And maybe more than one intermediate stage is required for more complex 
updates.

Some changes might involve upgrading Java jars as well, in a way that might 
cause nodes give incompatible results, in which case you may need to stage or 
sequence your Java changes as well, so that you don't make the final code 
change until you have verified that all nodes have compatible intermediate code 
that is compatible with both old nodes and new nodes.

Of course, it all depends on the nature of the update. For example, adding more 
synonyms may or may not be harmless with respect to whether existing index data 
becomes invalidated and each node needs to be completely reindexed, or if 
query-time synonyms are incompatible with index-time synonyms. Ditto for just 
about any analysis chain changes - they may be harmless, they may require full 
reindexing, they may simply not work for new data (i.e., a synonym is added in 
response to late-breaking news or an addition to a taxonomy) until nodes are 
updated, or maybe some queries become slightly or somewhat inaccurate until the 
update/reindex is complete.

So, you might want to have two stages of test system - one to just do a raw 
functional test of the changes, like whether your new synonyms work as expected 
or not, and then the pre-production stage which would be updated using exactly 
the same process as the production system, such as a rolling update or staged 
rolling update as required. The closer that pre-production system is run to the 
actual production, the greater the odds that you can have confidence that the 
update won't compromise the production system.

The pre-production test system might have, say, 10% of the production data and 
by only 10% the size of the production system.

In short, for smaller clusters having parallel systems with an atomic 
swap/redirection is probably simplest, while for larger clusters an incremental 
rolling update with thorough testing on a pre-production test cluster is the 
way to go.

-- Jack Krupansky

-Original Message-
From: Faisal Mansoor
Sent: Saturday, November 1, 2014 12:10 AM
To: solr-user@lucene.apache.org
Subject: How to update SOLR schema from continuous integration environment

Hi,

How do people usually update Solr configuration files from continuous 
integration environment like TeamCity or Jenkins.

We have multiple development and testing environments and use WebDeploy and 
AwsDeploy type of tools to remotely deploy code multiple times a day, to update 
solr I wrote a simple node server which accepts conf folder over http, updates 
the specified conf core folder and restarts the solr service.

Does there exists a standard tool for this uses case. I know about schema rest 
api, but, I want to update all the files in the conf folder rather than just 
updating a single file or adding or removing synonyms piecemeal.

Here is the link for the node server I mentioned if anyone is inter

RE: How to update SOLR schema from continuous integration environment

2014-11-01 Thread Will Martin
Well yes. But since there hasn't been any devops approaches yet, we really
aren't talking about Continuous Delivery. Continually delivering builds into
production is old hat and Jack nailed the canonical manners in which it has
been done. It really depends on whether an org is investing in the full
Agile lifecycle. A piece at a time is common,.

One possible devop approach:

Once you get near full test automation
: Jenkins builds the target
: chef does due diligence on dependencies
: chef pulls the build over. 
: chef configures the build once it is installed.
:chef takes the machine out of the load-balancers rotation
: chef puts the machine back in once it is launched and sanity tested (by
chef).




If you substitute Jack's plan, you get pretty much the same thing; except
that by using devops tools you introduce a little thing called idempotency.



-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Saturday, November 01, 2014 12:25 PM
To: solr-user@lucene.apache.org
Subject: Re: How to update SOLR schema from continuous integration
environment

Nice pictures, but that preso does not even begin to answer the question.

With master/slave replication, I do schema migration in two ways, depending
on whether a field is added or removed.

Adding a field:

1. Update the schema on the slaves. A defined field with no data is not a
problem.
2. Update the master.
3. Reindex to populate the field and wait for replication.
4. Update the request handlers or clients to use the new field.

Removing a field is the opposite. I haven't tried lately, but Solr used to
have problems with a field that was in the index but not in the schema.

1. Update the request handlers and clients to stop using the field.
2. Reindex without any data for the field that will be removed, wait for
replication.
3. Update the schema on the master and slaves.

I have not tried to automate this for continuous deployment. It isn't a big
deal for a single server test environment. It is the prod deployment that is
tricky.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/


On Nov 1, 2014, at 7:29 AM, Will Martin  wrote:

>
http://www.thoughtworks.com/insights/blog/enabling-continuous-delivery-enter
prises-testing
> 
> 
> -Original Message-
> From: Jack Krupansky [mailto:j...@basetechnology.com] 
> Sent: Saturday, November 01, 2014 9:46 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to update SOLR schema from continuous integration
environment
> 
> In all honesty, incrementally updating resources of a production server is
a rather frightening proposition. Parallel testing is always a better way to
go - bring up any changes in a parallel system for testing and then do an
atomic "swap" - redirection of requests from the old server to the new
server and then retire the old server only after the new server has had
enough time to burn in and get past any infant mortality problems.
> 
> That's production. Testing and dev? Who needs the hassle; just tear the
old server down and bring up the new server from scratch with all resources
updated from the get-go.
> 
> Oh, and the starting point would be keeping your full set of config and
resource files under source control so that you can carefully review changes
before they are "pushed", can compare different revisions, and can easily
back out a revision with confidence rather than "winging it."
> 
> That said, a lot of production systems these days are not designed for
parallel operation and swapping out parallel systems, especially for cloud
and cluster systems. In these cases the reality is more of a "rolling
update", where one node at a time is taken down, updated, brought up,
tested, brought back into production, tested some more, and only after
enough burn in time do you move to the next node.
> 
> This rolling update may also force you to sequence or stage your changes
so that old and new nodes are at least relatively compatible. So, the first
stage would update all nodes, one at a time, to the intermediate compatible
change, and only when that rolling update of all nodes is complete would you
move up to the next stage of the update to replace the intermediate update
with the final update. And maybe more than one intermediate stage is
required for more complex updates.
> 
> Some changes might involve upgrading Java jars as well, in a way that
might cause nodes give incompatible results, in which case you may need to
stage or sequence your Java changes as well, so that you don't make the
final code change until you have verified that all nodes have compatible
intermediate code that is compatible with both old nodes and new nodes.
> 
> Of course, it all depends on the nature of the update. For example, adding
more synonyms may or may not be harmless with respect to whether exist

RE: How to update SOLR schema from continuous integration environment

2014-11-02 Thread Will Martin
Well. You don't really think I HAVE a solr installation, do you Walter?  ;-)

No you're right.  The pattern I put out was general. 

It depends on the schema change doesn't it?


-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Saturday, November 01, 2014 11:42 PM
To: solr-user@lucene.apache.org
Subject: Re: How to update SOLR schema from continuous integration
environment

You do that with schema changes and I'll watch your site crash.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/


On Nov 1, 2014, at 8:31 PM, Will Martin  wrote:

> Well yes. But since there hasn't been any devops approaches yet, we 
> really aren't talking about Continuous Delivery. Continually 
> delivering builds into production is old hat and Jack nailed the 
> canonical manners in which it has been done. It really depends on 
> whether an org is investing in the full Agile lifecycle. A piece at a time
is common,.
> 
> One possible devop approach:
> 
> Once you get near full test automation
> : Jenkins builds the target
> : chef does due diligence on dependencies
> : chef pulls the build over. 
> : chef configures the build once it is installed.
> :chef takes the machine out of the load-balancers rotation
> : chef puts the machine back in once it is launched and sanity tested 
> (by chef).
> 
> 
> 
> 
> If you substitute Jack's plan, you get pretty much the same thing; 
> except that by using devops tools you introduce a little thing called
idempotency.
> 
> 
> 
> -Original Message-
> From: Walter Underwood [mailto:wun...@wunderwood.org]
> Sent: Saturday, November 01, 2014 12:25 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to update SOLR schema from continuous integration 
> environment
> 
> Nice pictures, but that preso does not even begin to answer the question.
> 
> With master/slave replication, I do schema migration in two ways, 
> depending on whether a field is added or removed.
> 
> Adding a field:
> 
> 1. Update the schema on the slaves. A defined field with no data is 
> not a problem.
> 2. Update the master.
> 3. Reindex to populate the field and wait for replication.
> 4. Update the request handlers or clients to use the new field.
> 
> Removing a field is the opposite. I haven't tried lately, but Solr 
> used to have problems with a field that was in the index but not in the
schema.
> 
> 1. Update the request handlers and clients to stop using the field.
> 2. Reindex without any data for the field that will be removed, wait 
> for replication.
> 3. Update the schema on the master and slaves.
> 
> I have not tried to automate this for continuous deployment. It isn't 
> a big deal for a single server test environment. It is the prod 
> deployment that is tricky.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/
> 
> 
> On Nov 1, 2014, at 7:29 AM, Will Martin  wrote:
> 
>> 
> http://www.thoughtworks.com/insights/blog/enabling-continuous-delivery
> -enter
> prises-testing
>> 
>> 
>> -Original Message-
>> From: Jack Krupansky [mailto:j...@basetechnology.com]
>> Sent: Saturday, November 01, 2014 9:46 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to update SOLR schema from continuous integration
> environment
>> 
>> In all honesty, incrementally updating resources of a production 
>> server is
> a rather frightening proposition. Parallel testing is always a better 
> way to go - bring up any changes in a parallel system for testing and 
> then do an atomic "swap" - redirection of requests from the old server 
> to the new server and then retire the old server only after the new 
> server has had enough time to burn in and get past any infant mortality
problems.
>> 
>> That's production. Testing and dev? Who needs the hassle; just tear 
>> the
> old server down and bring up the new server from scratch with all 
> resources updated from the get-go.
>> 
>> Oh, and the starting point would be keeping your full set of config 
>> and
> resource files under source control so that you can carefully review 
> changes before they are "pushed", can compare different revisions, and 
> can easily back out a revision with confidence rather than "winging it."
>> 
>> That said, a lot of production systems these days are not designed 
>> for
> parallel operation and swapping out parallel systems, especially for 
> cloud and cluster systems. In these cases the reality is more of a 
> "rolling update", where one node at a time is taken down, updat

Re: Bad type on operand stack: SolrInputDocument not assignable to SolrDocumentBase

2019-01-28 Thread Will Martin
Hi Shawn: We are deployed across the globe in many regions with different use patterns.Spring-data-solr is front and center for us and has proven pretty darn stable. "There can be some very strange problems when trying to use SolrJ with Spring.”Could you expand on this, so that we might know what might be getting missed?
Will MartinDEVOPS ENGINEER540.454.9565

urgently-email-logo
Description: application/apple-msg-attachment
8609 WESTWOOD CENTER DR, SUITE 475VIENNA, VA 22182geturgently.com

On Jan 27, 2019, at 2:08 PM, Shawn Heisey  wrote:There can be some very strange problems when trying to use SolrJ with Spring.

Re: [SECURITY] CVE-2018-8026: XXE vulnerability due to Apache Solr configset upload (exchange rate provider config / enum field config / TIKA parsecontext)

2018-07-04 Thread will martin
The cve id was reserved in April. The jira ticket 1 mo ago. Is this the
first notice to this list?

Thx

On Wed, Jul 4, 2018, 12:56 PM Uwe Schindler  wrote:

> CVE-2018-8026: XXE vulnerability due to Apache Solr configset upload
> (exchange rate provider config / enum field config / TIKA parsecontext)
>
> Severity: High
>
> Vendor:
> The Apache Software Foundation
>
> Versions Affected:
> Solr 6.0.0 to 6.6.4
> Solr 7.0.0 to 7.3.1
>
> Description:
> The details of this vulnerability were reported by mail to the Apache
> security mailing list.
> This vulnerability relates to an XML external entity expansion (XXE) in
> Solr
> config files (currency.xml, enumsConfig.xml referred from schema.xml,
> TIKA parsecontext config file). In addition, Xinclude functionality
> provided
> in these config files is also affected in a similar way. The vulnerability
> can
> be used as XXE using file/ftp/http protocols in order to read arbitrary
> local files from the Solr server or the internal network. The manipulated
> files can be uploaded as configsets using Solr's API, allowing to exploit
> that vulnerability. See [1] for more details.
>
> Mitigation:
> Users are advised to upgrade to either Solr 6.6.5 or Solr 7.4.0 releases
> both
> of which address the vulnerability. Once upgrade is complete, no other
> steps
> are required. Those releases only allow external entities and Xincludes
> that
> refer to local files / zookeeper resources below the Solr instance
> directory
> (using Solr's ResourceLoader); usage of absolute URLs is denied. Keep in
> mind, that external entities and XInclude are explicitly supported to
> better
> structure config files in large installations. Before Solr 6 this was no
> problem, as config files were not accessible through the APIs.
>
> If users are unable to upgrade to Solr 6.6.5 or Solr 7.4.0 then they are
> advised to make sure that Solr instances are only used locally without
> access
> to public internet, so the vulnerability cannot be exploited. In addition,
> reverse proxies should be guarded to not allow end users to reach the
> configset APIs. Please refer to [2] on how to correctly secure Solr
> servers.
>
> Solr 5.x and earlier are not affected by this vulnerability; those versions
> do not allow to upload configsets via the API. Nevertheless, users should
> upgrade those versions as soon as possible, because there may be other ways
> to inject config files through file upload functionality of the old web
> interface. Those versions are no longer maintained, so no deep analysis was
> done.
>
> Credit:
> Yuyang Xiao, Ishan Chattopadhyaya
>
> References:
> [1] https://issues.apache.org/jira/browse/SOLR-12450
> [2] https://wiki.apache.org/solr/SolrSecurity
>
> -
> Uwe Schindler
> uschind...@apache.org
> ASF Member, Apache Lucene PMC / Committer
> Bremen, Germany
> http://lucene.apache.org/
>
>
>


Re: Cluster with no overseer?

2019-05-21 Thread Will Martin
+1

Will Martin
DEVOPS ENGINEER
540.454.9565

8609 WESTWOOD CENTER DR, SUITE 475
VIENNA, VA 22182
geturgently.com


On Tue, May 21, 2019 at 7:39 PM Walter Underwood 
wrote:

> ADDROLE times out after 180 seconds. This seems to be an unrecoverable
> state for the cluster, so that is a pretty serious bug.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On May 21, 2019, at 4:10 PM, Walter Underwood 
> wrote:
> >
> > We have a 6.6.2 cluster in prod that appears to have no overseer. In
> /overseer_elect on ZK, there is an election folder, but no leader document.
> An OVERSEERSTATUS request fails with a timeout.
> >
> > I’m going to try ADDROLE, but I’d be delighted to hear any other ideas.
> We’ve diverted all the traffic to the backing cluster, so we can blow this
> one away and rebuild.
> >
> > Looking at the Zookeeper logs, I see a few instances of network failures
> across all three nodes.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
>
>


Re: Cluster with no overseer?

2019-05-21 Thread Will Martin
Walter. Can I cross-post to zk-dev?
Will MartinDEVOPS ENGINEER540.454.9565

urgently-email-logo
Description: application/apple-msg-attachment
8609 WESTWOOD CENTER DR, SUITE 475VIENNA, VA 22182geturgently.com

On May 21, 2019, at 9:26 PM, Will Martin <wmar...@urgent.ly> wrote:+1Will MartinDEVOPS ENGINEER540.454.95658609 WESTWOOD CENTER DR, SUITE 475VIENNA, VA 22182geturgently.comOn Tue, May 21, 2019 at 7:39 PM Walter Underwood <wun...@wunderwood.org> wrote:ADDROLE times out after 180 seconds. This seems to be an unrecoverable state for the cluster, so that is a pretty serious bug.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 21, 2019, at 4:10 PM, Walter Underwood <wun...@wunderwood.org> wrote:
> 
> We have a 6.6.2 cluster in prod that appears to have no overseer. In /overseer_elect on ZK, there is an election folder, but no leader document. An OVERSEERSTATUS request fails with a timeout.
> 
> I’m going to try ADDROLE, but I’d be delighted to hear any other ideas. We’ve diverted all the traffic to the backing cluster, so we can blow this one away and rebuild.
> 
> Looking at the Zookeeper logs, I see a few instances of network failures across all three nodes.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 




Re: Cluster with no overseer?

2019-05-21 Thread Will Martin
Worked with Fusion and Zookeeper at GSA for 18 months: admin role.

Before blowing it away, you could try:

- id a candidate node, with a snapshot you just might think is old enough
to be robust.
- clean data for zk nodes otherwise.
- bring up the chosen node and wait for it to settle[wish i could remember
why i called what i saw that]
- bring up other nodes 1 at a time.  let each one fully sync to follower of
the new leader.
- they should each in turn request the snapshot from the lead. then you
have

: align your collections with the ensemble. and for the life of me i can't
remember there being anything particularly tricky about that with fusion ,
which means I can't remember what I did... or have it doc'd at home. ;-)


Will Martin
DEVOPS ENGINEER
540.454.9565

8609 WESTWOOD CENTER DR, SUITE 475
VIENNA, VA 22182
geturgently.com


On Tue, May 21, 2019 at 11:40 PM Walter Underwood 
wrote:

> Yes, please. I have the logs from each of the Zookeepers.
>
> We are running 3.4.12.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On May 21, 2019, at 6:49 PM, Will Martin  wrote:
> >
> > Walter. Can I cross-post to zk-dev?
> >
> >
> >
> > Will Martin
> > DEVOPS ENGINEER
> > 540.454.9565
> >
> > 
> >
> > 8609 WESTWOOD CENTER DR, SUITE 475
> > VIENNA, VA 22182
> > geturgently.com <http://geturgently.com/>
> >
> >
> >
> >
> >> On May 21, 2019, at 9:26 PM, Will Martin  wmar...@urgent.ly>> wrote:
> >>
> >> +1
> >>
> >> Will Martin
> >> DEVOPS ENGINEER
> >> 540.454.9565
> >>
> >> 8609 WESTWOOD CENTER DR, SUITE 475
> >> VIENNA, VA 22182
> >> geturgently.com <http://geturgently.com/>
> >>
> >>
> >> On Tue, May 21, 2019 at 7:39 PM Walter Underwood  <mailto:wun...@wunderwood.org>> wrote:
> >> ADDROLE times out after 180 seconds. This seems to be an unrecoverable
> state for the cluster, so that is a pretty serious bug.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> >> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my
> blog)
> >>
> >> > On May 21, 2019, at 4:10 PM, Walter Underwood  <mailto:wun...@wunderwood.org>> wrote:
> >> >
> >> > We have a 6.6.2 cluster in prod that appears to have no overseer. In
> /overseer_elect on ZK, there is an election folder, but no leader document.
> An OVERSEERSTATUS request fails with a timeout.
> >> >
> >> > I’m going to try ADDROLE, but I’d be delighted to hear any other
> ideas. We’ve diverted all the traffic to the backing cluster, so we can
> blow this one away and rebuild.
> >> >
> >> > Looking at the Zookeeper logs, I see a few instances of network
> failures across all three nodes.
> >> >
> >> > wunder
> >> > Walter Underwood
> >> > wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> >> > http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
> (my blog)
> >> >
> >>
> >
>
>


Re: Solr slave core corrupted and not replicating.

2019-06-05 Thread Will Martin
Varma:

What version of Solr is running? You said master slave so you are not
running a solrcloud? Some mistakenly hold onto the nomenclature as
describing the leadership state.

When you look at the log do you have a window on the replication logging?

Search Queries are routed to the master for a given shard. Unless you are
using the distrib parameter?



why does it happen that a core goes corrupt? Any crashes or non
solr/bin/stop (service stop) downing?
Do you have an autocommit? Recent experience with backup issues suggest
that an open commit with partial persistence might cause this if a bounce
happens while that state exists?. (which is why we use solr/bin/stop or
service stop)

will martin






On Wed, Jun 5, 2019 at 6:26 PM varma mahesh  wrote:

> ++solr-user@lucene.apache.org
>
> On Thu 6 Jun, 2019, 1:19 AM varma mahesh,  wrote:
>
> > Hi Team,
> >
> >
> > What happens to Sitecore - Solr query handling when a core is corrupted
> in
> > Solr slave in a Master - slave setup?
> >
> > Our Sitecore site's solr search engine is a master-slave setup. One of
> the
> > cores of the Slave is corrupted and is not available at all in Slave.
> >
> > It is not being replicated from Master too (Expecting index replication
> to
> > do this but core is completely missing in Slave). As read in index
> > replication documentation, all the queries are handled by Slave part of
> the
> > set up.
> >
> > What happens to the queries that are handled by this core that is missing
> > in slave?
> >
> > Will they be taken over by Master?
> >
> > Please help me as I can find no info about this anywhere else. For info
> > the core that is missing is of Sitecore
> > analytics index.
> >
> > The error that Solr slave showing us for Analytics core is:
> >
> > Ojrg.apache.solr.common.SolrException: Error opening new searcher
> >  at org.apache.solr.core.SolrCore.(SolrCore.java:815)
> >   at org.apache.solr.core.SolrCore.(SolrCore.java:658)
> >at org.apache.solr.core.CoreContainer.create(CoreContainer.java:637)
> >   at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:381)
> >  at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:375)
> > at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> > at
> >
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:148)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
> >  at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
> >   at java.lang.Thread.run(Thread.java:748)
> >
> > Can you please help us here why this happened? We could not find any info
> > from the logs that is leading us to this error.
> >
> >
> >
>


Re: Discuss: virtual nodes in Solr

2019-06-28 Thread Will Martin
From: S G mailto:sg.online.em...@gmail.com>>
Subject: Discuss: virtual nodes in Solr
Date: June 28, 2019 at 8:04:44 PM EDT
To: solr-user@lucene.apache.org
Reply-To: solr-user@lucene.apache.org

Hi,

Has Solr tried to use vnodes concept like Cassandra:
https://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2

If this can be implemented carefully, we need not live with just
shard-splitting alone that can only double the number of shards.
With vnodes, shards can be increased incrementally as the need arises.
What's more, shards can be decreased too when the doc-count/traffic
decreases.

-SG

+1

Carefully? Deliberate would be a better word with this community; imho. How 
about an incubation epic story PMC?





SolrCloud BACKUP single shard collections

2019-07-15 Thread Will Martin
Hi.

I have 8 collections in a 3 node SolrCloud: 6.3

Given the following scenario:
1. preferredleader REPLICAPROP for all collections on core_node2
2. zookeeper->overseer_elect->leader has core_node1
3. BACKUP command always writes to/using core_node1 ???

Notes:

  1.  all collections have exactly one shard.
  2.  note: preferredleader has been set due to drift from core_node1 for 
several collections' shard leader
  3.  all collections have been REBALANCELEADER, so the shard leaders are all 
on core_node2 according to healthcheck

non-canonical: I know Backup is supposed to have a shared fs mounted to each 
node, experimentation shows that when
there is only 1 shard, only 1 node is writing to storage; and if that storage 
is local fs, no issues.

I expected the writes to come from the shard leaders, but they are coming from 
the zookeeper->leader node.

The workflow has been rock-solid as long as we have shard leaders and solrcloud 
leader consistent with each other.

Is my expectation wrong [writes happen on shard leader for single shard 
collections]? I need defined behavior so that I know where to pick up the 
backup files. This is all implemented in a script, and deterministic 
understanding of what.writes.where will make it a success.

thanks

--will


RE: solrcloud backup null pointer exeption

2019-07-26 Thread Will Martin
can you share:
solr version?
zookeeper ensemble type
number of shards in the collection?
distribution of shard replicas in the SolrCloud? 

from there the most obvious question is whether the stack trace is from the 
shard leader for the collection or the localhost; if they are not the same? 
There should be significantly more logging associated with this operation?

What version of NFS does your pod run?

-Original Message-
From: rffleaie  
Sent: Thursday, July 25, 2019 6:03 PM
To: solr-user@lucene.apache.org
Subject: solrcloud backup null pointer exeption

I have a solrcloud cluster installed on k8s.
I have created a nfs PVC that is mounted under /backup of every pod of the solr 
cluster.

When I start the backup with 

https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A8983%2Fsolr%2Fadmin%2Fcollections%3Faction%3DBACKUP%26name%3Dtest%26collection%3Dcollection_name%26location%3D%2Fbackup&data=02%7C01%7C%7C3bc3ddc47b6b42e7677708d7114bf3b6%7C84df9e7fe9f640afb435%7C1%7C0%7C636996890213331873&sdata=%2BlF%2FDerZFDaANchFbcVXtJnN98DJh4UIxIuyacYb1Vo%3D&reserved=0

I receive the follow error, someone have the same issue ???




  "Operation backup caused
exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
Could not backup all shards",
  "exception":{
"msg":"Could not backup all shards",
"rspCode":500},
  "error":{
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","org.apache.solr.common.SolrException"],
"msg":"Could not backup all shards",
"trace":"org.apache.solr.common.SolrException: Could not backup all 
shards\n\tat 
org.apache.solr.client.solrj.SolrResponse.getException(SolrResponse.java:53)\n\tat
org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:274)\n\tat
org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:246)\n\tat
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)\n\tat
org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:734)\n\tat
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:715)\n\tat
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:496)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)\n\tat
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)\n\tat
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)\n\tat
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
org.eclipse.jetty.server.Server.handle(Server.java:531)\n\tat
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)\n\tat
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)\n\tat
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)\n\tat
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)\n\tat
org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)\n\tat
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)\n\tat
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)\n\tat
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)\n\tat
org.eclipse.jetty.util.thread.