Re: Use of ShingleFilter causing very large BooleanQuery structures in Solr 7.1

2019-03-22 Thread Hubert-Price, Neil
Thanks Erick, that makes sense.

However it does lead me to another conclusion: in Solr prior to 6.0, or with 
sow=true on Solr 6.0+  that would mean that the ShingleFilter is totally 
ineffective within query analysers. It would be logically equivalent to not 
having the ShingleFilter configured at all.

The point of the ShingleFilter as I understand it is to create 
combinations/permutations, but there are none possible surely if it receives 
only one pre-split token at a time.

Going back to my original configuration, I think to achieve the same result as 
in Solr 4.6 - I would need to remove ShingleFilterFactory from the query 
analyser config for that field type?

Many Thanks,
Neil

Sent from my iPhone

> On 22 Mar 2019, at 02:38, Erick Erickson  wrote:
> 
> sow was introduced in Solr 6, so it’s just ignored in 4x.
> 
> bq. Surely the tokenizer splits on white space anyway, or it wouldn't work?
> 
> I didn’t work on that code, so I don’t have the details off the top of my 
> head, but I’ll take a stab at it as far as my understanding goes. The result 
> is in your parsed queries.
> 
> Note thatin the better-behaved case, you have a bunch of individual 
> tokens ORd together like:
> productdetails_tokens_en:9611444530
> productdetails_tokens_en:9611444530
> 
> and that’s all. IOW, the query parser has split them into individual tokens 
> that are fed one at a time into the analysis chain.
> 
> In the bad case you have a bunch of single tokens as well, but then what look 
> like multiple tokens, but are not:
> +productdetails_tokens_en:9611444500
> +productdetails_tokens_en:9612194002 9612194002 9612194002)
> 
> which is where the explosion is coming from. It’s deceptive, because when 
> shingling, this is a single token "9612194002 9612194002 9612194002” for all 
> it looks like something that’d be split by whitespace. 
> 
> If you take a look at your admin UI>>your_core>>schema and select your 
> productdetails_tokens_en from the drop down and then “load terms” you’ll see. 
> If you want to experiment, you can add a tokenSeparator character other than 
> a space to the shinglefilter that’ll make it clearer. Then the clause above 
> that looks like multiple, whitespace-separated tokens would look like what it 
> really is, a single token:
> 
> +productdetails_tokens_en:9612194002_9612194002_9612194002)
> 
> Best,
> Erick
> 
>> On Mar 21, 2019, at 3:10 PM, Hubert-Price, Neil  
>> wrote:
>> 
>> Surely the tokenizer splits on white space anyway, or it wouldn't work?
> 


Re: [CAUTION] Re: Use of ShingleFilter causing very large BooleanQuery structures in Solr 7.1

2019-03-22 Thread Hubert-Price, Neil
One other question 

Is there a system level configuration that can change the default for the sow= 
parameter?  Can it be flipped to have the default set to true?

Many Thanks,
Neil

On 22/03/2019, 08:36, "Hubert-Price, Neil"  wrote:

Thanks Erick, that makes sense.

However it does lead me to another conclusion: in Solr prior to 6.0, or 
with sow=true on Solr 6.0+  that would mean that the ShingleFilter is 
totally ineffective within query analysers. It would be logically equivalent to 
not having the ShingleFilter configured at all.

The point of the ShingleFilter as I understand it is to create 
combinations/permutations, but there are none possible surely if it receives 
only one pre-split token at a time.

Going back to my original configuration, I think to achieve the same result 
as in Solr 4.6 - I would need to remove ShingleFilterFactory from the query 
analyser config for that field type?

Many Thanks,
Neil

Sent from my iPhone

> On 22 Mar 2019, at 02:38, Erick Erickson  wrote:
> 
> sow was introduced in Solr 6, so it’s just ignored in 4x.
> 
> bq. Surely the tokenizer splits on white space anyway, or it wouldn't 
work?
> 
> I didn’t work on that code, so I don’t have the details off the top of my 
head, but I’ll take a stab at it as far as my understanding goes. The result is 
in your parsed queries.
> 
> Note thatin the better-behaved case, you have a bunch of individual 
tokens ORd together like:
> productdetails_tokens_en:9611444530
> productdetails_tokens_en:9611444530
> 
> and that’s all. IOW, the query parser has split them into individual 
tokens that are fed one at a time into the analysis chain.
> 
> In the bad case you have a bunch of single tokens as well, but then what 
look like multiple tokens, but are not:
> +productdetails_tokens_en:9611444500
> +productdetails_tokens_en:9612194002 9612194002 9612194002)
> 
> which is where the explosion is coming from. It’s deceptive, because when 
shingling, this is a single token "9612194002 9612194002 9612194002” for all it 
looks like something that’d be split by whitespace. 
> 
> If you take a look at your admin UI>>your_core>>schema and select your 
productdetails_tokens_en from the drop down and then “load terms” you’ll see. 
If you want to experiment, you can add a tokenSeparator character other than a 
space to the shinglefilter that’ll make it clearer. Then the clause above that 
looks like multiple, whitespace-separated tokens would look like what it really 
is, a single token:
> 
> +productdetails_tokens_en:9612194002_9612194002_9612194002)
> 
> Best,
> Erick
> 
>> On Mar 21, 2019, at 3:10 PM, Hubert-Price, Neil 
 wrote:
>> 
>> Surely the tokenizer splits on white space anyway, or it wouldn't work?
> 




Solr LTR model Performance Issues

2019-03-22 Thread Kamal Kishore Aggarwal
Hi,

I am trying to use LTR with solr 6.6.2.There are different types of model
like Linear Model, Multiple Additive Trees Model and Neural Network Model.

I have tried using Linear & Multiadditive model and compared the
performance of results. There is a major difference in response time
between the 2 models. I am observing that Multiadditive model is taking way
higher time than linear model.

Is there a way we can improve the performance here.

Note: The size of Multiadditive model is 136 MB.

Regards
Kamal Kishore


I’m
protected online with Avast Free Antivirus. Get it here — it’s free forever.

<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>


Re: Solr LTR model Performance Issues

2019-03-22 Thread Jörn Franke
Can you share the time needed of the two models? How many documents? What is 
your loading pipeline? Have you observed cpu/memory?

> Am 22.03.2019 um 12:01 schrieb Kamal Kishore Aggarwal :
> 
> Hi,
> 
> I am trying to use LTR with solr 6.6.2.There are different types of model
> like Linear Model, Multiple Additive Trees Model and Neural Network Model.
> 
> I have tried using Linear & Multiadditive model and compared the
> performance of results. There is a major difference in response time
> between the 2 models. I am observing that Multiadditive model is taking way
> higher time than linear model.
> 
> Is there a way we can improve the performance here.
> 
> Note: The size of Multiadditive model is 136 MB.
> 
> Regards
> Kamal Kishore
> 
> 
> I’m
> protected online with Avast Free Antivirus. Get it here — it’s free forever.
> 
> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>


Solr query high response time

2019-03-22 Thread Rajdeep Sahoo
Hi all,
  My solr query sometime taking more than 60 sec to return the response .
Is there any way I can check why it is taking so much time .
  Please let me know if there is any way to analyse this issue(high
response time ) .Thanks


Why is elevate not working when I convert a request to local parameters?

2019-03-22 Thread Tim Allison
Should probably send this one from an anonymous email... :(

I can see from the results that elevate is working with this:

select?&defType=edismax&q=transcript&qf=my_field

However, elevate is not working with this:

select?&q={!edismax%20v=transcript%20qf=my_field}

This is Solr 4.x...y, I know...

What am I doing wrong?  How can I fix this?

Thank you.

Best,

 Tim


Re: [CAUTION] Re: Use of ShingleFilter causing very large BooleanQuery structures in Solr 7.1

2019-03-22 Thread Shawn Heisey

On 3/22/2019 2:02 AM, Hubert-Price, Neil wrote:

One other question 

Is there a system level configuration that can change the default for the sow= 
parameter?  Can it be flipped to have the default set to true?


Any parameter can be put into the query handler definition.  In 
defaults, invariants, or appends.  Most people only use the defaults 
section.  If you put it into invariants, then clients will not be able 
to change the setting for that parameter.


In response to another message you sent on the thread:

You can still split on whitespace even if sow=false, which is the 
default setting for sow starting in 7.0.  You just have to do it 
yourself.  The tokenizers that people typically use will split on 
whitespace -- WhitespaceTokenizerFactory and StandardTokenizerFactory 
are two examples of this.  There are also filters that split on whitespace.


I do not know whether you should remove the shingle filter to achieve 
your goals.  You would need to examine all the things you need to happen 
and use the analysis tab in the UI to decide what analysis you need.


Thanks,
Shawn


Re: Solr query high response time

2019-03-22 Thread Shawn Heisey

On 3/22/2019 7:52 AM, Rajdeep Sahoo wrote:

   My solr query sometime taking more than 60 sec to return the response .
Is there any way I can check why it is taking so much time .
   Please let me know if there is any way to analyse this issue(high
response time ) .Thanks


With the information provided, we have nothing to go on to figure out 
what is happening.


In many cases, the problem turns out to be insufficient memory.  If you 
can gather a specific screenshot of a process listing, we MIGHT be able 
to offer some insight.  The way to gather that info is posted here:


https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue

Attachments to messages sent to this list will not make it through to 
us.  You will need to use a file sharing site and provide a link to the 
file.


Some queries, when executed on a very large index, will be slow even 
when there is sufficient memory.  After you send the screenshot 
mentioned above, we might need to ask for additional information.


Thanks,
Shawn


Range query on multivalued string field results in useless highlighting

2019-03-22 Thread Wolf, Karl (NIH/NLM/LHC) [C]
Range queries against mutivalued string fields produces useless highlighting, 
even though "hl.highlightMultiTerm":"true"

I have uncovered what I believe is a bug. At the very lease it is a difference 
in behavior between Solr v5.1.0 and v7.5.0 (and v7.7.1).

I have a Field defined in my schema as:




I am using a query containing a Range clause and I am using highlighting to get 
the list of values that match the range query.

All examples below were using the appropriate Solr Admin Server Query page.

The range query using Solr v5.1.0 produces CORRECT and useful results:

{
  "responseHeader": {
"status": 0,
"QTime": 366,
"params": {
  "q": "ResourceCorrespondent:[A TO B}",
  "hl": "true",
  "indent": "true",
  "hl.preserveMulti": "true",
  "fl": "ResourceCorrespondent,ResourceID",
  "hl.requireFieldMatch": "true",
  "hl.usePhraseHighlighter": "true",
  "hl.fl": "ResourceCorrespondent",
  "wt": "json",
  "hl.highlightMultiTerm": "true",
  "_": "1553275722025"
}
  },
  "response": {
"numFound": 999,
"start": 0,
"docs": [
  {
"ResourceCorrespondent": [
  "Stanley, Wendell M.",
  "Avery, Roy"
],
"ResourceID": "CCAAHG"
  },
  {
"ResourceCorrespondent": [
  "Avery, Roy"
],
"ResourceID": "CCGMDS"
  },
... lots more docs, then
]
  },
... we get to the highlighting portion of the response
... this tells me which values of each ResourceCorrespondent field
... actually matching the query

  "highlighting": {
"CCAAHG": {
  "ResourceCorrespondent": [
"Avery, Roy"
  ]
},
"CCGMDS": {
  "ResourceCorrespondent": [
"Avery, Roy"
  ]
},
"BBACKV": {
  "ResourceCorrespondent": [
"American Institute of Biological Sciences",
"Albritton, Errett C."
  ]
},
... lots more useful highlight values. Note two matching values
... for document BBACKV.
}

***
***
However, using exact same parameters with Solr v7.5.0 or v7.7.1, the top 
portion of the
response is basically the same including the number of documents found

{
  "responseHeader":{
"status":0,
"QTime":245,
"params":{
  "q":"ResourceCorrespondent:[A TO B}",
  "hl":"on",
  "hl.preserveMulti":"true",
  "fl":"ResourceID, ResourceCorrespondent",
  "hl.requireFieldMatch":"true",
  "hl.fl":"ResourceCorrespondent",
  "hightlightMultiTerm":"true",
  "wt":"json",
  "_":"1553105129887",
  "usePhraseHighLighter":"true"}},
  "response":{"numFound":999,"start":0,"docs":[

The documents are in a different order, but that doesn't matter.

The problem is with the lighlighting which is effectively empty. I don't know 
what
values in each document actually matched the query:

  "highlighting":{
"QQBBLX":{},
"QQBCLN":{},
"QQBCLM":{},
... etc.

*** NOTE: The data is the same for all Solr versions and the Solr indexes were 
rebuilt
for each Solr version.

***
Changing to using "&hl.method=unified", the highlighting looks like:

  "highlighting":{
"QQBBLX":{
  "ResourceCorrespondent":[]},
"QQBCLN":{
  "ResourceCorrespondent":[]},
"QQBCLM":{
  "ResourceCorrespondent":[]},

*** Closer but still no useful values

***
NOTE: if I change only the query to be a wildcard query to 
q="ResourceCorrespondent:A*"

the highlighting is correct in both Solr v7.5.0 and v7.7.1:

  "highlighting":{
"QQBBLX":{
  "ResourceCorrespondent":["American Public Health Association"]},
"QQBCLN":{
  "ResourceCorrespondent":["Abram, Morris B."]},
"QQBCLM":{
  "ResourceCorrespondent":["Abram, Morris B."]},
... etc.

*** This makes me think there is some problem with a Range query feeding the
Highlighter code.

***
All variations of hl specs or other query parameters do not fix the problem.
The wildcard query is my current work around but there still is a problem with
range queries:

So there is some incompatibility among:

1) A multivalued string field AND
2) A range query against that field AND
3) Highlighting

The highlight portion of the response is effectively "empty"

I don't know when this issue was first introduced. I have recently been 
updating from 5.1.0
to 7.5.0 in one big leap. I have attempted to read through the change logs for 
the intervening
versions but I gave up to save my sanity.

--Karl


Re: CDCR issues

2019-03-22 Thread Jay Potharaju
This might be causing the high CPU in 7.7.x.

https://github.com/apache/lucene-solr/commit/eb652b84edf441d8369f5188cdd5e3ae2b151434#diff-e54b251d166135a1afb7938cfe152bb5
That is related to this JDK bug
https://bugs.openjdk.java.net/browse/JDK-8129861.


Thanks
Jay Potharaju



On Thu, Mar 21, 2019 at 10:20 PM Jay Potharaju 
wrote:

> Hi,
> I just enabled CDCR for one  collection. I am seeing high CPU usage and
> the high number of tlog files and increasing.
> The collection does not have lot of data , just started reindexing of
> data.
> .
> Solr 7.7.0 , implicit sharding 8 shards
> I have enabled buffer on source side and disabled buffer on target side.
> The number of replicators is set to 4.
>  Any suggestions on how to tackle high cpu and growing tlog. The tlog are
> small in size but for the one shard I checked there were about 100 of them.
>
> Thanks
> Jay


Re: Need help on LTR

2019-03-22 Thread Kamuela Lau
I think the issue is that you store the feature as  originalScore but in
your model you refer to it as original_score

On Wed, Mar 20, 2019 at 1:58 PM Mohomed Rimash  wrote:

> one more thing i noticed is your feature params values doesn't wrap in q or
> qf field. check that as well
>
> On Wed, 20 Mar 2019 at 01:34, Amjad Khan  wrote:
>
> > Did, but same error
> >
> > {
> >   "responseHeader":{
> > "status":400,
> > "QTime":5},
> >   "error":{
> > "metadata":[
> >   "error-class","org.apache.solr.common.SolrException",
> >   "root-error-class","java.lang.NullPointerException"],
> > "msg":"org.apache.solr.ltr.model.ModelException: Model type does not
> > exist org.apache.solr.ltr.model.LinearModel",
> > "code":400}}
> >
> >
> >
> > > On Mar 19, 2019, at 3:26 PM, Mohomed Rimash 
> > wrote:
> > >
> > > Please update the weights values to greater than 0 and less than 1.
> > >
> > > On Wed, 20 Mar 2019 at 00:13, Amjad Khan  wrote:
> > >
> > >> Feature File
> > >> ===
> > >>
> > >> [
> > >>  {
> > >>"store" : "exampleFeatureStore",
> > >>"name" : "isCityName",
> > >>"class" : "org.apache.solr.ltr.feature.FieldValueFeature",
> > >>"params" : { "field" : "CITY_NAME" }
> > >>  },
> > >>  {
> > >>"store" : "exampleFeatureStore",
> > >>"name" : "originalScore",
> > >>"class" : "org.apache.solr.ltr.feature.OriginalScoreFeature",
> > >>"params" : {}
> > >>  },
> > >>  {
> > >>"store" : "exampleFeatureStore",
> > >>"name" : "isLat",
> > >>"class" : "org.apache.solr.ltr.feature.FieldValueFeature",
> > >>"params" : { "field" : "LATITUDE" }
> > >>  }
> > >> ]
> > >>
> > >> Model File
> > >> ==
> > >> {
> > >>  "store": "exampleFeatureStore",
> > >>  "class": "org.apache.solr.ltr.model.LinearModel",
> > >>  "name": "exampleModelStore",
> > >>  "features": [{
> > >>  "name": "isCityName"
> > >>},
> > >>{
> > >>  "name": "isLat"
> > >>},
> > >>{
> > >>  "name": "original_score"
> > >>}
> > >>  ],
> > >>  "params": {
> > >>"weights": {
> > >>  "isCityName": 0.0,
> > >>  "isLat": 0.0,
> > >>  "original_score": 1.0
> > >>}
> > >>  }
> > >> }
> > >>
> > >>
> > >>
> > >>> On Mar 19, 2019, at 2:04 PM, Mohomed Rimash 
> > >> wrote:
> > >>>
> > >>> Can you share the feature file and the model file,
> > >>> 1. I had few instances where invalid values for parameters (ie
> weights
> > >> set
> > >>> to more than 1 , with minmaxnormalizer) resulted the above error,
> > >>> 2, Check all the features added to the model has a weight under
> params
> > ->
> > >>> weights in the model
> > >>>
> > >>>
> > >>> On Tue, 19 Mar 2019 at 21:21, Roopa Rao  wrote:
> > >>>
> >  Does your feature definitions and the feature names used in the
> model
> >  match?
> > 
> >  On Tue, Mar 19, 2019 at 10:17 AM Amjad Khan 
> > >> wrote:
> > 
> > > Yes, I did.
> > >
> > > I can see the feature that I created by this
> > > schema/feature-store/exampleFeatureStore and it return me the
> > features
> > >> I
> > > created. But issue is when I try to put store-model.
> > >
> > >> On Mar 19, 2019, at 12:18 AM, Mohomed Rimash <
> rim...@yaalalabs.com>
> > > wrote:
> > >>
> > >> Hi Amjad, After adding the libraries into the path, Did you
> restart
> > >> the
> > >> SOLR ?
> > >>
> > >> On Tue, 19 Mar 2019 at 08:45, Amjad Khan 
> > wrote:
> > >>
> > >>> I followed the Solr LTR Documentation
> > >>>
> > >>> https://lucene.apache.org/solr/guide/7_4/learning-to-rank.html <
> > >>> https://lucene.apache.org/solr/guide/7_4/learning-to-rank.html>
> > >>>
> > >>> 1. Added library into the solr-config
> > >>> 
> > >>>  > >>> regex=".*\.jar" />
> > >>>  > >>> regex="solr-ltr-\d.*\.jar" />
> > >>> 2. Successfully added feature
> > >>> 3. Get schema to see feature is available
> > >>> 4. When I try to push model I see the error below, however I
> added
> > >> the
> > > lib
> > >>> into solr-cofig
> > >>>
> > >>> Response
> > >>> {
> > >>> "responseHeader":{
> > >>>  "status":400,
> > >>>  "QTime":1},
> > >>> "error":{
> > >>>  "metadata":[
> > >>>"error-class","org.apache.solr.common.SolrException",
> > >>>"root-error-class","java.lang.NullPointerException"],
> > >>>  "msg":"org.apache.solr.ltr.model.ModelException: Model type does
> >  not
> > >>> exist org.apache.solr.ltr.model.LinearModel",
> > >>>  "code":400}}
> > >>>
> > >>> Thanks
> > >
> > >
> > 
> > >>
> > >>
> >
> >
>


Re: Solr LTR model Performance Issues

2019-03-22 Thread Kamal Kishore Aggarwal
HI Jörn Franke,

Thanks for the quick reply.

I have performed the jmeter load testing on one of the server for Linear vs
Multipleadditive tree model. We are using lucidworks fusion.
There is some business logic in the query pipeline followed by main solr
ltr query. This is the total time taken by query pipeline.
Below are the response time:

# of Threads Ramup Period Loop Count Type Total Requests Average Response
Time (ms)
Iteration 1 Iteration 2 Iteration 3
10 1 10 Linear Model  100 2038 1998 1975
25 1 10 Linear Model  250 4329 3961 3726

10 1 10 MultiAdditive Model 100 12721 12631 12567
25 1 10 MultiAdditive Model 250 27924 31420 30758
# of docs: 500K and Indexing size is 10 GB.

As of now, I did not checked the CPU or memory usage, but did not observed
any errors during jmeter load test.

Let me know if any other information is required.

Regards
Kamal


I’m
protected online with Avast Free Antivirus. Get it here — it’s free forever.

<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Fri, Mar 22, 2019 at 5:13 PM Jörn Franke  wrote:

> Can you share the time needed of the two models? How many documents? What
> is your loading pipeline? Have you observed cpu/memory?
>
> > Am 22.03.2019 um 12:01 schrieb Kamal Kishore Aggarwal <
> kkroyal@gmail.com>:
> >
> > Hi,
> >
> > I am trying to use LTR with solr 6.6.2.There are different types of model
> > like Linear Model, Multiple Additive Trees Model and Neural Network
> Model.
> >
> > I have tried using Linear & Multiadditive model and compared the
> > performance of results. There is a major difference in response time
> > between the 2 models. I am observing that Multiadditive model is taking
> way
> > higher time than linear model.
> >
> > Is there a way we can improve the performance here.
> >
> > Note: The size of Multiadditive model is 136 MB.
> >
> > Regards
> > Kamal Kishore
> >
> > <
> https://www.avast.com/en-in/recommend?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=default3&tag=d4ef6ef9-b8d1-40b8-96ac-2354fd69483b
> >
> > I’m
> > protected online with Avast Free Antivirus. Get it here — it’s free
> forever.
> > <
> https://www.avast.com/en-in/recommend?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=default3&tag=d4ef6ef9-b8d1-40b8-96ac-2354fd69483b
> >
> > <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>