Re: copyfield not working

2019-01-14 Thread Jay Potharaju
 thanks for the info Andrea!
Thanks
Jay



On Sun, Jan 13, 2019 at 11:53 PM Andrea Gazzarini 
wrote:

> Hi Jay, the text analysis always operates on the indexed content. The
> stored content of a filed is left untouched unless you do something
> before it gets indexed (e.g. on client side or by an
> UpdateRequestProcessor).
>
> Cheers,
> Andrea
>
> On 14/01/2019 08:46, Jay Potharaju wrote:
> > Hi,
> > I have a copy field in which i am copying the contents of text_en field
> to
> > another custom field.
> > After indexing i was expecting any of the special characters in the
> > paragraph to be removed, but it does not look like that is happening. The
> > copied content is same as the what is there in the source. I ran analysis
> > ...looks like the pattern matching works as expected and the special
> > characters are removed.
> >
> > Any suggestions?
> > 
>  <
> > charFilter class="solr.PatternReplaceCharFilterFactory" pattern=
> > "['!#\$%'\(\)\*+,-\./:;=?@\[\]\^_`{|}~!@#$%^*]" />  > "solr.StandardTokenizerFactory"/>  > "solr.SuggestStopFilterFactory" ignoreCase="true" words=
> > "lang/stopwords_en.txt" /> 
> <
> > filter class="solr.EnglishPossessiveFilterFactory"/>  > "solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
>   > fieldType>
> >
> > Thanks
> > Jay
> >
>


RE: Schema.xml, copyField, Slash, ignoreCase ?

2019-01-14 Thread Bruno Mannina
Hi Steve,

Many thanks for this field, I will test it this afternoon in my dev' server.

Thanks also for your explanation !

Have a nice day !

Bruno

-Message d'origine-
De : Steve Rowe [mailto:sar...@gmail.com] 
Envoyé : vendredi 11 janvier 2019 17:43
À : solr-user@lucene.apache.org
Objet : Re: Schema.xml, copyField, Slash, ignoreCase ?

Hi Bruno,

ignoreCase: Looks like you already have achieved this?

auto truncation: This is caused by inclusion of PorterStemFilterFactory in your 
"text_en" field type.  If you don't want its effects (i.e. treating different 
forms of the same word interchangeably), remove the filter.

process slash char: I think you want the slash to be included in symbol terms 
rather than interpreted as a term separator.  One way to achieve this is to 
first, pre-tokenization, convert the slash to a string that does not include a 
term separator, and then post-tokenization, convert the substituted string back 
to a slash.

Here's a version of your text_en that uses PatternReplaceCharFilterFactory[1] 
to convert slashes inside of symbol-ish terms (the pattern is a guess based on 
the symbol text you've provided; you'll likely need to adjust it) to "_": a 
string unlikely to otherwise occur, and which will not be interpreted by 
StandardTokenizer as a term separator; and then PatternReplaceFilterFactory[1] 
to convert "_" back to slashes.  Note that the patterns for the two are 
slightly different, since the *char filter* is given as input the entire field 
text, while the *filter* is given the text of single terms.

-

  








  
  









  

-

[1] 
http://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-5.4.pdf

--
Steve


> On Jan 11, 2019, at 4:18 AM, Bruno Mannina  
> wrote:
> 
> I need to have default “text” field with:
> 
> - ignoreCase,
> 
> - no auto truncation,
> 
> - process slash char
> 
> 
> 
> I would like to perform only query on the field “text”
> 
> Queries can contain:  code or keywords or both.
> 
> 
> 
> I have 2 fields named symbol and title, and 1 alias ti (old field that 
> I can’t delete or modify)
> 
> 
> 
> * Symbol contains code with slash (i.e A62C21/02)
> 
>  required="true" stored="true"/>
> 
> 
> 
> * Title contains English text and also symbol
> 
> stored="true" termVectors="true" termPositions="true" 
> termOffsets="true"/>
> 
> 
> 
> { "symbol": "B65D81/20",
> 
> "title": [
> 
> "under vacuum or superatmospheric pressure, or in a special 
> atmosphere, e.g. of inert gas  {(B65D81/28  takes precedence; 
> containers with pressurising means for maintaining ball pressure A63B39/025)} 
> "
> 
> ]}
> 
> 
> 
> * Ti is an alias of title
> 
> stored="true" termVectors="true" termPositions="true" 
> termOffsets="true"/>
> 
> 
> 
> * Text is
> 
>  multiValued="true"/>
> 
> 
> 
> - Alias are:
> 
> 
> 
>
> 
>
> 
>
> 
>
> 
> 
> 
> 
> 
> If I do these queries :
> 
> 
> 
> * ti:airbag   à it’s ok
> 
> * title:airbag  à not good for me because it found
> airbags
> 
> * ti:b65D81/28  à not good, debug shows ti:b65d81 OR ti:28
> 
> * ti:”b65D81/28”  à it’s ok
> 
> * symbol:b65D81/28  à it’s ok (even without “ “)
> 
> 
> 
> NOW with “text” field
> 
> * b65D81/28  à not good, debug shows text:b65d81 OR
> text:28
> 
> * airbag   à it’s ok
> 
> * “b65D81/28”  à it’s ok
> 
> 
> 
> It will be great if I can enter symbol without “ “
> 
> 
> 
> Could you help me to have a text field which solve this problem ? 
> (please find below all def of my fields)
> 
> 
> 
> Many thanks for your help.
> 
> 
> 
> String_ci is my own definition
> 
> 
> 
> sortMissingLast="true" omitNorms="true">
> 
>
> 
>  
> 
>  
> 
>
> 
>
> 
> 
> 
> positionIncrementGap="100" multiValued="true">
> 
>  
> 
>
> 
> words="stopwords.txt" />
> 
>
> 
>  
> 
>  
> 
>
> 
> words="stopwords.txt" />
> 
> ignoreCase="true" expand="true"/>
> 
>
> 
>  
> 
>
> 
> 
> 
> positionIncrementGap="100">
> 
>  
> 
>
> 
> words="lang/stopwords_en.txt"/>
> 
>
> 
>
> 
> protected="protwords.txt"/>
> 
>
> 
>  
> 
>  
> 
>
> 
> ignoreCase="true" expand="true"/>
> 
> words="lang/stopwords_en.txt"/>
> 
>
> 
>
> 
>protected="protwords.txt"/>
> 
>
> 
>  
> 
>
> 
> 
> 
> 
> 
> Best Regards
> 
> Bruno
> 
> 
> 
> 
> 
> ---
> L'absence de virus dans ce courrier électronique a été vérifiée par le 
> logiciel antivirus Avast.
> https://www.avast.com/antivirus



RE: Schema.xml, copyField, Slash, ignoreCase ?

2019-01-14 Thread Bruno Mannina
Hi Erick,

Thanks for the tip Admin>>UI>>(core)>>analysis, I will investigate this 
afternoon.

Regards,
Bruno

-Message d'origine-
De : Erick Erickson [mailto:erickerick...@gmail.com] 
Envoyé : vendredi 11 janvier 2019 17:18
À : solr-user
Objet : Re: Schema.xml, copyField, Slash, ignoreCase ?

The admin UI>>(select a core)>>analysis page is your friend here. It'll show 
you exactly what each filter in your analysis chain does and from there you'll 
need to mix and match filters, your tokenizer and the like to support the 
use-cases you need.

My guess is that the field type you're using contains 
WordDelimiterFilterFactory which is splitting up on the slash.
Similarly for your aribag/airbags problem, probably you have one of the 
stemmers in your analysis chain.

See "Filter Descriptions" in your version of the ref guide.

And one caution: The admin>>core>>analysis chain shows you what happens _after_ 
query parsing. So if you enter (without quotes) "bing bong" those tokens will 
be shown. What fools people is that the query _parser_ gets in there first, so 
they'll then wonder why field:bing bong doesn't work. It's because the parser 
made it into field:bing default_field:bong. So you'll still (potentially) have 
to quote or escape some terms on input, it depends on the query parser you're 
using.

Best,
Erick

On Fri, Jan 11, 2019 at 1:40 AM Bruno Mannina  
wrote:
>
> Hello,
>
>
>
> I’m facing a problem concerning the default field “text” (SOLR 5.4) 
> and queries which contains / (slash)
>
>
>
> I need to have default “text” field with:
>
> - ignoreCase,
>
> - no auto truncation,
>
> - process slash char
>
>
>
> I would like to perform only query on the field “text”
>
> Queries can contain:  code or keywords or both.
>
>
>
> I have 2 fields named symbol and title, and 1 alias ti (old field that 
> I can’t delete or modify)
>
>
>
> * Symbol contains code with slash (i.e A62C21/02)
>
>  required="true" stored="true"/>
>
>
>
> * Title contains English text and also symbol
>
>  stored="true" termVectors="true" termPositions="true" 
> termOffsets="true"/>
>
>
>
> { "symbol": "B65D81/20",
>
> "title": [
>
>  "under vacuum or superatmospheric pressure, or in a special 
> atmosphere, e.g. of inert gas  {(B65D81/28  takes precedence; 
> containers with pressurising means for maintaining ball pressure A63B39/025)} 
> "
>
> ]}
>
>
>
> * Ti is an alias of title
>
>  stored="true" termVectors="true" termPositions="true" 
> termOffsets="true"/>
>
>
>
> * Text is
>
>  multiValued="true"/>
>
>
>
> - Alias are:
>
>
>
> 
>
> 
>
> 
>
> 
>
>
>
>
>
> If I do these queries :
>
>
>
> * ti:airbag   à it’s ok
>
> * title:airbag  à not good for me because it found
> airbags
>
> * ti:b65D81/28  à not good, debug shows ti:b65d81 OR ti:28
>
> * ti:”b65D81/28”  à it’s ok
>
> * symbol:b65D81/28  à it’s ok (even without “ “)
>
>
>
> NOW with “text” field
>
> * b65D81/28  à not good, debug shows text:b65d81 OR
> text:28
>
> * airbag   à it’s ok
>
> * “b65D81/28”  à it’s ok
>
>
>
> It will be great if I can enter symbol without “ “
>
>
>
> Could you help me to have a text field which solve this problem ? 
> (please find below all def of my fields)
>
>
>
> Many thanks for your help.
>
>
>
> String_ci is my own definition
>
>
>
>  sortMissingLast="true" omitNorms="true">
>
> 
>
>   
>
>   
>
> 
>
> 
>
>
>
>  positionIncrementGap="100" multiValued="true">
>
>   
>
> 
>
>  words="stopwords.txt" />
>
> 
>
>   
>
>   
>
> 
>
>  words="stopwords.txt" />
>
>  ignoreCase="true" expand="true"/>
>
> 
>
>   
>
> 
>
>
>
>  positionIncrementGap="100">
>
>   
>
> 
>
>  words="lang/stopwords_en.txt"/>
>
> 
>
> 
>
>  protected="protwords.txt"/>
>
> 
>
>   
>
>   
>
> 
>
>  ignoreCase="true" expand="true"/>
>
>  words="lang/stopwords_en.txt"/>
>
> 
>
> 
>
> protected="protwords.txt"/>
>
> 
>
>   
>
> 
>
>
>
>
>
> Best Regards
>
> Bruno
>
>
>
>
>
> ---
> L'absence de virus dans ce courrier électronique a été vérifiée par le 
> logiciel antivirus Avast.
> https://www.avast.com/antivirus



RE: Delayed/waiting requests

2019-01-14 Thread Gael Jourdan-Weil
I had a look to GC logs this morning but I'm not sure how to interpret them.


Over a period of 54mn, there is:

- Number of pauses: 2739

- Accumulated pauses: 93s => that is 2.86% of the time

- Average pause duration: 0.03s

- Average pause interval: 1.18s

- Accumulated full GC: 0

I'm not sure if this is a lot or not. What do you think ?


Looking more closely to GC logs with GC Viewer, I can notice that the high 
response time peaks happens at the same time where GC pauses takes 2x more time 
(around 0.06s) than average.


Also we are indeed indexing at the same time but we have autowarming set.

I don't see any Searcher opened at the time we experience slowness.

Nevertheless, our filterCache is set to autowarm 12k entries which is also the 
maxSize.

Could this have any downside?


Thanks,

Gaël



De : Erick Erickson 
Envoyé : vendredi 11 janvier 2019 17:21
À : solr-user
Objet : Re: Delayed/waiting requests

Jimi's comment is one of the very common culprits.

Autowarming is another. Are you indexing at the same
time? If so it could well be  you aren't autowarming and
the spikes are caused by using a new IndexSearcher
that has to read much of the index off disk when commits
happen. The "smoking gun" here would be if the spikes
correlate to your commits (soft or hard-with-opensearcher-true).

Best,
Erick

On Fri, Jan 11, 2019 at 1:23 AM Gael Jourdan-Weil
 wrote:
>
> Interesting indeed, we did not see anything with VisualVM but having a look 
> at the GC logs could gives us more info, especially on the pauses.
>
> I will collect data over the week-end and look at it.
>
>
> Thanks
>
> 
> De : Hullegård, Jimi 
> Envoyé : vendredi 11 janvier 2019 03:46:02
> À : solr-user@lucene.apache.org
> Objet : Re: Delayed/waiting requests
>
> Could be caused by garbage collection in the jvm.
>
> https://wiki.apache.org/solr/SolrPerformanceProblems
>
> Go down to the segment called “GC pause problems”
>
> /Jimi
>
> Sent from my iPhone
>
> On 11 Jan 2019, at 05:05, Gael Jourdan-Weil 
> mailto:gael.jourdan-w...@kelkoogroup.com>> 
> wrote:
>
> Hello,
>
> We are experiencing some performance issues on a simple SolrCloud cluster of 
> 3 replicas (1 core) but what we found during our analysis seems a bit odd, so 
> we thought the community could have relevant ideas on this.
>
> Load: between 30 and 40 queries per second, constant over time of analysis
>
> Symptoms: high response time over short period of time but quite frequently.
> We are talking about requests response time going from 50ms to 5000ms or even 
> worse during less than 5 seconds, and then going back to normal.
>
> What we found out: just before response time increase, requests seems to be 
> delayed.
> That is during 2/3 seconds, requests pile up, no response is sent, and then 
> all requests are resolved and responses are all returned to the clients at 
> the same time.
> Very much like if there was a lock happening somewhere. But we found no 
> "lock" time nor at JVM or system level.
>
> Does someone can think of something that could explain this in the way Solr 
> works ?
> Or ideas to track down the root cause..
>
> Solr version is 7.2.1.
>
> Thanks for reading,
>
> Gaël Jourdan-Weil
>
> Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här 
> kan du läsa mer om vår behandling och dina rättigheter, 
> Integritetspolicy


Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

2019-01-14 Thread Alexandre Rafalovitch
I think asking this question on Tika mailing list may give you better
answers. Then, if the conclusion is that the behavior is configurable,
you can see how to do it in Solr. It may be however, that you need to
do the parsing outside of Solr with standalone Tika. Standalone Tika
is a production advice anyway.

I would suggest the title be something like "How to prefer plain/text
part of an email message when parsing .eml files".

Regards,
  Alex.

On Mon, 14 Jan 2019 at 00:20, Zheng Lin Edwin Yeo  wrote:
>
> Hi,
>
> I have uploaded a sample EML file here:
> https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing
>
> This is what is indexed in the content:
>
> "content":"  font-size: 14pt; font-family: book antiqua,
> palatino, serif;  Hi There,font-size: 14pt; font-family:
> book antiqua, palatino, serif;  My client owns the domain name “
> font-size: 14pt; color: #ff; font-family: arial black, sans-serif;
>  TravelInsuranceEurope.com   font-size: 14pt; font-family: book
> antiqua, palatino, serif;  ” and is considering putting it in market.
> It is keyword rich domain with good search volume,adword bidding and
> type-in-traffic.font-size: 14pt; font-family: book
> antiqua, palatino, serif;  Based on our extensive study, we strongly
> feel that you should consider buying this domain name to improve the
> SEO, Online visibility, brand image, authority and type-in-traffic for
> your business. We also do provide free 1 year hosting and unlimited
> emails along with domain name.font-size: 14pt;
> font-family: book antiqua, palatino, serif;  Besides this, if you need
> any other domain name, web and app designing services and digital
> marketing services (SEO, PPC and SMO) at reasonable charges, feel free
> to contact us.font-size: 14pt; font-family: book antiqua,
> palatino, serif;  Best Regards,font-size: 14pt;
> font-family: book antiqua, palatino, serif;  Josh   ",
>
>
> As you can see, this is taken from the Content-Type: text/html.
> However, the Content-Type: text/plain looks clean, and that is what we want
> it to be indexed.
>
> How can we configure the Tika in Solr to change the priority to get the
> content from Content-Type: text/plain  instead of Content-Type: text/html?
>
> On Mon, 14 Jan 2019 at 11:18, Zheng Lin Edwin Yeo 
> wrote:
>
> > Hi,
> >
> > I am using Solr 7.5.0 with Tika 1.18.
> >
> > Currently I am facing a situation during the indexing of EML files,
> > whereby the content is being extracted from the Content-type=text/html
> > instead of Content-type=text/plain.
> >
> > The problem with Content-type=text/html is that it contains alot of words
> > like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and all of
> > these get indexed in Solr as well, which makes the content very cluttered,
> > and it also affect the search, as when we search for words like "font", all
> > the contents gets returned because of this.
> >
> > Would like to enquire on the following:
> > 1. Why Tika didn't get the text part (text/plain). Is there any way to
> > configure the Tika in Solr to change the priority to get the text part
> > (text/plain) instead of html part (text/html).
> > 2. If that is not possible, as you can see, the content is not clean,
> > which is not right. How can we get this to be clean when Tika is extracting
> > text?
> >
> > Regards,
> > Edwin
> >


Re: DateRangeField requires month?

2019-01-14 Thread Jeremy Smith
Hi Mikhail, thanks for the response.  I'm probably missing something, but what 
makes 2000-11T13 contiguous and 2000T13 not contiguous?  They seem pretty 
similar to me, but only the former is supported.


Thanks,

Jeremy


From: Mikhail Khludnev 
Sent: Sunday, January 13, 2019 12:59:31 AM
To: solr-user
Subject: Re: DateRangeField requires month?

Hello, Jeremy.

See below.

On Mon, Jan 7, 2019 at 5:09 PM Jeremy Smith  wrote:

> Hello,
>
>  I am trying to use the DateRangeField and ran into an interesting
> issue.  According to the documentation (
> https://lucene.apache.org/solr/guide/7_6/working-with-dates.html), these
> are both valid for the DateRangeField: 2000-11 and 2000-11T13.  I can
> confirm this is working in 7.6.  I would also expect to be able to use
> 2000T13, which would mean any time in the year 2000 between 1300 and 1400.


Nope. This is not a range, but multiple ranges. DateRangeField supports
contiguous ranges only.


> However, I get an error when trying to insert this value:
>
>
> "error":{"metadata":
>
>
> ["error-class","org.apache.solr.common.SolrException","root-error-class","java.lang.NumberFormatException"],
>
> "msg":"ERROR: Error adding field 'dtRange'='2000T13' msg=Couldn't
> parse date because: Improperly formatted date: 2000T13","code":400
>
> }
>
>
> I am using 7.6 with a super simple schema containing only _version_ and a
> DateRangeField and there's nothing special in my solrconfig.xml.  Is this
> behavior expected?  Should I open a jira issue?
>
>
> Thanks,
>
> Jeremy
>


--
Sincerely yours
Mikhail Khludnev


Re: Search query with & without question mark

2019-01-14 Thread Elizabeth Haubert
Because the standard query parser treats '?' as a single-character wildcard:
https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html

So in the case q="how do I add a field", the word "field" in your document
matches.  In the second case q="how do I add a field?" it is looking for
tokens like "fields" or "fielde";   The term without a trailing 1-character
suffix doesn't match anymore.   That is why it is no longer included in the
scoring.

https://lucene.apache.org/solr/guide/7_6/the-standard-query-parser.html#wildcard-searches

Elizabeth


On Mon, Jan 14, 2019 at 2:07 AM Jay Potharaju  wrote:

> the parsedquery is same when debugging, but when calculating the scores
> different fields are being taken into consideration. Why would that be the
> case? My guess is that the suggeststopfilterfactory is not working as i
> expect it to and causing this weird situation.
>
> Updated field type definition:
>   <
> charFilter class="solr.PatternReplaceCharFilterFactory" pattern=
> "['!#\$%'\(\)\*+,-\./:;=?@\[\]\^_`{|}~!@#$%^*]" />  "solr.StandardTokenizerFactory"/>  "solr.SuggestStopFilterFactory" ignoreCase="true" words=
> "lang/stopwords_en.txt" />  <
> filter class="solr.EnglishPossessiveFilterFactory"/>  "solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> 
>  fieldType>
>
> Debug Query:
> *"rawquerystring":"how do i add a field"*,
> "querystring":"how do i add a field",
> "parsedquery":"(+(DisjunctionMaxQuery((topic_title_plain:how))
> DisjunctionMaxQuery((topic_title_plain:do))
> DisjunctionMaxQuery((topic_title_plain:i))
> DisjunctionMaxQuery((topic_title_plain:add))
> DisjunctionMaxQuery((topic_title_plain:a))
> DisjunctionMaxQuery((topic_title_plain:field/no_coord",
> "parsedquery_toString":"+((topic_title_plain:how)
> (topic_title_plain:do) (topic_title_plain:i) (topic_title_plain:add)
> (topic_title_plain:a) (topic_title_plain:field))",
> "explain":{
>   "1":"
> 6.1034017 = sum of:
>   2.0065408 = weight(topic_title_plain:add in 107) [SchemaSimilarity],
> result of:
> 2.0065408 = score(doc=107,freq=1.0 = termFreq=1.0
> ), product of:
>   2.1391609 = idf, computed as log(1 + (docCount - docFreq + 0.5) /
> (docFreq + 0.5)) from:
> 32.0 = docFreq
> 275.0 = docCount
>   0.9380037 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 -
> b + b * fieldLength / avgFieldLength)) from:
> 1.0 = termFreq=1.0
> 1.2 = parameter k1
> 0.75 = parameter b
> 3.4436364 = avgFieldLength
> 4.0 = fieldLength
>   4.096861 = weight(topic_title_plain:field in 107) [SchemaSimilarity],
> result of:
> 4.096861 = score(doc=107,freq=1.0 = termFreq=1.0
> ), product of:
>   4.367638 = idf, computed as log(1 + (docCount - docFreq + 0.5) /
> (docFreq + 0.5)) from:
> 3.0 = docFreq
> 275.0 = docCount
>   0.9380037 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 -
> b + b * fieldLength / avgFieldLength)) from:
> 1.0 = termFreq=1.0
> 1.2 = parameter k1
> 0.75 = parameter b
> 3.4436364 = avgFieldLength
> 4.0 = fieldLength
> "},
>
> *rawquerystring":"how do i add a field?",*
> "querystring":"how do i add a field?",
> "parsedquery":"(+(DisjunctionMaxQuery((topic_title_plain:how))
> DisjunctionMaxQuery((topic_title_plain:do))
> DisjunctionMaxQuery((topic_title_plain:i))
> DisjunctionMaxQuery((topic_title_plain:add))
> DisjunctionMaxQuery((topic_title_plain:a))
> DisjunctionMaxQuery((topic_title_plain:field/no_coord",
> "parsedquery_toString":"+((topic_title_plain:how)
> (topic_title_plain:do) (topic_title_plain:i) (topic_title_plain:add)
> (topic_title_plain:a) (topic_title_plain:field))",
> "explain":{
>   "2":"
> 3.798876 = sum of:
>   2.033249 = weight(topic_title_plain:how in 202) [SchemaSimilarity],
> result of:
> 2.033249 = score(doc=202,freq=1.0 = termFreq=1.0
> ), product of:
>   2.4634004 = idf, computed as log(1 + (docCount - docFreq + 0.5) /
> (docFreq + 0.5)) from:
> 23.0 = docFreq
> 275.0 = docCount
>   0.82538307 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1
> - b + b * fieldLength / avgFieldLength)) from:
> 1.0 = termFreq=1.0
> 1.2 = parameter k1
> 0.75 = parameter b
> 3.4436364 = avgFieldLength
> 5.2244897 = fieldLength
> *  1.7656271 = weight(topic_title_plain:add in 202) [SchemaSimilarity],
> result of:*
> 1.7656271 = score(doc=202,freq=1.0 = termFreq=1.0
> ), product of:
>   2.1391609 = idf, computed as log(1 + (docCount - docFreq + 0.5) /
> (docFreq + 0.5)) from:
> 32.0 = docFreq
> 275.0 = docCount
>   0.82538307 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1
> - b + b * fieldLength / avgFieldLength)) from:
> 1.0 = termFreq=1.0
> 1.2 = parameter k1
> 0.75 = parameter b
> 3.4436364 = avgFieldLength
> 5.2244897 = fieldLength
> "},
> Thanks
> Jay
>
>
>
> 

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

2019-01-14 Thread Terry Steichen
Using 6.6.0, I am able to index EML files just fine.  The trick is, when
indexing files containing .eml, add "-filetypes eml" to the commandline
(note the plural filetypes).

Terry Steichen

On 1/13/19 10:18 PM, Zheng Lin Edwin Yeo wrote:
> Hi,
>
> I am using Solr 7.5.0 with Tika 1.18.
>
> Currently I am facing a situation during the indexing of EML files, whereby
> the content is being extracted from the Content-type=text/html instead of
> Content-type=text/plain.
>
> The problem with Content-type=text/html is that it contains alot of words
> like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and all of
> these get indexed in Solr as well, which makes the content very cluttered,
> and it also affect the search, as when we search for words like "font", all
> the contents gets returned because of this.
>
> Would like to enquire on the following:
> 1. Why Tika didn't get the text part (text/plain). Is there any way to
> configure the Tika in Solr to change the priority to get the text part
> (text/plain) instead of html part (text/html).
> 2. If that is not possible, as you can see, the content is not clean, which
> is not right. How can we get this to be clean when Tika is extracting text?
>
> Regards,
> Edwin
>


Re: Delayed/waiting requests

2019-01-14 Thread Erick Erickson
Gael:

bq. Nevertheless, our filterCache is set to autowarm 12k entries which
is also the maxSize

That is far, far, far too many. Let's assume you actually have 12K
entries in the filterCache.
Every time you open a new searcher, 12K queries are executed _before_
the searcher
accepts any new requests. While being able to re-use a filterCache
entry is useful, one of
the primary purposes is to pre-load index data from disk into memory
which can be
the event that takes the most time.

The queryResultCache has a similar function. I often find that this
cache doesn't have a
very high hit ratio, but again executing a _few_ of these queries
warms the index from
disk.

I think of both caches as a map, where the key is the "thing", (fq
clause in the case
of filterCache, the whole query in the case of the queryResultCache).
Autowarming
replays the most recently executed N of these entries, essentially
just as though
they were submitted by a user.

Hypothesis: You're massively over-warming, and when that kicks in you're seeing
increased CPU and GC pressure leading to the anomalies you're seeing. Further,
you have such excessive autowarming going on that it's hard to see the
associated messages in the log.

Here's what I'd recommend: Set your autowarm counts to something on the order
of 16. If the culprit is just excessive autowarming, I'd expect your spikes to
be much less severe. It _might_ be that your users see some increased (very
temporary) variance in response time. You can tell that the autowarming
configurations are "more art than science", I can't give you any other
recommendations than "start small and increase until you're happy"
unfortunately.

I usually do this with some kind of load tester in a dev lab of course ;).

Finally, if you use the metrics data (see:
https://lucene.apache.org/solr/guide/7_1/metrics-reporting.html)
you can see the autowarm times. Don't get too lost in the page to
start, just hit the "http://localhost:8983/solr/admin/metrics"; endpoint
and look for "warmupTime", then refine on how to get _only_
the warmup stats ;).

Best,
Erick

On Mon, Jan 14, 2019 at 5:08 AM Gael Jourdan-Weil
 wrote:
>
> I had a look to GC logs this morning but I'm not sure how to interpret them.
>
>
> Over a period of 54mn, there is:
>
> - Number of pauses: 2739
>
> - Accumulated pauses: 93s => that is 2.86% of the time
>
> - Average pause duration: 0.03s
>
> - Average pause interval: 1.18s
>
> - Accumulated full GC: 0
>
> I'm not sure if this is a lot or not. What do you think ?
>
>
> Looking more closely to GC logs with GC Viewer, I can notice that the high 
> response time peaks happens at the same time where GC pauses takes 2x more 
> time (around 0.06s) than average.
>
>
> Also we are indeed indexing at the same time but we have autowarming set.
>
> I don't see any Searcher opened at the time we experience slowness.
>
> Nevertheless, our filterCache is set to autowarm 12k entries which is also 
> the maxSize.
>
> Could this have any downside?
>
>
> Thanks,
>
> Gaël
>
>
> 
> De : Erick Erickson 
> Envoyé : vendredi 11 janvier 2019 17:21
> À : solr-user
> Objet : Re: Delayed/waiting requests
>
> Jimi's comment is one of the very common culprits.
>
> Autowarming is another. Are you indexing at the same
> time? If so it could well be  you aren't autowarming and
> the spikes are caused by using a new IndexSearcher
> that has to read much of the index off disk when commits
> happen. The "smoking gun" here would be if the spikes
> correlate to your commits (soft or hard-with-opensearcher-true).
>
> Best,
> Erick
>
> On Fri, Jan 11, 2019 at 1:23 AM Gael Jourdan-Weil
>  wrote:
> >
> > Interesting indeed, we did not see anything with VisualVM but having a look 
> > at the GC logs could gives us more info, especially on the pauses.
> >
> > I will collect data over the week-end and look at it.
> >
> >
> > Thanks
> >
> > 
> > De : Hullegård, Jimi 
> > Envoyé : vendredi 11 janvier 2019 03:46:02
> > À : solr-user@lucene.apache.org
> > Objet : Re: Delayed/waiting requests
> >
> > Could be caused by garbage collection in the jvm.
> >
> > https://wiki.apache.org/solr/SolrPerformanceProblems
> >
> > Go down to the segment called “GC pause problems”
> >
> > /Jimi
> >
> > Sent from my iPhone
> >
> > On 11 Jan 2019, at 05:05, Gael Jourdan-Weil 
> > mailto:gael.jourdan-w...@kelkoogroup.com>>
> >  wrote:
> >
> > Hello,
> >
> > We are experiencing some performance issues on a simple SolrCloud cluster 
> > of 3 replicas (1 core) but what we found during our analysis seems a bit 
> > odd, so we thought the community could have relevant ideas on this.
> >
> > Load: between 30 and 40 queries per second, constant over time of analysis
> >
> > Symptoms: high response time over short period of time but quite frequently.
> > We are talking about requests response time going from 50ms to 5000ms or 
> > even worse during less than 5 seconds, and then going back to normal.
> >

Re: Bootstrapping a Collection on SolrCloud

2019-01-14 Thread Frank Greguska
I've decided to take the approach to wait for the expected number of nodes
to become available before initializing the collection. Here is the script
I am using:

https://github.com/apache/incubator-sdap-nexus/blob/91b15ce0b123d652eaa1f5eb589a835ae3e77ceb/docker/solr/cloud-init/create-collection.py

This script will be deployed (using kubernetes) alongside every Solr node
and started at the same time as Solr. I utilize a lock in zookeeper to
ensure that only one node ever attempts to create the collection.

I still think this could be done without any actual nodes running so that
when the cluster starts the collection is immediately ready but this seems
to fit my purpose for now.

- Frank

On Wed, Jan 9, 2019 at 7:22 PM Erick Erickson 
wrote:

> First, for a given data set, I can easily double or halve
> the size of the index on disk depending on what options
> I choose for my fields; things like how many times I may
> need to copy fields to support various use-cases,
> whether I need to store the input for some, all or no
> fields, whether I enable docValues, whether I need to
> support phrase queries and on and on
>
> Even assuming you can estimate the eventual size,
> it doesn't help much. As one example, if you choose
> stored="true", the index size will grow by roughly 50% of
> the raw data size. But that data doesn't really affect
> searching that much in that it doesn't need to be
> RAM resident in the same way your terms data needs
> to be. So In  order to be performant I may need anywhere
> from a fraction of the raw index size on disk to multiples
> of the index size on disk in terms of RAM.
>
> So you see where this is going. I'm not against your
> suggestion, but I have strong doubts as to its
> feasibility give all the variables I've seen. We can revisit
> this after you've had a chance to kick the tires, I suspect
> we'll have more shared context on which to base
> the discussion.
>
> Best,
> Erick
>
> On Wed, Jan 9, 2019 at 5:12 PM Frank Greguska  wrote:
> >
> > Thanks, I am no Solr expert so I may be over-simplifying things a bit in
> my
> > ignorance.
> >
> > "No. The replicas are in a "down" state the Solr instances are brought
> back
> > up" Why can't I dictate (at least initially) the "up" state somehow? It
> > seems Solr keeps track of where replicas were deployed so that the
> cluster
> > 'heals' itself when all nodes are back. At deployment, I know which nodes
> > should be available so the collection could be unavailable until all
> > expected nodes are up.
> >
> > Thank you for the pointer to the createNodeSet parameter, that might
> prove
> > useful.
> >
> > "I think the thing I'm getting stuck on is how in the world the
> > Solr code could know enough to "do the right thing". How many
> > docs do you have? How big are they? How much to you expect
> > to grow? What kinds of searches do you want to support?"
> >
> > Solr can't know these things. But me as the deployer/developer might.
> > For example say I know my initial data size and can say the index will be
> > 10 TB. If I have 2 nodes with 5 TB disks well then I have to have 2
> shards
> > because it won't fit on one node. If instead I have 4 nodes with 5 TB
> > disks, well I could still have 2 shards but with replicas. Or I could
> > choose no replicas but more shards. This is what I mean by the
> > shard/replica decision being partially dependent on available hardware;
> > there are some decisions I could make knowing my planned deployment so
> that
> > when I start the cluster it can be immediately functional. Rather than
> > first starting the cluster, then creating the collection, then making it
> > available.
> >
> > You may be right that it is a small and complicated concern because I
> > really only need to care about it once when I am first deploying my
> > cluster. But everyone who needs to stand up a SolrCloud cluster needs to
> do
> > it. My guess is most people either do it manually as a one-time
> operations
> > thing or they write a custom script to do it for them automatically as I
> am
> > attempting. Seems like a good candidate for a new feature.
> >
> > - Frank
> >
> > On Wed, Jan 9, 2019 at 4:18 PM Erick Erickson 
> > wrote:
> >
> > > bq.  do all 100 replicas move to the one remaining node?
> > >
> > > No. The replicas are in a "down" state the Solr instances
> > > are brought back up (I'm skipping autoscaling here, but
> > > even that wouldn't move all the replicas to the one remaining
> > > node).
> > >
> > > bq.  what the collection *should* look like based on the
> > > hardware I am deploying to.
> > >
> > > With the caveat that the Solr instances have to be up, this
> > > is entirely possible. First of all, you can provide a "createNodeSet"
> > > to the create command to specify exactly what Solr nodes you
> > > want used for your collection. There's a special "EMPTY"
> > > value that _almost_ does what you want, that is it creates
> > > no replicas, just the configuration in ZooKeeper. Thereafter,

Solr Authentication Error - Error trying to proxy request for url

2019-01-14 Thread Ganesh Sethuraman
We are using Solr 7.2.1 in Solr Cloud mode, with embedded Zookeeper for
test purposes. We enabled SSL and Authentication, and we are able to see
the admin working fine with authentication. But queries through the UI or
otherwise is failing with the following error. Request your help to resolve
the same. Is this related to authentication or SSL? If you can throw some
light on it, it will be of great help to us.

https://solr-node-1:8080/solr//select?q=*:*

Error:
{
  "error":{
"metadata":[
  "error-class","org.apache.solr.common.SolrException",

 
"root-error-class","sun.security.provider.certpath.SunCertPathBuilderException"],
"msg":"Error trying to proxy request for url:
https://doaminsolr/ba_test/select ",
"trace":"org.apache.solr.common.SolrException: Error trying to proxy
request for url: https://domain/solr/ba_test/select\n\tat
org.apache.solr.servlet.HttpSolrCall.remoteQuery(HttpSolrCall.java:646)\n\tat


Re: Curator in SOLR

2019-01-14 Thread Gus Heck
Unfortunately, the answer is no we don't quite have the same thing for
TRA's yet. What's needed to bridge the gap are auto-scaling features that
allow such migrations of older collections to different hardware take place
automatically.  Dave Smiley and I have definitely discussed the possibility
while developing TRA's but neither of us has really had the ability to
prioritize it yet. Other features have taken priority. Feel free to propose
something, or even contribute something. I'll certainly be interested in
reviewing if you come up with a patch :)

I also haven't (yet) taken the time to look super carefully at the existing
auto-scaling features to see if some existing combination can get you there.

-Gus

On Sun, Jan 13, 2019 at 7:50 AM SOLR4189  wrote:

> Hi all,
>
> I want to use TimeRoutedAlias collection. But first of all I have a
> question
> : Does Solr have something like  Curator
>    in ElasticSearch?
> How
> can I manage/move old read-only collections to "weaker hardware"?
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


-- 
http://www.the111shift.com


Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

2019-01-14 Thread Zheng Lin Edwin Yeo
Hi Alex,

Thanks for the suggestions.
Yes, I have posted it in the Tika mailing list too.

Regards,
Edwin

On Mon, 14 Jan 2019 at 21:16, Alexandre Rafalovitch 
wrote:

> I think asking this question on Tika mailing list may give you better
> answers. Then, if the conclusion is that the behavior is configurable,
> you can see how to do it in Solr. It may be however, that you need to
> do the parsing outside of Solr with standalone Tika. Standalone Tika
> is a production advice anyway.
>
> I would suggest the title be something like "How to prefer plain/text
> part of an email message when parsing .eml files".
>
> Regards,
>   Alex.
>
> On Mon, 14 Jan 2019 at 00:20, Zheng Lin Edwin Yeo 
> wrote:
> >
> > Hi,
> >
> > I have uploaded a sample EML file here:
> >
> https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing
> >
> > This is what is indexed in the content:
> >
> > "content":"  font-size: 14pt; font-family: book antiqua,
> > palatino, serif;  Hi There,font-size: 14pt; font-family:
> > book antiqua, palatino, serif;  My client owns the domain name “
> > font-size: 14pt; color: #ff; font-family: arial black, sans-serif;
> >  TravelInsuranceEurope.com   font-size: 14pt; font-family: book
> > antiqua, palatino, serif;  ” and is considering putting it in market.
> > It is keyword rich domain with good search volume,adword bidding and
> > type-in-traffic.font-size: 14pt; font-family: book
> > antiqua, palatino, serif;  Based on our extensive study, we strongly
> > feel that you should consider buying this domain name to improve the
> > SEO, Online visibility, brand image, authority and type-in-traffic for
> > your business. We also do provide free 1 year hosting and unlimited
> > emails along with domain name.font-size: 14pt;
> > font-family: book antiqua, palatino, serif;  Besides this, if you need
> > any other domain name, web and app designing services and digital
> > marketing services (SEO, PPC and SMO) at reasonable charges, feel free
> > to contact us.font-size: 14pt; font-family: book antiqua,
> > palatino, serif;  Best Regards,font-size: 14pt;
> > font-family: book antiqua, palatino, serif;  Josh   ",
> >
> >
> > As you can see, this is taken from the Content-Type: text/html.
> > However, the Content-Type: text/plain looks clean, and that is what we
> want
> > it to be indexed.
> >
> > How can we configure the Tika in Solr to change the priority to get the
> > content from Content-Type: text/plain  instead of Content-Type:
> text/html?
> >
> > On Mon, 14 Jan 2019 at 11:18, Zheng Lin Edwin Yeo 
> > wrote:
> >
> > > Hi,
> > >
> > > I am using Solr 7.5.0 with Tika 1.18.
> > >
> > > Currently I am facing a situation during the indexing of EML files,
> > > whereby the content is being extracted from the Content-type=text/html
> > > instead of Content-type=text/plain.
> > >
> > > The problem with Content-type=text/html is that it contains alot of
> words
> > > like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and all of
> > > these get indexed in Solr as well, which makes the content very
> cluttered,
> > > and it also affect the search, as when we search for words like
> "font", all
> > > the contents gets returned because of this.
> > >
> > > Would like to enquire on the following:
> > > 1. Why Tika didn't get the text part (text/plain). Is there any way to
> > > configure the Tika in Solr to change the priority to get the text part
> > > (text/plain) instead of html part (text/html).
> > > 2. If that is not possible, as you can see, the content is not clean,
> > > which is not right. How can we get this to be clean when Tika is
> extracting
> > > text?
> > >
> > > Regards,
> > > Edwin
> > >
>


Re: Solr Authentication Error - Error trying to proxy request for url

2019-01-14 Thread Zheng Lin Edwin Yeo
Hi,

When you generate the SSL certificate, did you put the IP address to the
the IP address of your system?

Regards,
Edwin

On Tue, 15 Jan 2019 at 01:31, Ganesh Sethuraman 
wrote:

> We are using Solr 7.2.1 in Solr Cloud mode, with embedded Zookeeper for
> test purposes. We enabled SSL and Authentication, and we are able to see
> the admin working fine with authentication. But queries through the UI or
> otherwise is failing with the following error. Request your help to resolve
> the same. Is this related to authentication or SSL? If you can throw some
> light on it, it will be of great help to us.
>
> https://solr-node-1:8080/solr//select?q=*:*
>
> Error:
> {
>   "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>
>
>  
> "root-error-class","sun.security.provider.certpath.SunCertPathBuilderException"],
> "msg":"Error trying to proxy request for url:
> https://doaminsolr/ba_test/select ",
> "trace":"org.apache.solr.common.SolrException: Error trying to proxy
> request for url: https://domain/solr/ba_test/select\n\tat
>
> org.apache.solr.servlet.HttpSolrCall.remoteQuery(HttpSolrCall.java:646)\n\tat
>


newbie: a few newbie questions regarding Solr

2019-01-14 Thread Audun Holme
Hi
My magento store (magento 1.9.3.7) have a very slow search, takes 30 seconds to 
show some results, should be down to 1 second...
So I have read about solr and think maybe I need solr.
I found this link:https://github.com/integer-net/solr-magento1
I downloaed that plugin and started to read about how to install it. Unsure as 
I was I sent an email to https://integernet-solr.com/ to ask  a few questions 
to them. They replied  shortly and told me that Solr is a sepate app running on 
the apache server. The plugin I found at gitHub is just a connection between 
the apache and the magento store. Correct?
question:
Where to I get both the apache app? I thought it was included in the module, 
but I have been wrong so many times :) Is it solr_conf folder??

>From the install guide, I read:"Install Solr and create at least one working 
>core"I assume it here means the apache app?? Isn't there a standard app in the 
>module ready to be used?? if so what folder is it in??

Would be great if someone had a quide on how to confiigure it as I get 
confused. 

Audun


Re: newbie: a few newbie questions regarding Solr

2019-01-14 Thread Shawn Heisey

On 1/14/2019 3:18 PM, Audun Holme wrote:

So I have read about solr and think maybe I need solr.
I found this link:https://github.com/integer-net/solr-magento1
I downloaed that plugin and started to read about how to install it. Unsure as 
I was I sent an email to https://integernet-solr.com/ to ask  a few questions 
to them. They replied  shortly and told me that Solr is a sepate app running on 
the apache server. The plugin I found at gitHub is just a connection between 
the apache and the magento store. Correct?


The project you linked is probably a magento plugin that knows how to 
access a Solr server.  It does not include Solr.



question:
Where to I get both the apache app? I thought it was included in the module, 
but I have been wrong so many times :) Is it solr_conf folder??


What is "the apache app"?

Solr and Apache are not equivalent in any way. Solr is a software 
project that is managed within the Apache Foundation.  Even though the 
project is known by the full phrase "Apache Solr", it has nothing at all 
to do with the apache http server -- that is a completely separate project.



 From the install guide, I read:"Install Solr and create at least one working 
core"I assume it here means the apache app?? Isn't there a standard app in the 
module ready to be used?? if so what folder is it in??


Solr is a completely standalone piece of software.  It does not run as 
part of the apache http server.  Normally it is not embedded within ANY 
other software.  In order to embed Solr in another piece of software, 
that software would have to be written in Java.  I'm pretty sure that 
magento is not written in Java.


The instructions in the project you referenced are pretty clear.  
Install Solr, start it, create a core, replace the core's config files 
with those provided in the github project, and then restart Solr.  
Before you can install Solr, you will have to download it.


I would recommend installing Solr on an OS like Linux. There is a 
service installer included in the Solr download that works on most 
operating systems with GNU tools.  There is no service installer 
included for Windows.  Some people have successfully created a Windows 
service for Solr on their own.


Here's information in the Solr documentation for creating a core:

https://lucene.apache.org/solr/guide/6_6/running-solr.html

I used the 6.6 version of the documentation for the above because the 
project you linked says it works with 4.x through 6.x.  I have no idea 
whether their configs are compatible with 7.x versions of Solr.


This is where you can get the latest 6.x version of Solr (at the time I 
write this):


http://archive.apache.org/dist/lucene/solr/6.6.5/

It is available in both tgz and zip formats.  The largest files in the 
directory are the full binary download.


Thanks,
Shawn