Re: Recommended Java Distribution

2020-03-19 Thread Eric Buss
Hi Kaya,

We have been using Amazon Corretto for Solr for the past 6 months without 
issue. We did not notice any difference from running on Open JDK prior to that.

Cheers
Eric

On 2020-03-19, 6:04 AM, "Jan Høydahl"  wrote:

Our official statement is here


https://lucene.apache.org/solr/guide/8_4/solr-system-requirements.html#sources-for-java

I have no experience with Corretto in production, but Amazon uses it 
heavily for all their Java workloads in the cloud. I believe it is based on 
OpenJDK but with Amazon’s own patches.
I would not hesitate to make such a decision. But perhaps people with 
first-hand experience can share what they found?

Jan

> 19. mar. 2020 kl. 11:13 skrev Kayak28 :
> 
> Hello, Solr Community:
> 
> My customer would like to use Amazon Corretto JDK instead of OpenJDK.
> 
> I wonder if it is ok to say, "yes, you can use" or I should not recommend
> it at all.
> 
> Is anyone in the Community using Amazon Corretto for your Solr?
> 
> Have you ever had any problems with that?
> 
> If you share any experience, I would be really appreciated.
> 
> 
> -- 
> 
> Sincerely,
> Kaya
> github: https://github.com/28kayak





FlattenGraphFilter Eliminates Tokens - Can't match "Can't"

2019-11-25 Thread Eric Buss
Hi all,

I have been trying to solve an issue where FlattenGraphFilter (FGF) removes 
tokens produced by WordDelimiterGraphFilter (WDGF) - consequently searches that 
contain the contraction "can't" do not match.

This is on Solr version 7.7.1.

The field in question is defined as follows:



And the relevant fieldType "text_general":




















Finally, the relevant entries in synonyms.txt are:

can,cans
cants,cant

Using the Solr console Analysis and "can't" as the Field Value, the following
tokens are produced (find the verbose output at the bottom of this email):

Index
ST| can't
SF| can't
WDGF  | cant | can't | can | t
FGF   | cant | can't | can | t
SGF   | cants | cant | can't | | cans | can | t
ICUFF | cants | cant | can't | | cans | can | t
FGF   | cants | cant | can't | | t

Query
ST| can't
SF| can't
WDGF  | can | t
SF| can | t
ICUFF | can | t

As you can see after the FGF the tokens "can" and "cans" are pruned so the query
does not match. Is there a reasonable way to preserve these tokens?

My key concern is that I want the "fix" for this to have as little impact on 
other queries as possible.

Some things I have checked/tried:

Searching for similar problems I found this thread: 
https://lucene.472066.n3.nabble.com/Questions-for-SynonymGraphFilter-and-WordDelimiterGraphFilter-td4420154.html
Here it is suggested that FGF is not necessary (without any supporting 
evidence). This goes directly against the documentation that states "If you use 
[the SynonymGraphFilter] during indexing, you must follow it with a Flatten 
Graph Filter":
https://lucene.apache.org/solr/guide/7_0/filter-descriptions.html
Despite this warning I tried out removing the FGF on a local 
cluster and indeed it still runs and this search now works, however I am 
paranoid that this will break far more things than it fixes.

I have tried adding the FGF as a filter to the query. This does not eliminate 
the "can" term in the query analysis.

I have tested other contracted words. Some have this issue as well - others do
not. "haven't", "shouldn't", "couldn't", "I'll", "weren't", "ain't" all 
preserve their tokens "won't" does not. I believe the pattern here is that 
whenever part of the contraction has synonyms this problem manifests.

Eliminating WDGF is not viable as we rely on this functionality for other uses
of delimiters (such as wi-fi -> wi fi).

Performing WDGF after synonyms is also not viable as in the case that we have
the data "historical-text" we want this to match the search "history text".

The hacky solution I have found is to use the PatternReplaceFilterFactory to
replace "can't" with "cant". Though this technically solves the issue, I hope it
is obvious why this does not feel like an ideal solution.

Has anyone encountered this type of issue before? Any advice on how the filter 
use here could be improved to handle this case?

Thanks,
Eric Buss


PS. The verbose output from Analysis of "can't"

Index

ST| text  | can't| 
  | raw_bytes | [63 61 6e 27 74] | 
  | start | 0| 
  | end   | 5| 
  | positionLength| 1| 
  | type  || 
  | termFrequency | 1| 
  | position  | 1| 
SF| text  | can't| 
  | raw_bytes | [63 61 6e 27 74] | 
  | start | 0| 
  | end   | 5| 
  | positionLength| 1| 
  | type  || 
  | termFrequency | 1| 
  | position  | 1| 
WDGF  | text  | cant  | can't| can| t   
   | 
  | raw_bytes | [63 61 6e 74] | [63 61 6e 27 74] | [63 61 6e] | [74]
   | 
  | start | 0 | 0| 0  | 4   
   | 
  | end   | 5 | 5| 3  | 5   
   | 
  | positionLength| 2 | 2| 1  | 1   
   | 
  | type  | ||  | 
 | 
  | termFrequency | 1 | 1| 1  | 1   
   | 
  | position  | 1 | 1| 1  | 2   
   | 
  | keyword   | false | false| false  | false   
   | 
FGF   | text  | cant  | can't| can| t   
   | 
  | raw_bytes | [63 61 6e 74]

Re: FlattenGraphFilter Eliminates Tokens - Can't match "Can't"

2019-12-05 Thread Eric Buss
Thanks for the reply,

I wouldn't be surprised if the issue you linked is related, I also found 
another similar issue: https://issues.apache.org/jira/browse/LUCENE-8723 

You are absolutely right that the FlattenGraphFilter should only be used once, 
but as you noted the issue I am experiencing seems unrelated.

On 2019-12-05, 10:23 AM, "Michael Gibney"  wrote:

I wonder if this might be similar/related to the underlying problem
that is intended to be addressed by
https://issues.apache.org/jira/browse/LUCENE-8985?

btw, I think you only want to use FlattenGraphFilter *once* in the
indexing analysis chain, towards the end (after all components that
emit graphs). ...though that's probably *not* what's causing the
problem (based on the fact that the extra FGF doesn't seem to modify
any attributes).



On Mon, Nov 25, 2019 at 2:19 PM Eric Buss  wrote:
>
> Hi all,
>
> I have been trying to solve an issue where FlattenGraphFilter (FGF) 
removes
> tokens produced by WordDelimiterGraphFilter (WDGF) - consequently 
searches that
> contain the contraction "can't" do not match.
>
> This is on Solr version 7.7.1.
>
> The field in question is defined as follows:
>
> 
>
> And the relevant fieldType "text_general":
>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>
> Finally, the relevant entries in synonyms.txt are:
>
> can,cans
> cants,cant
>
> Using the Solr console Analysis and "can't" as the Field Value, the 
following
> tokens are produced (find the verbose output at the bottom of this email):
>
> Index
> ST| can't
> SF| can't
> WDGF  | cant | can't | can | t
> FGF   | cant | can't | can | t
> SGF   | cants | cant | can't | | cans | can | t
> ICUFF | cants | cant | can't | | cans | can | t
> FGF   | cants | cant | can't | | t
>
> Query
> ST| can't
> SF| can't
> WDGF  | can | t
> SF| can | t
> ICUFF | can | t
>
> As you can see after the FGF the tokens "can" and "cans" are pruned so 
the query
> does not match. Is there a reasonable way to preserve these tokens?
>
> My key concern is that I want the "fix" for this to have as little impact 
on
> other queries as possible.
>
> Some things I have checked/tried:
>
> Searching for similar problems I found this thread:
> 
https://lucene.472066.n3.nabble.com/Questions-for-SynonymGraphFilter-and-WordDelimiterGraphFilter-td4420154.html
> Here it is suggested that FGF is not necessary (without any supporting
> evidence). This goes directly against the documentation that states "If 
you use
> [the SynonymGraphFilter] during indexing, you must follow it with a 
Flatten
> Graph Filter":
> https://lucene.apache.org/solr/guide/7_0/filter-descriptions.html
> Despite this warning I tried out removing the FGF on a local
> cluster and indeed it still runs and this search now works, however I am
> paranoid that this will break far more things than it fixes.
>
> I have tried adding the FGF as a filter to the query. This does not 
eliminate
> the "can" term in the query analysis.
>
> I have tested other contracted words. Some have this issue as well - 
others do
> not. "haven't", "shouldn't", "couldn't", "I'll", "weren't", "ain't" all
> preserve their tokens "won't" does not. I believe the pattern here is that
> whenever part of the contraction has synonyms this problem manifests.
>
> Eliminating WDGF is not viable as we rely on this functionality for other 
uses
> of delimiters (such as wi-fi -> wi fi).
>
> Performing WDGF after synonyms is also not viable as in the case that we 
have
> the data "historical-text" we want this to match the search "history 
text".
>
> The hacky solution I have found is to use the PatternReplaceFilterFactory 
to
> replace "can't" with "cant". Though this technically solves the issue, I 
hope it
> is obvious why this does not feel like an ideal solution.
>
> Has anyone encount