[jira] [Resolved] (OAK-11568) Elastic: improved compatibility for analyzer definitions

Thomas Mueller (Jira) Mon, 31 Mar 2025 08:30:19 -0700


     [ 
https://issues.apache.org/jira/browse/OAK-11568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Thomas Mueller resolved OAK-11568.
----------------------------------
    Fix Version/s: 1.78.0
       Resolution: Fixed

A first version is now committed. I think more work is needed, but we can open 
a new Jira issue for that.

> Elastic: improved compatibility for analyzer definitions
> --------------------------------------------------------
>
>                 Key: OAK-11568
>                 URL: https://issues.apache.org/jira/browse/OAK-11568
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: elastic-search
>            Reporter: Thomas Mueller
>            Assignee: Thomas Mueller
>            Priority: Major
>             Fix For: 1.78.0
>
>
> Currently, analyzer definitions for Lucene indexes are not fully compatible 
> with Elasticsearch. I guess we don't need 100% compatibility, but we should 
> still improve it. I have a few cases that are not currently supported, and I 
> hope supporting them is possible.
> *Missing Configuration*
> The Lucene Oak index is lenient and doesn't fail if configuration is missing. 
> For Elasticsearch, we get "mapping requires either `mappings` or 
> `mappings_path` to be configured" in this case. We should also be lenient.
> *NGram*
> We need to translate the configuration to avoid "Unknown tokenizer type 
> [n_gram]".
> *Hyphenation Compound Word*
> For Oak, we support custom xml files in the configuration itself. For 
> Elasticsearch, the configuration file needs to already exist on the server. 
> Because of that, we can not support the equivalent of the Lucene 
> configuration.
> *Word Delimiter*
> I found that with recent versions of Elasticsearch, the synonym filter can 
> not be combined with the word delimiter filter, or the word delimiter graph. 
> Doing so can easily result in "IllegalStateException: startOffset must be 
> non-negative, and endOffset must be >= startOffset, and offsets must not go 
> backwards" when closing the ElasticIndexer object. I tried many combinations, 
> but none of them worked:
> * Tokenizer: Standard; filters: LowerCase, WordDelimiter, Synonym, PorterStem
> * Tokenizer: Standard; filters: LowerCase, WordDelimiter graph, Synonym, 
> PorterStem (replacing with graph)
> * Tokenizer: Standard; filters: LowerCase, Synonym, WordDelimiter graph, 
> PorterStem (reordering)
> * Tokenizer: None; filters: LowerCase, Synonym, WordDelimiter graph, 
> PorterStem (no tokenizer)
> * Filters: Synonym, WordDelimiter graph (minimum)
> However, _just_ using the Synonym filter, OR the WordDelimiter graph, always 
> worked in my tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (OAK-11568) Elastic: improved compatibility for analyzer definitions

Reply via email to