[
https://issues.apache.org/jira/browse/OAK-11568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thomas Mueller resolved OAK-11568.
----------------------------------
Fix Version/s: 1.78.0
Resolution: Fixed
A first version is now committed. I think more work is needed, but we can open
a new Jira issue for that.
> Elastic: improved compatibility for analyzer definitions
> --------------------------------------------------------
>
> Key: OAK-11568
> URL: https://issues.apache.org/jira/browse/OAK-11568
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: elastic-search
> Reporter: Thomas Mueller
> Assignee: Thomas Mueller
> Priority: Major
> Fix For: 1.78.0
>
>
> Currently, analyzer definitions for Lucene indexes are not fully compatible
> with Elasticsearch. I guess we don't need 100% compatibility, but we should
> still improve it. I have a few cases that are not currently supported, and I
> hope supporting them is possible.
> *Missing Configuration*
> The Lucene Oak index is lenient and doesn't fail if configuration is missing.
> For Elasticsearch, we get "mapping requires either `mappings` or
> `mappings_path` to be configured" in this case. We should also be lenient.
> *NGram*
> We need to translate the configuration to avoid "Unknown tokenizer type
> [n_gram]".
> *Hyphenation Compound Word*
> For Oak, we support custom xml files in the configuration itself. For
> Elasticsearch, the configuration file needs to already exist on the server.
> Because of that, we can not support the equivalent of the Lucene
> configuration.
> *Word Delimiter*
> I found that with recent versions of Elasticsearch, the synonym filter can
> not be combined with the word delimiter filter, or the word delimiter graph.
> Doing so can easily result in "IllegalStateException: startOffset must be
> non-negative, and endOffset must be >= startOffset, and offsets must not go
> backwards" when closing the ElasticIndexer object. I tried many combinations,
> but none of them worked:
> * Tokenizer: Standard; filters: LowerCase, WordDelimiter, Synonym, PorterStem
> * Tokenizer: Standard; filters: LowerCase, WordDelimiter graph, Synonym,
> PorterStem (replacing with graph)
> * Tokenizer: Standard; filters: LowerCase, Synonym, WordDelimiter graph,
> PorterStem (reordering)
> * Tokenizer: None; filters: LowerCase, Synonym, WordDelimiter graph,
> PorterStem (no tokenizer)
> * Filters: Synonym, WordDelimiter graph (minimum)
> However, _just_ using the Synonym filter, OR the WordDelimiter graph, always
> worked in my tests.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)