[jira] [Comment Edited] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

Kazuaki Hiraga (Jira) Fri, 17 Jan 2020 09:58:11 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018159#comment-17018159
 ]


Kazuaki Hiraga edited comment on LUCENE-9123 at 1/17/20 5:56 PM:
-----------------------------------------------------------------

{quote}
This solution would fix Kuromoji to create a simple chain of tokens, all with 
position increment 1 (no overlapping compound tokens)?
{quote}
Yes. Although I may need to test more documents to ensure that the fix will 
produce a simple chain of tokens, it seems working fine so far.

{quote}
Would you only use that mode when parsing the synonyms to build the synonym 
filter (or synonym graph filter)? (Since that seems to be where the error is 
occurring here). Or would you also use that as your primary Tokenizer (which 
would mean you don't also get compound words directly out of Kuromoji). 
{quote}

In my case, I use this mode as my primary Tokenizer configuration since I 
usually want to have decompound tokens.

It would be nice if synonym filter and synonym graph filter can work with this 
mode without the patch. However, I don't think there are many situations that 
we need original tokens along with decompound ones (I cannot say we will never 
need though).  Current workaround for this issue is using normal mode that will 
not produce decompound tokens. But, for example, we cannot get a document that 
contains 株式会社 by using a query 会社 because 株式会社 will be one token and normal 
mode doesn't produce decoumpound tokens that will produce two tokens 株式 and 会社 
(in this case, we can use n-gram in addition to tokenize field to get a 
document but it has other issues).

Therefore, there are two issues. #1 Kuromoji produces compound and decompound 
tokens on both of search mode and extended mode, which compound one is rarely 
needed. #2 Neither synonym filter nor synonym graph filter can work with tokens 
that overlap position. 

 [~mikemccand], I will try to find the ticket for #2. If there's no one, I will 
create one. And I will change the title of this ticket to focus on #1.


was (Author: h.kazuaki):
{quote}
This solution would fix Kuromoji to create a simple chain of tokens, all with 
position increment 1 (no overlapping compound tokens)?
{quote}
Yes. Although I may need to test more documents to ensure that the fix will 
produce a simple chain of tokens, it seems working fine so far.

{quote}
Would you only use that mode when parsing the synonyms to build the synonym 
filter (or synonym graph filter)? (Since that seems to be where the error is 
occurring here). Or would you also use that as your primary Tokenizer (which 
would mean you don't also get compound words directly out of Kuromoji). 
{quote}

In my case, I use this mode as my primary Tokenizer configuration since I 
usually want to have decompound tokens.

It would be nice if synonym filter and synonym graph filter can work with this 
mode without the patch. However, I don't think there are many situations that 
we need original tokens along with decompound ones (I cannot say we will never 
need though).  Current workaround for this issue is using normal mode that will 
not produce decompound tokens. But, for example, we cannot get a document that 
contains 株式会社 by using a query 会社 because 株式会社 will be one token and normal 
mode doesn't produce decoumpound tokens that will produce two tokens 株式 and 会社 
(in this case, we can use n-gram in addition to tokenize field to get a 
document but it has other issues).

I will try to find out that one which dedicated issue for the filter. If 
there's no one, I will create a ticket to record the issue. 

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-9123
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9123
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 8.4
>            Reporter: Kazuaki Hiraga
>            Priority: Major
>         Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>     <fieldType name="text_custom_ja" class="solr.TextField" 
> positionIncrementGap="100" autoGeneratePhraseQueries="false">
>       <analyzer>
>         <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>         <filter class="solr.SynonymGraphFilterFactory"
>                     synonyms="lang/synonyms_ja.txt"
>                     tokenizerFactory="solr.JapaneseTokenizerFactory"/>
>         <filter class="solr.JapaneseBaseFormFilterFactory"/>
>         <!-- Removes tokens with certain part-of-speech tags -->
>         <filter class="solr.JapanesePartOfSpeechStopFilterFactory" 
> tags="lang/stoptags_ja.txt" />
>         <!-- Normalizes full-width romaji to half-width and half-width kana 
> to full-width (Unicode NFKC subset) -->
>         <filter class="solr.CJKWidthFilterFactory"/>
>         <!-- Removes common tokens typically not useful for search, but have 
> a negative effect on ranking -->
>         <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="lang/stopwords_ja.txt" /> -->
>         <!-- Normalizes common katakana spelling variations by removing any 
> last long sound character (U+30FC) -->
>         <filter class="solr.JapaneseKatakanaStemFilterFactory" 
> minimumLength="4"/>
>         <!-- Lower-cases romaji characters -->
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

Reply via email to