Re: Is it possible to specigfy only one-character term synonym for2-gram tokenizer?

Emir Arnautovic Thu, 22 Oct 2015 03:21:26 -0700

Hi Scott,

Using PatternReplaceCharFilter is not same as replacing raw data(replacing raw data is not proper solution as it does not solve issuewhen searching with "other" character). This is part of tokenstandardization, no different than lower casing - it is standardapproach as well when it comes to Latin characters:<charFilter class="solr.MappingCharFilterFactory"mapping="mapping-ISOLatin1Accent.txt"/>

Quick search of "MappingCharFilterFactory chinese" shows it is used -you should check if suitable for your case.


Thanks,
Emir

On 22.10.2015 11:48, Scott Chu wrote:

Hi solr-user,

Ya, I thought about replacing C1 with C2 in the underground raw data.However, it's a huge data set (over 10M news articles) so I give upthis strategy eariler. My current temporary solution is going back touse 1-gram tokenizer ((i.e.StandardTokenizer) so I can only set 1rule. But it is kinda ugly, especially when applying highlight, e.g.search "C1C2" Solr returns highlight snippet such as"...<em>C1</em><em>C2<em>...".

Scott Chu，scott....@udngroup.com <mailto:scott....@udngroup.com>
2015/10/22

    ----- Original Message -----
    *From: *Emir Arnautovic <mailto:emir.arnauto...@sematext.com>
    *To: *solr-user <mailto:solr-user@lucene.apache.org>
    *Date: *2015-10-22, 17:08:26
    *Subject: *Re: Is it possible to specigfy only one-character term
    synonym for2-gram tokenizer?

    Hi Scott,
    I don't have experience with Chinese, but SynonymFilter works on
    tokens,
    so if CJKTokenizer recognizes C1 and Cm as tokens, it should work. If
    not, than you can try configuring PatternReplaceCharFilter to
    replace C1
    to C2 during indexing and searching and get a match.

    Thanks,
    Emir

    On 22.10.2015 10:53, Scott Chu wrote:
    > Hi solr-user,
    > I always uses CJKTokenizer on appropriate amount of Chinese news
    > articles. Say in Chinese, character C1 has same meaning as
    > character C2 (e.g 台=臺), Is it possible that I only add this
    line in
    > synonym.txt:
    > C1,C2 (and in true exmaple: 台, 臺)
    > and by applying CJKTokenizer and SynonymFilter, I only have to
    query
    > "C1Cm..." (say Cm is arbitrary Chinese character) and Solr will
    > return documents that matche whether "C1Cm" or "C2Cm"?
    > Scott Chu，scott....@udngroup.com
    <mailto:%20scott....@udngroup.com> <mailto:scott....@udngroup.com
    <mailto:%20scott....@udngroup.com>>
    > 2015/10/22
    >

--Monitoring * Alerting * Anomaly Detection * Centralized Log Management

    Solr & Elasticsearch Support * http://sematext.com/




    -----
    未在此訊息中找到病毒。
    已透過 AVG 檢查 - www.avg.com
    版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

Re: Is it possible to specigfy only one-character term synonym for2-gram tokenizer?

Reply via email to