Re: HMMChineseTokenizer splits up alphanumeric characters

Zheng Lin Edwin Yeo Sat, 19 Mar 2016 09:42:19 -0700

I found that in WordDelimiterFilterFactory, there is a parameter called
splitOnNumerics, which does the same function as what HMMChineseTokenizer
did.


   -

   *splitOnNumerics="1"* causes alphabet => number transitions to generate
   a new part [Solr 1.3]:
   -

      "j2se" => "j" "2" "se"
      - default is true ("1"); set to 0 to turn off

I suspect that HMMChineseTokenizer has build in a similar parameter which
will split up the numeric characters, so I guess I have to check on the
HMMChineseFilterFactory to see if it is possible to disable this
parameters, since I believe there is no way to merge them back after it is
split up?

Regards,
Edwin



On 17 March 2016 at 11:11, Erick Erickson <[email protected]> wrote:

> Yes, there is one and only one tokenizer allowed.
>
> Best,
> Erick
>
> On Wed, Mar 16, 2016 at 7:51 PM, Zheng Lin Edwin Yeo
> <[email protected]> wrote:
> > Thanks Shawn for your reply.
> >
> > Yes, I'm looking to see if we can implement a combination of tokenizes
> and
> > filters.
> >
> > However, I tried before that we can only implement one tokenizer for each
> > fieldType. So is it true that I can only stick to one tokenizer, and the
> > rest of the implementation have to be done by either filters or to
> > customise the tokenizer in order to possibly achieve what I want?
> >
> > Regards,
> > Edwin
> >
> >
> > On 17 March 2016 at 09:34, Shawn Heisey <[email protected]> wrote:
> >
> >> On 3/16/2016 4:33 AM, Zheng Lin Edwin Yeo wrote:
> >> > I found that HMMChineseTokenizer will split a string that consist of
> >> > numbers and characters (alphanumeric). For example, if I have a code
> that
> >> > looks like "1a2b3c4d", it will be split to 1 | a | 2 | b | 3 | c | 4
> | d
> >> > This has caused the search query speed to slow quite tremendously
> (like
> >> at
> >> > least 10 seconds slower), as it has to search through individual
> tokens.
> >> >
> >> > Would like to check, is there any way that we can solve this issue
> >> without
> >> > re-indexing? We have quite alot of code in the index which consist of
> >> > alphanumeric characters, and we have more than 10 million documents in
> >> the
> >> > index, so re-indexing with another tokenizer or pipeline is quite a
> huge
> >> > process.
> >>
> >> ANY change you make to index analysis will require reindexing.
> >>
> >> I have no idea what the advantages and disadvantages are in the various
> >> tokenizers and filters for Asian characters.  There may be a combination
> >> of tokenizer and filters that will do what you want.
> >>
> >> We do have an index for a company in Japan.  I'm using ICUTokenizer with
> >> some of the CJK filters, and in some cases I'm using
> >> ICUFoldingFilterFactory for lowercasing and normalization.  The jars
> >> required for ICU analysis components can be found in the contrib folder
> >> in the Solr download.
> >>
> >> There are ways to create a whole new index and then move it into place
> >> to replace your existing index.  For SolrCloud mode, you would use the
> >> collection alias feature.  For standalone Solr, you can swap cores.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>

Re: HMMChineseTokenizer splits up alphanumeric characters

Reply via email to