Hi,
     So, Isn't advisable to use classicTokenizer and classicAnalyzer?

On Thu, Oct 19, 2017 at 8:29 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Have you looked at the specification to see how it's _supposed_ to work?
>
> From the javadocs:
> "implements Unicode text segmentation, * as specified by UAX#29."
>
> See http://unicode.org/reports/tr29/#Word_Boundaries
>
> If you look at the spec and feel that ClassicAnalyzer incorrectly
> implements the word break rules then perhaps there's a JIRA.
>
> Best,
> Erick
>
> On Thu, Oct 19, 2017 at 6:39 AM, Chitra <chithu.r...@gmail.com> wrote:
> > Hi,
> >               I indexed a term 'ⒶeŘꝋꝒɫⱯŋɇ' (aeroplane) and the term was
> > indexed as "er l n", some characters were trimmed while indexing.
> >
> > Here is my code
> >
> > protected Analyzer.TokenStreamComponents createComponents(final String
> > fieldName, final Reader reader)
> >     {
> >         final ClassicTokenizer src = new ClassicTokenizer(getVersion(),
> > reader);
> >         src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
> >
> >         TokenStream tok = new ClassicFilter(src);
> >         tok = new LowerCaseFilter(getVersion(), tok);
> >         tok = new StopFilter(getVersion(), tok, stopwords);
> >         tok = new ASCIIFoldingFilter(tok); // to enable AccentInsensitive
> > search
> >
> >         return new Analyzer.TokenStreamComponents(src, tok)
> >         {
> >             @Override
> >             protected void setReader(final Reader reader) throws
> IOException
> >             {
> >
> > src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
> >                 super.setReader(reader);
> >             }
> >         };
> >     }
> >
> >
> > Am I missing anything? Is that expected behavior for my input or any
> reason
> > behind such abnormal behavior?
> >
> >
> > --
> > Regards,
> > Chitra
>



-- 
Regards,
Chitra

Reply via email to