Hi, So, Isn't advisable to use classicTokenizer and classicAnalyzer? On Thu, Oct 19, 2017 at 8:29 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> Have you looked at the specification to see how it's _supposed_ to work? > > From the javadocs: > "implements Unicode text segmentation, * as specified by UAX#29." > > See http://unicode.org/reports/tr29/#Word_Boundaries > > If you look at the spec and feel that ClassicAnalyzer incorrectly > implements the word break rules then perhaps there's a JIRA. > > Best, > Erick > > On Thu, Oct 19, 2017 at 6:39 AM, Chitra <chithu.r...@gmail.com> wrote: > > Hi, > > I indexed a term 'ⒶeŘꝋꝒɫⱯŋɇ' (aeroplane) and the term was > > indexed as "er l n", some characters were trimmed while indexing. > > > > Here is my code > > > > protected Analyzer.TokenStreamComponents createComponents(final String > > fieldName, final Reader reader) > > { > > final ClassicTokenizer src = new ClassicTokenizer(getVersion(), > > reader); > > src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH); > > > > TokenStream tok = new ClassicFilter(src); > > tok = new LowerCaseFilter(getVersion(), tok); > > tok = new StopFilter(getVersion(), tok, stopwords); > > tok = new ASCIIFoldingFilter(tok); // to enable AccentInsensitive > > search > > > > return new Analyzer.TokenStreamComponents(src, tok) > > { > > @Override > > protected void setReader(final Reader reader) throws > IOException > > { > > > > src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH); > > super.setReader(reader); > > } > > }; > > } > > > > > > Am I missing anything? Is that expected behavior for my input or any > reason > > behind such abnormal behavior? > > > > > > -- > > Regards, > > Chitra > -- Regards, Chitra