If you are using Solr, you can configure your analysis chain to use the ICUFoldingFilterFactory (https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory) and then view the results in the solr admin window.
If you are in pure Lucene (circa version 4.8, some mods will be required depending on your version): 1) Extend Analyzer: @Override protected TokenStreamComponents createComponents(String field, Reader reader) { Tokenizer stream = new StandardTokenizer(version, reader); TokenFilter icu = new ICUFoldingFilter(stream); return new TokenStreamComponents(stream, icu); } 2) Then iterate through the tokens: TokenStream stream = analyzer.tokenStream("", new StringReader(text)); stream.reset(); CharTermAttribute cattr = stream.getAttribute(CharTermAttribute.class); while (stream.incrementToken()) { String token = cattr.toString(); ... -----Original Message----- From: paolo anghileri [mailto:paolo.anghil...@codegeneration.it] Sent: Saturday, November 22, 2014 11:41 AM To: Allison, Timothy B. Subject: Re: Lucene ancient greek normalization Sorry Timothy for the beginner question, how did you manage to run this test? Many thanks Paolo On 21/11/2014 21:14, Allison, Timothy B. wrote: > ICU looks promising: > > Μῆνιν ἄειδε, θεὰ, Πηληϊάδεω Ἀχιλλῆος -> > > 1.μηνιν > 2.αειδε > 3.θεα > 4.πηληιαδεω > 5.αχιλληοσ > > -----Original Message----- > From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] > Sent: Friday, November 21, 2014 3:08 PM > To: d...@lucene.apache.org > Subject: Re: Lucene ancient greek normalization > > Are you sure that's not something that's already addressed by the ICU > Filter? > http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/icu/ICUTransformFilterFactory.html > > If you follow the links to what's possible, the page talks about > Greek, though not ancient: > http://userguide.icu-project.org/transforms/general#TOC-Greek > > There was also some discussion on: > https://issues.apache.org/jira/browse/LUCENE-1343 > > Regards, > Alex. > Personal: http://www.outerthoughts.com/ and @arafalov > Solr resources and newsletter: http://www.solr-start.com/ and @solrstart > Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 > > > On 21 November 2014 14:14, paolo anghileri > <paolo.anghil...@codegeneration.it> wrote: >> For development purposes I need the ability in lucene to normalize ancient >> greek characters for al the cases of grammatical details such as accents, >> diacritics and so on. >> >> My need is to retrieve ancient greek words with accents and other >> grammatical details by the input of the string without accents. >> >> For example the input of οργανον (organon) should to retrieve also Ὄργανον, >> >> >> I am not a lucene commiter and I a new to this so my question is about the >> best practice to implement this in Lucene, and possibile submit a commit >> proposal to Lucene A project management committee. >> >> I have made some searches and found this file in Lucene-soir: >> >> >> It contains normalization for some chars. >> My thought would be to add extra normalization here, including all unicode >> ancient greek chars with all grammatical details. >> I already have all the unicode values for that chars so It should not be >> difficult for me to include them >> >> If my understanding is correct, this should add to lucene the features >> described above. >> >> >> As I am new to this, my needs are: >> >> To be sure that this is the correct place in Lucene for doing normalization >> How to post commit proposal >> >> >> Any help appreciated >> >> Kind regards >> >> Paolo > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org >