RE: Lucene ancient greek normalization

Allison, Timothy B. Mon, 24 Nov 2014 04:26:07 -0800

If you are using Solr, you can configure your analysis chain to use the 
ICUFoldingFilterFactory 
(https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory)
 and then view the results in the solr admin window.


If you are in pure Lucene (circa version 4.8, some mods will be required 
depending on your version):
1) Extend Analyzer:
        @Override
        protected TokenStreamComponents createComponents(String field, Reader 
reader) {
                Tokenizer stream = new StandardTokenizer(version, reader);
                TokenFilter icu = new ICUFoldingFilter(stream);
                return new TokenStreamComponents(stream, icu);
        }

2)
Then iterate through the tokens:

                TokenStream stream = analyzer.tokenStream("", new 
StringReader(text));
                stream.reset();
                CharTermAttribute cattr = 
stream.getAttribute(CharTermAttribute.class);
                while (stream.incrementToken()) {
                            String token = cattr.toString();
...
-----Original Message-----
From: paolo anghileri [mailto:paolo.anghil...@codegeneration.it] 
Sent: Saturday, November 22, 2014 11:41 AM
To: Allison, Timothy B.
Subject: Re: Lucene ancient greek normalization

Sorry Timothy for the beginner question, how did you manage to run this 
test?

Many thanks

Paolo

On 21/11/2014 21:14, Allison, Timothy B. wrote:
> ICU looks promising:
>
> Μῆνιν ἄειδε, θεὰ, Πηληϊάδεω Ἀχιλλῆος ->
>
> 1.μηνιν
> 2.αειδε
> 3.θεα
> 4.πηληιαδεω
> 5.αχιλληοσ
>
> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> Sent: Friday, November 21, 2014 3:08 PM
> To: d...@lucene.apache.org
> Subject: Re: Lucene ancient greek normalization
>
> Are you sure that's not something that's already addressed by the ICU
> Filter? 
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/icu/ICUTransformFilterFactory.html
>
> If you follow the links to what's possible, the page talks about
> Greek, though not ancient:
> http://userguide.icu-project.org/transforms/general#TOC-Greek
>
> There was also some discussion on:
> https://issues.apache.org/jira/browse/LUCENE-1343
>
> Regards,
>     Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 21 November 2014 14:14, paolo anghileri
> <paolo.anghil...@codegeneration.it> wrote:
>> For development purposes I need the ability in lucene to normalize ancient
>> greek characters for al the cases of grammatical details such as accents,
>> diacritics and so on.
>>
>> My need is to retrieve ancient greek words with accents and other
>> grammatical details by the input of the string without accents.
>>
>> For example the input of οργανον (organon) should to retrieve also  Ὄργανον,
>>
>>
>> I am not a lucene commiter and I a new to this so my question is about the
>> best practice to implement this in Lucene, and possibile submit a commit
>> proposal to Lucene A project management committee.
>>
>> I have made some searches and found this file in Lucene-soir:
>>
>>
>> It contains normalization for some chars.
>> My thought would be to add extra normalization here, including all unicode
>> ancient greek chars with all grammatical details.
>> I already have all the unicode values for that chars so It should not be
>> difficult for me to include them
>>
>> If my understanding is correct, this should add to lucene the features
>> described above.
>>
>>
>> As I am new to this, my needs are:
>>
>>   To be sure that this is the correct place in Lucene for doing normalization
>> How to post commit proposal
>>
>>
>> Any help appreciated
>>
>> Kind regards
>>
>> Paolo
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

RE: Lucene ancient greek normalization

Reply via email to