[ 
https://issues.apache.org/jira/browse/LUCENE-9220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038532#comment-17038532
 ] 

ASF subversion and git services commented on LUCENE-9220:
---------------------------------------------------------

Commit 0203815ab2bf72a77fe8f58daa0e6e269e00e9a8 in lucene-solr's branch 
refs/heads/master from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0203815 ]

LUCENE-9220: regenerate all stemmers/stopwords/test data from snowball 2.0 
(#1262)

Previous situation:

* The snowball base classes (Among, SnowballProgram, etc) had accumulated local 
performance-related changes. There was a task that would also "patch" generated 
classes (e.g. GermanStemmer) after-the-fact.
* Snowball classes had many "non-changes" from the original such as removal of 
tabs addition of javadocs, license headers, etc.
* Snowball test data (inputs and expected stems) was incorporated into lucene 
testing, but this was maintained manually. Also files had become large, making 
the test too slow (Nightly).
* Snowball stopwords lists from their website were manually maintained. In some 
cases encoding fixes were manually applied.
* Some generated stemmers (such as Estonian and Armenian) exist in lucene, but 
have no corresponding `.sbl` file in snowball sources at all.

Besides this mess, snowball project is "moving along" and acquiring new 
languages, adding non-BSD-licensed test data, huge test data, and other 
complexity. So it is time to automate the integration better.

New situation:

* Lucene has a `gradle snowball` regeneration task. It works on Linux or Mac 
only. It checks out their repos, applies the `snowball.patch` in our 
repository, compiles snowball stemmers, regenerates all java code, applies any 
adjustments so that our build is happy.
* Tests data is automatically regenerated from the commit hash of the snowball 
test data repository. Not all languages are tested from their data: only where 
the license is simple BSD. Test data is also (deterministically) sampled, so 
that we don't have huge files. We just want to make sure our integration works.
* Randomized tests are still set to test every language with generated fake 
words. The regeneration task ensures all languages get tested (it writes a 
simple text file list of them).
* Stopword files are automatically regenerated from the commit hash of the 
snowball website repository.
* The regeneration procedure is idempotent. This way when stuff does change, 
you know exactly what happened. For example if test data changes to a different 
license, you may see a git deletion. Or if a new language/stopwords/test data 
gets added, you will see git additions.

> Upgrade Snowball version to 2.0
> -------------------------------
>
>                 Key: LUCENE-9220
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9220
>             Project: Lucene - Core
>          Issue Type: Wish
>            Reporter: Nguyen Minh Gia Huy
>            Priority: Major
>         Attachments: snowball_53739a805cfa6c.patch, 
> snowball_53739a805cfa6c.patch, snowball_53739a805cfa6c.patch
>
>          Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> When working with Snowball-based stemmers, I realized that Lucene is 
> currently [using a pre-compiled version of 
> Snowball|https://lucene.apache.org/core/8_4_1/analyzers-common/org/apache/lucene/analysis/snowball/package-summary.html],
>  that seems from 12 years ago: 
> https://github.com/snowballstem/snowball/tree/e103b5c257383ee94a96e7fc58cab3c567bf079b
> Snowball has just released v2.0 in 10/2019 with many improvements, new 
> supported languages ( Arabic, Indonesian…) and new features ( stringdef 
> notation for Unicode codepoints…). Details of the changes could be found 
> here: https://github.com/snowballstem/snowball/blob/master/NEWS. I think 
> these changes of Snowball could give a promising positive impact on Lucene.
> I wonder when Lucene should upgrade Snowball to the latest version ( v2.0).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to