[ https://issues.apache.org/jira/browse/LUCENE-9220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038532#comment-17038532 ]
ASF subversion and git services commented on LUCENE-9220: --------------------------------------------------------- Commit 0203815ab2bf72a77fe8f58daa0e6e269e00e9a8 in lucene-solr's branch refs/heads/master from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0203815 ] LUCENE-9220: regenerate all stemmers/stopwords/test data from snowball 2.0 (#1262) Previous situation: * The snowball base classes (Among, SnowballProgram, etc) had accumulated local performance-related changes. There was a task that would also "patch" generated classes (e.g. GermanStemmer) after-the-fact. * Snowball classes had many "non-changes" from the original such as removal of tabs addition of javadocs, license headers, etc. * Snowball test data (inputs and expected stems) was incorporated into lucene testing, but this was maintained manually. Also files had become large, making the test too slow (Nightly). * Snowball stopwords lists from their website were manually maintained. In some cases encoding fixes were manually applied. * Some generated stemmers (such as Estonian and Armenian) exist in lucene, but have no corresponding `.sbl` file in snowball sources at all. Besides this mess, snowball project is "moving along" and acquiring new languages, adding non-BSD-licensed test data, huge test data, and other complexity. So it is time to automate the integration better. New situation: * Lucene has a `gradle snowball` regeneration task. It works on Linux or Mac only. It checks out their repos, applies the `snowball.patch` in our repository, compiles snowball stemmers, regenerates all java code, applies any adjustments so that our build is happy. * Tests data is automatically regenerated from the commit hash of the snowball test data repository. Not all languages are tested from their data: only where the license is simple BSD. Test data is also (deterministically) sampled, so that we don't have huge files. We just want to make sure our integration works. * Randomized tests are still set to test every language with generated fake words. The regeneration task ensures all languages get tested (it writes a simple text file list of them). * Stopword files are automatically regenerated from the commit hash of the snowball website repository. * The regeneration procedure is idempotent. This way when stuff does change, you know exactly what happened. For example if test data changes to a different license, you may see a git deletion. Or if a new language/stopwords/test data gets added, you will see git additions. > Upgrade Snowball version to 2.0 > ------------------------------- > > Key: LUCENE-9220 > URL: https://issues.apache.org/jira/browse/LUCENE-9220 > Project: Lucene - Core > Issue Type: Wish > Reporter: Nguyen Minh Gia Huy > Priority: Major > Attachments: snowball_53739a805cfa6c.patch, > snowball_53739a805cfa6c.patch, snowball_53739a805cfa6c.patch > > Time Spent: 5.5h > Remaining Estimate: 0h > > When working with Snowball-based stemmers, I realized that Lucene is > currently [using a pre-compiled version of > Snowball|https://lucene.apache.org/core/8_4_1/analyzers-common/org/apache/lucene/analysis/snowball/package-summary.html], > that seems from 12 years ago: > https://github.com/snowballstem/snowball/tree/e103b5c257383ee94a96e7fc58cab3c567bf079b > Snowball has just released v2.0 in 10/2019 with many improvements, new > supported languages ( Arabic, Indonesian…) and new features ( stringdef > notation for Unicode codepoints…). Details of the changes could be found > here: https://github.com/snowballstem/snowball/blob/master/NEWS. I think > these changes of Snowball could give a promising positive impact on Lucene. > I wonder when Lucene should upgrade Snowball to the latest version ( v2.0). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org