RE: German Compound Splitter words.fst causing problems.

Christopher Morley Thu, 26 Mar 2015 04:54:13 -0700

Thanks for the tip Markus.  We are using this filter to decompound German 
words.  Update: I am on the path to victory.  The words.fst file is actually 
built by the plugin, however there is a basic input&output file format mismatch 
(at the byte level) that doesn't occur with 4.0.  As soon as you try to use 
lucene core 4.1 with this particular plugin, it breaks with the same error I 
was getting.  The FST code in lucene says clearly that there is no guaranteed 
backward compatibility, so there you have it.  I'm probably going to need to 
incorporate some older code from lucene and/or figure out how to make the 
plugin work with the new lucene code.


-Chris.

-----Original Message-----
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, March 25, 2015 6:15 PM
To: solr-user@lucene.apache.org
Subject: RE: German Compound Splitter words.fst causing problems.

Hello Chris - i don't know that token filter you mention but i would like to 
recommend Lucene's HyphenationCompoundWordTokenFilter. It works reasonably well 
if you provide the hyphenation rules and a dictionary. It has some flaws such 
as decompounding to irrelevant subwords, overlapping subwords or to subwords 
that do not form the whole compound word (minus genitives),  but these can be 
fixed.

Markus
 
-----Original message-----
> From:Chris Morley <ch...@depahelix.com>
> Sent: Wednesday 25th March 2015 17:59
> To: solr-user@lucene.apache.org
> Subject: German Compound Splitter words.fst causing problems.
> 
> Hello, Chris Morley here, of Wayfair.com. I am working on the German 
> compound-splitter by Dawid Weiss. 
>   
>   I tried to "upgrade" the words.fst file that comes with the German 
> compound-splitter using Solr 3.5, but it doesn't work. Below is the 
> IndexNotFoundException that I get.
>   
>  cmorley@Caracal01:~/Work/oss/git/apache-solr-3.5.0$ java -cp 
> lucene/build/lucene-core-3.5-SNAPSHOT.jar 
> org.apache.lucene.index.IndexUpgrader wordsFst  Exception in thread "main" 
> org.apache.lucene.index.IndexNotFoundException: 
> org.apache.lucene.store.MMapDirectory@/home/cmorley/Work/oss/git/apache-solr-3.5.0/wordsFst
>  lockFactory=org.apache.lucene.store.NativeFSLockFactory@201a755e
>                  at 
> org.apache.lucene.index.IndexUpgrader.upgrade(IndexUpgrader.java:118)
>                  at 
> org.apache.lucene.index.IndexUpgrader.main(IndexUpgrader.java:85)
>   
>  The reason I'm attempting this at all is due to the answer here, 
> http://stackoverflow.com/questions/25450865/migrate-solr-1-4-index-files-to-4-7,
>  which says to do the upgrade in a two step process, first using Solr 3.5, 
> and then the latest Solr version (4.10.3).  When I try this running the unit 
> tests for my modified German compound-splitter I'm getting this same type of 
> error.  The thing is, this is an FST, not an index, which is a little 
> confusing.  The reason why I'm following this answer though, is because I'm 
> getting that exact same message when trying to build the (modified) project 
> with maven....at the point at which it tries to load in words.fst. Below.
>   
>  [main] ERROR com.wayfair.lucene.analysis.de.compound.GermanCompoundSplitter 
> - Format version is not supported (resource: 
> com.wayfair.lucene.analysis.de.compound.InputStreamDataInput@79a66240): 0 
> (needs to be between 3 and 4). This version of Lucene only supports indexes 
> created with release 3.0 and later.  Failed to initialize static data 
> structures for German compound splitter.
>   
>  Thanks,
>  -Chris.
> 
> 
>

RE: German Compound Splitter words.fst causing problems.

Reply via email to