Re: Lexical analysis tools for German language data

2012-04-12 Thread Tomas Zerolo
On Thu, Apr 12, 2012 at 03:46:56PM +, Michael Ludwig wrote: > > Von: Walter Underwood > > > German noun decompounding is a little more complicated than it might > > seem. > > > > There can be transformations or inflections, like the "s" in > > "Weinachtsbaum" (Weinachten/Baum). > > I remembe

Re: Lexical analysis tools for German language data

2012-04-12 Thread Walter Underwood
German noun decompounding is a little more complicated than it might seem. There can be transformations or inflections, like the "s" in "Weinachtsbaum" (Weinachten/Baum). Internal nouns should be recapitalized, like "Baum" above. Some compounds probably should not be decompounded, like "Fahrrad

Re: Lexical analysis tools for German language data

2012-04-12 Thread Markus Jelsma
Hi, We've done a lot of tests with the HyphenationCompoundWordTokenFilter using a from TeX generated FOP XML file for the Dutch language and have seen decent results. A bonus was that now some tokens can be stemmed properly because not all compounds are listed in the dictionary for the Hunspell

Re: Lexical analysis tools for German language data

2012-04-12 Thread Bernd Fehling
Paul, nearly two years ago I requested an evaluation license and tested BASIS Tech Rosette for Lucene & Solr. Was working excellent but the price much much to high. Yes, they also have compound analysis for several languages including German. Just configure your pipeline in solr and setup the pr

Re: Lexical analysis tools for German language data

2012-04-12 Thread Paul Libbrecht
Bernd, can you please say a little more? I think this list is ok to contain some description for commercial solutions that satisfy a request formulated on list. Is there any product at BASIS Tech that provides a compound-analyzer with a big dictionary of decomposed compounds in German? If yes,

Re: Lexical analysis tools for German language data

2012-04-12 Thread Valeriy Felberg
If you want that query "jacke" matches a document containing the word "windjacke" or "kinderjacke", you could use a custom update processor. This processor could search the indexed text for words matching the pattern ".*jacke" and inject the word "jacke" into an additional field which you can searc

Re: Lexical analysis tools for German language data

2012-04-12 Thread Bernd Fehling
You might have a look at: http://www.basistech.com/lucene/ Am 12.04.2012 11:52, schrieb Michael Ludwig: > Given an input of "Windjacke" (probably "wind jacket" in English), I'd > like the code that prepares the data for the index (tokenizer etc) to > understand that this is a "Jacke" ("jacket")

Re: Lexical analysis tools for German language data

2012-04-12 Thread Paul Libbrecht
Michael, I'm on this list and the lucene list since several years and have not found this yet. It's been one "neglected topics" to my taste. There is a CompoundAnalyzer but it requires the compounds to be dictionary based, as you indicate. I am convinced there's a way to build the de-compound