AW: Lexical analysis tools for German language data

2012-04-13 Thread Michael Ludwig
> Von: Tomas Zerolo > > > There can be transformations or inflections, like the "s" in > > > "Weinachtsbaum" (Weinachten/Baum). > > > > I remember from my linguistics studies that the terminus technicus > > for these is "Fugenmorphem" (interstitial or joint morpheme) [...] > > IANAL (I am not a l

Re: Lexical analysis tools for German language data

2012-04-12 Thread Tomas Zerolo
On Thu, Apr 12, 2012 at 03:46:56PM +, Michael Ludwig wrote: > > Von: Walter Underwood > > > German noun decompounding is a little more complicated than it might > > seem. > > > > There can be transformations or inflections, like the "s" in > > "Weinachtsbaum" (Weinachten/Baum). > > I remembe

Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Walter Underwood
On Apr 12, 2012, at 9:00 AM, Paul Libbrecht wrote: > More or less, Fahrrad is generally abbreviated as Rad. > (even though Rad can mean wheel and bike) A synonym could handle this, since "farhren" would not be a good match. It is judgement call, but this seems more like an equivalence "Fahrrad =

Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Markus Jelsma
On Thursday 12 April 2012 18:00:14 Paul Libbrecht wrote: > Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit : > >> Some compounds probably should not be decompounded, like "Fahrrad" > >> (farhren/Rad). With a dictionary-based stemmer, you might decide to > >> avoid decompounding for words in the dic

Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Walter Underwood
On Apr 12, 2012, at 8:46 AM, Michael Ludwig wrote: > I remember from my linguistics studies that the terminus technicus for > these is "Fugenmorphem" (interstitial or joint morpheme). That is some excellent linguistic jargon. I'll file that with "hapax legomenon". If you don't highlight, you ca

Re: AW: Lexical analysis tools for German language data

2012-04-12 Thread Paul Libbrecht
Le 12 avr. 2012 à 17:46, Michael Ludwig a écrit : >> Some compounds probably should not be decompounded, like "Fahrrad" >> (farhren/Rad). With a dictionary-based stemmer, you might decide to >> avoid decompounding for words in the dictionary. > > Good point. More or less, Fahrrad is generally ab

AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
> Von: Walter Underwood > German noun decompounding is a little more complicated than it might > seem. > > There can be transformations or inflections, like the "s" in > "Weinachtsbaum" (Weinachten/Baum). I remember from my linguistics studies that the terminus technicus for these is "Fugenmorph

Re: Lexical analysis tools for German language data

2012-04-12 Thread Walter Underwood
German noun decompounding is a little more complicated than it might seem. There can be transformations or inflections, like the "s" in "Weinachtsbaum" (Weinachten/Baum). Internal nouns should be recapitalized, like "Baum" above. Some compounds probably should not be decompounded, like "Fahrrad

AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
> Von: Markus Jelsma > We've done a lot of tests with the HyphenationCompoundWordTokenFilter > using a from TeX generated FOP XML file for the Dutch language and > have seen decent results. A bonus was that now some tokens can be > stemmed properly because not all compounds are listed in the > dic

Re: Lexical analysis tools for German language data

2012-04-12 Thread Markus Jelsma
Hi, We've done a lot of tests with the HyphenationCompoundWordTokenFilter using a from TeX generated FOP XML file for the Dutch language and have seen decent results. A bonus was that now some tokens can be stemmed properly because not all compounds are listed in the dictionary for the Hunspell

AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
> Von: Valeriy Felberg > If you want that query "jacke" matches a document containing the word > "windjacke" or "kinderjacke", you could use a custom update processor. > This processor could search the indexed text for words matching the > pattern ".*jacke" and inject the word "jacke" into an addi

Re: Lexical analysis tools for German language data

2012-04-12 Thread Bernd Fehling
Paul, nearly two years ago I requested an evaluation license and tested BASIS Tech Rosette for Lucene & Solr. Was working excellent but the price much much to high. Yes, they also have compound analysis for several languages including German. Just configure your pipeline in solr and setup the pr

Re: Lexical analysis tools for German language data

2012-04-12 Thread Paul Libbrecht
Bernd, can you please say a little more? I think this list is ok to contain some description for commercial solutions that satisfy a request formulated on list. Is there any product at BASIS Tech that provides a compound-analyzer with a big dictionary of decomposed compounds in German? If yes,

Re: Lexical analysis tools for German language data

2012-04-12 Thread Valeriy Felberg
If you want that query "jacke" matches a document containing the word "windjacke" or "kinderjacke", you could use a custom update processor. This processor could search the indexed text for words matching the pattern ".*jacke" and inject the word "jacke" into an additional field which you can searc

Re: Lexical analysis tools for German language data

2012-04-12 Thread Bernd Fehling
You might have a look at: http://www.basistech.com/lucene/ Am 12.04.2012 11:52, schrieb Michael Ludwig: > Given an input of "Windjacke" (probably "wind jacket" in English), I'd > like the code that prepares the data for the index (tokenizer etc) to > understand that this is a "Jacke" ("jacket")

Re: Lexical analysis tools for German language data

2012-04-12 Thread Paul Libbrecht
Michael, I'm on this list and the lucene list since several years and have not found this yet. It's been one "neglected topics" to my taste. There is a CompoundAnalyzer but it requires the compounds to be dictionary based, as you indicate. I am convinced there's a way to build the de-compound

AW: Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
> Given an input of "Windjacke" (probably "wind jacket" in English), > I'd like the code that prepares the data for the index (tokenizer > etc) to understand that this is a "Jacke" ("jacket") so that a > query for "Jacke" would include the "Windjacke" document in its > result set. > > It appears t

Lexical analysis tools for German language data

2012-04-12 Thread Michael Ludwig
Given an input of "Windjacke" (probably "wind jacket" in English), I'd like the code that prepares the data for the index (tokenizer etc) to understand that this is a "Jacke" ("jacket") so that a query for "Jacke" would include the "Windjacke" document in its result set. It appears to me that such