Re: Stemmer Question

2012-03-10 Thread Jamie Johnson
Barring the horrible name I am wondering if folks would be interested in having something like this as an alternative to the standard kstemmer. This is largely based on the SynonymFilter except it builds tokens using the kstemmer and the original input. I've created a JIRA for this to start discu

Re: Stemmer Question

2012-03-09 Thread Jamie Johnson
So I've thrown something together fairly quickly which is based on what Ahmet had sent that I believe will preserve the original token as well as the stemmed version. I didn't go as far as weighting them differently using the payloads however. I am not sure how to use the preserveOriginal attribu

Re: Stemmer Question

2012-03-09 Thread Jamie Johnson
Further digging leads me to believe this is not the case. The Synonym Filter supports this, but the Stemming Filter does not. Ahmet, Would you be willing to provide your filter as well? I wonder if we can make it aware of the preserveOriginal attribute on WordDelimterFilterFactory? On Fri, Ma

Re: Stemmer Question

2012-03-09 Thread Jamie Johnson
Ok, so I'm digging through the code and I noticed in org.apache.lucene.analysis.synonym.SynonymFilter there are mentions of a keepOrig attribute. Doing some googling led me to http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters which speaks of an attribute preserveOriginal="1" on solr.Word

Re: Stemmer Question

2012-03-09 Thread Ahmet Arslan
> I'd be very interested to see how you > did this if it is available. Does > this seem like something useful to the community at large? I PMed it to you. Filter is not a big deal. Just modified from {@link org.apache.lucene.wordnet.SynonymTokenFilter}. If requested, I can provide it publicly t

Re: Stemmer Question

2012-03-08 Thread Jamie Johnson
I'd be very interested to see how you did this if it is available. Does this seem like something useful to the community at large? On Thursday, March 8, 2012, Ahmet Arslan wrote: >> Thanks the KeywordMarkerFilterFactory >> seems to be what I was looking >> for. I'm still wondering about keeping

Re: Stemmer Question

2012-03-08 Thread Ahmet Arslan
> Thanks the KeywordMarkerFilterFactory > seems to be what I was looking > for.  I'm still wondering about keeping the unstemmed > word as a token > though.  While I know that this would increase the > index size slightly > I wonder what the negative of doing such a thing would > be?  Just seems >

Re: Stemmer Question

2012-03-08 Thread Jamie Johnson
Thanks the KeywordMarkerFilterFactory seems to be what I was looking for. I'm still wondering about keeping the unstemmed word as a token though. While I know that this would increase the index size slightly I wonder what the negative of doing such a thing would be? Just seems less destructive s

Re: Stemmer Question

2012-03-08 Thread Ahmet Arslan
> I was previously using the > PorterStemmer to do stemming and ran into > an issue where it was overly aggressive with some words or > abbreviations which I needed to stop.  I have recently > switched to > KStem and I believe the issue is less, but I was wondering > still if > there was a way to s