Thanks! It is good to know I did not do something in vаin :) On Wed, Jan 20, 2010 at 6:54 PM, Erick Erickson <erickerick...@gmail.com>wrote:
> Ah, OK. I take the "unnecessary" comment back. If you require > the original form of the tokens (not just the original text), then you > do have to do something to preserve them, so I think you're on > the right track.... > > FWIW > Erick > > On Wed, Jan 20, 2010 at 9:38 AM, Bogdan Vatkov <bogdan.vat...@gmail.com > >wrote: > > > Hi Eric, > > > > I think I realize that and I am actually using this - I am using the > > stemmed, cased etc. token from the stored "term vectors" and additionally > I > > am using the field values. > > But the fields values are different from the tokens in the level of > > granularity. > > When I access the term vector for my field "body" I get the tokens: > > "old", "man", "sea" (the rest is stopwords) > > while if I use document.getter methods for my field I get the value of > the > > field "body", which is: > > "The Old Man and the Sea" > > But.. what I actually need is the original version of the tokens and not > > the > > field value itself, in that example I need: > > "Old" "Man", "Sea" > > and not > > "The Old Man and the Sea" > > that is why I had to do my version of that filter so that during tokens > > transformation (stemming, lowercasing) I store a map of the filtered term > > -to- original term. > > > > I am using Apache Mahout to read from Solr index (term vectors) and > cluster > > Solr documents based on these terms (tokens) and the clustering process > > itself works with the stemmed, lowercased terms while at the end I want > to > > present the original terms - and the only way I found is by using this > > stemmed term-to-original-token-map which I build during stemming. > > Am I missing some existing method to access stored tokens before they get > > stemmed? > > > > Best regards, > > Bogdan > > > > On Wed, Jan 20, 2010 at 2:39 AM, Erick Erickson <erickerick...@gmail.com > > >wrote: > > > > > This is completely unnecessary. Fields can be both indexed and > > > stored, and the operations are orthogonal. > > > > > > That is, when you specify that a field is indexed, it is run through > > > an analyzer and the *tokens* are indexed, after any > > > stemming, casing, etc. > > > > > > Stored means that the original value, before any analysis > > > whatsoever, is put in a completely separate location. > > > It's only there for retrieval and display to the user. It's as if > > > a copy of the original text was put into one place, and the > > > tokens were put in another. > > > > > > Consider the problem of book titles. If I have a title "The Old > > > Man and the Sea", I want to display that title as a result of > > > searching for "old sea man". Rather than force the separate > > > storage to be done programmatically, SOLR allows you to > > > specify these two options. So if I specify indexing and storing, > > > the tokens "old" "man" "sea" (assuming lowercasing, > > > stopword removal, etc) are added to the searchable index. > > > "The Old Man and the Sea" is copied somewhere else, and > > > when you ask for the *value* of the field, you get "The Old Man > > > and the Sea". This stored part of the index is never searched, it > > > is solely there for retrieval/display. > > > > > > I'd really get a copy of the book, it'll save you lots of time and > > > effort. > > > > > > HTH > > > Erick > > > > > > On Tue, Jan 19, 2010 at 5:45 PM, Bogdan Vatkov < > bogdan.vat...@gmail.com > > > >wrote: > > > > > > > I am using fields like: > > > > <field name="msg_body" type="body_text" termVectors="true" > > > indexed="true" > > > > stored="true"/> > > > > which contain multi-line text, not just single strings, what does > > "stored > > > > values" mean? > > > > I am relatively new to Solr > > > > > > > > I solved my issue by copy/pasting and enhancing > > > > the SnowballPorterFilterFactory class by > > > > creating SnowballPorterWithUnstemLowerCaseFilterFactory > > > > I added lowercasing inside the factory since I need to capture the > > > original > > > > terms store them in a side file and only then lowercase and stem. > > > > > > > > <fieldType name="body_text" class="solr.TextField" > > > > positionIncrementGap="100"> > > > > <analyzer type="index"> > > > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > > > <filter class="solr.StopFilterFactory" > > > > ignoreCase="true" > > > > words="stopwords.txt" > > > > enablePositionIncrements="true" > > > > /> > > > > <filter class="solr.WordDelimiterFilterFactory" > > > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > > > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > > > <!-- <filter class="solr.LowerCaseFilterFactory"/> --> > > > > <!-- <filter class="solr.SnowballPorterFilterFactory" > > > > language="English" protected="protwords.txt"/> --> > > > > <filter > > > > > > > > > > > > > > class="org.bogdan.solr.analysis.SnowballPorterWithUnstemLowerCaseFilterFactory" > > > > language="English" protected="protwords.txt" > > unstemmed="unstemmed.txt"/> > > > > </analyzer> > > > > > > > > I was wondering if there is an easier way (without doing this custom > > > filter > > > > that I did). > > > > > > > > Best regards, > > > > Bogdan > > > > > > > > On Wed, Jan 20, 2010 at 12:38 AM, Otis Gospodnetic < > > > > otis_gospodne...@yahoo.com> wrote: > > > > > > > > > Bogdan, > > > > > > > > > > You can get them from stored values of your fields, if you are > > storing > > > > > them. > > > > > > > > > > Otis > > > > > -- > > > > > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > > > > > > > > > > > > > > > > > > > > ----- Original Message ---- > > > > > > From: Bogdan Vatkov <bogdan.vat...@gmail.com> > > > > > > To: solr-user@lucene.apache.org > > > > > > Sent: Tue, January 19, 2010 5:28:51 PM > > > > > > Subject: Unstemming after solr.PorterStemFilterFactory > > > > > > > > > > > > Hi, > > > > > > > > > > > > I am indexing with the solr.PorterStemFilterFactory included but > > then > > > I > > > > > need > > > > > > to access the unstemmed versions of the terms, what would be the > > > > easiest > > > > > way > > > > > > to get the unstemmed version? > > > > > > Thanks in advance. > > > > > > > > > > > > Best regards, > > > > > > Bogdan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Best regards, > > > > > > Bogdan > > > > > > > > > > > > > > > > > > > > > > -- > > > > Best regards, > > > > Bogdan > > > > > > > > > > > > > > > -- > > Best regards, > > Bogdan > > > -- Best regards, Bogdan