Ah, OK. I take the "unnecessary" comment back. If you require the original form of the tokens (not just the original text), then you do have to do something to preserve them, so I think you're on the right track....
FWIW Erick On Wed, Jan 20, 2010 at 9:38 AM, Bogdan Vatkov <bogdan.vat...@gmail.com>wrote: > Hi Eric, > > I think I realize that and I am actually using this - I am using the > stemmed, cased etc. token from the stored "term vectors" and additionally I > am using the field values. > But the fields values are different from the tokens in the level of > granularity. > When I access the term vector for my field "body" I get the tokens: > "old", "man", "sea" (the rest is stopwords) > while if I use document.getter methods for my field I get the value of the > field "body", which is: > "The Old Man and the Sea" > But.. what I actually need is the original version of the tokens and not > the > field value itself, in that example I need: > "Old" "Man", "Sea" > and not > "The Old Man and the Sea" > that is why I had to do my version of that filter so that during tokens > transformation (stemming, lowercasing) I store a map of the filtered term > -to- original term. > > I am using Apache Mahout to read from Solr index (term vectors) and cluster > Solr documents based on these terms (tokens) and the clustering process > itself works with the stemmed, lowercased terms while at the end I want to > present the original terms - and the only way I found is by using this > stemmed term-to-original-token-map which I build during stemming. > Am I missing some existing method to access stored tokens before they get > stemmed? > > Best regards, > Bogdan > > On Wed, Jan 20, 2010 at 2:39 AM, Erick Erickson <erickerick...@gmail.com > >wrote: > > > This is completely unnecessary. Fields can be both indexed and > > stored, and the operations are orthogonal. > > > > That is, when you specify that a field is indexed, it is run through > > an analyzer and the *tokens* are indexed, after any > > stemming, casing, etc. > > > > Stored means that the original value, before any analysis > > whatsoever, is put in a completely separate location. > > It's only there for retrieval and display to the user. It's as if > > a copy of the original text was put into one place, and the > > tokens were put in another. > > > > Consider the problem of book titles. If I have a title "The Old > > Man and the Sea", I want to display that title as a result of > > searching for "old sea man". Rather than force the separate > > storage to be done programmatically, SOLR allows you to > > specify these two options. So if I specify indexing and storing, > > the tokens "old" "man" "sea" (assuming lowercasing, > > stopword removal, etc) are added to the searchable index. > > "The Old Man and the Sea" is copied somewhere else, and > > when you ask for the *value* of the field, you get "The Old Man > > and the Sea". This stored part of the index is never searched, it > > is solely there for retrieval/display. > > > > I'd really get a copy of the book, it'll save you lots of time and > > effort. > > > > HTH > > Erick > > > > On Tue, Jan 19, 2010 at 5:45 PM, Bogdan Vatkov <bogdan.vat...@gmail.com > > >wrote: > > > > > I am using fields like: > > > <field name="msg_body" type="body_text" termVectors="true" > > indexed="true" > > > stored="true"/> > > > which contain multi-line text, not just single strings, what does > "stored > > > values" mean? > > > I am relatively new to Solr > > > > > > I solved my issue by copy/pasting and enhancing > > > the SnowballPorterFilterFactory class by > > > creating SnowballPorterWithUnstemLowerCaseFilterFactory > > > I added lowercasing inside the factory since I need to capture the > > original > > > terms store them in a side file and only then lowercase and stem. > > > > > > <fieldType name="body_text" class="solr.TextField" > > > positionIncrementGap="100"> > > > <analyzer type="index"> > > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > > <filter class="solr.StopFilterFactory" > > > ignoreCase="true" > > > words="stopwords.txt" > > > enablePositionIncrements="true" > > > /> > > > <filter class="solr.WordDelimiterFilterFactory" > > > generateWordParts="1" generateNumberParts="1" catenateWords="1" > > > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > > <!-- <filter class="solr.LowerCaseFilterFactory"/> --> > > > <!-- <filter class="solr.SnowballPorterFilterFactory" > > > language="English" protected="protwords.txt"/> --> > > > <filter > > > > > > > > > class="org.bogdan.solr.analysis.SnowballPorterWithUnstemLowerCaseFilterFactory" > > > language="English" protected="protwords.txt" > unstemmed="unstemmed.txt"/> > > > </analyzer> > > > > > > I was wondering if there is an easier way (without doing this custom > > filter > > > that I did). > > > > > > Best regards, > > > Bogdan > > > > > > On Wed, Jan 20, 2010 at 12:38 AM, Otis Gospodnetic < > > > otis_gospodne...@yahoo.com> wrote: > > > > > > > Bogdan, > > > > > > > > You can get them from stored values of your fields, if you are > storing > > > > them. > > > > > > > > Otis > > > > -- > > > > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > > > > > > > > > > > > > > > > ----- Original Message ---- > > > > > From: Bogdan Vatkov <bogdan.vat...@gmail.com> > > > > > To: solr-user@lucene.apache.org > > > > > Sent: Tue, January 19, 2010 5:28:51 PM > > > > > Subject: Unstemming after solr.PorterStemFilterFactory > > > > > > > > > > Hi, > > > > > > > > > > I am indexing with the solr.PorterStemFilterFactory included but > then > > I > > > > need > > > > > to access the unstemmed versions of the terms, what would be the > > > easiest > > > > way > > > > > to get the unstemmed version? > > > > > Thanks in advance. > > > > > > > > > > Best regards, > > > > > Bogdan > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best regards, > > > > > Bogdan > > > > > > > > > > > > > > > > > -- > > > Best regards, > > > Bogdan > > > > > > > > > -- > Best regards, > Bogdan >