Re: No Analyzer, tokenizer or stemmer works at Solr

MitchK Fri, 08 Jan 2010 06:30:49 -0800

Okay, you're right. It really would be cleaner, if I do such stuff in the
code which populates the document to Solr.


Is there a way to prepare a document the described way with Lucene/Solr,
before I analyze it?
My use case is to categorize several documents in an automatic way, which
includes that I have to "create" data from the given input doing some
information retrieval.

The problem is I am really new to Solr and Lucene - as you can see - and I
do not know, whether there are some classes that fit my needs.

Any idea?


Erick Erickson wrote:
> 
> Well, I'd approach either of these use cases
> by simply performing my computations on
> the input and storing the result in another
> (non-indexed unless I wanted to search it)
> field. This wouldn't happen in the Analyzer,
> but in the code that populated the document
> fields.....
> 
> Which is a much cleaner solution IMO than creating
> some sort of "index this but store that" capability.
> The purpose of analysis is to produce *searchable*
> tokens after all.
> 
> But we're getting into angels dancing on pins here. Do
> you actually have a use case you're trying to implement
> or is this mostly theoretical?
> 
> Erick
> 
> On Thu, Jan 7, 2010 at 2:08 PM, MitchK <mitc...@web.de> wrote:
> 
>>
>> The difference between stored and indexed is clear now.
>>
>> You are right, if you are responsing only to "normal users".
>>
>> Use case:
>> You got a stored field "The good, the bad and the ugly".
>> And you got a really fantastic analyzer, which is doing some magic to
>> this
>> movie title.
>> Let's say, the analyzer translates the title into md5 or into another
>> abstract expression.
>> Instead of doing the same magical function on the client's side again and
>> again, he only needs to take the prepared data from your response.
>>
>> Another use case could be:
>> Imagine you have got two categories: cheap and expensive and your
>> document
>> gots a title-, a label-, an owner- and a price-field.
>> Imagine you would analyze, index and store them like you normally do and
>> afterwards you want to set, whether the document belongs to the expensive
>> item-group or not.
>> If the price for the item is higher than 500$, it belongs to the
>> expensive
>> ones, otherwise not.
>> I think, this would be a job for a special analyzer - and this only makes
>> sense, if I also store the analyzed data.
>>
>> I think information retrieval is a really interesting use case.
>>
>>
>> Erick Erickson wrote:
>> >
>> > What is your use case for "responding sometimes with the indexed
>> value"?
>> > Other than reconstructing a field that hasn't been stored, I can't
>> think
>> > of
>> > one.
>> >
>> > I still think you're missing the point. Indexing and storing are
>> > orthogonal operations that have (almost) nothing to do with each
>> > other, for all that they happen at the same time on the same field.
>> >
>> > You never search against the stored data in a field. You *always*
>> > search against the indexed data.
>> >
>> > Contrariwise, you never display the indexed form to the user, you
>> > *always* show the stored data (unless you come up with
>> > a really interesting use case).
>> >
>> > Step back and consider what happens when you index data,
>> > it gets broken up all kinds of ways. Stop words are removed,
>> > case may change, etc, etc, etc. It makes no sense to
>> > then display this data for a user. Would you really like
>> > to have, say a movie title "The Good, The Bad, and The
>> > Ugly". Remove stopwords, puncuation and lowercase
>> > and you index three tokens "good", "bad", "ugly".
>> > Even if you reconstruct this field, the user would see
>> > "good bad ugly". Bad, very bad.
>> >
>> > Yet I want to display the original title to the user in
>> > response to searching on "ugly", so I need the
>> > original, unanalyzed data.
>> >
>> > Perhaps it would help to think of it this way.
>> > 1> take some data and index it in f1
>> >     but do NOT store it in f1. Store it in f2
>> >     but do NOT index it in f2.
>> > 2> take that same data, index AND store
>> >     it in f3.
>> >
>> > <1> is almost entirely equivalent to <2>
>> > in terms of index resources.
>> >
>> > Practically though, <1> is harder to use,
>> > because you have to remember
>> > to use f1 for searching and f2 for getting
>> > the raw data.
>> >
>> > HTH
>> > Erick
>> >
>> > On Thu, Jan 7, 2010 at 12:11 PM, MitchK <mitc...@web.de> wrote:
>> >
>> >>
>> >> Thank you, Ryan. I will have a look on lucene's material and luke.
>> >>
>> >> I think I got it. :)
>> >>
>> >> Sometimes there will be the need, to response on the one hand the
>> value
>> >> and
>> >> on the other hand the indexed version of the value.
>> >> How can I fullfill such needs? Doing copyfield on indexed-only fields?
>> >>
>> >>
>> >>
>> >> ryantxu wrote:
>> >> >
>> >> >
>> >> > On Jan 7, 2010, at 10:50 AM, MitchK wrote:
>> >> >
>> >> >>
>> >> >> Eric,
>> >> >>
>> >> >> you mean, everything is okay, but I do not see it?
>> >> >>
>> >> >>>> Internally for searching the analysis takes place and writes to
>> the
>> >> >>>> index in an inverted fashion, but the stored stuff is left alone.
>> >> >>
>> >> >> if I use an analyzer, Solr "stores" it's output two ways?
>> >> >> One public output, which is similar to the original input
>> >> >> and one "hidden" or internal output, which is based on the
>> >> >> analyzer's work?
>> >> >> Did I understand that right?
>> >> >
>> >> > yes.
>> >> >
>> >> > indexed fields and stored fields are different.
>> >> >
>> >> > Solr results show stored fields in the results (however facets are
>> >> > based on indexed fields)
>> >> >
>> >> > Take a look at Lucene in Action for a better description of what is
>> >> > happening.  The best tool to get your head around what is happening
>> is
>> >> > probably luke (http://www.getopt.org/luke/)
>> >> >
>> >> >
>> >> >>
>> >> >> If yes, I have got another problem:
>> >> >> I don't want to waste any diskspace.
>> >> >
>> >> > You have control over what is stored and what is indexed -- how that
>> >> > is configured is up to you.
>> >> >
>> >> > ryan
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27063452.html
>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27065305.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Custom-Analyzer-Tokenizer-works-but-results-were-not-saved-tp27026739p27076795.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: No Analyzer, tokenizer or stemmer works at Solr

Reply via email to