A more interesting use case:

Analyzing text and finding a number, like the mean word length or the mean
number of repeated words. These are standard tools for spam detection. To
create these, we would want to shovel text into a text processing chain that
creates an integer. We then want to both store that integer and index it. We
don't want to store the shoveled text.

Solr does not now do this. I don't know if the Solr processing stack has
this flexibility, or if it is worth adding it.

Lance

-----Original Message-----
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Thursday, January 17, 2008 6:11 PM
To: solr-user@lucene.apache.org
Subject: Re: copyField limitation



: But, the <copyField> directive in the schema has a limitation. It will
only
: copy data between fields with the same type. If the two fields are a
: different type, the copy is ignored. This example would require
<copyField>
: to translate 'sint' to 'integer'. 

i can't reproduce this problem. with the following additions to the example
schema...

   <field name="popularityI" type="integer" indexed="true" stored="true"
default="0"/>
   ...
   <copyField source="popularity" dest="popularityI"/>

...i was able to see, sort, and search on the popularityI field with no
problems.

: Another case is days (not times):
        ...
: This would express the date as a string 2008-xx-xxT00:00:00Z and store
that
: into the day field. It is not as optimal as using '2008-xx-xx' but is
still
: useful for wildcards.
        ...

I'm not entirely sure i understand wht you are asking ... but i believe your
point is that there is no easy way to do a copyFiled that reformats the data
(ie: changing date formats, or converting the date to an int) 

In my opinion, this class of situations isn't a limitation of copyField as
much as it is a silly restriction in the way FieldTypes are handled by
IndexSchema ... currently "TextField" is a special case because it's hte
only FieldType that can have an analyzer (i'm not even sure where this
special case logic is ... i thought it was when the INdexSchema is
initialized, but i can't find it now)

It would be nice if any FieldType could have an analyzer, and as long as th
token(s) produced by that analyzer met the neccessary conditions for the
data type, things would go on their merry way ... DateReFormatFilter's could
be used to convert from any arbitray date format to the one Solr expects,
etc.... you could have have a detailedDate field and <copyField> from that
to a justDate string field that used a PatternReplaceFilter to strip off the
time.

This still wouldn't help change the "stored" value of those fields though so
that the data would look right when retrieving stored values.

Perhaps we should add an optional hook for mutating the "stored" value of a
fieldtype as well?  ... it could be an Analyzer (ie: 
tokenizer+filterchain) so that we get reuse of existing concepts, with
each resulting token being treated as a seperate multivalue (for the common
case of rejoining all the tokens into a single string, we can add a
StringBufferConcatTokenFilter or something) 

        ?


-Hoss


Reply via email to