This may be possible to do with Lucene's new SinkTokenizer/
TeeTokenFilter functionality. You might find http://www.mail-archive.com/[EMAIL PROTECTED]/msg06863.html
useful in that context. Also, search the Lucene dev list for
discussion.
-Grant
On Jan 22, 2008, at 3:13 PM, Lance Norskog wrote:
A more interesting use case:
Analyzing text and finding a number, like the mean word length or
the mean
number of repeated words. These are standard tools for spam
detection. To
create these, we would want to shovel text into a text processing
chain that
creates an integer. We then want to both store that integer and
index it. We
don't want to store the shoveled text.
Solr does not now do this. I don't know if the Solr processing stack
has
this flexibility, or if it is worth adding it.
Lance
-----Original Message-----
From: Chris Hostetter [mailto:[EMAIL PROTECTED]
Sent: Thursday, January 17, 2008 6:11 PM
To: solr-user@lucene.apache.org
Subject: Re: copyField limitation
: But, the <copyField> directive in the schema has a limitation. It
will
only
: copy data between fields with the same type. If the two fields are a
: different type, the copy is ignored. This example would require
<copyField>
: to translate 'sint' to 'integer'.
i can't reproduce this problem. with the following additions to the
example
schema...
<field name="popularityI" type="integer" indexed="true"
stored="true"
default="0"/>
...
<copyField source="popularity" dest="popularityI"/>
...i was able to see, sort, and search on the popularityI field with
no
problems.
: Another case is days (not times):
...
: This would express the date as a string 2008-xx-xxT00:00:00Z and
store
that
: into the day field. It is not as optimal as using '2008-xx-xx' but
is
still
: useful for wildcards.
...
I'm not entirely sure i understand wht you are asking ... but i
believe your
point is that there is no easy way to do a copyFiled that reformats
the data
(ie: changing date formats, or converting the date to an int)
In my opinion, this class of situations isn't a limitation of
copyField as
much as it is a silly restriction in the way FieldTypes are handled by
IndexSchema ... currently "TextField" is a special case because it's
hte
only FieldType that can have an analyzer (i'm not even sure where this
special case logic is ... i thought it was when the INdexSchema is
initialized, but i can't find it now)
It would be nice if any FieldType could have an analyzer, and as
long as th
token(s) produced by that analyzer met the neccessary conditions for
the
data type, things would go on their merry way ...
DateReFormatFilter's could
be used to convert from any arbitray date format to the one Solr
expects,
etc.... you could have have a detailedDate field and <copyField>
from that
to a justDate string field that used a PatternReplaceFilter to strip
off the
time.
This still wouldn't help change the "stored" value of those fields
though so
that the data would look right when retrieving stored values.
Perhaps we should add an optional hook for mutating the "stored"
value of a
fieldtype as well? ... it could be an Analyzer (ie:
tokenizer+filterchain) so that we get reuse of existing concepts, with
each resulting token being treated as a seperate multivalue (for the
common
case of rejoining all the tokens into a single string, we can add a
StringBufferConcatTokenFilter or something)
?
-Hoss
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ