On 7/17/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:
Hi,
On 7/17/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On 7/17/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:
> > Hi all,
> >
> > Is there a way to pass arguments to analyzers per document? Let's say
> > that I have a field "foo" which is tokenized by WhitespaceTokenizer
> > and then filtered by MyCustomStemmingFilter. MyCustomStemmingFilter
> > can stem more than one language but (obviously) it needs to know the
> > language of the document it is working on. So what I need is to
> > specify the language per document (actually per field).
> >
> > Here is an example:
> > <doc>
> > <field name="....
> > .....
> > <field name="foo" lang="en">My spam egg bars baz.</field>
> > </doc>
> >
> > Is something like this possible with Solr?
>
> You can pass extra args to a factory in the field-type definition, but
> that means you would need a separate field-type per language.
Thanks for the answer.
Your suggestion would work for this particular use case, but IMHO
there are other use cases out there that can benefit (for example, one
may process the whole document and add parameters for each field based
on document-level analysis) from this.
Would this be useful feature for Solr? I would actually like to work
on it if others consider this as a useful add-on. It seems simple to
accomplish and it would probably be a good introduction to Solr
internals.
wrt passing more info to the analyzer at runtime to alter its
behavior: analyzers are singletons per field-type, and
Analyzer.tokenStream(String fieldName, Reader reader) is called to
analyze a particular value. There isn't really a good place to pass
in extra info.
During XML parsing, we *could* build up a Map of the parameters we
don't know about, but then the question is what to do with them. One
hackish solution would be to store them in a thread-local where your
analyzer could check it. Perhaps a custom request processor could do
that task.
It seems there does need to be some kind of framework more aligned
with parsing documents (word docs, pdf, etc), for adding metadata to
fields at runtime (how does UIMA or Tika fit into this?), and for
mapping the fields+metadata to Solr/Lucene document fields.
-Yonik