Re: Single multilingual field analyzed based on other field values

Trey Grainger Thu, 19 Dec 2013 19:02:05 -0800

Hi Dave,

Sorry for the delayed reply.  Did you end up trying the (scary) caching
idea?

Yeah, there's no reasonable way today to access data from other fields from
the document in the analyzers.  Creating an update request processor which
pulls the data prior to the field-by-field analysis and injects it (in some
format) into the field that needs the data pulled from other fields is how
to do this today.

In my examples, I only inserted a prefix prior to the entire field (i.e.
en,es|hables espanol is what she asks), but if you need something more
complicated to identify specific sections of the field to use different
analyzers then you could pull that off, as well.  For example:
<field name="multilingual_field">[langs="en"]hello world
[langs="en,es"]hables espanol is what she asks.[
autodetectOtherLangs="true" fallbackLangs="en"]some unknown language text
for identification</field>

Then, you would just have the analyzer for the field parse the content,
pass each chunk of text into the appropriate analyzer, and then modify the
term positions and offsets as necessary.  My example in chapter 14 of Solr
in Action assumed you would be using the same languages throughout the
whole field, but it would just require a little bit of pre-parsing work to
direct the use of specific analyers only for specific parts of the content.

Frankly, I'm not sure pulling the data from another field (particularly if
you want different sections processed with different languages) is going to
be much simpler than putting it all into the field to be analyzed to begin
with (or better yet having an update request processor do it for you -
including the detection of language boundaries - inside of Solr so the
customer doesn't have to worry about it).

-Trey

On Tue, Oct 29, 2013 at 12:18 PM, davetroiano <dtroi...@basistech.com>wrote:

> Hi Trey,
>
> I was reading v9 of the Solr in Action MEAP but browsing your github repo,
> so I think I'm looking at the latest stuff.
>
> Agreed that the thread caching idea is dangerous.  Perhaps it would work
> now, but it could easily break in a later version of Solr.
>
> I didn't mention another reason why I'd like to analyze based on other
> field
> values, which is that I'd like the ability to run analyzers on sub-sections
> of the MultiTextField.  e.g., given a multilingual document, run my
> text_english analyzer on the first half of a document and my text_french
> analyzer on the second half.  Of course, I could extend the prepend
> approach
> to take start and end offsets (e.g., <field
> name="myField">[en_0_1000,fr_1001_2500|]blah, blah, ...</field>), but if it
> were possible I'd rather grab that data from another field and simplify the
> tokenizer (in terms of the string manipulation and having to adjust
> position
> offsets to ignore the prepended data... though you've already done the
> tricky part).
>
> Based on what I'm seeing on the message boards and JIRA (e.g., SOLR-1536 /
> SOLR-1327 not being fixed), it seems like there isn't a clean way to run
> analyzers dynamically based on data in other field(s).  If I end up trying
> the caching idea, I'll report my findings here.
>
> Thanks,
> Dave
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Single-multilingual-field-analyzed-based-on-other-field-values-tp4098141p4098242.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Single multilingual field analyzed based on other field values

Reply via email to