Re: Application of different stemmers / stopword lists within a single field

Alexandre Rafalovitch Sun, 27 Apr 2014 20:01:18 -0700

If you can throw money at the problem:
http://www.basistech.com/text-analytics/rosette/language-identifier/ .
Language Boundary Locator at the bottom of the page seems to be
part/all of your solution.

Otherwise, specifically for English and Arabic, you could play with
Unicode ranges to try detecting text blocks:
1) Create an UpdateRequestProcessor chain that
a) clones text into field_EN and field_AR.
b) applies regular expression transformations that strip English or
Arabic unicode text range correspondingly, so field_EN only has
English characters left, etc. Of course, you need to decide what you
want to do with occasional EN or neutral characters happening in the
middle of Arabic text (numbers: Arabic or Indic? brackets, dashes,
etc). But if you just index text, it might be ok even if it is not
perfect.
c) deletes empty fields, just in case not all of them have mix language
2) Use eDismax to search over both fields, each with its own processor.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency

On Fri, Apr 25, 2014 at 5:34 PM, Timothy Hill <timothy.d.h...@gmail.com> wrote:
> This may not be a practically solvable problem, but the company I work for
> has a large number of lengthy mixed-language documents - for example,
> scholarly articles about Islam written in English but containing lengthy
> passages of Arabic. Ideally, we would like users to be able to search both
> the English and Arabic portions of the text, using the full complement of
> language-processing tools such as stemming and stopword removal.
>
> The problem, of course, is that these two languages co-occur in the same
> field. Is there any way to apply different processing to different words or
> paragraphs within a single field through language detection? Is this to all
> intents and purposes impossible within Solr? Or is another approach (using
> language detection to split the single large field into
> language-differentiated smaller fields, for example) possible/recommended?
>
> Thanks,
>
> Tim Hill

Re: Application of different stemmers / stopword lists within a single field

Reply via email to