Re: Application of different stemmers / stopword lists within a single field

Erick Erickson Fri, 25 Apr 2014 09:23:32 -0700

Solr doesn't have such capabilities built in that I know of. There are
various language-recognition tools out there that you could
potentially fire the extracted text blocks at and get something back,
but extracting the text blocks would be a custom step on your part...

Hmmm, if you can solve the above (and you can use Tika in a SolrJ
client to get the text quite easily, see:
http://searchhub.org/2012/02/14/indexing-with-solrj/) it seems pretty
easy to at least use one of the tools to make a "best guess" at the
language and then use custom fields (i.e. text_ar, text_fr, whatever)
to use the right language analysis chain at index time.

Then, fire the incoming query at _all_ your language fields and count
on the scoring to bubble "best" documents to the top.

A lot of hand-waving here...

Best,
Erick

On Fri, Apr 25, 2014 at 3:34 AM, Timothy Hill <timothy.d.h...@gmail.com> wrote:
> This may not be a practically solvable problem, but the company I work for
> has a large number of lengthy mixed-language documents - for example,
> scholarly articles about Islam written in English but containing lengthy
> passages of Arabic. Ideally, we would like users to be able to search both
> the English and Arabic portions of the text, using the full complement of
> language-processing tools such as stemming and stopword removal.
>
> The problem, of course, is that these two languages co-occur in the same
> field. Is there any way to apply different processing to different words or
> paragraphs within a single field through language detection? Is this to all
> intents and purposes impossible within Solr? Or is another approach (using
> language detection to split the single large field into
> language-differentiated smaller fields, for example) possible/recommended?
>
> Thanks,
>
> Tim Hill

Re: Application of different stemmers / stopword lists within a single field

Reply via email to