Hi Tim,

Step one is probably to detect language boundaries.  You know your data.
 If they happen on paragraph breaks, your job will be easier.  If they
don't, a bit harder, but not impossible at all.  I'm sure there is a ton of
research on this topic out there, but the obvious approach would involve
dictionaries and individual terms or shingle lookups, keeping track of "the
current language" or "language of last N terms" and watching out for a
switch.

Once you have that you'd know the language of each paragraph.  At that
point you'd feed those into Solr in separate language-specific fields.

Of course, the other side of this is often the more complicated one -
identifying the language of the query.  The problem is they are short.  But
you can handle it via UI, via user preferences, via a combination of these
things, etc.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Fri, Apr 25, 2014 at 6:34 AM, Timothy Hill <timothy.d.h...@gmail.com>wrote:

> This may not be a practically solvable problem, but the company I work for
> has a large number of lengthy mixed-language documents - for example,
> scholarly articles about Islam written in English but containing lengthy
> passages of Arabic. Ideally, we would like users to be able to search both
> the English and Arabic portions of the text, using the full complement of
> language-processing tools such as stemming and stopword removal.
>
> The problem, of course, is that these two languages co-occur in the same
> field. Is there any way to apply different processing to different words or
> paragraphs within a single field through language detection? Is this to all
> intents and purposes impossible within Solr? Or is another approach (using
> language detection to split the single large field into
> language-differentiated smaller fields, for example) possible/recommended?
>
> Thanks,
>
> Tim Hill
>

Reply via email to