If you can throw money at the problem: http://www.basistech.com/text-analytics/rosette/language-identifier/ . Language Boundary Locator at the bottom of the page seems to be part/all of your solution.
Otherwise, specifically for English and Arabic, you could play with Unicode ranges to try detecting text blocks: 1) Create an UpdateRequestProcessor chain that a) clones text into field_EN and field_AR. b) applies regular expression transformations that strip English or Arabic unicode text range correspondingly, so field_EN only has English characters left, etc. Of course, you need to decide what you want to do with occasional EN or neutral characters happening in the middle of Arabic text (numbers: Arabic or Indic? brackets, dashes, etc). But if you just index text, it might be ok even if it is not perfect. c) deletes empty fields, just in case not all of them have mix language 2) Use eDismax to search over both fields, each with its own processor. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Fri, Apr 25, 2014 at 5:34 PM, Timothy Hill <timothy.d.h...@gmail.com> wrote: > This may not be a practically solvable problem, but the company I work for > has a large number of lengthy mixed-language documents - for example, > scholarly articles about Islam written in English but containing lengthy > passages of Arabic. Ideally, we would like users to be able to search both > the English and Arabic portions of the text, using the full complement of > language-processing tools such as stemming and stopword removal. > > The problem, of course, is that these two languages co-occur in the same > field. Is there any way to apply different processing to different words or > paragraphs within a single field through language detection? Is this to all > intents and purposes impossible within Solr? Or is another approach (using > language detection to split the single large field into > language-differentiated smaller fields, for example) possible/recommended? > > Thanks, > > Tim Hill