Solr doesn't have such capabilities built in that I know of. There are various language-recognition tools out there that you could potentially fire the extracted text blocks at and get something back, but extracting the text blocks would be a custom step on your part...
Hmmm, if you can solve the above (and you can use Tika in a SolrJ client to get the text quite easily, see: http://searchhub.org/2012/02/14/indexing-with-solrj/) it seems pretty easy to at least use one of the tools to make a "best guess" at the language and then use custom fields (i.e. text_ar, text_fr, whatever) to use the right language analysis chain at index time. Then, fire the incoming query at _all_ your language fields and count on the scoring to bubble "best" documents to the top. A lot of hand-waving here... Best, Erick On Fri, Apr 25, 2014 at 3:34 AM, Timothy Hill <timothy.d.h...@gmail.com> wrote: > This may not be a practically solvable problem, but the company I work for > has a large number of lengthy mixed-language documents - for example, > scholarly articles about Islam written in English but containing lengthy > passages of Arabic. Ideally, we would like users to be able to search both > the English and Arabic portions of the text, using the full complement of > language-processing tools such as stemming and stopword removal. > > The problem, of course, is that these two languages co-occur in the same > field. Is there any way to apply different processing to different words or > paragraphs within a single field through language detection? Is this to all > intents and purposes impossible within Solr? Or is another approach (using > language detection to split the single large field into > language-differentiated smaller fields, for example) possible/recommended? > > Thanks, > > Tim Hill