Hi Tim, Step one is probably to detect language boundaries. You know your data. If they happen on paragraph breaks, your job will be easier. If they don't, a bit harder, but not impossible at all. I'm sure there is a ton of research on this topic out there, but the obvious approach would involve dictionaries and individual terms or shingle lookups, keeping track of "the current language" or "language of last N terms" and watching out for a switch.
Once you have that you'd know the language of each paragraph. At that point you'd feed those into Solr in separate language-specific fields. Of course, the other side of this is often the more complicated one - identifying the language of the query. The problem is they are short. But you can handle it via UI, via user preferences, via a combination of these things, etc. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Fri, Apr 25, 2014 at 6:34 AM, Timothy Hill <timothy.d.h...@gmail.com>wrote: > This may not be a practically solvable problem, but the company I work for > has a large number of lengthy mixed-language documents - for example, > scholarly articles about Islam written in English but containing lengthy > passages of Arabic. Ideally, we would like users to be able to search both > the English and Arabic portions of the text, using the full complement of > language-processing tools such as stemming and stopword removal. > > The problem, of course, is that these two languages co-occur in the same > field. Is there any way to apply different processing to different words or > paragraphs within a single field through language detection? Is this to all > intents and purposes impossible within Solr? Or is another approach (using > language detection to split the single large field into > language-differentiated smaller fields, for example) possible/recommended? > > Thanks, > > Tim Hill >