On Wed, Jan 23, 2013 at 3:38 PM, Eduard Moraru <enygma2...@gmail.com> wrote:
> Hello, > > Here is my problem: > > I am trying to do multilingual indexing in Solr and each document > translation is indexed as an independent Solr/Lucene document having some > fields suffixed with the language code. Here is an example: > > English: > id:SomeDocument_en > lang:en > title:"English version of the title" <-- text_general field > title_en:"English version of the title" <-- text_en field, optimized for > english > content:"English version of the content" > content_en:"English version of the content" > author:SomeGuy > ... > > French: > id:SomeDocument_fr > lang:fr > title:"French version of the title" <-- text_general field > title_fr:"French version of the title" <-- text_en field, optimized for > english > content:"French version of the content" > content_fr:"French version of the content" > author:SomeGuy > ... > > I am writing a search back-end that should allow my users to easily query > this layout, without having to care about the fact that *some* fields have > funky names. To achieve this, I would like to write some sort of query > expander that allows users to write simple queries like: > > "title:version author:SomeGuy content:content" > > which would get automagically expanded to: > > "(title_en:version OR title_fr:version) author:SomeGuy (content_en:content > OR content_fr:content)" > > The list of available languages for which to do the expansion is > configured separately and should be manually synchronized with the > configured fields from the schema.xml. > > I want the users to use my back-end/library for searching the Solr index > and I want this library to work with either a remote or embedded Solr > server, using solrj. > > My current/best approach to date is to use a custom Lucene QueryParser set > up with the KeywordAnalyzer and "" as default field. The result is a Query > object that I can call extractTerms() on it so that I can inspect the terms > and see what terms are in my list of fields to expand. The actual expansion > consists of doing a replaceAll("field:value", "(field:value OR > field_en:value OR field_fr:value OR ...))" on the toString() of the Query > instance. Since some of the parsed query terms don`t support > extractTerms(), I have overridden getPrefixQuery, getWildcardQuery and > getRangeQuery in my custom query parser to return a simple TermQuery > instead. Also, I`ve overridden getFieldQuery to manually add quotes inside > a Term's value since the parsing strips them. Basically, I`ve tried as much > as possible to make the Query.toString() method output a query which > resembles as much as possible the input query and which does not alter the > field value inside the terms (preserves quotes, preserves escaping, etc.). > > Reminder: this custom query parser is not run inside Solr (as a plugin or > anything). It is run inside my search back-end module and its purpose is to > run, no matter where the Solr instance is located (local or remote). > > This approach was looking pretty good in my tests, until I`ve noticed some > shortcomings: > - Solr-specific queries containing local parameters are not parsed (I get > an invalid query exception). This is most likely due to the fact that > QueryParser is Lucene specific and it does not understand the term of local > parameters. > - Queries such as single quote (") or single escape character (/) also > throw exceptions, even if such queries work (well, get cleaned up properly > but don`t throw exceptions) in a pure solr query. > - Other stuff that I might miss by this approach that consists of > diferences between Solr and Lucene queries. > > I tried as much as possible to > typo: I tried to *not* have to write a Solr plugin. have to write some sort of Solr query parser plugin that needs to be > installed inside the running Solr instance and tried to do everything on > the "client" side of the request. > > I have noticed that all the Solr-specific query parsers > (ExtendedSolrQueryParser and such) can not be instantiated without > supplying an IndexSchema and a SolrCore. If I do this on the client side, I > end up with an embedded Solr server, just so that I can expand a query, > which is definetly not the overhead that I want. > > Can somebody please suggest me what is the best approach to handling this > problem? Am I handling the issue from a bad angle? Is there a different > best practice when dealing with multilingual documents that allows querying > fields in all the languages more easily? > > Thank you for the patience of reading this message and I hope you`ll help > me find a good solution to this problem that has been eating a lot of my > time recently. > > -Eduard >