Re: Multiple-fields multilingual indexing - Query expansion for multilingual fields

Eduard Moraru Wed, 23 Jan 2013 05:41:03 -0800

On Wed, Jan 23, 2013 at 3:38 PM, Eduard Moraru <enygma2...@gmail.com> wrote:


> Hello,
>
> Here is my problem:
>
> I am trying to do multilingual indexing in Solr and each document
> translation is indexed as an independent Solr/Lucene document having some
> fields suffixed with the language code. Here is an example:
>
> English:
> id:SomeDocument_en
> lang:en
> title:"English version of the title" <-- text_general field
> title_en:"English version of the title" <-- text_en field, optimized for
> english
> content:"English version of the content"
> content_en:"English version of the content"
> author:SomeGuy
> ...
>
> French:
> id:SomeDocument_fr
> lang:fr
> title:"French version of the title" <-- text_general field
> title_fr:"French version of the title" <-- text_en field, optimized for
> english
> content:"French version of the content"
> content_fr:"French version of the content"
> author:SomeGuy
> ...
>
> I am writing a search back-end that should allow my users to easily query
> this layout, without having to care about the fact that *some* fields have
> funky names. To achieve this, I would like to write some sort of query
> expander that allows users to write simple queries like:
>
> "title:version author:SomeGuy content:content"
>
> which would get automagically expanded to:
>
> "(title_en:version OR title_fr:version) author:SomeGuy (content_en:content
> OR content_fr:content)"
>
> The list of available languages for which to do the expansion is
> configured separately and should be manually synchronized with the
> configured fields from the schema.xml.
>
> I want the users to use my back-end/library for searching the Solr index
> and I want this library to work with either a remote or embedded Solr
> server, using solrj.
>
> My current/best approach to date is to use a custom Lucene QueryParser set
> up with the KeywordAnalyzer and "" as default field. The result is a Query
> object that I can call extractTerms() on it so that I can inspect the terms
> and see what terms are in my list of fields to expand. The actual expansion
> consists of doing a replaceAll("field:value", "(field:value OR
> field_en:value OR field_fr:value OR ...))" on the toString() of the Query
> instance. Since some of the parsed query terms don`t support
> extractTerms(), I have overridden getPrefixQuery, getWildcardQuery and
> getRangeQuery in my custom query parser to return a simple TermQuery
> instead. Also, I`ve overridden getFieldQuery to manually add quotes inside
> a Term's value since the parsing strips them. Basically, I`ve tried as much
> as possible to make the Query.toString() method output a query which
> resembles as much as possible the input query and which does not alter the
> field value inside the terms (preserves quotes, preserves escaping, etc.).
>
> Reminder: this custom query parser is not run inside Solr (as a plugin or
> anything). It is run inside my search back-end module and its purpose is to
> run, no matter where the Solr instance is located (local or remote).
>
> This approach was looking pretty good in my tests, until I`ve noticed some
> shortcomings:
> - Solr-specific queries containing local parameters are not parsed (I get
> an invalid query exception). This is most likely due to the fact that
> QueryParser is Lucene specific and it does not understand the term of local
> parameters.
> - Queries such as single quote (") or single escape character (/) also
> throw exceptions, even if such queries work (well, get cleaned up properly
> but don`t throw exceptions) in a pure solr query.
> - Other stuff that I might miss by this approach that consists of
> diferences between Solr and Lucene queries.
>
> I tried as much as possible to
>

typo: I tried to *not* have to write a Solr plugin.

 have to write some sort of Solr query parser plugin that needs to be
> installed inside the running Solr instance and tried to do everything on
> the "client" side of the request.
>
> I have noticed that all the Solr-specific query parsers
> (ExtendedSolrQueryParser and such) can not be instantiated without
> supplying an IndexSchema and a SolrCore. If I do this on the client side, I
> end up with an embedded Solr server, just so that I can expand a query,
> which is definetly not the overhead that I want.
>
> Can somebody please suggest me what is the best approach to handling this
> problem? Am I handling the issue from a bad angle? Is there a different
> best practice when dealing with multilingual documents that allows querying
> fields in all the languages more easily?
>
> Thank you for the patience of reading this message and I hope you`ll help
> me find a good solution to this problem that has been eating a lot of my
> time recently.
>
> -Eduard
>

Re: Multiple-fields multilingual indexing - Query expansion for multilingual fields

Reply via email to