Re: Step by step tutorial for multi-language indexing and search

Jakub Godawa Sun, 24 Oct 2010 03:49:23 -0700

Hi Erick, thanks for your help!

I need some technical help though... let me put it that way:


1. I deleted everything in index with:
curl http://localhost:8983/solr/update -F stream.body='
<delete><query>*:*</query></delete>'
curl http://localhost:8983/solr/update -F stream.body=' <commit />'

2. I created 2 documents with fields: name_en, answer_en, name_es, answer_es
3. I made a query through admin page, with response:

<response>
-
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">9</int>
-
<lst name="params">
<str name="indent">on</str>
<str name="start">0</str>
<str name="q">Jakub
</str>
<str name="version">2.2</str>
<str name="rows">10</str>
</lst>
</lst>
-
<result name="response" numFound="2" start="0">
-
<doc>
-
<arr name="answer_en_t">
<str>My name is Jakub</str>
</arr>
-
<arr name="answer_es_t">
<str>Me llamo Jakub.</str>
</arr>
-
<arr name="id">
<str>Question:1</str>
</arr>
-
<arr name="name_en_t">
<str>What is your name?</str>
</arr>
-
<arr name="name_es_t">
<str>Como te llamas?</str>
</arr>
-
<arr name="pk_s">
<str>1</str>
</arr>
-
<arr name="spell">
<str>What is your name?</str>
<str>My name is Jakub</str>
<str>Como te llamas?</str>
<str>Me llamo Jakub.</str>
</arr>
</doc>
-
<doc>
-
<arr name="answer_en_t">
<str>I am in the kitchen Jakub!</str>
</arr>
-
<arr name="answer_es_t">
<str>Estoy en la cocina.</str>
</arr>
-
<arr name="id">
<str>Question:2</str>
</arr>
-
<arr name="name_en_t">
<str>Where are you?</str>
</arr>
-
<arr name="name_es_t">
<str>Donde estas?</str>
</arr>
-
<arr name="pk_s">
<str>2</str>
</arr>
-
<arr name="spell">
<str>Where are you?</str>
<str>I am in the kitchen Jakub!</str>
<str>Donde estas?</str>
<str>Estoy en la cocina.</str>
</arr>
</doc>
</result>
</response>

4. Now I needed two dismaxes to make it work in two separate languages. Lets
say I just want to look up in *_en fields, then I created a dismax:

<requestHandler name="/English" class="solr.SearchHandler">
    <lst name="defaults">
      <str name="defType">dismax</str>
      <str name="echoParams">explicit</str>
      <float name="tie">0.01</float>
      <str name="qf">
        name_en_t^0.5 answer_en_t^1.0
     </str>
     </lst>
  </requestHandler>


5. Hitting the url: http://localhost:8982/solr/English/?q=Jakub gaves me an
error:

there are more terms than documents in field "name_en_t", but it's
impossible to sort on tokenized fields

6. I know that I should create a separate dismax for Spanish.

My questions:
1. Why those fields are named with *_t? I saw in schema.xml that they are
made dynamicly. Can/should I create my own predefined fields in schema.xml?
Is this the place where you put "HOW" the field should be interpreted by
indexer?
2. Why the error in no. 5 is being thrown? I know that you cannot do sorting
on tokenized fields, but I don't see myself trying to index anything nor
tokenizing.
3. How should it be changed to work properly?

Thank you and I ask for patience as this can help many rookies like to me to
get started.
Jakub.

2010/10/21 Erick Erickson <erickerick...@gmail.com>

> See below:
>
> But also search the archives for multilanguage, this topic has been
> discussed
> many times before. Lucid Imagination maintains a Solr-powered (of course)
> searchable
> list at: http://www.lucidimagination.com/search/
>
> <http://www.lucidimagination.com/search/>
>
> On Wed, Oct 20, 2010 at 9:03 AM, Jakub Godawa <jakub.god...@gmail.com
> >wrote:
>
> > Hi everyone! (my first post)
> >
> > I am new, but really curious about usefullness of lucene/solr in
> documents
> > search from the web applications. I use Ruby on Rails to create one, with
> > plugin "acts_as_solr_reloaded" that makes connection between web app and
> > solr easy.
> >
> > So I am in a point, where I know that good solution is to prepare
> > multi-language documents with fields like:
> > question_en, answer_en,
> > question_fr, answer_fr,
> > question_pl,  answer_pl... etc.
> >
> > I need to create an index that would work with 6 languages: english,
> > french,
> > german, russian, ukrainian and polish.
> >
> > My questions are:
> > 1. Is it doable to have just one search field that behaves like Google's
> > for
> > all those documents? It can be an option to indicate a language to
> search.
> >
>
> This depends on what you mean by do-able. Are you going to allow a French
> user to search an English document (& etc)? But the real answer is "yes,
> you
> can
> if you .....". There'll be tradeoffs.
>
> Take a look at the dismax handler. It's kind of hard to grok all at once,
> but you
> can cause it to search across multiple fields. That is, the user types
> "language",
> and you can turn it into a complex query under the covers like
> lang_en:language lang_fr:language lang_ru:language, etc. You can also
> apply boosts. Note that this has obvious problems with, say, Russian. Half
> your
> job will be figuring out what will satisfy the user.....
>
> You could also have a #different# dismax handler defined for various
> languages. Say
> the user was coming from Spanish. Consider a browseES handler. See
> solrconfig.xml
> for the default dismax handler. The Solr book mentioned above describes
> this.
>
>
> > 2. How should I begin changing the solr/conf/schema.xml (or other) file
> to
> > tailor it to my needs? As I am a real rookie here, I am still a bit
> > confused
> > about "fields", "fieldTypes" and their connection with particular field
> > (ex.
> > answer_fr) and the "tokenizers" and "analyzers". If someone can provide a
> > basic step by step tutorial on how to make it work in two languages I
> would
> > be more that happy.
> >
>
> You have several choices here:
> > books "Lucene in Action" and "Solr 1.4, Enterprise SearchServer" both
> have
> discussions here.
> > Spend some time on the solr/admin/analysis page. That page allows you to
> see
>   pretty much exactly what each of the steps in an analyzer chain
> accomplish.
>
>
> > 3. Do all those languages are supported (officially/unofficialy) by
> > lucene/solr?
> >
>
> See:
>
> http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/analysis/Analyzer.html
> Remember that Solr is built on Lucene, so these analyzers are available.
>
>
> >
> > Thank you for help,
> > Jakub Godawa.
> >
>
> Best
> Erick
>

Re: Step by step tutorial for multi-language indexing and search

Reply via email to