Re: Step by step tutorial for multi-language indexing and search

Lance Norskog Wed, 27 Oct 2010 00:53:41 -0700

Yes, you can declare each field with the Spanish, French, etc. types.The _t and other types are "dynamic" and don't have to be declared. Thisfeature is generally used when you have hundreds or thousands of fields.It is more clear to declare your fields.

You're right- that error should not be thrown. You are not asking for asort.I don't know that one. You could try starting over with the Solr 1.4.1release binaries.


Jakub Godawa wrote:

Hi Erick, thanks for your help!

I need some technical help though... let me put it that way:

1. I deleted everything in index with:
curl http://localhost:8983/solr/update -F stream.body='
<delete><query>*:*</query></delete>'
curl http://localhost:8983/solr/update -F stream.body='<commit />'

2. I created 2 documents with fields: name_en, answer_en, name_es, answer_es
3. I made a query through admin page, with response:

<response>
-
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">9</int>
-
<lst name="params">
<str name="indent">on</str>
<str name="start">0</str>
<str name="q">Jakub
</str>
<str name="version">2.2</str>
<str name="rows">10</str>
</lst>
</lst>
-
<result name="response" numFound="2" start="0">
-
<doc>
-
<arr name="answer_en_t">
<str>My name is Jakub</str>
</arr>
-
<arr name="answer_es_t">
<str>Me llamo Jakub.</str>
</arr>
-
<arr name="id">
<str>Question:1</str>
</arr>
-
<arr name="name_en_t">
<str>What is your name?</str>
</arr>
-
<arr name="name_es_t">
<str>Como te llamas?</str>
</arr>
-
<arr name="pk_s">
<str>1</str>
</arr>
-
<arr name="spell">
<str>What is your name?</str>
<str>My name is Jakub</str>
<str>Como te llamas?</str>
<str>Me llamo Jakub.</str>
</arr>
</doc>
-
<doc>
-
<arr name="answer_en_t">
<str>I am in the kitchen Jakub!</str>
</arr>
-
<arr name="answer_es_t">
<str>Estoy en la cocina.</str>
</arr>
-
<arr name="id">
<str>Question:2</str>
</arr>
-
<arr name="name_en_t">
<str>Where are you?</str>
</arr>
-
<arr name="name_es_t">
<str>Donde estas?</str>
</arr>
-
<arr name="pk_s">
<str>2</str>
</arr>
-
<arr name="spell">
<str>Where are you?</str>
<str>I am in the kitchen Jakub!</str>
<str>Donde estas?</str>
<str>Estoy en la cocina.</str>
</arr>
</doc>
</result>
</response>

4. Now I needed two dismaxes to make it work in two separate languages. Lets
say I just want to look up in *_en fields, then I created a dismax:

<requestHandler name="/English" class="solr.SearchHandler">
     <lst name="defaults">
       <str name="defType">dismax</str>
       <str name="echoParams">explicit</str>
       <float name="tie">0.01</float>
       <str name="qf">
         name_en_t^0.5 answer_en_t^1.0
      </str>
      </lst>
   </requestHandler>


5. Hitting the url: http://localhost:8982/solr/English/?q=Jakub gaves me an
error:

there are more terms than documents in field "name_en_t", but it's
impossible to sort on tokenized fields

6. I know that I should create a separate dismax for Spanish.

My questions:
1. Why those fields are named with *_t? I saw in schema.xml that they are
made dynamicly. Can/should I create my own predefined fields in schema.xml?
Is this the place where you put "HOW" the field should be interpreted by
indexer?
2. Why the error in no. 5 is being thrown? I know that you cannot do sorting
on tokenized fields, but I don't see myself trying to index anything nor
tokenizing.
3. How should it be changed to work properly?

Thank you and I ask for patience as this can help many rookies like to me to
get started.
Jakub.

2010/10/21 Erick Erickson<erickerick...@gmail.com>

See below:

But also search the archives for multilanguage, this topic has been
discussed
many times before. Lucid Imagination maintains a Solr-powered (of course)
searchable
list at: http://www.lucidimagination.com/search/

<http://www.lucidimagination.com/search/>

On Wed, Oct 20, 2010 at 9:03 AM, Jakub Godawa<jakub.god...@gmail.com

wrote:

Hi everyone! (my first post)

I am new, but really curious about usefullness of lucene/solr in

documents

search from the web applications. I use Ruby on Rails to create one, with
plugin "acts_as_solr_reloaded" that makes connection between web app and
solr easy.

So I am in a point, where I know that good solution is to prepare
multi-language documents with fields like:
question_en, answer_en,
question_fr, answer_fr,
question_pl,  answer_pl... etc.

I need to create an index that would work with 6 languages: english,
french,
german, russian, ukrainian and polish.

My questions are:
1. Is it doable to have just one search field that behaves like Google's
for
all those documents? It can be an option to indicate a language to

search.

This depends on what you mean by do-able. Are you going to allow a French
user to search an English document (&  etc)? But the real answer is "yes,
you
can
if you .....". There'll be tradeoffs.

Take a look at the dismax handler. It's kind of hard to grok all at once,
but you
can cause it to search across multiple fields. That is, the user types
"language",
and you can turn it into a complex query under the covers like
lang_en:language lang_fr:language lang_ru:language, etc. You can also
apply boosts. Note that this has obvious problems with, say, Russian. Half
your
job will be figuring out what will satisfy the user.....

You could also have a #different# dismax handler defined for various
languages. Say
the user was coming from Spanish. Consider a browseES handler. See
solrconfig.xml
for the default dismax handler. The Solr book mentioned above describes
this.

2. How should I begin changing the solr/conf/schema.xml (or other) file

to

tailor it to my needs? As I am a real rookie here, I am still a bit
confused
about "fields", "fieldTypes" and their connection with particular field
(ex.
answer_fr) and the "tokenizers" and "analyzers". If someone can provide a
basic step by step tutorial on how to make it work in two languages I

would

be more that happy.

You have several choices here:

books "Lucene in Action" and "Solr 1.4, Enterprise SearchServer" both

have
discussions here.

Spend some time on the solr/admin/analysis page. That page allows you to

see
   pretty much exactly what each of the steps in an analyzer chain
accomplish.

3. Do all those languages are supported (officially/unofficialy) by
lucene/solr?

See:

http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/analysis/Analyzer.html
Remember that Solr is built on Lucene, so these analyzers are available.

Thank you for help,
Jakub Godawa.

Best
Erick

Re: Step by step tutorial for multi-language indexing and search

Reply via email to