I can think of ways to tackle your problem:
Option 1: each document will have a field indicating its language. Then,
when searching, you can simply filter the query on the language you're
searching on. Advantages: everything is in one index, so if in the
future you will need to do a cross language search you'll be able to do
that without changing anything. Disadvantages: Well, depending on how
your data is structured, your index can grow big - now if you always
search only on one language then you will always use only a part of the
index which is to some extent a performance penalty (depends on the size
of the index). Another disadvantage is that the schema configuration can
get a bit messy - since everything is in one index, for each field and
field type you'll probably need to define different versions for
different languages (each one with a different language specific
analyzer), so for example, if you have a "title" fields, you'll probably
need to define "title_en" (for English content) an "title_zh" (for
Chinese content), then you will also need to make sure that when you
index the content, you send the right fields to Solr (although, you can
perhaps create a clever update processor that updates the field names
based on the language field).
Option 2: have separate Solr core for each language. Advantages: Well,
as opposed to Option 1, here you have smaller indexes, where each is
dedicated to one language. If the corpus is very big you can have
performance gains here. Since we are talking about different indexes
here, each core has its own simple and clean schema (no need for
multiple fields and field types). Disadvantage: The main one is that you
cannot perform cross language search. You also need to remember to use
the right Solr core when indexing & querying.
2) I posted some chinese docs to the server. The query of my chinese
word does not return any result. This happens to my arabic docs too.
What filter should I look at for this type of problem. Thanks a lot!
Sorry, I don't have experience with Arabic or Chinese languages so I
don't know of any good analyzers for them.
Cheers,
Uri
Hi,
I have two questions.
1) Can solr be configured so all my english docs will be saved in a
group, say group-en? My chinese docs will be saved in group-cn. So my
search will only be conducted in the intended group, instead of
everywhere.
2) I posted some chinese docs to the server. The query of my chinese
word does not return any result. This happens to my arabic docs too.
What filter should I look at for this type of problem. Thanks a lot!
Elaine