Solr 1.2 and MoreLikeThis

2007-10-29 Thread Gereon Steffens

Hi,

I've been trying to get MoreLikeThis running in Solr 1.2, so far without 
success. Since there is no mention of any special installation steps in 
the Wiki, I had assumed that it was built into 1.2, but that does not 
seem to be the case. So now I've downloaded the patches from SOLR-69, 
and am unable to apply them against the 1.2 source tree. (patch 
complains about missing and non-matching files)


What I am missing, or what am I doing wrong? Any help would be greaty 
appreciated.


Gereon


Re: Solr 1.2 and MoreLikeThis

2007-10-29 Thread Gereon Steffens

OK, I've mostly figured it out. Patching leaves me with one rejection:

apache-solr-1.2.0$ patch -p0 the rejected code appears to be non-vital, so I've just left it out. 
Since Solr 1.2 is based on Lucene 2.1, I've used the 
lucene-query.2.1.1-dev.jar to compile (after fixing the DEFALT/DEFAULT 
typo), and MLT seems to work. Is that the correct procedure? If so, I'll 
update the wiki accordingly.


Gereon

Gereon Steffens wrote:

Hi,

I've been trying to get MoreLikeThis running in Solr 1.2, so far without 
success. Since there is no mention of any special installation steps in 
the Wiki, I had assumed that it was built into 1.2, but that does not 
seem to be the case. So now I've downloaded the patches from SOLR-69, 
and am unable to apply them against the 1.2 source tree. (patch 
complains about missing and non-matching files)


What I am missing, or what am I doing wrong? Any help would be greaty 
appreciated.


Gereon


Re: multi-language searching with Solr

2008-05-07 Thread Gereon Steffens
I have the same requirement, and from what I understand the distributed 
search feature will help implementing this, by having one shard per 
language. Am I right?


Gereon


Mike Klaas wrote:

On 5-May-08, at 1:28 PM, Eli K wrote:


Wouldn't this impact both indexing and search performance and the size
of the index?
It is also probable that I will have more then one free text fields
later on and with at least 20 languages this approach does not seem
very manageable.  Are there other options for making this work with
stemming?


If you want stemming, then you have to execute one query per language 
anyway, since the stemming will be different in every language.


This is a fundamental requirement: you somehow need to track the 
language of every token if you want correct multi-language stemming.  
The easiest way to do this would be to split each language into its own 
field.  But there are other options: you could prefix every indexed 
token with the language:


en:The en:quick en:brown en:fox en:jumped ...
fr:Le fr:brun fr:renard fr:vite fr:a fr:sauté ...

Separate fields seems easier to me, though.

-Mike






Re: multi-language searching with Solr

2008-05-08 Thread Gereon Steffens



These are shards of one index and not multiple indexes.  There is
probably a way to get each shard to contain one language but then you
end up with x servers for x languages, and some will be under utilized
while other will be over utilized.
Schemas will be identical, except for analysers. The language 
distribution I'm dealing with is about 60% german, 40% english. For 
availability reasons, each shard needs to run on at least two instances 
anyway, with a load balancer in front, so I think I'll be able to adjust 
utilization that way.


Gereon


UTF-8 2-byte vs 4-byte encodings

2007-05-02 Thread Gereon Steffens
Hi,

I have a question regarding UTF-8 encodings, illustrated by the
utf8-example.xml file. This file contains raw, unescaped UTF8 characters,
for example the "e acute" character, represented as two bytes 0xC3 0xA9.
When this file is added to Solar and retrieved later, the XML output
contains a four-byte representation of that character, namely 0xC2 0x83
0xC2 0xA9.

If, on the other hand, the input data contains this same character as an
entity &#A9; the output contains the two-byte encoded representation 0xC3
0xA9.

Why is that so, and is there a way to always get characters like these out
of Solr as their two-byte representations?

The reason I'm asking is that I often have to deal with CDATA sections in
my input files that contain raw (two-byte) UTF8 characters that can't be
encoded as entities.

Thanks,
Gereon



Re: AW: UTF-8 2-byte vs 4-byte encodings

2007-05-02 Thread Gereon Steffens
Hi Chrisitian,

> It is not sufficient to set the encoding in the XML but
> you need an additional HTTP header to set the encoding ("Content-type:
> text/xml; charset=UTF-8")
Thanks, that's what I was missing.

Gereon