Re: Skewed IDF in multi lingual index, again

Walter Underwood Thu, 30 Nov 2017 08:29:53 -0800

I’ve occasionally considered using Unicode language tags (U+E001 and friends) 
on each term. That would make a term specific to a language, so we would get 
[en]LaserJet, [fr]LaserJet, [de]LaserJet, and so on. But that is a pretty big 
hammer, because it restricts matches to the same language. If the entire 
document is in one language, might as well use a filter query for that 
language. The tags would work for multiple languages in one document.


Maybe make the untagged term a synonym. For cross-language terms like 
“LaserJet”, the untagged one would have worse idf.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 30, 2017, at 8:14 AM, Markus Jelsma <markus.jel...@openindex.io> wrote:
> 
> Hello,
> 
> We already discussed this problem five years ago [1]. In short: documents in 
> foreign languages are scored higher for some terms.
> 
> It was solved back then by using docCount instead of maxDoc when calculating 
> idf, it worked really well! But, probably due to index changes, the problem 
> is back for some terms, mostly proper nouns, well, just like five years ago.
> 
> We already deboost documents by 0.7 that are not in the user's preference 
> language but in some cases it is not enough. I can go on by reducing that 
> boost but that's not what i prefer.
> 
> I'd like to know if there are additional tricks to solve the problem.
> 
> Many thanks!
> Markus
> 
> [1] 
> http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html

Re: Skewed IDF in multi lingual index, again

Reply via email to