I’ve occasionally considered using Unicode language tags (U+E001 and friends) on each term. That would make a term specific to a language, so we would get [en]LaserJet, [fr]LaserJet, [de]LaserJet, and so on. But that is a pretty big hammer, because it restricts matches to the same language. If the entire document is in one language, might as well use a filter query for that language. The tags would work for multiple languages in one document.
Maybe make the untagged term a synonym. For cross-language terms like “LaserJet”, the untagged one would have worse idf. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Nov 30, 2017, at 8:14 AM, Markus Jelsma <markus.jel...@openindex.io> wrote: > > Hello, > > We already discussed this problem five years ago [1]. In short: documents in > foreign languages are scored higher for some terms. > > It was solved back then by using docCount instead of maxDoc when calculating > idf, it worked really well! But, probably due to index changes, the problem > is back for some terms, mostly proper nouns, well, just like five years ago. > > We already deboost documents by 0.7 that are not in the user's preference > language but in some cases it is not enough. I can go on by reducing that > boost but that's not what i prefer. > > I'd like to know if there are additional tricks to solve the problem. > > Many thanks! > Markus > > [1] > http://lucene.472066.n3.nabble.com/Skewed-IDF-in-multi-lingual-index-td4019095.html