I know Russian better than Russians ;)
I currently use default configuration for "dismax" provided by SOLR
1.1; I can add few URLs tonight to the crawler to see what happens. As
I know, Lucene/Nutch can even define web page (pdf, txt, html)
language by checking raw bytearray (raw HTTP Response without
"language" clues in HTML). Code in Nutch Trunk is huge, a lot of
useful staff...
Quoting Daniel Alheiros:
My indexing process follows:
1. RussianTokenizer
2. RussianLowerCaseFilter
3. RussianStopFilter
4. RussianStemFilter
I haven't tried it yet... I'll need probably separate SOLR + Website
for Russian (?)
Currently http://www.tokenizer.org has pages in French (Canadian shops
are bilingual), and Google correctly "understands" that such pages are
in French (without additional HTML/HTTP language clues); I don't know
French and can't test...
Unfortunately query "écran" does not retrieve anything. However, I
have a lot of "d'Intel", including "dIntel et écran".
I need to work on it too... Thanks!