I know Russian better than Russians ;)
I currently use default configuration for "dismax" provided by SOLR 1.1; I can add few URLs tonight to the crawler to see what happens. As I know, Lucene/Nutch can even define web page (pdf, txt, html) language by checking raw bytearray (raw HTTP Response without "language" clues in HTML). Code in Nutch Trunk is huge, a lot of useful staff...

Quoting Daniel Alheiros:
My indexing process follows:
    1. RussianTokenizer
    2. RussianLowerCaseFilter
    3. RussianStopFilter
    4. RussianStemFilter


I haven't tried it yet... I'll need probably separate SOLR + Website for Russian (?)

Currently http://www.tokenizer.org has pages in French (Canadian shops are bilingual), and Google correctly "understands" that such pages are in French (without additional HTML/HTTP language clues); I don't know French and can't test...

Unfortunately query "écran" does not retrieve anything. However, I have a lot of "d'Intel", including "d’Intel et écran".

I need to work on it too... Thanks!

Reply via email to