Re: Problems querying Russian content

funtick Thu, 28 Jun 2007 10:24:16 -0700

I know Russian better than Russians ;)

I currently use default configuration for "dismax" provided by SOLR1.1; I can add few URLs tonight to the crawler to see what happens. AsI know, Lucene/Nutch can even define web page (pdf, txt, html)language by checking raw bytearray (raw HTTP Response without"language" clues in HTML). Code in Nutch Trunk is huge, a lot ofuseful staff...


Quoting Daniel Alheiros:

My indexing process follows:
    1. RussianTokenizer
    2. RussianLowerCaseFilter
    3. RussianStopFilter
    4. RussianStemFilter

I haven't tried it yet... I'll need probably separate SOLR + Websitefor Russian (?)

Currently http://www.tokenizer.org has pages in French (Canadian shopsare bilingual), and Google correctly "understands" that such pages arein French (without additional HTML/HTTP language clues); I don't knowFrench and can't test...

Unfortunately query "écran" does not retrieve anything. However, Ihave a lot of "d'Intel", including "dIntel et écran".


I need to work on it too... Thanks!

Re: Problems querying Russian content

Reply via email to