Re: looking for documentation on solr.JapaneseTokenizerFactory

Micheal Cooper Tue, 28 Jun 2016 01:05:14 -0700

Very nice. Thank you.

My non-Japanese devs had set Solr to use CJK for indexing and Whitespace 
Tokenizer for search, which does not work at all because Japanese does not use 
whitespace. I was able to find settings that seem to be working well.

For reference for other knowledge-seekers:

I contacted the company that donated Kuromoji, the JapaneseTokenizer from 
Lucene that is used in Solr, and they directed me to 
https://cwiki.apache.org/confluence/display/solr/Language+Analysis#LanguageAnalysis-Japanese
which has info for v6. The only problem I had was that it seems that 
JapaneseIterationMarkCharFilterFactory is not available for v4.10, but I just 
removed it. It is an edge case, and I can look into that later.

The other thing to be careful of is loading the library.
I could not reload the core because Solr could not load Kuromoji, and I found 
that that directory was not loaded in the solrconfig.xml.
When I tried to use the default relative link method, it did not work. It seems 
to have something to do with the Lucene libraries. The Japanese blog I found 
recommended using an absolute link, so I put that in the ‘config’ section that 
loads library directories, and it worked.

Here are some links that also helped:
https://cwiki.apache.org/confluence/display/solr/Language+Analysis#LanguageAnalysis-Japanese
http://d.hatena.ne.jp/kahnn/20130828/1377645204
http://blog.flect.co.jp/labo/2012/10/solr40schemaxml-bf12.html

Micheal

On 2016/06/28, 16:10, "Alexandre Rafalovitch" <arafa...@gmail.com> wrote:

Have you seen http://discovery-grindstone.blogspot.com.au/ ? It is a
series of articles on setting up SJK for library content.

Regards,
   Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/

On 28 June 2016 at 10:59, Micheal Cooper <micheal.coo...@oist.jp> wrote:
> I have a vendor-supplied Solr 4.10 set up for multisite search which indexes 
> two large Drupal 7 sites which have content in Japanese, English, and 
> Undefined.
>
> The English searches are OK, but the Japanese does not work well at all. The 
> vendors are in the US, so it is understandable that they cannot really test 
> it for themselves.
>
> I am trying to fix this config before setting userdict, synonyms, stopwords, 
> and the like. There is obviously a problem with the Tokenization.
>
> I have searched Google in English and Japanese and Safari Books in English, 
> but I cannot find a definitive page or tutorial on setting up Solr with 
> Kuromoji (JapaneseTokenizerFactory) correctly, and the official documentation 
> is not helpful. The comments for text_ja in the config say "See 
> http://wiki.apache.org/solr/JapaneseLanguageSupport for more on Japanese 
> language support," but when you go there, it just says, "This page will 
> contain various information on Japanese support in Lucene/Solr 3.6 & 4.0, but 
> it currently just a filler...".
>
> Does anyone have a good source of info for setting up Solr for Japanese 
> content?
>
> Micheal
>

Re: looking for documentation on solr.JapaneseTokenizerFactory

Reply via email to