RE: I18N with SOLR?

Dilip.TS Sun, 18 Nov 2007 22:09:53 -0800

           Hello,

              Also can we have something like this ? i.e  having multiple
"defaultSearchField" entries in the schema.xml while searching for a keyword
which has a combination of more than 1 language:


              <defaultSearchField>text</defaultSearchField>
              <defaultSearchField>text_french</defaultSearchField>...
  -----Original Message-----
  From: Dilip.TS [mailto:[EMAIL PROTECTED]
  Sent: Monday, November 19, 2007 11:29 AM
  To: solr-user@lucene.apache.org
  Subject: RE: I18N with SOLR?


            Hello,
                    Does SOLR supports searching for a keyword which has a
combination of more than 1 language within the same search page?



    -----Original Message-----
    From: Guglielmo Celata [mailto:[EMAIL PROTECTED]
    Sent: Thursday, November 15, 2007 7:39 PM
    To: solr-user@lucene.apache.org; [EMAIL PROTECTED]
    Subject: Re: I18N with SOLR?


    Hi Dillip,
    don't know if this helps, but I have set up a TextIt field in the
config/schema.xml file, in order to index italian text.
    It works pretty well with non-ascii characters (we do have some accented
vowels, even if not as many as the french).
    It also works with  stopwords (and I assume with protwords as well,
though I didn't try). I created an italian-stopwords.txt file in the config/
path.
    I think the SnowballPorterFilterFactory is a default usable class in
Solr, although I remember having read it's a bit slower than other
libraries.
    But I am no expert.


        <fieldtype name="textIt" class="solr.TextField"
positionIncrementGap="100">
          <analyzer>
            <tokenizer class="solr.WhitespaceTokenizerFactory "/>
            <filter class="solr.ISOLatin1AccentFilterFactory"/>
            <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumber
    s="1" catenateAll="0"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.StopFilterFactory"
words="italian-stopwords.txt" ignoreCase="true"/>
            <filter class="solr.SnowballPorterFilterFactory"
language="Italian"/>
          </analyzer>
        </fieldtype>



    On 15/11/2007, Dilip.TS <[EMAIL PROTECTED]> wrote:
      Hi Ed,
        Thanks for the help,  but i have some queries,
        i understand that we need to have a stopwords_french.txt and
      protwords_french.txt files say for french in solr/conf directory.
        Is it like we need to write the classes like
FrenchStopFilterFactory,
      FrenchPorterFilterFactory for each language
        or do we have these classes in built in solr? I didnt find them in
      SOLR/Lucene APIs.
        I found some classes like
org.apache.lucene.analysis.fr.FrenchAnalyzer
      etc., in lucene-analyzers.jar.
        Any idea what is this class used for?

      Thanks in advance,

      Regards
      Dilip

      -----Original Message-----
      From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of
Ed
      Summers
      Sent: Monday, November 12, 2007 7:00 PM
      To: solr-user@lucene.apache.org ; [EMAIL PROTECTED]
      Subject: Re: I18N with SOLR?


      I'd say yes. Solr supports Unicode and ships with language specific
      analyzers, and allows you to provide your own custom analyzers if you
      need them. This allows you to create different <fieldType> definitions
      for the languages you want to support. For example here is an example
      field type for French text which uses a French stopword list and
      French stemming.

          <fieldType
            name="text_french"
            class="solr.TextField" >
            <analyzer>
              <tokenizer class="solr.WhitespaceTokenizerFactory " />
              <filter
                class="solr.FrenchStopFilterFactory"
                ignoreCase="true"
                words="stopwords_french.txt" />
              <filter
                class=" solr.FrenchPorterFilterFactory"
                protected="protwords_french.txt" />
              <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
            </analyzer>
          </fieldType>

      Then you can create a <dynamicField> definitions that allow you to
      index and query your documents using the correct field type:

          <dynamicField
            name="*_french"
            type="text_french"
            indexed="true"
            stored="true"/>

      This means that when you index you need to know what language your
      data is in so that you know what field names to use in your document
      (e.g. title_french). And at search time you need to know what language
      you are in so you know which fields to search.  Most user interfaces
      are in a single language context so from the query perspective you'll
      most likely know the language they want to search in. If you don't
      know the language context in either case you could try to guess using
      something like org.apache.nutch.analysis.lang.LanguageIdentifier.

      I hope this helps. We used this technique (without the guessing) quite
      effectively at the Library of Congress recently for a prototype
      application that needed to provide search functionality in 7 different
      languages.

      //Ed

      On Nov 12, 2007 1:56 AM, Dilip.TS < [EMAIL PROTECTED]> wrote:
      > Hello,
      >
      >   Does SOLR supports I18N (with multiple language support) ?
      >   Thanks in advance.
      >
      > Regards,
      > Dilip TS
      >
      >

RE: I18N with SOLR?

Reply via email to