On Sun, May 15, 2011 at 8:02 AM, Michael McCandless
<luc...@mikemccandless.com> wrote:
> On Fri, May 6, 2011 at 8:49 AM, Michael McCandless
> <luc...@mikemccandless.com> wrote:
>
>> Shouldn't we  have field types in the eg schema for the different
>> languages?  Ie, text_zh, text_th, text_en, text_ja, text_nl, etc.
>
> In fact, until we break out dedicated language field types, shouldn't
> we default autophrase to off in Solr?

I've taken a crack at a generic text field for
non-whitespace-delimited fields to the example schema:

   <!-- A general unstemmed text field that is better for non
whitespace delimited languanges (nwd) due to
autoGeneratePhraseQueries=false -->
    <fieldType name="text_nwd" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="false"> >
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

   <dynamicField name="*_nwd" type="text_nwd" indexed="true"  stored="true"/>

You can try it out on trunk with a query like:
http://localhost:8983/solr/select?q=name_nwd:F-11&debugQuery=true

And verify it generates an OR:

<str name="querystring">name_nwd:F-11</str>
<str name="parsedquery">name_nwd:f name_nwd:11</str>

Can someone verify that the WDF params are OK (i.e. I didn't catenate
since that wouldn't make sense if the word parts were actually whole
words in a non-whitespace-delimited language).  Does that make sense?


As far as Solr defaults... perhaps way way back "text" should have
been named "text_en".
But any changes now should be comprehensive (we need to consider
impacts to the example
data, the example schema, the solr tuturial which relies on some of
the current behavior, and a ton of documentation
on the wiki related to  both analysis components (multi-word synonyms,
WDF, etc) and other quickstart guides.

Anyway, changes to the example schema (or the behavior of the example
schema) can have a large impact.
I personally think that adding a new field is much easier and less
disruptive, and given the potential impact
we should hear what others have to say about it too (I'm out the rest
of today, and I know a lot of other
people aren't around this weekend either).

-Yonik

Reply via email to