On Sun, May 15, 2011 at 8:02 AM, Michael McCandless <luc...@mikemccandless.com> wrote: > On Fri, May 6, 2011 at 8:49 AM, Michael McCandless > <luc...@mikemccandless.com> wrote: > >> Shouldn't we have field types in the eg schema for the different >> languages? Ie, text_zh, text_th, text_en, text_ja, text_nl, etc. > > In fact, until we break out dedicated language field types, shouldn't > we default autophrase to off in Solr?
I've taken a crack at a generic text field for non-whitespace-delimited fields to the example schema: <!-- A general unstemmed text field that is better for non whitespace delimited languanges (nwd) due to autoGeneratePhraseQueries=false --> <fieldType name="text_nwd" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false"> > <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> <dynamicField name="*_nwd" type="text_nwd" indexed="true" stored="true"/> You can try it out on trunk with a query like: http://localhost:8983/solr/select?q=name_nwd:F-11&debugQuery=true And verify it generates an OR: <str name="querystring">name_nwd:F-11</str> <str name="parsedquery">name_nwd:f name_nwd:11</str> Can someone verify that the WDF params are OK (i.e. I didn't catenate since that wouldn't make sense if the word parts were actually whole words in a non-whitespace-delimited language). Does that make sense? As far as Solr defaults... perhaps way way back "text" should have been named "text_en". But any changes now should be comprehensive (we need to consider impacts to the example data, the example schema, the solr tuturial which relies on some of the current behavior, and a ton of documentation on the wiki related to both analysis components (multi-word synonyms, WDF, etc) and other quickstart guides. Anyway, changes to the example schema (or the behavior of the example schema) can have a large impact. I personally think that adding a new field is much easier and less disruptive, and given the potential impact we should hear what others have to say about it too (I'm out the rest of today, and I know a lot of other people aren't around this weekend either). -Yonik