Re: Solr Accent Insensitive and sensitive search

Denis WSRosa Thu, 18 Aug 2011 05:27:07 -0700

Hi! Thank you for your response!

here is my full schema:


<?xml version="1.0" encoding="UTF-8" ?>
<!-- Licensed to the Apache Software Foundation (ASF) under one or more
contributor
    license agreements. See the NOTICE file distributed with this work for
additional
    information regarding copyright ownership. The ASF licenses this file to

    You under the Apache License, Version 2.0 (the "License"); you may not
use
    this file except in compliance with the License. You may obtain a copy
of
    the License at http://www.apache.org/licenses/LICENSE-2.0 Unless
required
    by applicable law or agreed to in writing, software distributed under
the
    License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
CONDITIONS
    OF ANY KIND, either express or implied. See the License for the specific

    language governing permissions and limitations under the License. -->

<!-- This is the Solr schema file. This file should be named "schema.xml"
    and should be in the conf directory under the solr home (i.e.
./solr/conf/schema.xml
    by default) or located where the classloader for the Solr webapp can
find
    it. This example schema is the recommended starting point for users. It
should
    be kept correct and concise, usable out-of-the-box. For more
information,
    on how to customize this file, please see
http://wiki.apache.org/solr/SchemaXml
    PERFORMANCE NOTE: this schema includes many optional features and should

    not be used for benchmarking. To improve performance one could - set
stored="false"
    for all fields possible (esp large fields) when you only need to search
on
    the field but don't need to return the original value. - set
indexed="false"
    if you don't need to search on the field, but only return the field as a

    result of searching on other indexed fields. - remove all unneeded
copyField
    statements - for best index size and searching performance, set "index"
to
    false for all general text fields, use copyField to copy them to the
catchall
    "text" field, and use that for searching. - For maximum indexing
performance,
    use the StreamingUpdateSolrServer java client. - Remember to run the JVM

    in server mode, and use a higher logging level that avoids logging every

    request -->

<schema name="example" version="1.4">

    <types>

        <fieldType name="uuid" class="solr.StrField" multiValued="false" />
        <!-- Not analized field -->
        <fieldType name="string" class="solr.StrField" multiValued="false"
            omitNorms="true" />

        <!-- boolean type: "true" or "false" -->
        <fieldType name="boolean" class="solr.BoolField"
            sortMissingLast="true" omitNorms="true" />
        <!--Binary data type. The data should be sent/retrieved in as Base64
encoded
            Strings -->
        <fieldtype name="binary" class="solr.BinaryField" />
        <!-- Default numeric field types. For faster range queries, consider
the
            tint/tfloat/tlong/tdouble types. -->
        <fieldType name="int" class="solr.TrieIntField"
            precisionStep="0" omitNorms="true" positionIncrementGap="0" />
        <fieldType name="float" class="solr.TrieFloatField"
            precisionStep="0" omitNorms="true" positionIncrementGap="0" />
        <fieldType name="long" class="solr.TrieLongField"
            precisionStep="0" omitNorms="true" positionIncrementGap="0" />
        <fieldType name="double" class="solr.TrieDoubleField"
            precisionStep="0" omitNorms="true" positionIncrementGap="0" />
        <fieldType name="date" class="solr.DateField"
            sortMissingLast="true" omitNorms="true" />

        <!-- Numeric field types that index each value at various levels of
precision
            to accelerate range queries when the number of values between
the range endpoints
            is large. See the javadoc for NumericRangeQuery for internal
implementation
            details. Smaller precisionStep values (specified in bits) will
lead to more
            tokens indexed per value, slightly larger index size, and faster
range queries.
            A precisionStep of 0 disables indexing at different precision
levels. -->
        <fieldType name="tint" class="solr.TrieIntField"
            precisionStep="8" omitNorms="true" positionIncrementGap="0" />
        <fieldType name="tfloat" class="solr.TrieFloatField"
            precisionStep="8" omitNorms="true" positionIncrementGap="0" />
        <fieldType name="tlong" class="solr.TrieLongField"
            precisionStep="8" omitNorms="true" positionIncrementGap="0" />
        <fieldType name="tdouble" class="solr.TrieDoubleField"
            precisionStep="8" omitNorms="true" positionIncrementGap="0" />
        <!-- A Trie based date field for faster date range queries and date
faceting. -->
        <fieldType name="tdate" class="solr.TrieDateField"
            omitNorms="true" precisionStep="6" positionIncrementGap="0" />

        <!-- Key type fields, no filers -->
        <fieldType name="keytype" class="solr.TextField"
            multiValued="false" omitNorms="true">
            <analyzer>
                <tokenizer class="solr.KeywordTokenizerFactory" />
            </analyzer>
        </fieldType>

        <!-- A general text field that has reasonable, generic
cross-language defaults:
            it tokenizes with StandardTokenizer, removes stop words from
case-insensitive
            "stopwords.txt" (empty by default), and down cases. At query
time only, it
            also applies synonyms. -->
        <fieldType name="text" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <!-- <filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"
                    enablePositionIncrements="true" /> -->
                <filter class="solr.ASCIIFoldingFilterFactory" />
                <filter class="solr.LowerCaseFilterFactory" />
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <!-- filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"
                    enablePositionIncrements="true" / -->
                <!--filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt"
                    ignoreCase="true" expand="true"/ -->
                <filter class="solr.ASCIIFoldingFilterFactory" />
                <filter class="solr.LowerCaseFilterFactory" />
            </analyzer>
        </fieldType>

        <!-- lowercases the entire field value, keeping it as a single
token. -->
        <fieldType name="tags" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer>
                <filter class="solr.ASCIIFoldingFilterFactory" />
                <tokenizer class="solr.PatternTokenizerFactory" pattern=","
/>
                <filter class="solr.LowerCaseFilterFactory" />
            </analyzer>
        </fieldType>

        <!-- lowercases the entire field value, keeping it as a single
token. -->
        <fieldType name="number" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.WhitespaceTokenizerFactory" />
                <filter class="solr.TrimFilterFactory" />
            </analyzer>
        </fieldType>

        <!-- A general content field used for search. Should be used for
content
            strings. This sort of field will have the html tags removed -->
        <fieldType name="content" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer>
                <charFilter class="solr.HTMLStripCharFilterFactory" />
                <tokenizer class="solr.StandardTokenizerFactory" />
                <!-- <filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"
                    enablePositionIncrements="true" /> -->
                <!-- <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt"
                    ignoreCase="true" expand="false"/> -->
                <filter class="solr.ASCIIFoldingFilterFactory" />
                <filter class="solr.LowerCaseFilterFactory" />
            </analyzer>
        </fieldType>

        <!-- Just like text except it reverses the characters of each token,
to
            enable more efficient leading wildcard queries. -->
        <fieldType name="text_rev" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <!-- <filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"
                    enablePositionIncrements="true" /> -->
                <filter class="solr.ASCIIFoldingFilterFactory" />
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.ReversedWildcardFilterFactory"
                    withOriginal="true" maxPosAsterisk="3"
maxPosQuestion="2"
                    maxFractionAsterisk="0.33" />
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <!-- <filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"
                    enablePositionIncrements="true" /> -->
                <filter class="solr.ASCIIFoldingFilterFactory" />
                <filter class="solr.LowerCaseFilterFactory" />
            </analyzer>
        </fieldType>

        <!-- Just like tags except it reverses the characters of each token,
to
            enable more efficient leading wildcard queries. -->
        <fieldType name="tags_rev" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer type="index">
                <filter class="solr.ASCIIFoldingFilterFactory" />
                <tokenizer class="solr.PatternTokenizerFactory" pattern=","
/>
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.ReversedWildcardFilterFactory"
                    withOriginal="true" maxPosAsterisk="3"
maxPosQuestion="2"
                    maxFractionAsterisk="0.33" />
            </analyzer>
            <analyzer type="query">
                <filter class="solr.ASCIIFoldingFilterFactory" />
                <tokenizer class="solr.PatternTokenizerFactory" pattern=","
/>
                <filter class="solr.LowerCaseFilterFactory" />
            </analyzer>
        </fieldType>

    </types>


    <fields>

        <!-- Basic fields -->
        <field name="UUID" type="uuid" indexed="true" stored="false"
            multiValued="false" required="true" />
        <field name="DocumentType" type="keytype" indexed="true"
stored="false"
            required="true" multiValued="false"/>
        <field name="DocumentLocale" type="string" indexed="true"
            stored="true" required="false" />
        <field name="DocumentId" type="string" indexed="false" stored="true"
            required="true" />
        <field name="DocumentName" type="text" indexed="true" stored="true"
            required="false" />
        <field name="DocumentDisplayName" type="text" indexed="true"
            stored="true" required="true" />
        <field name="DocumentCreateDate" type="text" indexed="false"
            stored="true" required="false" />
        <field name="DocumentLastUpdateDate" type="text" indexed="false"
            stored="true" required="false" />
        <field name="DocumentContent" type="content" indexed="true"
            stored="false" required="false" />
        <field name="DocumentMIME" type="text" indexed="true" stored="true"
            required="false" />
        <field name="DocumentTAGS" type="tags" indexed="true" stored="true"
            required="false" />
        <field name="URL" type="string" indexed="false" stored="true"
            required="false" />
        <field name="DocumentUSER" type="long" indexed="false" stored="true"
            required="false" />
        <field name="DocumentAuthor" type="string" indexed="false"
            stored="true" required="false" />
        <field name="DocumentSpace" type="long" indexed="false"
stored="true"
            required="false" />
        <field name="DocumentTenant" type="long" indexed="false"
stored="true"
            required="false" />
        <field name="DocumentDescription" type="text" indexed="false"
            stored="true" required="false" />
        <field name="META.Content-Type" type="string" indexed="false"
            stored="false" required="false" />
        <field name="DELETED" type="string" indexed="true" stored="false"
            required="false" />

        <!-- Extra Fields -->

        <!-- Indexed general text field -->
        <dynamicField name="*_text_i" type="text" indexed="true"
            stored="false" />
        <!-- Stored general text field -->
        <dynamicField name="*_text_s" type="text" indexed="false"
            stored="true" />
        <!-- Indexed and stored general text field -->
        <dynamicField name="*_text_is" type="text" indexed="true"
            stored="true" />
        <!-- Indexed general number field -->
        <dynamicField name="*_long_i" type="long" indexed="true"
            stored="false" />
        <!-- Stored general number field -->
        <dynamicField name="*_long_s" type="long" indexed="false"
            stored="true" />
        <!-- Indexed and stored general number field -->
        <dynamicField name="*_long_is" type="long" indexed="true"
            stored="true" />
        <!-- Indexed general date field -->
        <dynamicField name="*_date_i" type="long" indexed="true"
            stored="false" />
        <!-- Stored general date field -->
        <dynamicField name="*_date_s" type="long" indexed="false"
            stored="true" />
        <!-- Indexed and stored general date field -->
        <dynamicField name="*_date_is" type="long" indexed="true"
            stored="true" />
        <!-- Indexed general boolean field -->
        <dynamicField name="*_boolean_i" type="boolean" indexed="true"
            stored="false" />
        <!-- Stored general boolean field -->
        <dynamicField name="*_boolean_s" type="boolean" indexed="false"
            stored="true" />
        <!-- Indexed and stored general boolean field -->
        <dynamicField name="*_boolean_is" type="boolean" indexed="true"
            stored="true" />
        <!-- Indexed general mult valuated number fields -->
        <dynamicField name="*_number_i" type="number" indexed="true"
            stored="false" />
        <!-- Stored general mult valuated number fields -->
        <dynamicField name="*_number_s" type="number" indexed="false"
            stored="true" />
        <!-- Indexed and stored general mult valuated number fields -->
        <dynamicField name="*_number_is" type="number" indexed="true"
            stored="true" />

        <!-- catchall text field that indexes tokens both normally and in
reverse
            for efficient leading wildcard queries. -->
        <field name="DocumentDisplayName_rev" type="text_rev" indexed="true"
            stored="false" multiValued="false" />
        <field name="DocumentDescription_rev" type="text_rev" indexed="true"
            stored="false" multiValued="false" />
        <field name="DocumentTAGS_rev" type="tags_rev" indexed="true"
            stored="false" multiValued="true" />
        <field name="DocumentContent_rev" type="text_rev" indexed="true"
            stored="false" multiValued="true" />

        <!-- All other fields -->
        <dynamicField name="*" type="string" indexed="true"
            stored="false" />

    </fields>

    <!-- Field to use to determine and enforce document uniqueness. Unless
this
        field is marked with required="false", it will be a required field
-->
    <uniqueKey>UUID</uniqueKey>

    <!-- field for the QueryParser to use when an explicit fieldname is
absent -->
    <defaultSearchField>DocumentDisplayName</defaultSearchField>

    <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
    <solrQueryParser defaultOperator="AND" />

    <!-- copyField commands copy one field to another at the time a document

        is added to the index. It's used either to index the same field
differently,
        or to add multiple fields to the same field for easier/faster
searching. -->
    <copyField source="DocumentDescription" dest="DocumentDescription_rev"
/>
    <copyField source="DocumentDisplayName" dest="DocumentDisplayName_rev"
/>
    <copyField source="DocumentTAGS" dest="DocumentTAGS_rev" />
    <copyField source="DocumentContent" dest="DocumentContent_rev" />

</schema>


What I'm doing wrong?




On Wed, Aug 17, 2011 at 5:37 PM, Michael Ryan <mr...@moreover.com> wrote:

> Are you using the same analyzer for both type="query" and type="index"? Can
> you show us the fieldType from your schema?
>
> -Michael
>



-- 
Denis Wilson Souza Rosa
----------------------------------------------------
Systems Architect
mobile: +55 11 8112 8284
email: deniswsr...@gmail.com / deniswsr...@hotmail.com

Re: Solr Accent Insensitive and sensitive search

Reply via email to