Using terms and N-gram

openvictor Open Wed, 02 Feb 2011 21:02:36 -0800

Dear all,

I am trying to implement an autocomplete system for research. But I am stuck
on some problems that I can't solve.


Here is my problem :
I give text like :
"the cat is black" and I want to explore all 1 gram to 8 gram for all the
text that are passed :
the, cat, is, black, the cat, cat is, is black, etc...

In order to do that I have defined the following fieldtype in my schema :

    <!--Custom fieldtype-->
    <fieldType name="ngram_field" class="solr.TextField">
      <analyzer type="index">
    <tokenizer class="solr.LowerCaseTokenizerFactory" />
    <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt"
ignoreCase="true" maxGramSize="8"
           minGramSize="1"/>
      </analyzer>
      <analyzer type="query">
    <tokenizer class="solr.LowerCaseTokenizerFactory" />
    <filter class="solr.CommonGramsFilterFactory" ignoreCase="true"
maxGramSize="8"
           minGramSize="1"/>
      </analyzer>
    </fieldType>


Then the following field :

    <field name="p_title_ngram" type="ngram_field" indexed="true"
stored="true"/>

Then I feed solr with some phrases and I was really surprised to see that
Solr didn't behave as expected.
I went to the schema browser to see the result for the very profound query :
"the cat is black and it rains"

The results are quite deceiving : first 1 grams are not found. some 2 grams
are found like : the_cat, "and_it" etc... But not what I expected.
Is there something I am missing here ? (by the way I also tried to remove
the mingramsize and maxgramsize even the words).

Thank you,
Victor Kabdebon

Using terms and N-gram

Reply via email to