Hi all,

I was trying out the MoreLikeThis support, and getting some odd results.

I realized that unless the fields being used for similarity calculation have a stored term vector, the MoreLikeThis code from Lucene will re-analyze the field using the StandardAnalyzer. Which, in my case, is quite different from what I'm using in the Solr schema.

So the first note is just for anybody using MoreLikeThis, make sure you also specify termVectors=true in the Solr schema for any fields being passed to the query as mlt.fl parameters.

The second note is that the Wiki page and the example schema might want to include some reference to the termVectors field attribute. For example, the sample schema says:

   <!-- Valid attributes for fields:
     name: mandatory - the name for the field
type: mandatory - the name of a previously defined type from the <types> section
     indexed: true if this field should be indexed (searchable or sortable)
     stored: true if this field should be retrievable
     compressed: [false] if this field should be stored using gzip compression
       (this will only apply if the field type is compressable; among
       the standard field types, only TextField and StrField are)
     multiValued: true if this field may contain multiple values per document
     omitNorms: (expert) set to true to omit the norms associated with
       this field (this disables length normalization and index-time
       boosting for the field, and saves some memory).  Only full-text
       fields or fields that need an index-time boost need norms.

Which made me think initially these were the only valid attributes for fields. Likewise the wiki page at http://wiki.apache.org/solr/SchemaXml also doesn't make any mention of termVectors, termPositions, or termOffsets. I would edit that page, but there currently isn't a section that talks about all the attributes, only the common ones.

Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Reply via email to