@Mahout experts: could you please, elaborate on that? It seems that I am stopping successfully quite some words with the stopwords mechanism in Solr (I do not get search results when querying with stopwords with the localhost/solr/select interface) but this somehow is not effective when Solr index gets converted to vectors in the org.apache.mahout.utils.vectors.lucene.Driver class. As a result I get clusters which contain (and are even mainly driven by) the stopwords... I am still not an expert in reading from Lucene index - is it possible that the Vector generation uses some "raw" reading of the Solr/Lucene index and thus getting the stopwords?
Best regards, Bogdan On Sun, Jan 3, 2010 at 3:51 AM, Lance Norskog <goks...@gmail.com> wrote: > Fields are both stored and indexed. The stored copy is exactly what > you sent in. The index is built with the "text" type's analysis stack > and is not stored. This output has the stopwords removed. The output > is not stored in one place, but parts of it are scattered around the > Lucene index data structures. When you search for one of these > stopwords, you should not get any documents. > > On Sat, Jan 2, 2010 at 5:20 PM, Bogdan Vatkov <bogdan.vat...@gmail.com> > wrote: > > Hi, > > > > I am using a default (example) configuration of Solr and there the > > stopwording seems to be enabled for both indexing and querying of fields > of > > type "text". > > I have a custom field which is of the "text" type. > > I have extended the stopwords.txt file with lots of words but when I > index > > some documents the index contains stopwords - I can see this with the > Luke > > tool. > > Am I supposed to see these terms in the index after they are declared in > the > > stopwords.txt file? > > What could be wrong? > > > > Best regards, > > Bogdan > > > > > > -- > Lance Norskog > goks...@gmail.com > -- Best regards, Bogdan