Thanks Jack,
On 08/20/2012 06:41 PM, Jack Krupansky wrote:
How are you ingesting the offic documents? SolrCell, or some other
method?
I am using pytika, a python module that uses Tika to extract the content.
I then add it using a python tool called sunburnt.
Do you have CopyFields?
Yes I have a copy field like this:
<copyField source="fulltext" dest="text"/>
What fields are you querying on?
on fulltext
What does your "text" field type look like?
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
language="English" protected="protwords.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
language="English" protected="protwords.txt"/>
</analyzer>
</fieldType>
thanks again
robert
-- Jack Krupansky
-----Original Message----- From: robert rottermann
Sent: Monday, August 20, 2012 10:39 AM
To: solr-user@lucene.apache.org
Cc: robert rottermann
Subject: solr finds allways all documents
Hi there,
I am new to solr et all. Besides I am a java noob.
What I am doing:
I want to do full text retrival on office documents. The metadata of
these documents are maintained in Postgesql.
So the only intormation I need to get out of solr is a documet ID.
My problem no is, that my index seem to be done badly.
(nearly) What ever I look up, returns all documents.
I would be very glad, if somebody could give me an idea what I shoul
change.
thanks
Robert
What I am using is the sample configuration that comes with solr 3.6.
I removed all the fields and added the following:
<fields>
<field name="docid" type="string" indexed="true" stored="true"
required="true"/>
<field name="docnum" type="text" indexed="true" stored="true"
required="false"/>
<field name="titel" type="text" indexed="true" stored="true"
required="false"/>
<field name="fsname" type="text" indexed="true" stored="true"
required="false"/>
<field name="directory" type="text" indexed="true" stored="true"
required="false"/>
<field name="fulltext" type="text" indexed="true" stored="false"
required="false"/>
<dynamicField name="*" type="ignored" />
</fields>
<!-- Field to use to determine and enforce document uniqueness.
Unless this field is marked with required="false", it will be a
required field
-->
<uniqueKey>docid</uniqueKey>