Hello Everyone, First off, sorry about the thread hijack earlier, it was not intentional.
Back to the point though, I'm having some issues getting SOLR to work with our dataset. I'm using it to index ticket data for our technical support department. Below are a few of the problems I've been having, and the wiki hasn't had much to say about them. 1) As an example, searching for "binarydata_groupdocument_fk" returns nothing, while searching for "BinaryData_GroupDocument_FK" returns results. I have the lowercasefilterfactory applied to both the index and query analyzers. Does this not actually set everything to lower case? From the wiki at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters, it says "Creates tokens by lowercasing all letters and dropping non-letters" but that does not seem to be happening here. Am I forgetting to configure something? 2) Some of our data is one sentence. Some is over 5 MB of text. When searching for a term, it's returning the one sentence data first because the fieldNorm is so different (0.4 for one, 0.002 for others). Is there a way to disable using the fieldnorm in the score calculation? An alternative I tried was posting parts of the data in as different values of the field (so having multiple tags of that field-name in the add xml post), but that appeared to have zero effect on the results - even the querydebugger showed the exact same calculation for the search. Does anyone know how to disable the fieldnorm, or have the score created from adding the scores from each value of a multivalued field? 3) I discovered that searching for '"certificate not found"' (using the double quotes for a phrase here) did not return any results, even though the phrase did exist (and was lower case originally too, so different than my first issue). I discovered it was because of the stopword "not", but the same stopfilterfactory was applied to both the index and query analyzers. Am I doing something wrong there? As a workaround I'm having php manually removing stopwords from the querystring, which is a real pain. I'm thinking my filters aren't being applied correctly since this is similar to issue #1 but with a different filter. Here is my fieldtype I do the actual searches on: <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> Any help or advice would be greatly appreciated, thanks! -Reece