Hi,

Your bug is right there in the WhitespaceTokenizer, where you see that it does 
NOT strip away the "." as whitespace.
Try with StandardTokenizerFactory instead, as it removes punctuation.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 19. aug. 2010, at 12.16, paul.mo...@dds.net wrote:

> Great! Now I'm getting somewhere, this worked! The others didn't.
> 
> http://localhost/solr/select?q=contents:"OB10.";
> 
> Hope this makes sense to you. I'm still somewhat confused with the output
> here. I had 'highlight matches' check, and from what I can tell, 'OB10'
> wasn't found. When I enter 'OB10.' into the query, column 11 'ob10.' became
> highlighted in the 'LowerCaseFilterFactory' table.
> 
> Am I using the wrong analyser, or supplying the wrong parameters to an
> analyser?
> 
> Thanks for your help so far!
> Paul
> 
> Index Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |term position |1   |2      |3    |4           |5    |6    |7    |8     |9    
>    |10   |11    |12   |13      |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |  term text   |To  |produce|a    |downloadable|file |using|a    
> |format|suitable|for  |OB10. |8-26 |Profiles|
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |  term type   |word|word   |word |word        |word |word |word |word  |word 
>    |word |word  |word |word    |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |    source    |0,2 |3,10   |11,12|13,25       |26,30|31,36|37,38|39,45 
> |46,54   |55,58|59,64 |65,69|70,78   |
> |  start,end   |    |       |     |            |     |     |     |      |     
>    |     |      |     |        |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |   payload    |    |       |     |            |     |     |     |      |     
>    |     |      |     |        |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> 
> 
> org.apache.solr.analysis.StandardFilterFactory {}
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |term position |1   |2      |3    |4           |5    |6    |7    |8     |9    
>    |10   |11    |12   |13      |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |  term text   |To  |produce|a    |downloadable|file |using|a    
> |format|suitable|for  |OB10. |8-26 |Profiles|
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |  term type   |word|word   |word |word        |word |word |word |word  |word 
>    |word |word  |word |word    |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |    source    |0,2 |3,10   |11,12|13,25       |26,30|31,36|37,38|39,45 
> |46,54   |55,58|59,64 |65,69|70,78   |
> |  start,end   |    |       |     |            |     |     |     |      |     
>    |     |      |     |        |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> |   payload    |    |       |     |            |     |     |     |      |     
>    |     |      |     |        |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
> 
> 
> org.apache.solr.analysis.LowerCaseFilterFactory {}
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
> |term position |1   |2      |3    |4           |5    |6    |7    |8     |9    
>    |10   |11   |12   |13      |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
> |  term text   |to  |produce|a    |downloadable|file |using|a    
> |format|suitable|for  |ob10.|8-26 |profiles|
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
> |  term type   |word|word   |word |word        |word |word |word |word  |word 
>    |word |word |word |word    |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
> |    source    |0,2 |3,10   |11,12|13,25       |26,30|31,36|37,38|39,45 
> |46,54   |55,58|59,64|65,69|70,78   |
> |  start,end   |    |       |     |            |     |     |     |      |     
>    |     |     |     |        |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
> |   payload    |    |       |     |            |     |     |     |      |     
>    |     |     |     |        |
> |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
> 
> 
> 
> Query Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> |--------------+-----------|
> |term position |1          |
> |--------------+-----------|
> |  term text   |OB10       |
> |--------------+-----------|
> |  term type   |word       |
> |--------------+-----------|
> |    source    |0,4        |
> |  start,end   |           |
> |--------------+-----------|
> |   payload    |           |
> |--------------+-----------|
> 
> 
> org.apache.solr.analysis.StandardFilterFactory {}
> |--------------+----------|
> |term position |1         |
> |--------------+----------|
> |  term text   |OB10      |
> |--------------+----------|
> |  term type   |word      |
> |--------------+----------|
> |    source    |0,4       |
> |  start,end   |          |
> |--------------+----------|
> |   payload    |          |
> |--------------+----------|
> 
> 
> org.apache.solr.analysis.LowerCaseFilterFactory {}
> |--------------+-------------|
> |term position |1            |
> |--------------+-------------|
> |  term text   |ob10         |
> |--------------+-------------|
> |  term type   |word         |
> |--------------+-------------|
> |    source    |0,4          |
> |  start,end   |             |
> |--------------+-------------|
> |   payload    |             |
> |--------------+-------------|
> 
> 
> I did look at ExtractingRequestHandler a while ago, but I don't think it
> supported password protected files. Just looked at it again, and it looks
> like it does now.
> 
> 
> 
> 
> 
> From: Jan Høydahl / Cominvent <jan....@cominvent.com>
> To:   solr-user@lucene.apache.org
> Date: 18/08/2010 23:16
> Subject:      Re: Missing tokens
> 
> 
> 
> Cannot see anything obvious...
> 
> Try
> http://localhost/solr/select?q=contents:OB10*
> http://localhost/solr/select?q=contents:"OB 10"
> http://localhost/solr/select?q=contents:"OB10.";
> http://localhost/solr/select?q=contents:ob10
> 
> Also, go to the Analysis page in admin, typie in your field name, enable
> verbose output and copy paste the problematic sentence in the "Index" part
> and then enter a OB10 in the "Query" part, and see how your doc and query
> gets processed.
> 
> PS: Why don't you try this instead of doing the PDF extraction yourselv:
> http://wiki.apache.org/solr/ExtractingRequestHandler ??
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
> 
> On 18. aug. 2010, at 16.25, paul.mo...@dds.net wrote:
> 
>> Here's my field description. I mentioned 'contents' field in my original
>> post. I've changed it to a different field, 'summary'. It's using the
>> 'text' fieldType as you can see below.
>> 
>>  <field name="summary" type="text" indexed="true" stored="true"/>
>> 
>> 
>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>>     <analyzer type="index">
>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>       <!-- in this example, we will only use synonyms at query time
>> 
> <filter class="solr.SynonymFilterFactory"
>> synonyms="index_synonyms.txt"
> ignoreCase="true" expand="false"/>
>>       -->
>>       <!-- Case insensitive stop word removal.
>> 
> enablePositionIncrements=true ensures that a 'gap' is left to
>> 
> allow for accurate phrase queries.
>>       -->
>>       <filter class="solr.StopFilterFactory"
>>               ignoreCase="true"
>>               words="stopwords.txt"
>>               enablePositionIncrements="true"
>>               />
>>       <filter class="solr.WordDelimiterFilterFactory" generateWordParts=
>> "1" generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> catenateAll="0" splitOnCaseChange="1"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="solr.EnglishPorterFilterFactory" protected=
>> "protwords.txt"/>
>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>     </analyzer>
>>     <analyzer type="query">
>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>       <filter class="solr.StopFilterFactory" ignoreCase="true" words=
>> "stopwords.txt"/>
>>       <filter class="solr.WordDelimiterFilterFactory" generateWordParts=
>> "1" generateNumberParts="1" catenateWords="0" catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="1"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="solr.EnglishPorterFilterFactory" protected=
>> "protwords.txt"/>
>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>     </analyzer>
>>   </fieldType>
>> 
>> I parsed the pdf using pdfbox. I can see my alphanumeric search term
> 'OB10'
>> in the extracted text before I add it to the index. I can also go into
> Luke
>> and see the 'OB10' in the contents of the 'summary' field even though
> Luke
>> can't find it when I do a search.
>> 
>> I can also use the browser to do a search in http://localhost/solr/admin
>> and again that search term doesn't return any results. I thought it may
> be
>> an alphanumber word splitting issue, but that doesn't seem be be the case
>> since I can search on ME26, and it returns a doc, and in fact, I can see
>> the 'OB10' search term in the summary field of the doc returned.
>> 
>> Here's a snippet of the summary field from that returned doc
>> 
>> To produce a downloadable file using a format suitable
>> for OB10. 8-26 Profiles
>> 
>> I'm thinking that the extracted text from pdfbox may have hidden chars
> that
>> solr can't parse. However, before I go down that road, I just want to be
>> sure I'm not making schoolboy errors with my solr setup.
>> 
>> thanks
>> Paul
>> 
>> 
>> 
>> From:                 Jan Høydahl / Cominvent <jan....@cominvent.com>
>> To:           solr-user@lucene.apache.org
>> Date:                 18/08/2010 11:56
>> Subject:              Re: Missing tokens
>> 
>> 
>> 
>> Hi,
>> 
>> Can you share with us how your schema looks for this field? What
> FieldType?
>> What tokenizer and analyser?
>> How do you parse the PDF document? Before submitting to Solr? With what
>> tool?
>> How do you do the query? Do you get the same results when doing the query
>> from a browser, not SolrJ?
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Training in Europe - www.solrtraining.com
>> 
>> On 18. aug. 2010, at 11.34, paul.mo...@dds.net wrote:
>> 
>>> 
>>> Hi, I'm having a problem with certain search terms not being found when
> I
>>> do a query. I'm using Solrj to index a pdf document, and add the
> contents
>>> to the 'contents' field. If I query the 'contents' field on the
>>> SolrInputDocument doc object as below, I get 50k tokens.
>>> 
>>> StringTokenizer to = new StringTokenizer((String)doc.getFieldValue(
>>> "contents"));
>>> System.out.println( "Tokens:"  + to.countTokens() );
>>> 
>>> However, once the doc is indexed and I use Luke to analyse the index, it
>>> has only 3300 tokens in that field. Where did the other 47k go?
>>> 
>>> I read some other threads mentioning to increase the maxfieldLength in
>>> solrconfig.xml, and my setting is below.
>>> 
>>> <maxFieldLength>2147483647</maxFieldLength>
>>> 
>>> Any advice is appreciated,
>>> Paul
>>> 
>> 
>> 
>> 
> 
> 
> 

Reply via email to