Re: Missing tokens

paul . moran Thu, 19 Aug 2010 03:18:24 -0700

Great! Now I'm getting somewhere, this worked! The others didn't.

http://localhost/solr/select?q=contents:"OB10.";


Hope this makes sense to you. I'm still somewhat confused with the output
here. I had 'highlight matches' check, and from what I can tell, 'OB10'
wasn't found. When I enter 'OB10.' into the query, column 11 'ob10.' became
highlighted in the 'LowerCaseFilterFactory' table.

Am I using the wrong analyser, or supplying the wrong parameters to an
analyser?

Thanks for your help so far!
Paul

Index Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|term position |1   |2      |3    |4           |5    |6    |7    |8     |9      
 |10   |11    |12   |13      |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|  term text   |To  |produce|a    |downloadable|file |using|a    
|format|suitable|for  |OB10. |8-26 |Profiles|
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|  term type   |word|word   |word |word        |word |word |word |word  |word   
 |word |word  |word |word    |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|    source    |0,2 |3,10   |11,12|13,25       |26,30|31,36|37,38|39,45 |46,54  
 |55,58|59,64 |65,69|70,78   |
|  start,end   |    |       |     |            |     |     |     |      |       
 |     |      |     |        |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|   payload    |    |       |     |            |     |     |     |      |       
 |     |      |     |        |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|


org.apache.solr.analysis.StandardFilterFactory {}
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|term position |1   |2      |3    |4           |5    |6    |7    |8     |9      
 |10   |11    |12   |13      |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|  term text   |To  |produce|a    |downloadable|file |using|a    
|format|suitable|for  |OB10. |8-26 |Profiles|
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|  term type   |word|word   |word |word        |word |word |word |word  |word   
 |word |word  |word |word    |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|    source    |0,2 |3,10   |11,12|13,25       |26,30|31,36|37,38|39,45 |46,54  
 |55,58|59,64 |65,69|70,78   |
|  start,end   |    |       |     |            |     |     |     |      |       
 |     |      |     |        |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|
|   payload    |    |       |     |            |     |     |     |      |       
 |     |      |     |        |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------|


org.apache.solr.analysis.LowerCaseFilterFactory {}
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
|term position |1   |2      |3    |4           |5    |6    |7    |8     |9      
 |10   |11   |12   |13      |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
|  term text   |to  |produce|a    |downloadable|file |using|a    
|format|suitable|for  |ob10.|8-26 |profiles|
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
|  term type   |word|word   |word |word        |word |word |word |word  |word   
 |word |word |word |word    |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
|    source    |0,2 |3,10   |11,12|13,25       |26,30|31,36|37,38|39,45 |46,54  
 |55,58|59,64|65,69|70,78   |
|  start,end   |    |       |     |            |     |     |     |      |       
 |     |     |     |        |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|
|   payload    |    |       |     |            |     |     |     |      |       
 |     |     |     |        |
|--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------|



Query Analyzer
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
|--------------+-----------|
|term position |1          |
|--------------+-----------|
|  term text   |OB10       |
|--------------+-----------|
|  term type   |word       |
|--------------+-----------|
|    source    |0,4        |
|  start,end   |           |
|--------------+-----------|
|   payload    |           |
|--------------+-----------|


org.apache.solr.analysis.StandardFilterFactory {}
|--------------+----------|
|term position |1         |
|--------------+----------|
|  term text   |OB10      |
|--------------+----------|
|  term type   |word      |
|--------------+----------|
|    source    |0,4       |
|  start,end   |          |
|--------------+----------|
|   payload    |          |
|--------------+----------|


org.apache.solr.analysis.LowerCaseFilterFactory {}
|--------------+-------------|
|term position |1            |
|--------------+-------------|
|  term text   |ob10         |
|--------------+-------------|
|  term type   |word         |
|--------------+-------------|
|    source    |0,4          |
|  start,end   |             |
|--------------+-------------|
|   payload    |             |
|--------------+-------------|


I did look at ExtractingRequestHandler a while ago, but I don't think it
supported password protected files. Just looked at it again, and it looks
like it does now.





From:   Jan Høydahl / Cominvent <jan....@cominvent.com>
To:     solr-user@lucene.apache.org
Date:   18/08/2010 23:16
Subject:        Re: Missing tokens



Cannot see anything obvious...

Try
http://localhost/solr/select?q=contents:OB10*
http://localhost/solr/select?q=contents:"OB 10"
http://localhost/solr/select?q=contents:"OB10.";
http://localhost/solr/select?q=contents:ob10

Also, go to the Analysis page in admin, typie in your field name, enable
verbose output and copy paste the problematic sentence in the "Index" part
and then enter a OB10 in the "Query" part, and see how your doc and query
gets processed.

PS: Why don't you try this instead of doing the PDF extraction yourselv:
http://wiki.apache.org/solr/ExtractingRequestHandler ??

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 18. aug. 2010, at 16.25, paul.mo...@dds.net wrote:

> Here's my field description. I mentioned 'contents' field in my original
> post. I've changed it to a different field, 'summary'. It's using the
> 'text' fieldType as you can see below.
>
>   <field name="summary" type="text" indexed="true" stored="true"/>
>
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <!-- in this example, we will only use synonyms at query time
>
<filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt"
ignoreCase="true" expand="false"/>
>        -->
>        <!-- Case insensitive stop word removal.
>
enablePositionIncrements=true ensures that a 'gap' is left to
>
allow for accurate phrase queries.
>        -->
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts=
> "1" generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory" protected=
> "protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true" words=
> "stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts=
> "1" generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory" protected=
> "protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> I parsed the pdf using pdfbox. I can see my alphanumeric search term
'OB10'
> in the extracted text before I add it to the index. I can also go into
Luke
> and see the 'OB10' in the contents of the 'summary' field even though
Luke
> can't find it when I do a search.
>
> I can also use the browser to do a search in http://localhost/solr/admin
> and again that search term doesn't return any results. I thought it may
be
> an alphanumber word splitting issue, but that doesn't seem be be the case
> since I can search on ME26, and it returns a doc, and in fact, I can see
> the 'OB10' search term in the summary field of the doc returned.
>
> Here's a snippet of the summary field from that returned doc
>
> To produce a downloadable file using a format suitable
> for OB10. 8-26 Profiles
>
> I'm thinking that the extracted text from pdfbox may have hidden chars
that
> solr can't parse. However, before I go down that road, I just want to be
> sure I'm not making schoolboy errors with my solr setup.
>
> thanks
> Paul
>
>
>
> From:          Jan Høydahl / Cominvent <jan....@cominvent.com>
> To:            solr-user@lucene.apache.org
> Date:          18/08/2010 11:56
> Subject:               Re: Missing tokens
>
>
>
> Hi,
>
> Can you share with us how your schema looks for this field? What
FieldType?
> What tokenizer and analyser?
> How do you parse the PDF document? Before submitting to Solr? With what
> tool?
> How do you do the query? Do you get the same results when doing the query
> from a browser, not SolrJ?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 18. aug. 2010, at 11.34, paul.mo...@dds.net wrote:
>
>>
>> Hi, I'm having a problem with certain search terms not being found when
I
>> do a query. I'm using Solrj to index a pdf document, and add the
contents
>> to the 'contents' field. If I query the 'contents' field on the
>> SolrInputDocument doc object as below, I get 50k tokens.
>>
>> StringTokenizer to = new StringTokenizer((String)doc.getFieldValue(
>> "contents"));
>> System.out.println( "Tokens:"  + to.countTokens() );
>>
>> However, once the doc is indexed and I use Luke to analyse the index, it
>> has only 3300 tokens in that field. Where did the other 47k go?
>>
>> I read some other threads mentioning to increase the maxfieldLength in
>> solrconfig.xml, and my setting is below.
>>
>> <maxFieldLength>2147483647</maxFieldLength>
>>
>> Any advice is appreciated,
>> Paul
>>
>
>
>

Re: Missing tokens

Reply via email to