Re: Missing tokens

Jan Høydahl / Cominvent Wed, 18 Aug 2010 15:16:40 -0700

Cannot see anything obvious...

Try
http://localhost/solr/select?q=contents:OB10*
http://localhost/solr/select?q=contents:"OB 10"
http://localhost/solr/select?q=contents:"OB10.";
http://localhost/solr/select?q=contents:ob10


Also, go to the Analysis page in admin, typie in your field name, enable 
verbose output and copy paste the problematic sentence in the "Index" part and 
then enter a OB10 in the "Query" part, and see how your doc and query gets 
processed.

PS: Why don't you try this instead of doing the PDF extraction yourselv: 
http://wiki.apache.org/solr/ExtractingRequestHandler ??

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 18. aug. 2010, at 16.25, paul.mo...@dds.net wrote:

> Here's my field description. I mentioned 'contents' field in my original
> post. I've changed it to a different field, 'summary'. It's using the
> 'text' fieldType as you can see below.
> 
>   <field name="summary" type="text" indexed="true" stored="true"/>
> 
> 
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <!-- in this example, we will only use synonyms at query time
>        <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>        -->
>        <!-- Case insensitive stop word removal.
>             enablePositionIncrements=true ensures that a 'gap' is left to
>             allow for accurate phrase queries.
>        -->
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts=
> "1" generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory" protected=
> "protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true" words=
> "stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory" generateWordParts=
> "1" generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory" protected=
> "protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
> 
> I parsed the pdf using pdfbox. I can see my alphanumeric search term 'OB10'
> in the extracted text before I add it to the index. I can also go into Luke
> and see the 'OB10' in the contents of the 'summary' field even though Luke
> can't find it when I do a search.
> 
> I can also use the browser to do a search in http://localhost/solr/admin
> and again that search term doesn't return any results. I thought it may be
> an alphanumber word splitting issue, but that doesn't seem be be the case
> since I can search on ME26, and it returns a doc, and in fact, I can see
> the 'OB10' search term in the summary field of the doc returned.
> 
> Here's a snippet of the summary field from that returned doc
> 
> To produce a downloadable file using a format suitable
> for OB10. 8-26 Profiles
> 
> I'm thinking that the extracted text from pdfbox may have hidden chars that
> solr can't parse. However, before I go down that road, I just want to be
> sure I'm not making schoolboy errors with my solr setup.
> 
> thanks
> Paul
> 
> 
> 
> From: Jan Høydahl / Cominvent <jan....@cominvent.com>
> To:   solr-user@lucene.apache.org
> Date: 18/08/2010 11:56
> Subject:      Re: Missing tokens
> 
> 
> 
> Hi,
> 
> Can you share with us how your schema looks for this field? What FieldType?
> What tokenizer and analyser?
> How do you parse the PDF document? Before submitting to Solr? With what
> tool?
> How do you do the query? Do you get the same results when doing the query
> from a browser, not SolrJ?
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
> 
> On 18. aug. 2010, at 11.34, paul.mo...@dds.net wrote:
> 
>> 
>> Hi, I'm having a problem with certain search terms not being found when I
>> do a query. I'm using Solrj to index a pdf document, and add the contents
>> to the 'contents' field. If I query the 'contents' field on the
>> SolrInputDocument doc object as below, I get 50k tokens.
>> 
>> StringTokenizer to = new StringTokenizer((String)doc.getFieldValue(
>> "contents"));
>> System.out.println( "Tokens:"  + to.countTokens() );
>> 
>> However, once the doc is indexed and I use Luke to analyse the index, it
>> has only 3300 tokens in that field. Where did the other 47k go?
>> 
>> I read some other threads mentioning to increase the maxfieldLength in
>> solrconfig.xml, and my setting is below.
>> 
>> <maxFieldLength>2147483647</maxFieldLength>
>> 
>> Any advice is appreciated,
>> Paul
>> 
> 
> 
>

Re: Missing tokens

Reply via email to