Hi, Your bug is right there in the WhitespaceTokenizer, where you see that it does NOT strip away the "." as whitespace. Try with StandardTokenizerFactory instead, as it removes punctuation.
-- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 19. aug. 2010, at 12.16, paul.mo...@dds.net wrote: > Great! Now I'm getting somewhere, this worked! The others didn't. > > http://localhost/solr/select?q=contents:"OB10." > > Hope this makes sense to you. I'm still somewhat confused with the output > here. I had 'highlight matches' check, and from what I can tell, 'OB10' > wasn't found. When I enter 'OB10.' into the query, column 11 'ob10.' became > highlighted in the 'LowerCaseFilterFactory' table. > > Am I using the wrong analyser, or supplying the wrong parameters to an > analyser? > > Thanks for your help so far! > Paul > > Index Analyzer > org.apache.solr.analysis.WhitespaceTokenizerFactory {} > |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| > |term position |1 |2 |3 |4 |5 |6 |7 |8 |9 > |10 |11 |12 |13 | > |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| > | term text |To |produce|a |downloadable|file |using|a > |format|suitable|for |OB10. |8-26 |Profiles| > |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| > | term type |word|word |word |word |word |word |word |word |word > |word |word |word |word | > |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| > | source |0,2 |3,10 |11,12|13,25 |26,30|31,36|37,38|39,45 > |46,54 |55,58|59,64 |65,69|70,78 | > | start,end | | | | | | | | | > | | | | | > |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| > | payload | | | | | | | | | > | | | | | > |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| > > > org.apache.solr.analysis.StandardFilterFactory {} > |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| > |term position |1 |2 |3 |4 |5 |6 |7 |8 |9 > |10 |11 |12 |13 | > |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| > | term text |To |produce|a |downloadable|file |using|a > |format|suitable|for |OB10. |8-26 |Profiles| > |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| > | term type |word|word |word |word |word |word |word |word |word > |word |word |word |word | > |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| > | source |0,2 |3,10 |11,12|13,25 |26,30|31,36|37,38|39,45 > |46,54 |55,58|59,64 |65,69|70,78 | > | start,end | | | | | | | | | > | | | | | > |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| > | payload | | | | | | | | | > | | | | | > |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| > > > org.apache.solr.analysis.LowerCaseFilterFactory {} > |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------| > |term position |1 |2 |3 |4 |5 |6 |7 |8 |9 > |10 |11 |12 |13 | > |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------| > | term text |to |produce|a |downloadable|file |using|a > |format|suitable|for |ob10.|8-26 |profiles| > |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------| > | term type |word|word |word |word |word |word |word |word |word > |word |word |word |word | > |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------| > | source |0,2 |3,10 |11,12|13,25 |26,30|31,36|37,38|39,45 > |46,54 |55,58|59,64|65,69|70,78 | > | start,end | | | | | | | | | > | | | | | > |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------| > | payload | | | | | | | | | > | | | | | > |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------| > > > > Query Analyzer > org.apache.solr.analysis.WhitespaceTokenizerFactory {} > |--------------+-----------| > |term position |1 | > |--------------+-----------| > | term text |OB10 | > |--------------+-----------| > | term type |word | > |--------------+-----------| > | source |0,4 | > | start,end | | > |--------------+-----------| > | payload | | > |--------------+-----------| > > > org.apache.solr.analysis.StandardFilterFactory {} > |--------------+----------| > |term position |1 | > |--------------+----------| > | term text |OB10 | > |--------------+----------| > | term type |word | > |--------------+----------| > | source |0,4 | > | start,end | | > |--------------+----------| > | payload | | > |--------------+----------| > > > org.apache.solr.analysis.LowerCaseFilterFactory {} > |--------------+-------------| > |term position |1 | > |--------------+-------------| > | term text |ob10 | > |--------------+-------------| > | term type |word | > |--------------+-------------| > | source |0,4 | > | start,end | | > |--------------+-------------| > | payload | | > |--------------+-------------| > > > I did look at ExtractingRequestHandler a while ago, but I don't think it > supported password protected files. Just looked at it again, and it looks > like it does now. > > > > > > From: Jan Høydahl / Cominvent <jan....@cominvent.com> > To: solr-user@lucene.apache.org > Date: 18/08/2010 23:16 > Subject: Re: Missing tokens > > > > Cannot see anything obvious... > > Try > http://localhost/solr/select?q=contents:OB10* > http://localhost/solr/select?q=contents:"OB 10" > http://localhost/solr/select?q=contents:"OB10." > http://localhost/solr/select?q=contents:ob10 > > Also, go to the Analysis page in admin, typie in your field name, enable > verbose output and copy paste the problematic sentence in the "Index" part > and then enter a OB10 in the "Query" part, and see how your doc and query > gets processed. > > PS: Why don't you try this instead of doing the PDF extraction yourselv: > http://wiki.apache.org/solr/ExtractingRequestHandler ?? > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Training in Europe - www.solrtraining.com > > On 18. aug. 2010, at 16.25, paul.mo...@dds.net wrote: > >> Here's my field description. I mentioned 'contents' field in my original >> post. I've changed it to a different field, 'summary'. It's using the >> 'text' fieldType as you can see below. >> >> <field name="summary" type="text" indexed="true" stored="true"/> >> >> >> <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> >> <analyzer type="index"> >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >> <!-- in this example, we will only use synonyms at query time >> > <filter class="solr.SynonymFilterFactory" >> synonyms="index_synonyms.txt" > ignoreCase="true" expand="false"/> >> --> >> <!-- Case insensitive stop word removal. >> > enablePositionIncrements=true ensures that a 'gap' is left to >> > allow for accurate phrase queries. >> --> >> <filter class="solr.StopFilterFactory" >> ignoreCase="true" >> words="stopwords.txt" >> enablePositionIncrements="true" >> /> >> <filter class="solr.WordDelimiterFilterFactory" generateWordParts= >> "1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" >> catenateAll="0" splitOnCaseChange="1"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.EnglishPorterFilterFactory" protected= >> "protwords.txt"/> >> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> >> </analyzer> >> <analyzer type="query"> >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" >> ignoreCase="true" expand="true"/> >> <filter class="solr.StopFilterFactory" ignoreCase="true" words= >> "stopwords.txt"/> >> <filter class="solr.WordDelimiterFilterFactory" generateWordParts= >> "1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" >> catenateAll="0" splitOnCaseChange="1"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.EnglishPorterFilterFactory" protected= >> "protwords.txt"/> >> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> >> </analyzer> >> </fieldType> >> >> I parsed the pdf using pdfbox. I can see my alphanumeric search term > 'OB10' >> in the extracted text before I add it to the index. I can also go into > Luke >> and see the 'OB10' in the contents of the 'summary' field even though > Luke >> can't find it when I do a search. >> >> I can also use the browser to do a search in http://localhost/solr/admin >> and again that search term doesn't return any results. I thought it may > be >> an alphanumber word splitting issue, but that doesn't seem be be the case >> since I can search on ME26, and it returns a doc, and in fact, I can see >> the 'OB10' search term in the summary field of the doc returned. >> >> Here's a snippet of the summary field from that returned doc >> >> To produce a downloadable file using a format suitable >> for OB10. 8-26 Profiles >> >> I'm thinking that the extracted text from pdfbox may have hidden chars > that >> solr can't parse. However, before I go down that road, I just want to be >> sure I'm not making schoolboy errors with my solr setup. >> >> thanks >> Paul >> >> >> >> From: Jan Høydahl / Cominvent <jan....@cominvent.com> >> To: solr-user@lucene.apache.org >> Date: 18/08/2010 11:56 >> Subject: Re: Missing tokens >> >> >> >> Hi, >> >> Can you share with us how your schema looks for this field? What > FieldType? >> What tokenizer and analyser? >> How do you parse the PDF document? Before submitting to Solr? With what >> tool? >> How do you do the query? Do you get the same results when doing the query >> from a browser, not SolrJ? >> >> -- >> Jan Høydahl, search solution architect >> Cominvent AS - www.cominvent.com >> Training in Europe - www.solrtraining.com >> >> On 18. aug. 2010, at 11.34, paul.mo...@dds.net wrote: >> >>> >>> Hi, I'm having a problem with certain search terms not being found when > I >>> do a query. I'm using Solrj to index a pdf document, and add the > contents >>> to the 'contents' field. If I query the 'contents' field on the >>> SolrInputDocument doc object as below, I get 50k tokens. >>> >>> StringTokenizer to = new StringTokenizer((String)doc.getFieldValue( >>> "contents")); >>> System.out.println( "Tokens:" + to.countTokens() ); >>> >>> However, once the doc is indexed and I use Luke to analyse the index, it >>> has only 3300 tokens in that field. Where did the other 47k go? >>> >>> I read some other threads mentioning to increase the maxfieldLength in >>> solrconfig.xml, and my setting is below. >>> >>> <maxFieldLength>2147483647</maxFieldLength> >>> >>> Any advice is appreciated, >>> Paul >>> >> >> >> > > >