Great! Now I'm getting somewhere, this worked! The others didn't. http://localhost/solr/select?q=contents:"OB10."
Hope this makes sense to you. I'm still somewhat confused with the output here. I had 'highlight matches' check, and from what I can tell, 'OB10' wasn't found. When I enter 'OB10.' into the query, column 11 'ob10.' became highlighted in the 'LowerCaseFilterFactory' table. Am I using the wrong analyser, or supplying the wrong parameters to an analyser? Thanks for your help so far! Paul Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| |term position |1 |2 |3 |4 |5 |6 |7 |8 |9 |10 |11 |12 |13 | |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| | term text |To |produce|a |downloadable|file |using|a |format|suitable|for |OB10. |8-26 |Profiles| |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| | term type |word|word |word |word |word |word |word |word |word |word |word |word |word | |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| | source |0,2 |3,10 |11,12|13,25 |26,30|31,36|37,38|39,45 |46,54 |55,58|59,64 |65,69|70,78 | | start,end | | | | | | | | | | | | | | |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| | payload | | | | | | | | | | | | | | |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| org.apache.solr.analysis.StandardFilterFactory {} |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| |term position |1 |2 |3 |4 |5 |6 |7 |8 |9 |10 |11 |12 |13 | |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| | term text |To |produce|a |downloadable|file |using|a |format|suitable|for |OB10. |8-26 |Profiles| |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| | term type |word|word |word |word |word |word |word |word |word |word |word |word |word | |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| | source |0,2 |3,10 |11,12|13,25 |26,30|31,36|37,38|39,45 |46,54 |55,58|59,64 |65,69|70,78 | | start,end | | | | | | | | | | | | | | |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| | payload | | | | | | | | | | | | | | |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+------+-----+--------| org.apache.solr.analysis.LowerCaseFilterFactory {} |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------| |term position |1 |2 |3 |4 |5 |6 |7 |8 |9 |10 |11 |12 |13 | |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------| | term text |to |produce|a |downloadable|file |using|a |format|suitable|for |ob10.|8-26 |profiles| |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------| | term type |word|word |word |word |word |word |word |word |word |word |word |word |word | |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------| | source |0,2 |3,10 |11,12|13,25 |26,30|31,36|37,38|39,45 |46,54 |55,58|59,64|65,69|70,78 | | start,end | | | | | | | | | | | | | | |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------| | payload | | | | | | | | | | | | | | |--------------+----+-------+-----+------------+-----+-----+-----+------+--------+-----+-----+-----+--------| Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} |--------------+-----------| |term position |1 | |--------------+-----------| | term text |OB10 | |--------------+-----------| | term type |word | |--------------+-----------| | source |0,4 | | start,end | | |--------------+-----------| | payload | | |--------------+-----------| org.apache.solr.analysis.StandardFilterFactory {} |--------------+----------| |term position |1 | |--------------+----------| | term text |OB10 | |--------------+----------| | term type |word | |--------------+----------| | source |0,4 | | start,end | | |--------------+----------| | payload | | |--------------+----------| org.apache.solr.analysis.LowerCaseFilterFactory {} |--------------+-------------| |term position |1 | |--------------+-------------| | term text |ob10 | |--------------+-------------| | term type |word | |--------------+-------------| | source |0,4 | | start,end | | |--------------+-------------| | payload | | |--------------+-------------| I did look at ExtractingRequestHandler a while ago, but I don't think it supported password protected files. Just looked at it again, and it looks like it does now. From: Jan Høydahl / Cominvent <jan....@cominvent.com> To: solr-user@lucene.apache.org Date: 18/08/2010 23:16 Subject: Re: Missing tokens Cannot see anything obvious... Try http://localhost/solr/select?q=contents:OB10* http://localhost/solr/select?q=contents:"OB 10" http://localhost/solr/select?q=contents:"OB10." http://localhost/solr/select?q=contents:ob10 Also, go to the Analysis page in admin, typie in your field name, enable verbose output and copy paste the problematic sentence in the "Index" part and then enter a OB10 in the "Query" part, and see how your doc and query gets processed. PS: Why don't you try this instead of doing the PDF extraction yourselv: http://wiki.apache.org/solr/ExtractingRequestHandler ?? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 18. aug. 2010, at 16.25, paul.mo...@dds.net wrote: > Here's my field description. I mentioned 'contents' field in my original > post. I've changed it to a different field, 'summary'. It's using the > 'text' fieldType as you can see below. > > <field name="summary" type="text" indexed="true" stored="true"/> > > > <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <!-- in this example, we will only use synonyms at query time > <filter class="solr.SynonymFilterFactory" > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> > --> > <!-- Case insensitive stop word removal. > enablePositionIncrements=true ensures that a 'gap' is left to > allow for accurate phrase queries. > --> > <filter class="solr.StopFilterFactory" > ignoreCase="true" > words="stopwords.txt" > enablePositionIncrements="true" > /> > <filter class="solr.WordDelimiterFilterFactory" generateWordParts= > "1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" > catenateAll="0" splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.EnglishPorterFilterFactory" protected= > "protwords.txt"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="true"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" words= > "stopwords.txt"/> > <filter class="solr.WordDelimiterFilterFactory" generateWordParts= > "1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" > catenateAll="0" splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.EnglishPorterFilterFactory" protected= > "protwords.txt"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > </fieldType> > > I parsed the pdf using pdfbox. I can see my alphanumeric search term 'OB10' > in the extracted text before I add it to the index. I can also go into Luke > and see the 'OB10' in the contents of the 'summary' field even though Luke > can't find it when I do a search. > > I can also use the browser to do a search in http://localhost/solr/admin > and again that search term doesn't return any results. I thought it may be > an alphanumber word splitting issue, but that doesn't seem be be the case > since I can search on ME26, and it returns a doc, and in fact, I can see > the 'OB10' search term in the summary field of the doc returned. > > Here's a snippet of the summary field from that returned doc > > To produce a downloadable file using a format suitable > for OB10. 8-26 Profiles > > I'm thinking that the extracted text from pdfbox may have hidden chars that > solr can't parse. However, before I go down that road, I just want to be > sure I'm not making schoolboy errors with my solr setup. > > thanks > Paul > > > > From: Jan Høydahl / Cominvent <jan....@cominvent.com> > To: solr-user@lucene.apache.org > Date: 18/08/2010 11:56 > Subject: Re: Missing tokens > > > > Hi, > > Can you share with us how your schema looks for this field? What FieldType? > What tokenizer and analyser? > How do you parse the PDF document? Before submitting to Solr? With what > tool? > How do you do the query? Do you get the same results when doing the query > from a browser, not SolrJ? > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Training in Europe - www.solrtraining.com > > On 18. aug. 2010, at 11.34, paul.mo...@dds.net wrote: > >> >> Hi, I'm having a problem with certain search terms not being found when I >> do a query. I'm using Solrj to index a pdf document, and add the contents >> to the 'contents' field. If I query the 'contents' field on the >> SolrInputDocument doc object as below, I get 50k tokens. >> >> StringTokenizer to = new StringTokenizer((String)doc.getFieldValue( >> "contents")); >> System.out.println( "Tokens:" + to.countTokens() ); >> >> However, once the doc is indexed and I use Luke to analyse the index, it >> has only 3300 tokens in that field. Where did the other 47k go? >> >> I read some other threads mentioning to increase the maxfieldLength in >> solrconfig.xml, and my setting is below. >> >> <maxFieldLength>2147483647</maxFieldLength> >> >> Any advice is appreciated, >> Paul >> > > >