Missing tokens
Hi, I'm having a problem with certain search terms not being found when I do a query. I'm using Solrj to index a pdf document, and add the contents to the 'contents' field. If I query the 'contents' field on the SolrInputDocument doc object as below, I get 50k tokens. StringTokenizer to = new StringTokenizer((String)doc.getFieldValue( "contents")); System.out.println( "Tokens:" + to.countTokens() ); However, once the doc is indexed and I use Luke to analyse the index, it has only 3300 tokens in that field. Where did the other 47k go? I read some other threads mentioning to increase the maxfieldLength in solrconfig.xml, and my setting is below. 2147483647 Any advice is appreciated, Paul
Re: Missing tokens
Here's my field description. I mentioned 'contents' field in my original post. I've changed it to a different field, 'summary'. It's using the 'text' fieldType as you can see below. I parsed the pdf using pdfbox. I can see my alphanumeric search term 'OB10' in the extracted text before I add it to the index. I can also go into Luke and see the 'OB10' in the contents of the 'summary' field even though Luke can't find it when I do a search. I can also use the browser to do a search in http://localhost/solr/admin and again that search term doesn't return any results. I thought it may be an alphanumber word splitting issue, but that doesn't seem be be the case since I can search on ME26, and it returns a doc, and in fact, I can see the 'OB10' search term in the summary field of the doc returned. Here's a snippet of the summary field from that returned doc To produce a downloadable file using a format suitable for OB10. 8-26 Profiles I'm thinking that the extracted text from pdfbox may have hidden chars that solr can't parse. However, before I go down that road, I just want to be sure I'm not making schoolboy errors with my solr setup. thanks Paul From: Jan Høydahl / Cominvent To: solr-user@lucene.apache.org Date: 18/08/2010 11:56 Subject:Re: Missing tokens Hi, Can you share with us how your schema looks for this field? What FieldType? What tokenizer and analyser? How do you parse the PDF document? Before submitting to Solr? With what tool? How do you do the query? Do you get the same results when doing the query from a browser, not SolrJ? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 18. aug. 2010, at 11.34, paul.mo...@dds.net wrote: > > Hi, I'm having a problem with certain search terms not being found when I > do a query. I'm using Solrj to index a pdf document, and add the contents > to the 'contents' field. If I query the 'contents' field on the > SolrInputDocument doc object as below, I get 50k tokens. > > StringTokenizer to = new StringTokenizer((String)doc.getFieldValue( > "contents")); > System.out.println( "Tokens:" + to.countTokens() ); > > However, once the doc is indexed and I use Luke to analyse the index, it > has only 3300 tokens in that field. Where did the other 47k go? > > I read some other threads mentioning to increase the maxfieldLength in > solrconfig.xml, and my setting is below. > > 2147483647 > > Any advice is appreciated, > Paul >
Re: Missing tokens
Great! Now I'm getting somewhere, this worked! The others didn't. http://localhost/solr/select?q=contents:"OB10."; Hope this makes sense to you. I'm still somewhat confused with the output here. I had 'highlight matches' check, and from what I can tell, 'OB10' wasn't found. When I enter 'OB10.' into the query, column 11 'ob10.' became highlighted in the 'LowerCaseFilterFactory' table. Am I using the wrong analyser, or supplying the wrong parameters to an analyser? Thanks for your help so far! Paul Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} |--++---+-++-+-+-+--++-+--+-+| |term position |1 |2 |3|4 |5|6|7|8 |9 |10 |11|12 |13 | |--++---+-++-+-+-+--++-+--+-+| | term text |To |produce|a|downloadable|file |using|a |format|suitable|for |OB10. |8-26 |Profiles| |--++---+-++-+-+-+--++-+--+-+| | term type |word|word |word |word|word |word |word |word |word |word |word |word |word| |--++---+-++-+-+-+--++-+--+-+| |source|0,2 |3,10 |11,12|13,25 |26,30|31,36|37,38|39,45 |46,54 |55,58|59,64 |65,69|70,78 | | start,end || | || | | | | | | | || |--++---+-++-+-+-+--++-+--+-+| | payload|| | || | | | | | | | || |--++---+-++-+-+-+--++-+--+-+| org.apache.solr.analysis.StandardFilterFactory {} |--++---+-++-+-+-+--++-+--+-+| |term position |1 |2 |3|4 |5|6|7|8 |9 |10 |11|12 |13 | |--++---+-++-+-+-+--++-+--+-+| | term text |To |produce|a|downloadable|file |using|a |format|suitable|for |OB10. |8-26 |Profiles| |--++---+-++-+-+-+--++-+--+-+| | term type |word|word |word |word|word |word |word |word |word |word |word |word |word| |--++---+-++-+-+-+--++-+--+-+| |source|0,2 |3,10 |11,12|13,25 |26,30|31,36|37,38|39,45 |46,54 |55,58|59,64 |65,69|70,78 | | start,end || | || | | | | | | | || |--++---+-++-+-+-+--++-+--+-+| | payload|| | || | | | | | | | || |--++---+-++-+-+-+--++-+--+-+| org.apache.solr.analysis.LowerCaseFilterFactory {} |--++---+-++-+-+-+--++-+-+-+| |term position |1 |2 |3|4 |5|6|7|8 |9 |10 |11 |12 |13 | |--++---+-++-+-+-+--++-+-+-+| | term text |to |produce|a|downloadable|file |using|a |format|suitable|for |ob10.|8-26 |profiles| |--++---+-++-+-+-+--++-+-+-+| | term type |word|word |word |word|word |word |word |word |word |word |word |word |word| |--++---+-++-+-+-+--++-+-+-+| |source|0,2 |3,10 |11,12|13,25 |26,30|31,36|37,38|39,45 |46,54 |55,58|59,64|65,69|70,78 | | start,end || | || | | | | | | | || |--++---+-++-+-+-+--++-+-+-+| | payload|| | || | | | | | | | || |--++---+-++-+-+-+--++-+-+-+| Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} |--+---| |term position |1 | |--+---| | term text |OB10 | |--+---| | term type |word | |--+---| |source|0,4| | start,
Re: Missing tokens
I did that and it worked. Thanks very much for your expert assistance, Jan! Paul From: Jan Høydahl / Cominvent To: solr-user@lucene.apache.org Date: 19/08/2010 16:15 Subject:Re: Missing tokens Hi, Your bug is right there in the WhitespaceTokenizer, where you see that it does NOT strip away the "." as whitespace. Try with StandardTokenizerFactory instead, as it removes punctuation. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 19. aug. 2010, at 12.16, paul.mo...@dds.net wrote: > Great! Now I'm getting somewhere, this worked! The others didn't. > > http://localhost/solr/select?q=contents:"OB10."; > > Hope this makes sense to you. I'm still somewhat confused with the output > here. I had 'highlight matches' check, and from what I can tell, 'OB10' > wasn't found. When I enter 'OB10.' into the query, column 11 'ob10.' became > highlighted in the 'LowerCaseFilterFactory' table. > > Am I using the wrong analyser, or supplying the wrong parameters to an > analyser? > > Thanks for your help so far! > Paul > > Index Analyzer > org.apache.solr.analysis.WhitespaceTokenizerFactory {} > |--++---+-++-+-+-+--++-+--+-+| > |term position |1 |2 |3|4 |5|6|7|8 | 9 |10 |11|12 |13 | > |--++---+-++-+-+-+--++-+--+-+| > | term text |To |produce|a|downloadable|file |using|a|format| suitable|for |OB10. |8-26 |Profiles| > |--++---+-++-+-+-+--++-+--+-+| > | term type |word|word |word |word|word |word |word |word | word|word |word |word |word| > |--++---+-++-+-+-+--++-+--+-+| > |source|0,2 |3,10 |11,12|13,25 |26,30|31,36|37,38|39,45 | 46,54 |55,58|59,64 |65,69|70,78 | > | start,end || | || | | | | | | | || > |--++---+-++-+-+-+--++-+--+-+| > | payload|| | || | | | | | | | || > |--++---+-++-+-+-+--++-+--+-+| > > > org.apache.solr.analysis.StandardFilterFactory {} > |--++---+-++-+-+-+--++-+--+-+| > |term position |1 |2 |3|4 |5|6|7|8 | 9 |10 |11|12 |13 | > |--++---+-++-+-+-+--++-+--+-+| > | term text |To |produce|a|downloadable|file |using|a|format| suitable|for |OB10. |8-26 |Profiles| > |--++---+-++-+-+-+--++-+--+-+| > | term type |word|word |word |word|word |word |word |word | word|word |word |word |word| > |--++---+-++-+-+-+--++-+--+-+| > |source|0,2 |3,10 |11,12|13,25 |26,30|31,36|37,38|39,45 | 46,54 |55,58|59,64 |65,69|70,78 | > | start,end || | || | | | | | | | || > |--++---+-++-+-+-+--++-+--+-+| > | payload|| | || | | | | | | | || > |--++---+-++-+-+-+--++-+--+-+| > > > org.apache.solr.analysis.LowerCaseFilterFactory {} > |--++---+-++-+-+-+--++-+-+-+| > |term position |1 |2 |3|4 |5|6|7|8 | 9 |10 |11 |12 |13 | > |--++---+-++-+-+-+--++-+-+-+| > | term text |to |produce|a|downloadable|file |using|a|format| suitable|for |ob10.|8-26 |profiles| > |--++---+-++-+-+-+--++-+-+-+| > | term type |word|word |word |word|word |word |word |word | word|word |word |word |word| > |--++---+-++-+-+-+--++-+-+-+| > |source|0,2 |3,10 |11,12|13,25 |26,30|31,36|37,38|39,45 | 46,54 |55,58|59,64|65,69|70,78 | > | start,end || | || | | | | | |