Cannot see anything obvious... Try http://localhost/solr/select?q=contents:OB10* http://localhost/solr/select?q=contents:"OB 10" http://localhost/solr/select?q=contents:"OB10." http://localhost/solr/select?q=contents:ob10
Also, go to the Analysis page in admin, typie in your field name, enable verbose output and copy paste the problematic sentence in the "Index" part and then enter a OB10 in the "Query" part, and see how your doc and query gets processed. PS: Why don't you try this instead of doing the PDF extraction yourselv: http://wiki.apache.org/solr/ExtractingRequestHandler ?? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 18. aug. 2010, at 16.25, paul.mo...@dds.net wrote: > Here's my field description. I mentioned 'contents' field in my original > post. I've changed it to a different field, 'summary'. It's using the > 'text' fieldType as you can see below. > > <field name="summary" type="text" indexed="true" stored="true"/> > > > <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <!-- in this example, we will only use synonyms at query time > <filter class="solr.SynonymFilterFactory" > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> > --> > <!-- Case insensitive stop word removal. > enablePositionIncrements=true ensures that a 'gap' is left to > allow for accurate phrase queries. > --> > <filter class="solr.StopFilterFactory" > ignoreCase="true" > words="stopwords.txt" > enablePositionIncrements="true" > /> > <filter class="solr.WordDelimiterFilterFactory" generateWordParts= > "1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" > catenateAll="0" splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.EnglishPorterFilterFactory" protected= > "protwords.txt"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="true"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" words= > "stopwords.txt"/> > <filter class="solr.WordDelimiterFilterFactory" generateWordParts= > "1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" > catenateAll="0" splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.EnglishPorterFilterFactory" protected= > "protwords.txt"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > </fieldType> > > I parsed the pdf using pdfbox. I can see my alphanumeric search term 'OB10' > in the extracted text before I add it to the index. I can also go into Luke > and see the 'OB10' in the contents of the 'summary' field even though Luke > can't find it when I do a search. > > I can also use the browser to do a search in http://localhost/solr/admin > and again that search term doesn't return any results. I thought it may be > an alphanumber word splitting issue, but that doesn't seem be be the case > since I can search on ME26, and it returns a doc, and in fact, I can see > the 'OB10' search term in the summary field of the doc returned. > > Here's a snippet of the summary field from that returned doc > > To produce a downloadable file using a format suitable > for OB10. 8-26 Profiles > > I'm thinking that the extracted text from pdfbox may have hidden chars that > solr can't parse. However, before I go down that road, I just want to be > sure I'm not making schoolboy errors with my solr setup. > > thanks > Paul > > > > From: Jan Høydahl / Cominvent <jan....@cominvent.com> > To: solr-user@lucene.apache.org > Date: 18/08/2010 11:56 > Subject: Re: Missing tokens > > > > Hi, > > Can you share with us how your schema looks for this field? What FieldType? > What tokenizer and analyser? > How do you parse the PDF document? Before submitting to Solr? With what > tool? > How do you do the query? Do you get the same results when doing the query > from a browser, not SolrJ? > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Training in Europe - www.solrtraining.com > > On 18. aug. 2010, at 11.34, paul.mo...@dds.net wrote: > >> >> Hi, I'm having a problem with certain search terms not being found when I >> do a query. I'm using Solrj to index a pdf document, and add the contents >> to the 'contents' field. If I query the 'contents' field on the >> SolrInputDocument doc object as below, I get 50k tokens. >> >> StringTokenizer to = new StringTokenizer((String)doc.getFieldValue( >> "contents")); >> System.out.println( "Tokens:" + to.countTokens() ); >> >> However, once the doc is indexed and I use Luke to analyse the index, it >> has only 3300 tokens in that field. Where did the other 47k go? >> >> I read some other threads mentioning to increase the maxfieldLength in >> solrconfig.xml, and my setting is below. >> >> <maxFieldLength>2147483647</maxFieldLength> >> >> Any advice is appreciated, >> Paul >> > > >