Jack, Thanks for the information. If I understand this correctly, the White space tokenizer will break a single token of size 300 into two tokens, one of size 256 and the other of size 44. If this is true, then for the single test document I have used, in the index in the portal_package field, I should see two tokens rather than one large single token.
If my understanding is correct, then why in my production system, where we occasionally get a single very large token, do I see this error? Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="portal_package" (whose UTF8 encoding is longer than the max length 32766) The existence of this error would lead me to conclude that a very large single token is making its way through the white space tokenizer and filters to the index where it is rejected. I'm afraid my understanding is not complete. Can you fill in the gaps? Thanks, Charles ----- Original Message ----- From: "Jack Krupansky" <jack.krupan...@gmail.com> To: solr-user@lucene.apache.org Sent: Friday, May 15, 2015 4:31:22 PM Subject: Re: Problem with solr.LengthFilterFactory Sorry that my brain has turned to mush... the issue you are hitting is due to a known, undocumented limit in the whitespace tokenizer: https://issues.apache.org/jira/browse/LUCENE-5785 "White space tokenizer has undocumented limit of 256 characters per token" If you look at the parsed query you will see that two query terms were generated. This is because the whitespace tokenizer will simply split long tokens every 256 characters. So, your filter will never see a long term. There is a note on the Jira (evidently by me!) that you can use the pattern tokenizer as a workaround. But... if your term is a string anyway, you could just use the keyword tokenizer. -- Jack Krupansky On Fri, May 15, 2015 at 4:06 PM, Charles Sanders <csand...@redhat.com> wrote: > Shawn, > Thanks a bunch for working with me on this. > > I have deleted all records from my index. Stopped solr. Made the schema > changes as requested. Started solr. Then insert the one test record. Then > search. Still see the same results. No portal_package is not the unique > key, its uri. Which is a string field. > > <field name="portal_package" type="text_std" indexed="true" stored="true" > multiValued="true"/> > > <fieldType name="text_std" class="solr.TextField" > positionIncrementGap="100"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.LengthFilterFactory" min="1" max="300" /> > </fieldType> > > { > "documentKind": "test", > "uri": "test300", > "id": "test300", > "portal_package} > > > { > "responseHeader": { > "status": 0, > "QTime": 47, > "params": { > "spellcheck": "true", > "enableElevation": "false", > "df": "allText", > "echoParams": "all", > "spellcheck.maxCollations": "5", > "spellcheck.dictionary": "andreasAutoComplete", > "spellcheck.count": "5", > "spellcheck.collate": "true", > "spellcheck.onlyMorePopular": "true", > "rows": "10", > "indent": "true", > "q": > "portal_packagedebug": "query", > "wt": "json" > } > }, > "response": { > "numFound": 1, > "start": 0, > "docs": [ > { > "documentKind": "test", > "uri": "test300", > "id": "test300", > "portal_packageversion_": 1501267024421060600, > "timestamp": "2015-05-15T19:56:43.247Z", > "language": "en" > } > ] > }, > "debug": { > "rawquerystring": > "portal_packagequerystring": > "portal_packageparsedquery": > "portal_package:1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456 > > portal_package:7890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890", > > "parsedquery_toString": > "portal_package:1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456 > > portal_package:7890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890", > > "QParser": "LuceneQParser" > } > } > > > > > > ----- Original Message ----- > > From: "Shawn Heisey" <apa...@elyograg.org> > To: solr-user@lucene.apache.org > Sent: Friday, May 15, 2015 3:29:19 PM > Subject: Re: Problem with solr.LengthFilterFactory > > On 5/15/2015 1:23 PM, Shawn Heisey wrote: > > Then I looked back at your fieldType definition and noticed that you > > are only defining an index analyzer. Remove the 'type="index"' part of > > the analyzer config so it happens at both index and query time, > > reindex, then try again. > > The reindex may be very important here. I would actually completely > delete your data directory and restart Solr before reindexing, to be > sure you don't have old recordsfrom any previous reindexes. > > http://wiki.apache.org/solr/HowToReindex > > I think this next part is unlikely, but I'm going to ask it anyway: Is > the portal_package field your schema uniqueKey? If it is, that might be > an additional source of problems. Using a solr.Textfield for a > uniqueKey field causes Solr to behave in unexpected ways. > > Thanks, > Shawn > > >