Jack, 
Thanks for the information. If I understand this correctly, the White space 
tokenizer will break a single token of size 300 into two tokens, one of size 
256 and the other of size 44. If this is true, then for the single test 
document I have used, in the index in the portal_package field, I should see 
two tokens rather than one large single token. 

If my understanding is correct, then why in my production system, where we 
occasionally get a single very large token, do I see this error? 
Caused by: java.lang.IllegalArgumentException: Document contains at least one 
immense term in field="portal_package" (whose UTF8 encoding is longer than the 
max length 32766) 

The existence of this error would lead me to conclude that a very large single 
token is making its way through the white space tokenizer and filters to the 
index where it is rejected. 

I'm afraid my understanding is not complete. Can you fill in the gaps? 

Thanks, 
Charles 


----- Original Message -----

From: "Jack Krupansky" <jack.krupan...@gmail.com> 
To: solr-user@lucene.apache.org 
Sent: Friday, May 15, 2015 4:31:22 PM 
Subject: Re: Problem with solr.LengthFilterFactory 

Sorry that my brain has turned to mush... the issue you are hitting is due 
to a known, undocumented limit in the whitespace tokenizer: 

https://issues.apache.org/jira/browse/LUCENE-5785 
"White space tokenizer has undocumented limit of 256 characters per token" 

If you look at the parsed query you will see that two query terms were 
generated. This is because the whitespace tokenizer will simply split long 
tokens every 256 characters. So, your filter will never see a long term. 

There is a note on the Jira (evidently by me!) that you can use the pattern 
tokenizer as a workaround. But... if your term is a string anyway, you 
could just use the keyword tokenizer. 


-- Jack Krupansky 

On Fri, May 15, 2015 at 4:06 PM, Charles Sanders <csand...@redhat.com> 
wrote: 

> Shawn, 
> Thanks a bunch for working with me on this. 
> 
> I have deleted all records from my index. Stopped solr. Made the schema 
> changes as requested. Started solr. Then insert the one test record. Then 
> search. Still see the same results. No portal_package is not the unique 
> key, its uri. Which is a string field. 
> 
> <field name="portal_package" type="text_std" indexed="true" stored="true" 
> multiValued="true"/> 
> 
> <fieldType name="text_std" class="solr.TextField" 
> positionIncrementGap="100"> 
> <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
> <filter class="solr.LengthFilterFactory" min="1" max="300" /> 
> </fieldType> 
> 
> { 
> "documentKind": "test", 
> "uri": "test300", 
> "id": "test300", 
> "portal_package":"12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890"
>  
> } 
> 
> 
> { 
> "responseHeader": { 
> "status": 0, 
> "QTime": 47, 
> "params": { 
> "spellcheck": "true", 
> "enableElevation": "false", 
> "df": "allText", 
> "echoParams": "all", 
> "spellcheck.maxCollations": "5", 
> "spellcheck.dictionary": "andreasAutoComplete", 
> "spellcheck.count": "5", 
> "spellcheck.collate": "true", 
> "spellcheck.onlyMorePopular": "true", 
> "rows": "10", 
> "indent": "true", 
> "q": 
> "portal_package:12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890",
>  
> "_": "1431719989047", 
> "debug": "query", 
> "wt": "json" 
> } 
> }, 
> "response": { 
> "numFound": 1, 
> "start": 0, 
> "docs": [ 
> { 
> "documentKind": "test", 
> "uri": "test300", 
> "id": "test300", 
> "portal_package": [ 
> 
> "12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890"
>  
> ], 
> "_version_": 1501267024421060600, 
> "timestamp": "2015-05-15T19:56:43.247Z", 
> "language": "en" 
> } 
> ] 
> }, 
> "debug": { 
> "rawquerystring": 
> "portal_package:12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890",
>  
> "querystring": 
> "portal_package:12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890",
>  
> "parsedquery": 
> "portal_package:1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456
>  
> portal_package:7890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890",
>  
> "parsedquery_toString": 
> "portal_package:1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456
>  
> portal_package:7890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890",
>  
> "QParser": "LuceneQParser" 
> } 
> } 
> 
> 
> 
> 
> 
> ----- Original Message ----- 
> 
> From: "Shawn Heisey" <apa...@elyograg.org> 
> To: solr-user@lucene.apache.org 
> Sent: Friday, May 15, 2015 3:29:19 PM 
> Subject: Re: Problem with solr.LengthFilterFactory 
> 
> On 5/15/2015 1:23 PM, Shawn Heisey wrote: 
> > Then I looked back at your fieldType definition and noticed that you 
> > are only defining an index analyzer. Remove the 'type="index"' part of 
> > the analyzer config so it happens at both index and query time, 
> > reindex, then try again. 
> 
> The reindex may be very important here. I would actually completely 
> delete your data directory and restart Solr before reindexing, to be 
> sure you don't have old recordsfrom any previous reindexes. 
> 
> http://wiki.apache.org/solr/HowToReindex 
> 
> I think this next part is unlikely, but I'm going to ask it anyway: Is 
> the portal_package field your schema uniqueKey? If it is, that might be 
> an additional source of problems. Using a solr.Textfield for a 
> uniqueKey field causes Solr to behave in unexpected ways. 
> 
> Thanks, 
> Shawn 
> 
> 
> 

Reply via email to