[jira] [Resolved] (LUCENE-8947) Indexing fails with "too many tokens for field" when using custom term frequencies

Michael McCandless (Jira) Sat, 16 Jan 2021 04:15:06 -0800


     [ 
https://issues.apache.org/jira/browse/LUCENE-8947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless resolved LUCENE-8947.
----------------------------------------
    Resolution: Won't Fix

It turns out we cannot find a safe way to fix this, so, users must not try to 
write too many, too large custom term frequencies such that their sum overflows 
a Java `int` in a single field / document.

> Indexing fails with "too many tokens for field" when using custom term 
> frequencies
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-8947
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8947
>             Project: Lucene - Core
>          Issue Type: Improvement
>    Affects Versions: 7.5
>            Reporter: Michael McCandless
>            Priority: Major
>          Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> We are using custom term frequencies (LUCENE-7854) to index per-token scoring 
> signals, however for one document that had many tokens and those tokens had 
> fairly large (~998,000) scoring signals, we hit this exception:
> {noformat}
> 2019-08-05T21:32:37,048 [ERROR] (LuceneIndexing-3-thread-3) 
> com.amazon.lucene.index.IndexGCRDocument: Failed to index doc: 
> java.lang.IllegalArgumentException: too many tokens for field "foobar"
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:825)
> at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
> at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:394)
> at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:297)
> at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:450)
> at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1291)
> at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1264)
> {noformat}
> This is happening in this code in {{DefaultIndexingChain.java}}:
> {noformat}
>   try {
>     invertState.length = Math.addExact(invertState.length, 
> invertState.termFreqAttribute.getTermFrequency());
>   } catch (ArithmeticException ae) {
>     throw new IllegalArgumentException("too many tokens for field \"" + 
> field.name() + "\"");
>   }{noformat}
> Where Lucene is accumulating the total length (number of tokens) for the 
> field.  But total length doesn't really make sense if you are using custom 
> term frequencies to hold arbitrary scoring signals?  Or, maybe it does make 
> sense, if user is using this as simple boosting, but maybe we should allow 
> this length to be a {{long}}?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-8947) Indexing fails with "too many tokens for field" when using custom term frequencies

Reply via email to