Re: Best way to fix "Document contains at least one immense term"?

Jack Krupansky Tue, 01 Jul 2014 14:23:30 -0700

You could develop an update processor to skip or trim long terms as you seefit. You can even code a script in JavaScruipt using the stateless scriptupdate processor.

Can you tell us more about the nature of your data? I mean, sometimesanalyzer filters strip or fold accented characters anyway, so count ofcharacters versus UTF-8 bytes may be a non-problem.


-- Jack Krupansky

-----Original Message-----From: Michael Ryan

Sent: Tuesday, July 1, 2014 9:49 AM
To: solr-user@lucene.apache.org
Subject: Best way to fix "Document contains at least one immense term"?

In LUCENE-5472, Lucene was changed to throw an error if a term is too long,rather than just logging a message. I have fields with terms that are toolong, but I don't care - I just want to ignore them and move on.

The recommended solution in the docs is to use LengthFilterFactory, but thislimits the terms by the number of characters, rather than the number ofUTF-8 bytes. So you can't just do something clever like set max=32766, dueto the possibility of multibyte characters.

So, is there a way of using LengthFilterFactory to do this such that anerror will never be thrown? Thinking I could use some max less than 32766 /3, but I want to be absolutely sure that there is not some edge case that isgoing to break. I guess I could just set it to something sane like 1000. Oris there another more direct solution to this problem?

-Michael

Re: Best way to fix "Document contains at least one immense term"?

Reply via email to