Re: Importing Large CSV File into Solr Cloud Fails with 400 Bad Request

Erick Erickson Mon, 03 Feb 2020 07:38:21 -0800

I don’t quite know how TolerantUpdateProcessor works with importing CSV
files, see: https://issues.apache.org/jira/browse/SOLR-445. That is about
sending batches of docs to Solr and frankly I don’t know what path your
process will take. It’s worth a try though.


Otherwise, I typically go with SolrJ and send batches. That does combine with
TolerantUpdateProcessor.

Best,
Erick

> On Feb 3, 2020, at 10:16 AM, Joseph Lorenzini <jalo...@gmail.com> wrote:
> 
> Hi Shawn/Erick,
> 
> This information has been very helpful. Thank you.
> 
> So I did some more investigation into our ETL process and I verified that
> with the exception of the text I sent above they are all obviously invalid
> dates. For example, one field value had 00 for a day so would guess that
> field had a non-printable character in it. S at least in the case of a
> record where a field has invalid date, the entire import process is
> aborted. I'll adjust the ETL process to stop passing invalid dates but this
> does lead me to question about failure modes for importing large data sets
> into a collection. Is there any way to specify a "continue on failure" mode
> such that solr logs that it was unable to parse a record and why and then
> continues onto the next node?
> 
> Thanks,
> Joe
> 
> On Sun, Feb 2, 2020 at 4:46 PM Shawn Heisey <apa...@elyograg.org> wrote:
> 
>> On 2/2/2020 8:47 AM, Joseph Lorenzini wrote:
>>>         <autoSoftCommit>
>>>             <maxTime>1000</maxTime>
>>>             <maxDocs>10000</maxDocs>
>>>         </autoSoftCommit>
>> 
>> That autoSoftCommit setting is far too aggressive, especially for bulk
>> indexing.  I don't know whether it's causing the specific problem you're
>> asking about here, but it's still a setting that will cause problems,
>> because Solr will constantly be doing commit operations while bulk
>> indexing is underway.
>> 
>> Erick mentioned this as well.  Greatly increasing the maxTime, and
>> removing maxDocs, is recommended.  I would recommend starting at one
>> minute.  The maxDocs setting should be removed from autoCommit as well.
>> 
>>> So I turned off two solr nodes, leaving a single solr node up. When I ran
>>> curl again, I noticed the import aborted with this exception.
>>> 
>>> Error adding field 'primary_dob'='1983-12-21T00:00:00Z' msg=Invalid Date
>> in
>>> Date Math String:'1983-12-21T00:00:00Z
>>> caused by: java.time.format.DateTimeParseException: Text
>>> '1983-12-21T00:00:00Z' could not be parsed at index 0'
>> 
>> That date string looks OK.  Which MIGHT mean there are characters in it
>> that are not visible.  Erick said that the single quote is balanced in
>> his message, which COULD mean that the character causing the problem is
>> one that deletes things when it is printed.
>> 
>> Thanks,
>> Shawn
>>

Re: Importing Large CSV File into Solr Cloud Fails with 400 Bad Request

Reply via email to