Re: Importing Large CSV File into Solr Cloud Fails with 400 Bad Request

Erick Erickson Sun, 02 Feb 2020 12:57:25 -0800

You’re opening new searchers very often, every second at least. 
I do not recommended this except under vary unusual circumstances.
This shouldn’t be the root of your problem, but it’s not helping
either. But I’d bump that up to 60 seconds or so.


I usually just specify maxTime and not maxDocs, I think that’s a little
more predictable. Consider the situation you’re describing where
occasionally you’ll be sending a bazillion docs, your commits will
be fast and furious over that period.

But that’s a perfectly valid date string. Hmmm, is there any chance 
that the first single quote is actually part of the string? When I force
the leading quote through adding a doc in the UI it fails with
a similar message, although the quotes are balanced so this is unlikely.

So I’m stumped as well. Are there any _other_ errors in the logs? I’m
wondering if this is some weird effect from an earlier error.

Best,
Erick


> On Feb 2, 2020, at 10:47 AM, Joseph Lorenzini <jalo...@gmail.com> wrote:
> 
> Hi Eric,
> 
> Thanks for the help.
> 
> For commit settings, you are referring to
> https://lucene.apache.org/solr/guide/8_3/updatehandlers-in-solrconfig.html.
> If so, yes, i have soft commits on. According to the docs, open search is
> turned by default. Here are the settings.
> 
>        <autoCommit>
>            <maxTime>600000</maxTime>
>            <maxDocs>180000</maxDocs>
>        </autoCommit>
>        <autoSoftCommit>
>            <maxTime>1000</maxTime>
>            <maxDocs>10000</maxDocs>
>        </autoSoftCommit>
> 
> 
> Please note, I am actually streaming a file from disk -- i am not sending
> the data via curl. curl is merely telling solr what local file to read from.
> 
> So I turned off two solr nodes, leaving a single solr node up. When I ran
> curl again, I noticed the import aborted with this exception.
> 
> Error adding field 'primary_dob'='1983-12-21T00:00:00Z' msg=Invalid Date in
> Date Math String:'1983-12-21T00:00:00Z
> caused by: java.time.format.DateTimeParseException: Text
> '1983-12-21T00:00:00Z' could not be parsed at index 0'
> 
> This field is a DatePointField. I've verified that if i remove records with
> a DatePointField that has parsing problems then solr upload proceeds
> further ....until it hits another record with a similar problem. I was
> surprised that a single record with invalid DatePointField would abort the
> whole process but that does seem to be what's happening.
> 
> So that's easy enough to fix if I knew why the text was failing to parse.
> The date certainly seems valid to me based on this documentation.
> 
> http://lucene.apache.org/solr/7_2_1/solr-core/org/apache/solr/schema/DatePointField.html
> 
> Any ideas on why that won't parse?
> 
> Thanks,
> Joe
> 
> 
> On Sun, Feb 2, 2020 at 8:51 AM Erick Erickson <erickerick...@gmail.com>
> wrote:
> 
>> What are your commit settings? Solr keeps certain in-memory structures
>> between commits, so it’s important to commit periodically. Say every 60
>> seconds as a straw-man proposal (and openSearcher should be set to
>> true or soft commits should be enabled).
>> 
>> When firing a zillion docs at Solr, it’s also best that your commits (both
>> hard
>> and soft) aren’t happening too frequently, thus my 60 second proposal.
>> 
>> The commit on the command you send will be executed after the last doc
>> is sent, so it’s irrelevant to the above.
>> 
>> Apart from that, when indexing every time you do commit, background
>> merges are kicked off and there’s a limited number of threads that are
>> allowed to run concurrently. When that max is reached the next update is
>> queued until one of the threads is free. So you _may_ be hitting a simple
>> timeout that’s showing up as a 400 error, which is something of a
>> catch-all return code. If this is the case, just lengthening the timeouts
>> might fix the issue.
>> 
>> Are you sending the documents to the leader? That’ll make the process
>> simpler since docs received by followers are simply forwarded to the
>> leader. That shouldn’t really matter, just a side-note.
>> 
>> Not all that helpful I know. Does the failure happen in the same place?
>> I.e.
>> is it possible that a particular doc is making this happen? Unlikely, but
>> worth
>> asking. One bad doc shouldn’t stop the whole process, but it’d be a clue
>> if there was.
>> 
>> If you’re particularly interested in performance, you should consider
>> indexing to a leader-only collection, either by deleting the followers or
>> shutting down the Solr instances. There’s a performance penalty due to
>> forwarding the docs (talking NRT replicas here) that can be quite
>> substantial. When you turn the Solr instances back on (or ADDREPLICA),
>> they’ll sync back up.
>> 
>> Finally, I mistrust just sending a large amount of data via HTTP, just
>> because
>> there’s not much you can do except hope it all works. If this is a
>> recurring
>> process I’d seriously consider writing a SolrJ program that parsed the
>> csv file and sent it to Solr.
>> 
>> Best,
>> Erick
>> 
>> 
>> 
>>> On Feb 2, 2020, at 9:32 AM, Joseph Lorenzini <jalo...@gmail.com> wrote:
>>> 
>>> Hi all,
>>> 
>>> I have three node solr cloud cluster. The collection has a single shard.
>> I
>>> am importing 140 GB CSV file into solr using curl with a URL that looks
>>> roughly like this. I am streaming the file from disk for performance
>>> reasons.
>>> 
>>> 
>> http://localhost:8983/solr/example/update?separator=%09&stream.file=/tmp/input.tsv&stream.contentType=text/csv;charset=utf-8&commit=true&f.note.split=true&f.note.separator=%7C
>>> 
>>> There are 139 million records in that file. I am able to import about
>>> 800,000 records into solr at which point solr hangs and then several
>>> minutes later returns a 400 bad request back to curl. I looked in the
>> logs
>>> and I did find a handful of exceptions (e.g invalid date, docvalues field
>>> is too large etc) for particular records but nothing that would explain
>> why
>>> the processing stalled and failed.
>>> 
>>> My expectation is that if solr encounters a record it cannot ingest, it
>>> will throw an exception for that particular record and continue
>> processing
>>> the next record. Is that how the importing works or do all records need
>> to
>>> be valid? If invalid records should not abort the process, then does
>> anyone
>>> have any idea what might be going on here?
>>> 
>>> Thanks,
>>> Joe
>> 
>>

Re: Importing Large CSV File into Solr Cloud Fails with 400 Bad Request

Reply via email to