Hi all,

I have three node solr cloud cluster. The collection has a single shard. I
am importing 140 GB CSV file into solr using curl with a URL that looks
roughly like this. I am streaming the file from disk for performance
reasons.

http://localhost:8983/solr/example/update?separator=%09&stream.file=/tmp/input.tsv&stream.contentType=text/csv;charset=utf-8&commit=true&f.note.split=true&f.note.separator=%7C

There are 139 million records in that file. I am able to import about
800,000 records into solr at which point solr hangs and then several
minutes later returns a 400 bad request back to curl. I looked in the logs
and I did find a handful of exceptions (e.g invalid date, docvalues field
is too large etc) for particular records but nothing that would explain why
the processing stalled and failed.

My expectation is that if solr encounters a record it cannot ingest, it
will throw an exception for that particular record and continue processing
the next record. Is that how the importing works or do all records need to
be valid? If invalid records should not abort the process, then does anyone
have any idea what might be going on here?

Thanks,
Joe

Reply via email to