Hi all, I have three node solr cloud cluster. The collection has a single shard. I am importing 140 GB CSV file into solr using curl with a URL that looks roughly like this. I am streaming the file from disk for performance reasons.
http://localhost:8983/solr/example/update?separator=%09&stream.file=/tmp/input.tsv&stream.contentType=text/csv;charset=utf-8&commit=true&f.note.split=true&f.note.separator=%7C There are 139 million records in that file. I am able to import about 800,000 records into solr at which point solr hangs and then several minutes later returns a 400 bad request back to curl. I looked in the logs and I did find a handful of exceptions (e.g invalid date, docvalues field is too large etc) for particular records but nothing that would explain why the processing stalled and failed. My expectation is that if solr encounters a record it cannot ingest, it will throw an exception for that particular record and continue processing the next record. Is that how the importing works or do all records need to be valid? If invalid records should not abort the process, then does anyone have any idea what might be going on here? Thanks, Joe