Hi, Try http://wiki.apache.org/solr/UpdateCSV , it should be faster. See 'Tab-delimited importing' at the end of the wiki page.
Cheers, Ahmet On Monday, May 19, 2014 1:31 PM, Hal Arres <hello.world.s...@gmail.com> wrote: Hallo there, I am working on an import-configuration for my solr-index and I got some issues with that. In the first step I configured an import-handler to import data from a database into the solr-index and it worked just fine, but it is very slow (7K documents per second). So I wanted to change that towards a data-import-handler using a FileDataSource. (i am running solr 4.6.1) I have to import nearly 150_000_000 lines each night and each line has the following characteristics: - fields are seperated by tabulator - 70 fields each line - one line is nearly 600 characters long - each line contains multiple data-types (date, int, string...) In the moment the files are imported into the database, from which they are imported by solr (database import-handler). To improve the import performance I wanted to import the files directly. This is the first approach I tested: --------------- <entity name="files" dataSource="null" rootEntity="false" processor="FileListEntityProcessor" baseDir="/tmp" fileName=".*\.infile" onError="abort" recursive="false"> <entity name="csv_file" processor="LineEntityProcessor" url="${files.fileAbsolutePath}" dataSource="fds" transformer="RegexTransformer"> <field column="rawLine" regex="^(.*)\t(.*)\t(.*)\t(.*)\t(.*)$" groupNames="field1,,,field4,field5"/> </entity> ----------------- If I import less than 10 fields this works just fine. But as soon as I extend the import to 30 fields, the time to import one line increases to more than 10sec! So I checked another way, in which I moved the transformation to a script: ---------------- <script><![CDATA[ function parse(row) { var rawLine = row.get("rawLine") var arr = rawLine.split("\t"); row.put("field1", arr[0]); row.put("field67", arr[67]); // row.remove("rawLine"); return row; } ]]></script> ----------------- But this was just slightly faster than the database import. Has someone of you an idea, how I can improve my import performance? Thank you very, very much, Sebastian