Hey Ahmet / Solr User Group, I tried using the built in UpdateCSV and it runs A LOT faster than a FileDataSource DIH as illustrated below. However, I am a bit confused about the numDocs/maxDoc values when doing an import this way. Here's my Get command against a Tab delimted file: (I removed server info and additional fields.. everything else is the same)
http://server:port/appname/solrcore/update/csv?commit=true&header=false&separator=%09&escape=\&stream.file=/location/of/file/on/server/file.csv&fieldnames=id,otherfields My response from solr <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">0</int><int name="QTime">591</int></lst> </response> I am experimenting with 2 csv files (1 with 10 records, the other with 1000) to see If I can get this to run correctly before running my entire collection of data. I initially loaded the first 1000 records to an empty core and that seemed to work, however, but when running the above with a csv file that has 10 records, I would like to see only 10 active records in my core. What I get instead, when looking at my stats page: numDocs 1000 maxDoc 1010 If I run the same url above while appending an 'optimize=true', I get: numDocs 1000, maxDoc 1000. Perhaps the commit=true is not doing what its supposed to or am I missing something? I also trying passing a commit afterward like this: http://server:port/appname/solrcore/update?stream.body=%3Ccommit/%3E ( didn't seem to do anything either) From: Ahmet Arslan <iori...@yahoo.com> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>; Mike L. <javaone...@yahoo.com> Sent: Saturday, June 29, 2013 7:20 AM Subject: Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5 Hi Mike, You could try http://wiki.apache.org/solr/UpdateCSV And make sure you commit at the very end. ________________________________ From: Mike L. <javaone...@yahoo.com> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org> Sent: Saturday, June 29, 2013 3:15 AM Subject: FileDataSource vs JdbcDataSouce (speed) Solr 3.5 I've been working on improving index time with a JdbcDataSource DIH based config and found it not to be as performant as I'd hoped for, for various reasons, not specifically due to solr. With that said, I decided to switch gears a bit and test out FileDataSource setup... I assumed by eliminiating network latency, I should see drastic improvements in terms of import time..but I'm a bit surprised that this process seems to run much slower, at least the way I've initially coded it. (below) The below is a barebone file import that I wrote which consumes a tab delimited file. Nothing fancy here. The regex just seperates out the fields... Is there faster approach to doing this? If so, what is it? Also, what is the "recommended" approach in terms of index/importing data? I know thats may come across as a vague question as there are various options available, but which one would be considered the "standard" approach within a production enterprise environment. (below has been cleansed) <dataConfig> <dataSource name="file" type="FileDataSource" /> <document> <entity name="entity1" processor="LineEntityProcessor" url="[location_of_file]/file.csv" dataSource="file" transformer="RegexTransformer,TemplateTransformer"> <field column="rawLine" regex="^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)$" groupNames="field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field10,field11,field12" /> </entity> </document> </dataConfig> Thanks in advance, Mike Thanks in advance, Mike