The split/group implementation in RegexTransformer is not as efficient as CSVLoader. Perhaps we need a specialized csv loader in DIH. SOLR-2549 aims to add this support. I'll take a look.
On Tue, Jul 2, 2013 at 12:26 AM, Mike L. <javaone...@yahoo.com> wrote: > Hey Ahmet / Solr User Group, > > I tried using the built in UpdateCSV and it runs A LOT faster than a > FileDataSource DIH as illustrated below. However, I am a bit confused about > the numDocs/maxDoc values when doing an import this way. Here's my Get > command against a Tab delimted file: (I removed server info and additional > fields.. everything else is the same) > > http://server:port/appname/solrcore/update/csv?commit=true&header=false&separator=%09&escape=\&stream.file=/location/of/file/on/server/file.csv&fieldnames=id,otherfields > > > My response from solr > > <?xml version="1.0" encoding="UTF-8"?> > <response> > <lst name="responseHeader"><int name="status">0</int><int > name="QTime">591</int></lst> > </response> > > I am experimenting with 2 csv files (1 with 10 records, the other with 1000) > to see If I can get this to run correctly before running my entire collection > of data. I initially loaded the first 1000 records to an empty core and that > seemed to work, however, but when running the above with a csv file that has > 10 records, I would like to see only 10 active records in my core. What I get > instead, when looking at my stats page: > > numDocs 1000 > maxDoc 1010 > > If I run the same url above while appending an 'optimize=true', I get: > > numDocs 1000, > maxDoc 1000. > > Perhaps the commit=true is not doing what its supposed to or am I missing > something? I also trying passing a commit afterward like this: > http://server:port/appname/solrcore/update?stream.body=%3Ccommit/%3E ( didn't > seem to do anything either) > > > From: Ahmet Arslan <iori...@yahoo.com> > To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>; Mike L. > <javaone...@yahoo.com> > Sent: Saturday, June 29, 2013 7:20 AM > Subject: Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5 > > > Hi Mike, > > > You could try http://wiki.apache.org/solr/UpdateCSV > > And make sure you commit at the very end. > > > > > ________________________________ > From: Mike L. <javaone...@yahoo.com> > To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org> > Sent: Saturday, June 29, 2013 3:15 AM > Subject: FileDataSource vs JdbcDataSouce (speed) Solr 3.5 > > > > I've been working on improving index time with a JdbcDataSource DIH based > config and found it not to be as performant as I'd hoped for, for various > reasons, not specifically due to solr. With that said, I decided to switch > gears a bit and test out FileDataSource setup... I assumed by eliminiating > network latency, I should see drastic improvements in terms of import > time..but I'm a bit surprised that this process seems to run much slower, at > least the way I've initially coded it. (below) > > The below is a barebone file import that I wrote which consumes a tab > delimited file. Nothing fancy here. The regex just seperates out the > fields... Is there faster approach to doing this? If so, what is it? > > Also, what is the "recommended" approach in terms of index/importing data? I > know thats may come across as a vague question as there are various options > available, but which one would be considered the "standard" approach within a > production enterprise environment. > > > (below has been cleansed) > > <dataConfig> > <dataSource name="file" type="FileDataSource" /> > <document> > <entity name="entity1" > processor="LineEntityProcessor" > url="[location_of_file]/file.csv" > dataSource="file" > transformer="RegexTransformer,TemplateTransformer"> > <field column="rawLine" > > regex="^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)$" > > groupNames="field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field10,field11,field12" > /> > </entity> > </document> > </dataConfig> > > Thanks in advance, > Mike > > Thanks in advance, > Mike -- Regards, Shalin Shekhar Mangar.