Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

Shalin Shekhar Mangar Wed, 03 Jul 2013 21:45:12 -0700

The split/group implementation in RegexTransformer is not as efficient
as CSVLoader. Perhaps we need a specialized csv loader in DIH.
SOLR-2549 aims to add this support. I'll take a look.


On Tue, Jul 2, 2013 at 12:26 AM, Mike L. <javaone...@yahoo.com> wrote:
>  Hey Ahmet / Solr User Group,
>
>    I tried using the built in UpdateCSV and it runs A LOT faster than a 
> FileDataSource DIH as illustrated below. However, I am a bit confused about 
> the numDocs/maxDoc values when doing an import this way. Here's my Get 
> command against a Tab delimted file: (I removed server info and additional 
> fields.. everything else is the same)
>
> http://server:port/appname/solrcore/update/csv?commit=true&header=false&separator=%09&escape=\&stream.file=/location/of/file/on/server/file.csv&fieldnames=id,otherfields
>
>
> My response from solr
>
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int 
> name="QTime">591</int></lst>
> </response>
>
> I am experimenting with 2 csv files (1 with 10 records, the other with 1000) 
> to see If I can get this to run correctly before running my entire collection 
> of data. I initially loaded the first 1000 records to an empty core and that 
> seemed to work, however, but when running the above with a csv file that has 
> 10 records, I would like to see only 10 active records in my core. What I get 
> instead, when looking at my stats page:
>
> numDocs 1000
> maxDoc 1010
>
> If I run the same url above while appending an 'optimize=true', I get:
>
> numDocs 1000,
> maxDoc 1000.
>
> Perhaps the commit=true is not doing what its supposed to or am I missing 
> something? I also trying passing a commit afterward like this:
> http://server:port/appname/solrcore/update?stream.body=%3Ccommit/%3E ( didn't 
> seem to do anything either)
>
>
> From: Ahmet Arslan <iori...@yahoo.com>
> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>; Mike L. 
> <javaone...@yahoo.com>
> Sent: Saturday, June 29, 2013 7:20 AM
> Subject: Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5
>
>
> Hi Mike,
>
>
> You could try http://wiki.apache.org/solr/UpdateCSV
>
> And make sure you commit at the very end.
>
>
>
>
> ________________________________
> From: Mike L. <javaone...@yahoo.com>
> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
> Sent: Saturday, June 29, 2013 3:15 AM
> Subject: FileDataSource vs JdbcDataSouce (speed) Solr 3.5
>
>
>
> I've been working on improving index time with a JdbcDataSource DIH based 
> config and found it not to be as performant as I'd hoped for, for various 
> reasons, not specifically due to solr. With that said, I decided to switch 
> gears a bit and test out FileDataSource setup... I assumed by eliminiating 
> network latency, I should see drastic improvements in terms of import 
> time..but I'm a bit surprised that this process seems to run much slower, at 
> least the way I've initially coded it. (below)
>
> The below is a barebone file import that I wrote which consumes a tab 
> delimited file. Nothing fancy here. The regex just seperates out the 
> fields... Is there faster approach to doing this? If so, what is it?
>
> Also, what is the "recommended" approach in terms of index/importing data? I 
> know thats may come across as a vague question as there are various options 
> available, but which one would be considered the "standard" approach within a 
> production enterprise environment.
>
>
> (below has been cleansed)
>
> <dataConfig>
>      <dataSource name="file" type="FileDataSource" />
>    <document>
>          <entity name="entity1"
>                  processor="LineEntityProcessor"
>                  url="[location_of_file]/file.csv"
>                  dataSource="file"
>                  transformer="RegexTransformer,TemplateTransformer">
>  <field column="rawLine"
>         
> regex="^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)$"
>         
> groupNames="field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field10,field11,field12"
>  />
>          </entity>
>    </document>
> </dataConfig>
>
> Thanks in advance,
> Mike
>
> Thanks in advance,
> Mike



-- 
Regards,
Shalin Shekhar Mangar.

Re: FileDataSource vs JdbcDataSouce (speed) Solr 3.5

Reply via email to