FileDataSource vs JdbcDataSouce (speed) Solr 3.5

Mike L. Fri, 28 Jun 2013 17:16:07 -0700
 
I've been working on improving index time with a JdbcDataSource DIH based 
config and found it not to be as performant as I'd hoped for, for various 
reasons, not specifically due to solr. With that said, I decided to switch 
gears a bit and test out FileDataSource setup... I assumed by eliminiating 
network latency, I should see drastic improvements in terms of import time..but 
I'm a bit surprised that this process seems to run much slower, at least the 
way I've initially coded it. (below)
 
The below is a barebone file import that I wrote which consumes a tab delimited 
file. Nothing fancy here. The regex just seperates out the fields... Is there 
faster approach to doing this? If so, what is it?
 
Also, what is the "recommended" approach in terms of index/importing data? I 
know thats may come across as a vague question as there are various options 
available, but which one would be considered the "standard" approach within a 
production enterprise environment.
 
 
(below has been cleansed)
 
<dataConfig>
     <dataSource name="file" type="FileDataSource" />
   <document>
         <entity name="entity1"
                 processor="LineEntityProcessor"
                 url="[location_of_file]/file.csv"
                 dataSource="file"
                 transformer="RegexTransformer,TemplateTransformer">
 <field column="rawLine"
        
regex="^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)$"
        
groupNames="field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field10,field11,field12"
 />
         </entity>
   </document>
</dataConfig>
 
Thanks in advance,
Mike
FileDataSource vs JdbcDataSouce (speed) Solr 3.5

Reply via email to