I've been working on improving index time with a JdbcDataSource DIH based
config and found it not to be as performant as I'd hoped for, for various
reasons, not specifically due to solr. With that said, I decided to switch
gears a bit and test out FileDataSource setup... I assumed by eliminiating
network latency, I should see drastic improvements in terms of import time..but
I'm a bit surprised that this process seems to run much slower, at least the
way I've initially coded it. (below)
The below is a barebone file import that I wrote which consumes a tab delimited
file. Nothing fancy here. The regex just seperates out the fields... Is there
faster approach to doing this? If so, what is it?
Also, what is the "recommended" approach in terms of index/importing data? I
know thats may come across as a vague question as there are various options
available, but which one would be considered the "standard" approach within a
production enterprise environment.
(below has been cleansed)
<dataConfig>
<dataSource name="file" type="FileDataSource" />
<document>
<entity name="entity1"
processor="LineEntityProcessor"
url="[location_of_file]/file.csv"
dataSource="file"
transformer="RegexTransformer,TemplateTransformer">
<field column="rawLine"
regex="^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)$"
groupNames="field1,field2,field3,field4,field5,field6,field7,field8,field9,field10,field11,field12,field13,field14,field15,field16,field17,field18,field19,field10,field11,field12"
/>
</entity>
</document>
</dataConfig>
Thanks in advance,
Mike