That sounds good option. So spark job will connect to MySQL and create solr
document which is pushed into solr using solrj probably in batches.
On Thu, Apr 12, 2018 at 10:48 PM, Rahul Singh
wrote:
> If you want speed, Spark is the fastest easiest way. You can connect to
> relational tables direc
CSV -> Spark -> SolR
https://github.com/lucidworks/spark-solr/blob/master/docs/examples/csv.adoc
If speed is not an issue there are other methods. Spring Batch / Spring Data
might have all the tools you need to get speed without Spark.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On
If you want speed, Spark is the fastest easiest way. You can connect to
relational tables directly and import or export to CSV / JSON and import from a
distributed filesystem like S3 or HDFS.
Combining a dfs with spark and a highly available SolR - you are maximizing all
threads.
--
Rahul Sing
Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data size
is around 100GB.
I am not much familiar with spark but are you suggesting that we should
create document by merging distinct RDBMS tables in using RDD?
On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh
wrote:
> How much data
How much data and what is the database source? Spark is probably the fastest
way.
--
Rahul Singh
rahul.si...@anant.us
Anant Corporation
On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar , wrote:
> Hi,
>
> We are using DIH with SortedMapBackedCache but as data size increases we
> need to provide mo
Hi,
We are using DIH with SortedMapBackedCache but as data size increases we
need to provide more heap memory to solr JVM.
Can we use multiple CSV file instead of database queries and later data in
CSV files can be joined using zipper? So bottom line is to create CSV files
for each of entity in da