Re: DIH with huge data

2018-04-12 Thread Sujay Bawaskar
That sounds good option. So spark job will connect to MySQL and create solr document which is pushed into solr using solrj probably in batches. On Thu, Apr 12, 2018 at 10:48 PM, Rahul Singh wrote: > If you want speed, Spark is the fastest easiest way. You can connect to > relational tables direc

Re: DIH with huge data

2018-04-12 Thread Rahul Singh
CSV -> Spark -> SolR https://github.com/lucidworks/spark-solr/blob/master/docs/examples/csv.adoc If speed is not an issue there are other methods. Spring Batch / Spring Data might have all the tools you need to get speed without Spark. -- Rahul Singh rahul.si...@anant.us Anant Corporation On

Re: DIH with huge data

2018-04-12 Thread Rahul Singh
If you want speed, Spark is the fastest easiest way. You can connect to relational tables directly and import or export to CSV / JSON and import from a distributed filesystem like S3 or HDFS. Combining a dfs with spark and a highly available SolR - you are maximizing all threads. -- Rahul Sing

Re: DIH with huge data

2018-04-12 Thread Sujay Bawaskar
Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data size is around 100GB. I am not much familiar with spark but are you suggesting that we should create document by merging distinct RDBMS tables in using RDD? On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh wrote: > How much data

Re: DIH with huge data

2018-04-12 Thread Rahul Singh
How much data and what is the database source? Spark is probably the fastest way. -- Rahul Singh rahul.si...@anant.us Anant Corporation On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar , wrote: > Hi, > > We are using DIH with SortedMapBackedCache but as data size increases we > need to provide mo

DIH with huge data

2018-04-12 Thread Sujay Bawaskar
Hi, We are using DIH with SortedMapBackedCache but as data size increases we need to provide more heap memory to solr JVM. Can we use multiple CSV file instead of database queries and later data in CSV files can be joined using zipper? So bottom line is to create CSV files for each of entity in da