If you want speed, Spark is the fastest easiest way. You can connect to relational tables directly and import or export to CSV / JSON and import from a distributed filesystem like S3 or HDFS.
Combining a dfs with spark and a highly available SolR - you are maximizing all threads. -- Rahul Singh rahul.si...@anant.us Anant Corporation On Apr 12, 2018, 1:10 PM -0400, Sujay Bawaskar <sujaybawas...@gmail.com>, wrote: > Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data size > is around 100GB. > I am not much familiar with spark but are you suggesting that we should > create document by merging distinct RDBMS tables in using RDD? > > On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh <rahul.xavier.si...@gmail.com > wrote: > > > How much data and what is the database source? Spark is probably the > > fastest way. > > > > -- > > Rahul Singh > > rahul.si...@anant.us > > > > Anant Corporation > > > > On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar <sujaybawas...@gmail.com>, > > wrote: > > > Hi, > > > > > > We are using DIH with SortedMapBackedCache but as data size increases we > > > need to provide more heap memory to solr JVM. > > > Can we use multiple CSV file instead of database queries and later data > > in > > > CSV files can be joined using zipper? So bottom line is to create CSV > > files > > > for each of entity in data-config.xml and join these CSV files using > > > zipper. > > > We also tried EHCache based DIH cache but since EHCache uses MMap IO its > > > not good to use with MMapDirectoryFactory and causes to exhaust physical > > > memory on machine. > > > Please suggest how can we handle use case of importing huge amount of > > data > > > into solr. > > > > > > -- > > > Thanks, > > > Sujay P Bawaskar > > > M:+91-77091 53669 > > > > > > -- > Thanks, > Sujay P Bawaskar > M:+91-77091 53669