Re: DIH with huge data

Rahul Singh Thu, 12 Apr 2018 10:19:11 -0700

If you want speed, Spark is the fastest easiest way. You can connect to 
relational tables directly and import or export to CSV / JSON and import from a 
distributed filesystem like S3 or HDFS.


Combining a dfs with spark and a highly available SolR - you are maximizing all 
threads.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 12, 2018, 1:10 PM -0400, Sujay Bawaskar <sujaybawas...@gmail.com>, wrote:
> Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data size
> is around 100GB.
> I am not much familiar with spark but are you suggesting that we should
> create document by merging distinct RDBMS tables in using RDD?
>
> On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh <rahul.xavier.si...@gmail.com
> wrote:
>
> > How much data and what is the database source? Spark is probably the
> > fastest way.
> >
> > --
> > Rahul Singh
> > rahul.si...@anant.us
> >
> > Anant Corporation
> >
> > On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar <sujaybawas...@gmail.com>,
> > wrote:
> > > Hi,
> > >
> > > We are using DIH with SortedMapBackedCache but as data size increases we
> > > need to provide more heap memory to solr JVM.
> > > Can we use multiple CSV file instead of database queries and later data
> > in
> > > CSV files can be joined using zipper? So bottom line is to create CSV
> > files
> > > for each of entity in data-config.xml and join these CSV files using
> > > zipper.
> > > We also tried EHCache based DIH cache but since EHCache uses MMap IO its
> > > not good to use with MMapDirectoryFactory and causes to exhaust physical
> > > memory on machine.
> > > Please suggest how can we handle use case of importing huge amount of
> > data
> > > into solr.
> > >
> > > --
> > > Thanks,
> > > Sujay P Bawaskar
> > > M:+91-77091 53669
> >
>
>
>
> --
> Thanks,
> Sujay P Bawaskar
> M:+91-77091 53669

Re: DIH with huge data

Reply via email to