Re: Solr + Parquets

2020-08-10 Thread Russell Jurney
Sorry, I'm a goofball. I use Parquet but use bzip2 json format for the last hop. Thanks, Russell Jurney @rjurney russell.jur...@gmail.com LI FB datasyndrome.com On Mon, Aug 10, 2020 at 7:56 PM Aroop

Re: Solr + Parquets

2020-08-10 Thread Aroop Ganguly
> script to iterate and load the files via the post command. You mean load parquet filed over post? That sounds unbelievable … Do u mean you created Solr doc for each parquet record in a partition and used solrJ or some other java lib to post the docs to Solr? df.mapPatitions(p => { ///batch th

Re: Solr + Parquets

2020-08-10 Thread Russell Jurney
There are ways to load data directly from Spark to Solr but I didn't find any of them satisfactory so I just create enough Spark partitions with reparition() (increase partition count)/coalesce() (decrease partition count) that I get as many Parquet files as I want and then I use a bash script to i

Re: Solr + Parquets

2020-08-07 Thread Jörn Franke
DIH is deprecated and it will be removed from Solr. You may though still be able to install it as a plug-in. However, AFAIK nobody maintains it. Do not use it anymore You can write a custom Spark data source that writes to Solr or does it in a spark Map step using SolrJ . In both cases do not c