Yes the post tool would also be an acceptable option and one I am familiar with. However, I also am not seeing exactly how I would query hdfs. The hadoop-solr [0 <https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers>] tool by lucidworks looks the most promising. I have a meeting to attend to shortly, and maybe I can explore that further in the afternoon.
I also would like to look further into solrj. I have no real reason to store the results of the XSL transformation anywhere other than solr. I am simply not familiar with it. But on the surface it seems like it might be the most performant way to handle this problem. If I do pursue this with solrj and spark will solr handle multiple solrj connections all trying to add documents? [0] https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers On Wed, Dec 6, 2017 at 5:36 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Perhaps the bin/post tool? See: > https://lucidworks.com/2015/08/04/solr-5-new-binpost-utility/ > > On Wed, Dec 6, 2017 at 2:05 PM, Matthew Roth <mgrot...@gmail.com> wrote: > > Hi All, > > > > Is there a DIH for HDFS? I see this old feature request [0 > > <https://issues.apache.org/jira/browse/SOLR-2096>] that never seems to > have > > gone anywhere. Google searches and searches on this list don't get me to > > far. > > > > Essentially my workflow is that I have many thousands of XML documents > > stored in hdfs. I run an xslt transformation in spark [1 > > <https://github.com/elsevierlabs-os/spark-xml-utils>]. This transforms > to > > the expected solr input of <add><doc><field ... /></doc></add>. This is > > than written the back to hdfs. Now how do I get it back to solr? I > suppose > > I could move the data back to the local fs, but on the surface that feels > > like the wrong way. > > > > I don't need to store the documents in HDFS after the spark > transformation, > > I wonder if I can write them using solrj. However, I am not really > familiar > > with solrj. I am also running a single node. Most of the material I have > > read on spark-solr expects you to be running SolrCloud. > > > > Best, > > Matt > > > > > > > > [0] https://issues.apache.org/jira/browse/SOLR-2096 > > [1] https://github.com/elsevierlabs-os/spark-xml-utils >