indexing XML stored on HDFS

Matthew Roth Wed, 06 Dec 2017 14:05:46 -0800

Hi All,

Is there a DIH for HDFS? I see this old feature request [0
<https://issues.apache.org/jira/browse/SOLR-2096>] that never seems to have
gone anywhere. Google searches and searches on this list don't get me to
far.


Essentially my workflow is that I have many thousands of XML documents
stored in hdfs. I run an xslt transformation in spark [1
<https://github.com/elsevierlabs-os/spark-xml-utils>]. This transforms to
the expected solr input of <add><doc><field ... /></doc></add>. This is
than written the back to hdfs. Now how do I get it back to solr? I suppose
I could move the data back to the local fs, but on the surface that feels
like the wrong way.

I don't need to store the documents in HDFS after the spark transformation,
I wonder if I can write them using solrj. However, I am not really familiar
with solrj. I am also running a single node. Most of the material I have
read on spark-solr expects you to be running SolrCloud.

Best,
Matt



[0] https://issues.apache.org/jira/browse/SOLR-2096
[1] https://github.com/elsevierlabs-os/spark-xml-utils

indexing XML stored on HDFS

Reply via email to