Re: indexing XML stored on HDFS

Matthew Roth Thu, 07 Dec 2017 07:04:07 -0800

Yes the post tool would also be an acceptable option and one I am familiar
with. However, I also am not seeing exactly how I would query hdfs. The
hadoop-solr [0
<https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers>] tool by
lucidworks looks the most promising. I have a meeting to attend to shortly,
and maybe I can explore that further in the afternoon.


I also would like to look further into solrj. I have no real reason to
store the results of the XSL transformation anywhere other than solr. I am
simply not familiar with it. But on the surface it seems like it might be
the most performant way to handle this problem.

If I do pursue this with solrj and spark will solr handle multiple solrj
connections all trying to add documents?

[0] https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers

On Wed, Dec 6, 2017 at 5:36 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Perhaps the bin/post tool? See:
> https://lucidworks.com/2015/08/04/solr-5-new-binpost-utility/
>
> On Wed, Dec 6, 2017 at 2:05 PM, Matthew Roth <mgrot...@gmail.com> wrote:
> > Hi All,
> >
> > Is there a DIH for HDFS? I see this old feature request [0
> > <https://issues.apache.org/jira/browse/SOLR-2096>] that never seems to
> have
> > gone anywhere. Google searches and searches on this list don't get me to
> > far.
> >
> > Essentially my workflow is that I have many thousands of XML documents
> > stored in hdfs. I run an xslt transformation in spark [1
> > <https://github.com/elsevierlabs-os/spark-xml-utils>]. This transforms
> to
> > the expected solr input of <add><doc><field ... /></doc></add>. This is
> > than written the back to hdfs. Now how do I get it back to solr? I
> suppose
> > I could move the data back to the local fs, but on the surface that feels
> > like the wrong way.
> >
> > I don't need to store the documents in HDFS after the spark
> transformation,
> > I wonder if I can write them using solrj. However, I am not really
> familiar
> > with solrj. I am also running a single node. Most of the material I have
> > read on spark-solr expects you to be running SolrCloud.
> >
> > Best,
> > Matt
> >
> >
> >
> > [0] https://issues.apache.org/jira/browse/SOLR-2096
> > [1] https://github.com/elsevierlabs-os/spark-xml-utils
>

Re: indexing XML stored on HDFS

Reply via email to