Re: indexing XML stored on HDFS

Cassandra Targett Fri, 08 Dec 2017 10:36:54 -0800

Matthew,

The hadoop-solr project you mention would give you the ability to index
files in HDFS. It's a Job Jar, so you submit it to Hadoop with the params
you need and it processes the files and sends them to Solr. It might not be
the fastest thing in the world since it uses MapReduce but we (I work at
Lucidworks) do have a number of people using it.


However, you mention that you're already processing your files with Spark,
and you don't really need them in HDFS in the long run - have you seen the
Spark-Solr project at https://github.com/lucidworks/spark-solr/? It has an
RDD for indexing docs to Solr, so you would be able to get the files from
wherever they originate, transform them in Spark, and get them into Solr.
It might be a better solution for your existing workflow.

Hope it helps -
Cassandra

On Thu, Dec 7, 2017 at 9:03 AM, Matthew Roth <mgrot...@gmail.com> wrote:

> Yes the post tool would also be an acceptable option and one I am familiar
> with. However, I also am not seeing exactly how I would query hdfs. The
> hadoop-solr [0
> <https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers>] tool by
> lucidworks looks the most promising. I have a meeting to attend to shortly,
> and maybe I can explore that further in the afternoon.
>
> I also would like to look further into solrj. I have no real reason to
> store the results of the XSL transformation anywhere other than solr. I am
> simply not familiar with it. But on the surface it seems like it might be
> the most performant way to handle this problem.
>
> If I do pursue this with solrj and spark will solr handle multiple solrj
> connections all trying to add documents?
>
> [0] https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers
>
> On Wed, Dec 6, 2017 at 5:36 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > Perhaps the bin/post tool? See:
> > https://lucidworks.com/2015/08/04/solr-5-new-binpost-utility/
> >
> > On Wed, Dec 6, 2017 at 2:05 PM, Matthew Roth <mgrot...@gmail.com> wrote:
> > > Hi All,
> > >
> > > Is there a DIH for HDFS? I see this old feature request [0
> > > <https://issues.apache.org/jira/browse/SOLR-2096>] that never seems to
> > have
> > > gone anywhere. Google searches and searches on this list don't get me
> to
> > > far.
> > >
> > > Essentially my workflow is that I have many thousands of XML documents
> > > stored in hdfs. I run an xslt transformation in spark [1
> > > <https://github.com/elsevierlabs-os/spark-xml-utils>]. This transforms
> > to
> > > the expected solr input of <add><doc><field ... /></doc></add>. This is
> > > than written the back to hdfs. Now how do I get it back to solr? I
> > suppose
> > > I could move the data back to the local fs, but on the surface that
> feels
> > > like the wrong way.
> > >
> > > I don't need to store the documents in HDFS after the spark
> > transformation,
> > > I wonder if I can write them using solrj. However, I am not really
> > familiar
> > > with solrj. I am also running a single node. Most of the material I
> have
> > > read on spark-solr expects you to be running SolrCloud.
> > >
> > > Best,
> > > Matt
> > >
> > >
> > >
> > > [0] https://issues.apache.org/jira/browse/SOLR-2096
> > > [1] https://github.com/elsevierlabs-os/spark-xml-utils
> >
>

Re: indexing XML stored on HDFS

Reply via email to