Re: indexing XML stored on HDFS

Matthew Roth Fri, 08 Dec 2017 11:49:59 -0800

Thanks Rick,

While long term storage of the documents in HDFS is not necessary you do
raise that easy access to these documents durning the development phase
will be useful.


Cassandra,

spark-solr I am under the impression that I must be running SolrCloud. At
this time I need some of the features that are not available in SolrCloud.
E.g. Joining across cores. Additionally, the projected demands of solr mean
running it as a single node will be acceptable.

The hadoop-solr project does look the most promising at the moment. I am
hoping to play with it some this afternoon, but it may have to wait until
the new week.

Thanks for the help.

Best,
Matt

On Fri, Dec 8, 2017 at 1:36 PM, Cassandra Targett <casstarg...@gmail.com>
wrote:

> Matthew,
>
> The hadoop-solr project you mention would give you the ability to index
> files in HDFS. It's a Job Jar, so you submit it to Hadoop with the params
> you need and it processes the files and sends them to Solr. It might not be
> the fastest thing in the world since it uses MapReduce but we (I work at
> Lucidworks) do have a number of people using it.
>
> However, you mention that you're already processing your files with Spark,
> and you don't really need them in HDFS in the long run - have you seen the
> Spark-Solr project at https://github.com/lucidworks/spark-solr/? It has an
> RDD for indexing docs to Solr, so you would be able to get the files from
> wherever they originate, transform them in Spark, and get them into Solr.
> It might be a better solution for your existing workflow.
>
> Hope it helps -
> Cassandra
>
> On Thu, Dec 7, 2017 at 9:03 AM, Matthew Roth <mgrot...@gmail.com> wrote:
>
> > Yes the post tool would also be an acceptable option and one I am
> familiar
> > with. However, I also am not seeing exactly how I would query hdfs. The
> > hadoop-solr [0
> > <https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers>] tool by
> > lucidworks looks the most promising. I have a meeting to attend to
> shortly,
> > and maybe I can explore that further in the afternoon.
> >
> > I also would like to look further into solrj. I have no real reason to
> > store the results of the XSL transformation anywhere other than solr. I
> am
> > simply not familiar with it. But on the surface it seems like it might be
> > the most performant way to handle this problem.
> >
> > If I do pursue this with solrj and spark will solr handle multiple solrj
> > connections all trying to add documents?
> >
> > [0] https://github.com/lucidworks/hadoop-solr/wiki/IngestMappers
> >
> > On Wed, Dec 6, 2017 at 5:36 PM, Erick Erickson <erickerick...@gmail.com>
> > wrote:
> >
> > > Perhaps the bin/post tool? See:
> > > https://lucidworks.com/2015/08/04/solr-5-new-binpost-utility/
> > >
> > > On Wed, Dec 6, 2017 at 2:05 PM, Matthew Roth <mgrot...@gmail.com>
> wrote:
> > > > Hi All,
> > > >
> > > > Is there a DIH for HDFS? I see this old feature request [0
> > > > <https://issues.apache.org/jira/browse/SOLR-2096>] that never seems
> to
> > > have
> > > > gone anywhere. Google searches and searches on this list don't get me
> > to
> > > > far.
> > > >
> > > > Essentially my workflow is that I have many thousands of XML
> documents
> > > > stored in hdfs. I run an xslt transformation in spark [1
> > > > <https://github.com/elsevierlabs-os/spark-xml-utils>]. This
> transforms
> > > to
> > > > the expected solr input of <add><doc><field ... /></doc></add>. This
> is
> > > > than written the back to hdfs. Now how do I get it back to solr? I
> > > suppose
> > > > I could move the data back to the local fs, but on the surface that
> > feels
> > > > like the wrong way.
> > > >
> > > > I don't need to store the documents in HDFS after the spark
> > > transformation,
> > > > I wonder if I can write them using solrj. However, I am not really
> > > familiar
> > > > with solrj. I am also running a single node. Most of the material I
> > have
> > > > read on spark-solr expects you to be running SolrCloud.
> > > >
> > > > Best,
> > > > Matt
> > > >
> > > >
> > > >
> > > > [0] https://issues.apache.org/jira/browse/SOLR-2096
> > > > [1] https://github.com/elsevierlabs-os/spark-xml-utils
> > >
> >
>

Re: indexing XML stored on HDFS

Reply via email to