Re: indexing XML stored on HDFS

Rick Leir Thu, 07 Dec 2017 07:27:11 -0800

Matthew, Oops, I should have mentioned re-indexing. With Solr, you want to be 
able to re-index quickly so you can try out different analysis chains. XSLT may 
not be fast enough for this if you have millions of docs. So I would be 
inclined to save the docs to a normal filesystem, perhaps in JSONL. Then use 
DIH or post tool or Python to post the docs to Solr.
Rick


On December 7, 2017 10:14:37 AM EST, Rick Leir <rl...@leirtech.com> wrote:
>Matthew,
>Do you have some sort of script calling xslt? Sorry, I do not know
>Scala and I did not have time to look into your spark utils.  The
>script or Scala could then shell out to curl, or if it is python it
>could use the request library to send a doc to Solr. Extra points for
>batching the documents. 
>
>Erick
>The last time I used the post tool, it was spinning up a jvm each time
>I called it (natch). Is there a simple way to launch it from a Java app
>server so you can call it repeatedly without the start-up overhead? It
>has been a few years, maybe I am wrong.
>Cheers -- Rick
>
>On December 6, 2017 5:36:51 PM EST, Erick Erickson
><erickerick...@gmail.com> wrote:
>>Perhaps the bin/post tool? See:
>>https://lucidworks.com/2015/08/04/solr-5-new-binpost-utility/
>>
>>On Wed, Dec 6, 2017 at 2:05 PM, Matthew Roth <mgrot...@gmail.com>
>>wrote:
>>> Hi All,
>>>
>>> Is there a DIH for HDFS? I see this old feature request [0
>>> <https://issues.apache.org/jira/browse/SOLR-2096>] that never seems
>>to have
>>> gone anywhere. Google searches and searches on this list don't get
>me
>>to
>>> far.
>>>
>>> Essentially my workflow is that I have many thousands of XML
>>documents
>>> stored in hdfs. I run an xslt transformation in spark [1
>>> <https://github.com/elsevierlabs-os/spark-xml-utils>]. This
>>transforms to
>>> the expected solr input of <add><doc><field ... /></doc></add>. This
>>is
>>> than written the back to hdfs. Now how do I get it back to solr? I
>>suppose
>>> I could move the data back to the local fs, but on the surface that
>>feels
>>> like the wrong way.
>>>
>>> I don't need to store the documents in HDFS after the spark
>>transformation,
>>> I wonder if I can write them using solrj. However, I am not really
>>familiar
>>> with solrj. I am also running a single node. Most of the material I
>>have
>>> read on spark-solr expects you to be running SolrCloud.
>>>
>>> Best,
>>> Matt
>>>
>>>
>>>
>>> [0] https://issues.apache.org/jira/browse/SOLR-2096
>>> [1] https://github.com/elsevierlabs-os/spark-xml-utils
>
>-- 
>Sorry for being brief. Alternate email is rickleir at yahoo dot com

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com

Re: indexing XML stored on HDFS

Reply via email to