Matthew, Oops, I should have mentioned re-indexing. With Solr, you want to be able to re-index quickly so you can try out different analysis chains. XSLT may not be fast enough for this if you have millions of docs. So I would be inclined to save the docs to a normal filesystem, perhaps in JSONL. Then use DIH or post tool or Python to post the docs to Solr. Rick
On December 7, 2017 10:14:37 AM EST, Rick Leir <rl...@leirtech.com> wrote: >Matthew, >Do you have some sort of script calling xslt? Sorry, I do not know >Scala and I did not have time to look into your spark utils. The >script or Scala could then shell out to curl, or if it is python it >could use the request library to send a doc to Solr. Extra points for >batching the documents. > >Erick >The last time I used the post tool, it was spinning up a jvm each time >I called it (natch). Is there a simple way to launch it from a Java app >server so you can call it repeatedly without the start-up overhead? It >has been a few years, maybe I am wrong. >Cheers -- Rick > >On December 6, 2017 5:36:51 PM EST, Erick Erickson ><erickerick...@gmail.com> wrote: >>Perhaps the bin/post tool? See: >>https://lucidworks.com/2015/08/04/solr-5-new-binpost-utility/ >> >>On Wed, Dec 6, 2017 at 2:05 PM, Matthew Roth <mgrot...@gmail.com> >>wrote: >>> Hi All, >>> >>> Is there a DIH for HDFS? I see this old feature request [0 >>> <https://issues.apache.org/jira/browse/SOLR-2096>] that never seems >>to have >>> gone anywhere. Google searches and searches on this list don't get >me >>to >>> far. >>> >>> Essentially my workflow is that I have many thousands of XML >>documents >>> stored in hdfs. I run an xslt transformation in spark [1 >>> <https://github.com/elsevierlabs-os/spark-xml-utils>]. This >>transforms to >>> the expected solr input of <add><doc><field ... /></doc></add>. This >>is >>> than written the back to hdfs. Now how do I get it back to solr? I >>suppose >>> I could move the data back to the local fs, but on the surface that >>feels >>> like the wrong way. >>> >>> I don't need to store the documents in HDFS after the spark >>transformation, >>> I wonder if I can write them using solrj. However, I am not really >>familiar >>> with solrj. I am also running a single node. Most of the material I >>have >>> read on spark-solr expects you to be running SolrCloud. >>> >>> Best, >>> Matt >>> >>> >>> >>> [0] https://issues.apache.org/jira/browse/SOLR-2096 >>> [1] https://github.com/elsevierlabs-os/spark-xml-utils > >-- >Sorry for being brief. Alternate email is rickleir at yahoo dot com -- Sorry for being brief. Alternate email is rickleir at yahoo dot com