Re: indexing XML stored on HDFS

2017-12-08 Thread Matthew Roth
Thanks Rick, While long term storage of the documents in HDFS is not necessary you do raise that easy access to these documents durning the development phase will be useful. Cassandra, spark-solr I am under the impression that I must be running SolrCloud. At this time I need some of the features

Re: indexing XML stored on HDFS

2017-12-08 Thread Cassandra Targett
Matthew, The hadoop-solr project you mention would give you the ability to index files in HDFS. It's a Job Jar, so you submit it to Hadoop with the params you need and it processes the files and sends them to Solr. It might not be the fastest thing in the world since it uses MapReduce but we (I wo

Re: indexing XML stored on HDFS

2017-12-07 Thread Rick Leir
Matthew, Oops, I should have mentioned re-indexing. With Solr, you want to be able to re-index quickly so you can try out different analysis chains. XSLT may not be fast enough for this if you have millions of docs. So I would be inclined to save the docs to a normal filesystem, perhaps in JSONL

Re: indexing XML stored on HDFS

2017-12-07 Thread Rick Leir
Matthew, Do you have some sort of script calling xslt? Sorry, I do not know Scala and I did not have time to look into your spark utils. The script or Scala could then shell out to curl, or if it is python it could use the request library to send a doc to Solr. Extra points for batching the doc

Re: indexing XML stored on HDFS

2017-12-07 Thread Matthew Roth
Yes the post tool would also be an acceptable option and one I am familiar with. However, I also am not seeing exactly how I would query hdfs. The hadoop-solr [0 ] tool by lucidworks looks the most promising. I have a meeting to attend t

Re: indexing XML stored on HDFS

2017-12-06 Thread Erick Erickson
Perhaps the bin/post tool? See: https://lucidworks.com/2015/08/04/solr-5-new-binpost-utility/ On Wed, Dec 6, 2017 at 2:05 PM, Matthew Roth wrote: > Hi All, > > Is there a DIH for HDFS? I see this old feature request [0 > ] that never seems to have