You probably get much more informed responses from the Cloudera folks, especially about Hue.
Best, Erick On Wed, Oct 11, 2017 at 6:05 AM, István <lecc...@gmail.com> wrote: > Hi, > > I have Solr 4.10.3 part of a CDH5 installation and I would like to index > huge amount of CSV files on HDFS. I was wondering what is the best way of > doing that. > > Here is the current approach: > > data.csv: > > id, fruit > 10, apple > 20, orange > > Indexing with the following command using search-mr-1.0.0-cdh5.11.1-job.jar > > hadoop --config /etc/hadoop/conf.cloudera.yarn jar \ > /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-1.0.0-cdh5.11.1-job.jar > \ > org.apache.solr.hadoop.MapReduceIndexerTool \ > -D 'mapred.child.java.opts=-Xmx500m' --log4j \ > /opt/cloudera/parcels/CDH/share/doc/search/examples/solr-nrt/log4j.properties > --morphline-file \ > /home/user/readCSV.conf \ > --output-dir hdfs://name-node.server.com:8020/user/solr/output --verbose > --go-live \ > --zk-host name-node.server.com:2181/solr --collection collection0 \ > hdfs://name-node.server.com:8020/user/solr/input > > This leads to the following exception: > > 2219 [main] INFO org.apache.solr.hadoop.MapReduceIndexerTool - Indexing 1 > files using 1 real mappers into 1 reducers > Error: java.io.IOException: Batch Write Failure > at org.apache.solr.hadoop.BatchWriter.throwIf(BatchWriter.java:239) > .. > Caused by: org.apache.solr.common.SolrException: ERROR: [doc=100] unknown > field 'file_path' > at > org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185) > at > org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78) > > It appears to me that the schema does not have file_path. The collection is > created through Hue and it properly identifies the two fields id and fruit. > I found out that the search-mr tool has the following code that references > the file_path: > > https://github.com/cloudera/search/blob/cdh5-1.0.0_5.2.0/search-mr/src/main/java/org/apache/solr/hadoop/HdfsFileFieldNames.java#L30 > > I am not sure what to do in order to be able to index files on HDFS. I have > two guesses: > > - add the fields definied in the search tool to the schema when I create it > (not sure how that work through Hue) > - disable the HDFS meatadata insertion when inserting data > > Has anybody seen this before? > > Thanks, > Istvan > > > > > -- > the sun shines for all