Hi, I have Solr 4.10.3 part of a CDH5 installation and I would like to index huge amount of CSV files on HDFS. I was wondering what is the best way of doing that.
Here is the current approach: data.csv: id, fruit 10, apple 20, orange Indexing with the following command using search-mr-1.0.0-cdh5.11.1-job.jar hadoop --config /etc/hadoop/conf.cloudera.yarn jar \ /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-1.0.0-cdh5.11.1-job.jar \ org.apache.solr.hadoop.MapReduceIndexerTool \ -D 'mapred.child.java.opts=-Xmx500m' --log4j \ /opt/cloudera/parcels/CDH/share/doc/search/examples/solr-nrt/log4j.properties --morphline-file \ /home/user/readCSV.conf \ --output-dir hdfs://name-node.server.com:8020/user/solr/output --verbose --go-live \ --zk-host name-node.server.com:2181/solr --collection collection0 \ hdfs://name-node.server.com:8020/user/solr/input This leads to the following exception: 2219 [main] INFO org.apache.solr.hadoop.MapReduceIndexerTool - Indexing 1 files using 1 real mappers into 1 reducers Error: java.io.IOException: Batch Write Failure at org.apache.solr.hadoop.BatchWriter.throwIf(BatchWriter.java:239) .. Caused by: org.apache.solr.common.SolrException: ERROR: [doc=100] unknown field 'file_path' at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185) at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78) It appears to me that the schema does not have file_path. The collection is created through Hue and it properly identifies the two fields id and fruit. I found out that the search-mr tool has the following code that references the file_path: https://github.com/cloudera/search/blob/cdh5-1.0.0_5.2.0/search-mr/src/main/java/org/apache/solr/hadoop/HdfsFileFieldNames.java#L30 I am not sure what to do in order to be able to index files on HDFS. I have two guesses: - add the fields definied in the search tool to the schema when I create it (not sure how that work through Hue) - disable the HDFS meatadata insertion when inserting data Has anybody seen this before? Thanks, Istvan -- the sun shines for all