Hi,

I have Solr 4.10.3 part of a CDH5 installation and I would like to index
huge amount of CSV files on HDFS. I was wondering what is the best way of
doing that.

Here is the current approach:

data.csv:

id, fruit
10, apple
20, orange

Indexing with the following command using search-mr-1.0.0-cdh5.11.1-job.jar

hadoop --config /etc/hadoop/conf.cloudera.yarn jar \
/opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-1.0.0-cdh5.11.1-job.jar
\
org.apache.solr.hadoop.MapReduceIndexerTool \
-D 'mapred.child.java.opts=-Xmx500m' --log4j \
/opt/cloudera/parcels/CDH/share/doc/search/examples/solr-nrt/log4j.properties
--morphline-file \
/home/user/readCSV.conf \
--output-dir hdfs://name-node.server.com:8020/user/solr/output --verbose
--go-live \
--zk-host name-node.server.com:2181/solr --collection collection0 \
hdfs://name-node.server.com:8020/user/solr/input

This leads to the following exception:

2219 [main] INFO  org.apache.solr.hadoop.MapReduceIndexerTool  - Indexing 1
files using 1 real mappers into 1 reducers
Error: java.io.IOException: Batch Write Failure
        at org.apache.solr.hadoop.BatchWriter.throwIf(BatchWriter.java:239)
..
Caused by: org.apache.solr.common.SolrException: ERROR: [doc=100] unknown
field 'file_path'
        at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
        at
org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)

It appears to me that the schema does not have file_path. The collection is
created through Hue and it properly identifies the two fields id and fruit.
I found out that the search-mr tool has the following code that references
the file_path:

https://github.com/cloudera/search/blob/cdh5-1.0.0_5.2.0/search-mr/src/main/java/org/apache/solr/hadoop/HdfsFileFieldNames.java#L30

I am not sure what to do in order to be able to index files on HDFS. I have
two guesses:

- add the fields definied in the search tool to the schema when I create it
(not sure how that work through Hue)
- disable the HDFS meatadata insertion when inserting data

Has anybody seen this before?

Thanks,
Istvan




-- 
the sun shines for all

Reply via email to