Hi Erik, The question is not about Hue but about why file_path is in the schema for HDFS files when using search-mr. I am wondering what is the standard way of indexing files on HDFS.
THanks, Istvan On Wed, Oct 11, 2017 at 4:53 PM, Erick Erickson <erickerick...@gmail.com> wrote: > You probably get much more informed responses from > the Cloudera folks, especially about Hue. > > Best, > Erick > > On Wed, Oct 11, 2017 at 6:05 AM, István <lecc...@gmail.com> wrote: > > Hi, > > > > I have Solr 4.10.3 part of a CDH5 installation and I would like to index > > huge amount of CSV files on HDFS. I was wondering what is the best way of > > doing that. > > > > Here is the current approach: > > > > data.csv: > > > > id, fruit > > 10, apple > > 20, orange > > > > Indexing with the following command using search-mr-1.0.0-cdh5.11.1-job. > jar > > > > hadoop --config /etc/hadoop/conf.cloudera.yarn jar \ > > /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-1.0. > 0-cdh5.11.1-job.jar > > \ > > org.apache.solr.hadoop.MapReduceIndexerTool \ > > -D 'mapred.child.java.opts=-Xmx500m' --log4j \ > > /opt/cloudera/parcels/CDH/share/doc/search/examples/ > solr-nrt/log4j.properties > > --morphline-file \ > > /home/user/readCSV.conf \ > > --output-dir hdfs://name-node.server.com:8020/user/solr/output --verbose > > --go-live \ > > --zk-host name-node.server.com:2181/solr --collection collection0 \ > > hdfs://name-node.server.com:8020/user/solr/input > > > > This leads to the following exception: > > > > 2219 [main] INFO org.apache.solr.hadoop.MapReduceIndexerTool - > Indexing 1 > > files using 1 real mappers into 1 reducers > > Error: java.io.IOException: Batch Write Failure > > at org.apache.solr.hadoop.BatchWriter.throwIf( > BatchWriter.java:239) > > .. > > Caused by: org.apache.solr.common.SolrException: ERROR: [doc=100] > unknown > > field 'file_path' > > at > > org.apache.solr.update.DocumentBuilder.toDocument( > DocumentBuilder.java:185) > > at > > org.apache.solr.update.AddUpdateCommand.getLuceneDocument( > AddUpdateCommand.java:78) > > > > It appears to me that the schema does not have file_path. The collection > is > > created through Hue and it properly identifies the two fields id and > fruit. > > I found out that the search-mr tool has the following code that > references > > the file_path: > > > > https://github.com/cloudera/search/blob/cdh5-1.0.0_5.2.0/ > search-mr/src/main/java/org/apache/solr/hadoop/HdfsFileFieldNames.java#L30 > > > > I am not sure what to do in order to be able to index files on HDFS. I > have > > two guesses: > > > > - add the fields definied in the search tool to the schema when I create > it > > (not sure how that work through Hue) > > - disable the HDFS meatadata insertion when inserting data > > > > Has anybody seen this before? > > > > Thanks, > > Istvan > > > > > > > > > > -- > > the sun shines for all > -- the sun shines for all