Re: Indexing files from HDFS

István Thu, 12 Oct 2017 01:05:03 -0700

Hi Erik,

The question is not about Hue but about why file_path is in the schema for
HDFS files when using search-mr. I am wondering what is the standard way of
indexing files on HDFS.


THanks,
Istvan

On Wed, Oct 11, 2017 at 4:53 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> You probably get much more informed responses from
> the Cloudera folks, especially about Hue.
>
> Best,
> Erick
>
> On Wed, Oct 11, 2017 at 6:05 AM, István <lecc...@gmail.com> wrote:
> > Hi,
> >
> > I have Solr 4.10.3 part of a CDH5 installation and I would like to index
> > huge amount of CSV files on HDFS. I was wondering what is the best way of
> > doing that.
> >
> > Here is the current approach:
> >
> > data.csv:
> >
> > id, fruit
> > 10, apple
> > 20, orange
> >
> > Indexing with the following command using search-mr-1.0.0-cdh5.11.1-job.
> jar
> >
> > hadoop --config /etc/hadoop/conf.cloudera.yarn jar \
> > /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-1.0.
> 0-cdh5.11.1-job.jar
> > \
> > org.apache.solr.hadoop.MapReduceIndexerTool \
> > -D 'mapred.child.java.opts=-Xmx500m' --log4j \
> > /opt/cloudera/parcels/CDH/share/doc/search/examples/
> solr-nrt/log4j.properties
> > --morphline-file \
> > /home/user/readCSV.conf \
> > --output-dir hdfs://name-node.server.com:8020/user/solr/output --verbose
> > --go-live \
> > --zk-host name-node.server.com:2181/solr --collection collection0 \
> > hdfs://name-node.server.com:8020/user/solr/input
> >
> > This leads to the following exception:
> >
> > 2219 [main] INFO  org.apache.solr.hadoop.MapReduceIndexerTool  -
> Indexing 1
> > files using 1 real mappers into 1 reducers
> > Error: java.io.IOException: Batch Write Failure
> >         at org.apache.solr.hadoop.BatchWriter.throwIf(
> BatchWriter.java:239)
> > ..
> > Caused by: org.apache.solr.common.SolrException: ERROR: [doc=100]
> unknown
> > field 'file_path'
> >         at
> > org.apache.solr.update.DocumentBuilder.toDocument(
> DocumentBuilder.java:185)
> >         at
> > org.apache.solr.update.AddUpdateCommand.getLuceneDocument(
> AddUpdateCommand.java:78)
> >
> > It appears to me that the schema does not have file_path. The collection
> is
> > created through Hue and it properly identifies the two fields id and
> fruit.
> > I found out that the search-mr tool has the following code that
> references
> > the file_path:
> >
> > https://github.com/cloudera/search/blob/cdh5-1.0.0_5.2.0/
> search-mr/src/main/java/org/apache/solr/hadoop/HdfsFileFieldNames.java#L30
> >
> > I am not sure what to do in order to be able to index files on HDFS. I
> have
> > two guesses:
> >
> > - add the fields definied in the search tool to the schema when I create
> it
> > (not sure how that work through Hue)
> > - disable the HDFS meatadata insertion when inserting data
> >
> > Has anybody seen this before?
> >
> > Thanks,
> > Istvan
> >
> >
> >
> >
> > --
> > the sun shines for all
>



-- 
the sun shines for all

Re: Indexing files from HDFS

Reply via email to