Hello, While this may be a question for cloudera, I wanted to tap the brains of this very active community as well.
I am trying to use the MapReduceIndexerTool to index data in a hive table to Solr Cloud / Cloudera Search. The tool is failing the job with the following error 1799 [main] INFO org.apache.solr.hadoop.MapReduceIndexerTool - Indexing 1 files using 1 real mappers into 10 reducers Error: MAX_ARRAY_LENGTH Error: MAX_ARRAY_LENGTH Error: MAX_ARRAY_LENGTH 36962 [main] ERROR org.apache.solr.hadoop.MapReduceIndexerTool - Job failed! jobName: org.apache.solr.hadoop.MapReduceIndexerTool/MorphlineMapper, jobId: job_1473161870114_0339 The error stack trace is 2016-09-08 10:39:20,128 ERROR [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.NoSuchFieldError: MAX_ARRAY_LENGTH at org.apache.lucene.codecs.memory.DirectDocValuesFormat.<clinit>(DirectDocValuesFormat.java:58) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at java.lang.Class.newInstance(Class.java:374) at org.apache.lucene.util.NamedSPILoader.reload(NamedSPILoader.java:67) at org.apache.lucene.util.NamedSPILoader.<init>(NamedSPILoader.java:47) at org.apache.lucene.util.NamedSPILoader.<init>(NamedSPILoader.java:37) at org.apache.lucene.codecs.DocValuesFormat.<clinit>(DocValuesFormat.java:43) at org.apache.solr.core.SolrResourceLoader.reloadLuceneSPI(SolrResourceLoader.java:205) My Schema.xml looks like <fields> <field name="dataset_id" type="string" indexed="true" stored="true" required="true" multiValued="false" docValue="true" /> <field name="search_string" type="string" indexed="true" stored="true" docValue="true"/> <field name="_version_" type="long" indexed="true" stored="true"/> </fields> <!-- Field to use to determine and enforce document uniqueness. Unless this field is marked with required="false", it will be a required field --> <uniqueKey>dataset_id</uniqueKey> I am otherwise about to post documents using Solr APIs / upload methods. Only the MapReduceIndexer tool is failing. The command I am using is hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/s hare/doc/search-1.0.0+cdh5.7.0+0/examples/solr-nrt/log4j.properties --morphline-file /home/$USER/morphline2.conf --output-dir hdfs://NNHOST:8020/user/$USER/outdir --verbose --zk-host ZKHOST:2181/solr1 --collection dataCatalog_search_index hdfs://NNHOST:8020/user/hive/warehouse/name.db/concatenated_index4/; My morphline config looks like SOLR_LOCATOR : { # Name of solr collection collection : search_index # ZooKeeper ensemble $zkHost:2181/solr1" } # Specify an array of one or more morphlines, each of which defines an ETL # transformation chain. A morphline consists of one or more (potentially # nested) commands. A morphline is a way to consume records (e.g. Flume events, # HDFS files or blocks), turn them into a stream of records, and pipe the stream # of records through a set of easily configurable transformations on the way to # a target application such as Solr. morphlines : [ { id : search_index importCommands : ["org.kitesdk.**", "org.apache.solr.**"] commands : [ { readCSV { separator : "," columns : [dataset_id,search_string] ignoreFirstLine : true charset : UTF-8 } } # Consume the output record of the previous command and pipe another # record downstream. # # Command that deletes record fields that are unknown to Solr # schema.xml. # # Recall that Solr throws an exception on any attempt to load a document # that contains a field that isn't specified in schema.xml. { sanitizeUnknownSolrFields { # Location from which to fetch Solr schema solrLocator : ${SOLR_LOCATOR} } } # log the record at DEBUG level to SLF4J { logDebug { format : "output record: {}", args : ["@{}"] } } # load the record into a Solr server or MapReduce Reducer { loadSolr { solrLocator : ${SOLR_LOCATOR} } } ] } ] Please let me know if I am going anything wrong. -- Sincerely, Darshan