Managed schema used with Cloudera MapreduceIndexerTool and morphlines?

Jay Hill Fri, 17 Mar 2017 21:02:21 -0700

I've got a very difficult project to tackle. I've been tasked with using
schemaless mode to index json files that we receive. The structure of the
json files will always be very different as we're receiving files from
different customers totally unrelated to one another. We are attempting to
build a "one size fits all" approach to receiving documents from a wide
variety of sources and then index them into Solr.


We're running in Solr 5.3. The schemaless approach works well enough -
until it doesn't. It seems to fail on type guessing and also gets confused
indexing to different shards. If it was reliable it would be the perfect
solution for our task. But the larger the JSON file the more likely it is
to fail. At a certain size it just doesn't work.

I've been advised by some experts and committers that schemaless is a good
tool for prototyping, but risky to run in production, but we thought we
would try it by doing offline indexing using the Cloudera
MapReduceIndexerTool to build offline indexes - but still using managed
schemas. This map reduce tool uses morphlines, which is a nifty ETL tool
that pipes together a series of commands to transform data. For example a
JSON or CSV file can be processed and loaded into a Solr index with a
"readJSON" command piped to a "loadSolr" command, for a simple example.

But the kite-sdk that manages the morphlines only seems to offer as they're
latest version, solr *4.10.3*-cdh5.10.0 (they're customized version of
4.10.3)

So I can't see any way to integrate schemaless (which has dependencies
after 4.10.3) with the morphlines.

But I thought I would ask here: Anybody had ANY experience with morphlines
to index to Solr? Any info would help me make sense of this.

Cheers to all!

Managed schema used with Cloudera MapreduceIndexerTool and morphlines?

Reply via email to