I've got a very difficult project to tackle. I've been tasked with using schemaless mode to index json files that we receive. The structure of the json files will always be very different as we're receiving files from different customers totally unrelated to one another. We are attempting to build a "one size fits all" approach to receiving documents from a wide variety of sources and then index them into Solr.
We're running in Solr 5.3. The schemaless approach works well enough - until it doesn't. It seems to fail on type guessing and also gets confused indexing to different shards. If it was reliable it would be the perfect solution for our task. But the larger the JSON file the more likely it is to fail. At a certain size it just doesn't work. I've been advised by some experts and committers that schemaless is a good tool for prototyping, but risky to run in production, but we thought we would try it by doing offline indexing using the Cloudera MapReduceIndexerTool to build offline indexes - but still using managed schemas. This map reduce tool uses morphlines, which is a nifty ETL tool that pipes together a series of commands to transform data. For example a JSON or CSV file can be processed and loaded into a Solr index with a "readJSON" command piped to a "loadSolr" command, for a simple example. But the kite-sdk that manages the morphlines only seems to offer as they're latest version, solr *4.10.3*-cdh5.10.0 (they're customized version of 4.10.3) So I can't see any way to integrate schemaless (which has dependencies after 4.10.3) with the morphlines. But I thought I would ask here: Anybody had ANY experience with morphlines to index to Solr? Any info would help me make sense of this. Cheers to all!