The OP asked about MapReduceIndexerTool. My understanding is that this is actually somewhat slower than the standard indexing path and is recommended only if the site is already invested in the Hadoop infrastructure. E.g. input files are already distributed on the Hadoop/Search cluster via HDFS.
See also: https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS Note, there is no coordination between replication between the HDFS and Solr systems. Thus, if you configure Solr replication N > 1 for each shard, and the HDFS replication factor is M > 1, then you get N * M copies of all your index data. That can add up fast ... There is work underway to harmonize/mitigate Solr and HDFS replication: Ability to set the replication factor for index files created by HDFSDirectoryFactory https://issues.apache.org/jira/browse/SOLR-6305 To get a feel for the overall condition of MR/Solr integration, I looked at JIRA issues related to HDFS and Hadoop. It appears to be an area with some decent bug fixes. There are some larger feature issues as well, but it isn't clear how much momentum these have. Can anyone (developers, current users) comment on the state of Hadoop integration? --------- Currently open JIRA issues for Solr containing "HDFS" or "Hadoop": https://issues.apache.org/jira/browse/SOLR-5069?jql=project%20%3D%20SOLR%20AND%20status%20%3D%20OPEN%20AND%20%28text%20~%20%22HDFS%22%20OR%20text%20~%20%22Hadoop%22%29%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC%2C%20created%20ASC Recently closed issues containing "HDFS" or "Hadoop": https://issues.apache.org/jira/browse/SOLR-7458?jql=project%20%3D%20SOLR%20AND%20status%20!%3D%20OPEN%20AND%20%28text%20~%20%22HDFS%22%20OR%20text%20~%20%22Hadoop%22%29%20ORDER%20BY%20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC -----Original Message----- From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] Sent: Wednesday, July 15, 2015 11:24 AM To: solr-user@lucene.apache.org Subject: RE: copying data from one collection to another collection (solr cloud 521) Since they want explicitly search within a given "version" of the data, this seems like a textbook application for collection aliases. You could have N public collection names: current_stuff, previous_stuff_1, previous_stuff_2, ... At any given time, these will be aliased to reference the "actual" collection names: current_stuff -> stuff_20150712, previous_stuff_1 -> stuff_20150705, previous_stuff_2 -> stuff_20150628, ... Every weekend, you create a new collection and index everything current into it. Once done, reset all the aliases to point to the newest N collections and dropping the oldest: current_stuff -> stuff_20150719 previous_stuff_1 -> stuff_20150712, previous_stuff_2 -> stuff_20150705, ... Collections API: Create or modify an Alias for a Collection https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api4 Thus, you can keep the IDs the same and use them to compare to previous versions of any given document. Useful, if only for debugging purposes. Curious if there are opportunities for optimization here. For example, would it be faster to make a file system copy of the most recent collection and load only changed documents (assuming the delta is available from the source system)? -----Original Message----- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, July 13, 2015 11:55 PM To: solr-user@lucene.apache.org Subject: Re: copying data from one collection to another collection (solr cloud 521) bq: does offline.... No. I'm talking about "collection aliasing". You can create an entirely new collection, index to it however you want then switch to using that new collection. bq: Any updates to EXISTING document in the LIVE collection should NOT be replicated to the previous week(s) snapshot(s) then give it a new ID maybe? Best, Erick On Mon, Jul 13, 2015 at 3:21 PM, Raja Pothuganti <rpothuga...@competitrack.com> wrote: > Thank you Erick >>Actually, my question is why do it this way at all? Why not index >>directly to your "live" nodes? This is what SolrCloud is built for. >>You an use "implicit" routing to create shards say, for each week and >>age out the ones that are "too old" as well. > > > Any updates to EXISTING document in the LIVE collection should NOT be > replicated to the previous week(s) snapshot(s). Think of the > snapshot(s) as an archive of sort and searchable independent of LIVE. > We're aiming to support at most 2 archives of data in the past. > > >>Another option would be to use "collection aliasing" to keep an >>offline index up to date then switch over when necessary. > > Does offline indexing refers to this link > https://github.com/cloudera/search/tree/0d47ff79d6ccc0129ffadcb50f9fe0 > b271f > 102aa/search-mr > > > Thanks > Raja > > > > On 7/13/15, 3:14 PM, "Erick Erickson" <erickerick...@gmail.com> wrote: > >>Actually, my question is why do it this way at all? Why not index >>directly to your "live" nodes? This is what SolrCloud is built for. >> >>There's the new backup/restore functionality that's still a work in >>progress, see: https://issues.apache.org/jira/browse/SOLR-5750 >> >>You an use "implicit" routing to create shards say, for each week and >>age out the ones that are "too old" as well. >> >>Another option would be to use "collection aliasing" to keep an >>offline index up to date then switch over when necessary. >> >>I'd really like to know this isn't an XY problem though, what's the >>high-level problem you're trying to solve? >> >>Best, >>Erick >> >>On Mon, Jul 13, 2015 at 12:49 PM, Raja Pothuganti >><rpothuga...@competitrack.com> wrote: >>> >>> Hi, >>> We are setting up a new SolrCloud environment with 5.2.1 on Ubuntu >>>boxes. We currently ingest data into a large collection, call it LIVE. >>>After the full ingest is done we then trigger a delta delta ingestion >>>every 15 minutes to get the documents & data that have changed into >>>this LIVE instance. >>> >>> In Solr 4.X using a Master / Slave setup we had slaves that would >>>periodically (weekly, or monthly) refresh their data from the Master >>>rather than every 15 minutes. We're now trying to figure out how to >>>get this same type of setup using SolrCloud. >>> >>> Question(s): >>> - Is there a way to copy data from one SolrCloud collection into >>>another quickly and easily? >>> - Is there a way to programmatically control when a replica receives >>>it's data or possibly move it to another collection (without losing >>>data) that updates on a different interval? It ideally would be >>>another collection name, call it Week1 ... Week52 ... to avoid a >>>replica in the same collection serving old data. >>> >>> One option we thought of was to create a backup and then restore >>>that into a new clean cloud. This has a lot of moving parts and isn't >>>nearly as neat as the Master / Slave controlled replication setup. It >>>also has the side effect of potentially taking a very long time to >>>backup and restore instead of just copying the indexes like the old M/S >>>setup. >>> >>> Any ideas of thoughts? Thanks in advance for you help. >>> Raja > ************************************************************************* This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF ************************************************************************* ************************************************************************* This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF *************************************************************************