Hi Charles,
Thank you for the response. We will be using aliasing. Looking into ways
to avoid ingestion into each of the collections as you have mentioned "For
example, would it be faster to make a file system copy of the most recent
collection ..² 

MapReduceIndexerTool is not an option at this point.


One option is to Backup each shard from current_stuff collection at the
end of week to a particular location( say directory /opt/data/) and then
1) empty/delete existing documents in previous_stuff_1 collection
2) restore each corresponding shard from /opt/data/ to previous_stuff_1
collection by using backup & restore as suggested
https://cwiki.apache.org/confluence/display/solr/Making+and+Restoring+Backu
ps+of+SolrCores


Trying to find if there are any better ways than above option.

Thanks
Raja




On 7/15/15, 10:23 AM, "Reitzel, Charles" <charles.reit...@tiaa-cref.org>
wrote:

>Since they want explicitly search within a given "version" of the data,
>this seems like a textbook application for collection aliases.
>
>You could have N public collection names: current_stuff,
>previous_stuff_1, previous_stuff_2, ...   At any given time, these will
>be aliased to reference the "actual" collection names:
>       current_stuff -> stuff_20150712,
>       previous_stuff_1 -> stuff_20150705,
>       previous_stuff_2 -> stuff_20150628,
>       ...
>
>Every weekend, you create a new collection and index everything current
>into it.  Once done, reset all the aliases to point to the newest N
>collections and dropping the oldest:
>       current_stuff -> stuff_20150719
>       previous_stuff_1 -> stuff_20150712,
>       previous_stuff_2 -> stuff_20150705,
>       ...
>
>Collections API: Create or modify an Alias for a Collection
>https://cwiki.apache.org/confluence/display/solr/Collections+API#Collectio
>nsAPI-api4
>
>Thus, you can keep the IDs the same and use them to compare to previous
>versions of any given document.   Useful, if only for debugging purposes.
>
>Curious if there are opportunities for optimization here.  For example,
>would it be faster to make a file system copy of the most recent
>collection and load only changed documents (assuming the delta is
>available from the source system)?
>
>-----Original Message-----
>From: Erick Erickson [mailto:erickerick...@gmail.com]
>Sent: Monday, July 13, 2015 11:55 PM
>To: solr-user@lucene.apache.org
>Subject: Re: copying data from one collection to another collection (solr
>cloud 521)
>
>bq: does offline....
>
>No. I'm talking about "collection aliasing". You can create an entirely
>new collection, index to it however  you want then switch to using that
>new collection.
>
>bq: Any updates to EXISTING document in the LIVE collection should NOT be
>replicated to the previous week(s) snapshot(s)
>
>then give it a new ID maybe?
>
>Best,
>Erick
>
>On Mon, Jul 13, 2015 at 3:21 PM, Raja Pothuganti
><rpothuga...@competitrack.com> wrote:
>> Thank you Erick
>>>Actually, my question is why do it this way at all? Why not index
>>>directly to your "live" nodes? This is what SolrCloud is built for.
>>>You an use "implicit" routing to create shards say, for each week and
>>>age out the ones that are "too old" as well.
>>
>>
>> Any updates to EXISTING document in the LIVE collection should NOT be
>> replicated to the previous week(s) snapshot(s). Think of the
>> snapshot(s) as an archive of sort and searchable independent of LIVE.
>> We're aiming to support at most 2 archives of data in the past.
>>
>>
>>>Another option would be to use "collection aliasing" to keep an
>>>offline index up to date then switch over when necessary.
>>
>> Does offline indexing refers to this link
>> https://github.com/cloudera/search/tree/0d47ff79d6ccc0129ffadcb50f9fe0
>> b271f
>> 102aa/search-mr
>>
>>
>> Thanks
>> Raja
>>
>>
>>
>> On 7/13/15, 3:14 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:
>>
>>>Actually, my question is why do it this way at all? Why not index
>>>directly to your "live" nodes? This is what SolrCloud is built for.
>>>
>>>There's the new backup/restore functionality that's still a work in
>>>progress, see: https://issues.apache.org/jira/browse/SOLR-5750
>>>
>>>You an use "implicit" routing to create shards say, for each week and
>>>age out the ones that are "too old" as well.
>>>
>>>Another option would be to use "collection aliasing" to keep an
>>>offline index up to date then switch over when necessary.
>>>
>>>I'd really like to know this isn't an XY problem though, what's the
>>>high-level problem you're trying to solve?
>>>
>>>Best,
>>>Erick
>>>
>>>On Mon, Jul 13, 2015 at 12:49 PM, Raja Pothuganti
>>><rpothuga...@competitrack.com> wrote:
>>>>
>>>> Hi,
>>>> We are setting up a new SolrCloud environment with 5.2.1 on Ubuntu
>>>>boxes. We currently ingest data into a large collection, call it LIVE.
>>>>After the full ingest is done we then trigger a delta delta ingestion
>>>>every 15 minutes to get the documents & data that have changed into
>>>>this LIVE instance.
>>>>
>>>> In Solr 4.X using a Master / Slave setup we had slaves that would
>>>>periodically (weekly, or monthly) refresh their data from the Master
>>>>rather than every 15 minutes. We're now trying to figure out how to
>>>>get this same type of setup using SolrCloud.
>>>>
>>>> Question(s):
>>>> - Is there a way to copy data from one SolrCloud collection into
>>>>another quickly and easily?
>>>> - Is there a way to programmatically control when a replica receives
>>>>it's data or possibly move it to another collection (without losing
>>>>data) that updates on a  different interval? It ideally would be
>>>>another collection name, call it Week1 ... Week52 ... to avoid a
>>>>replica in the same collection serving old data.
>>>>
>>>> One option we thought of was to create a backup and then restore
>>>>that into a new clean cloud. This has a lot of moving parts and isn't
>>>>nearly as neat as the Master / Slave controlled replication setup. It
>>>>also has the side effect of potentially taking a very long time to
>>>>backup and restore instead of just copying the indexes like the old
>>>>M/S setup.
>>>>
>>>> Any ideas of thoughts? Thanks in advance for you help.
>>>> Raja
>>
>
>*************************************************************************
>This e-mail may contain confidential or privileged information.
>If you are not the intended recipient, please notify the sender
>immediately and then delete it.
>
>TIAA-CREF
>*************************************************************************

Reply via email to