RE: MapReduceIndexerTool

Reitzel, Charles Wed, 15 Jul 2015 08:30:11 -0700

The OP asked about MapReduceIndexerTool.   My understanding is that this is 
actually somewhat slower than the standard indexing path and is recommended 
only if the site is already invested in the Hadoop infrastructure.  E.g. input 
files are already distributed on the Hadoop/Search cluster via HDFS.


See also:
https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS

Note, there is no coordination between replication between the HDFS and Solr 
systems.  Thus, if you configure Solr replication N > 1 for each shard, and the 
HDFS replication factor is M > 1, then you get N * M copies of all your index 
data.   That can add up fast ...

There is work underway to harmonize/mitigate Solr and HDFS replication:
Ability to set the replication factor for index files created by 
HDFSDirectoryFactory
https://issues.apache.org/jira/browse/SOLR-6305

To get a feel for the overall condition of MR/Solr integration, I looked at 
JIRA issues related to HDFS and Hadoop.   It appears to be an area with some 
decent bug fixes.  There are some larger feature issues as well, but it isn't 
clear how much momentum these have.   Can anyone (developers, current users) 
comment on the state of Hadoop integration?

---------

Currently open JIRA issues for Solr containing "HDFS" or "Hadoop":
https://issues.apache.org/jira/browse/SOLR-5069?jql=project%20%3D%20SOLR%20AND%20status%20%3D%20OPEN%20AND%20%28text%20~%20%22HDFS%22%20OR%20text%20~%20%22Hadoop%22%29%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC%2C%20created%20ASC

Recently closed issues containing "HDFS" or "Hadoop":
https://issues.apache.org/jira/browse/SOLR-7458?jql=project%20%3D%20SOLR%20AND%20status%20!%3D%20OPEN%20AND%20%28text%20~%20%22HDFS%22%20OR%20text%20~%20%22Hadoop%22%29%20ORDER%20BY%20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC


-----Original Message-----
From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] 
Sent: Wednesday, July 15, 2015 11:24 AM
To: solr-user@lucene.apache.org
Subject: RE: copying data from one collection to another collection (solr cloud 
521)

Since they want explicitly search within a given "version" of the data, this 
seems like a textbook application for collection aliases.   

You could have N public collection names: current_stuff, previous_stuff_1, 
previous_stuff_2, ...   At any given time, these will be aliased to reference 
the "actual" collection names:
        current_stuff -> stuff_20150712, 
        previous_stuff_1 -> stuff_20150705, 
        previous_stuff_2 -> stuff_20150628,
        ...

Every weekend, you create a new collection and index everything current into 
it.  Once done, reset all the aliases to point to the newest N collections and 
dropping the oldest:
        current_stuff -> stuff_20150719
        previous_stuff_1 -> stuff_20150712,
        previous_stuff_2 -> stuff_20150705,
        ...

Collections API: Create or modify an Alias for a Collection
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api4

Thus, you can keep the IDs the same and use them to compare to previous 
versions of any given document.   Useful, if only for debugging purposes.

Curious if there are opportunities for optimization here.  For example, would 
it be faster to make a file system copy of the most recent collection and load 
only changed documents (assuming the delta is available from the source system)?

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Monday, July 13, 2015 11:55 PM
To: solr-user@lucene.apache.org
Subject: Re: copying data from one collection to another collection (solr cloud 
521)

bq: does offline....

No. I'm talking about "collection aliasing". You can create an entirely new 
collection, index to it however  you want then switch to using that new 
collection.

bq: Any updates to EXISTING document in the LIVE collection should NOT be 
replicated to the previous week(s) snapshot(s)

then give it a new ID maybe?

Best,
Erick

On Mon, Jul 13, 2015 at 3:21 PM, Raja Pothuganti <rpothuga...@competitrack.com> 
wrote:
> Thank you Erick
>>Actually, my question is why do it this way at all? Why not index 
>>directly to your "live" nodes? This is what SolrCloud is built for.
>>You an use "implicit" routing to create shards say, for each week and 
>>age out the ones that are "too old" as well.
>
>
> Any updates to EXISTING document in the LIVE collection should NOT be 
> replicated to the previous week(s) snapshot(s). Think of the 
> snapshot(s) as an archive of sort and searchable independent of LIVE. 
> We're aiming to support at most 2 archives of data in the past.
>
>
>>Another option would be to use "collection aliasing" to keep an 
>>offline index up to date then switch over when necessary.
>
> Does offline indexing refers to this link 
> https://github.com/cloudera/search/tree/0d47ff79d6ccc0129ffadcb50f9fe0
> b271f
> 102aa/search-mr
>
>
> Thanks
> Raja
>
>
>
> On 7/13/15, 3:14 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:
>
>>Actually, my question is why do it this way at all? Why not index 
>>directly to your "live" nodes? This is what SolrCloud is built for.
>>
>>There's the new backup/restore functionality that's still a work in 
>>progress, see: https://issues.apache.org/jira/browse/SOLR-5750
>>
>>You an use "implicit" routing to create shards say, for each week and 
>>age out the ones that are "too old" as well.
>>
>>Another option would be to use "collection aliasing" to keep an 
>>offline index up to date then switch over when necessary.
>>
>>I'd really like to know this isn't an XY problem though, what's the 
>>high-level problem you're trying to solve?
>>
>>Best,
>>Erick
>>
>>On Mon, Jul 13, 2015 at 12:49 PM, Raja Pothuganti 
>><rpothuga...@competitrack.com> wrote:
>>>
>>> Hi,
>>> We are setting up a new SolrCloud environment with 5.2.1 on Ubuntu 
>>>boxes. We currently ingest data into a large collection, call it LIVE.
>>>After the full ingest is done we then trigger a delta delta ingestion 
>>>every 15 minutes to get the documents & data that have changed into 
>>>this LIVE instance.
>>>
>>> In Solr 4.X using a Master / Slave setup we had slaves that would 
>>>periodically (weekly, or monthly) refresh their data from the Master 
>>>rather than every 15 minutes. We're now trying to figure out how to 
>>>get this same type of setup using SolrCloud.
>>>
>>> Question(s):
>>> - Is there a way to copy data from one SolrCloud collection into 
>>>another quickly and easily?
>>> - Is there a way to programmatically control when a replica receives 
>>>it's data or possibly move it to another collection (without losing
>>>data) that updates on a  different interval? It ideally would be 
>>>another collection name, call it Week1 ... Week52 ... to avoid a 
>>>replica in the same collection serving old data.
>>>
>>> One option we thought of was to create a backup and then restore 
>>>that into a new clean cloud. This has a lot of moving parts and isn't 
>>>nearly as neat as the Master / Slave controlled replication setup. It 
>>>also has the side effect of potentially taking a very long time to 
>>>backup and restore instead of just copying the indexes like the old M/S 
>>>setup.
>>>
>>> Any ideas of thoughts? Thanks in advance for you help.
>>> Raja
>

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*************************************************************************

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*************************************************************************

RE: MapReduceIndexerTool

Reply via email to