Re: MapReduceIndexerTool

Erick Erickson Wed, 15 Jul 2015 09:07:28 -0700

Charles:

bq:  My understanding is that this is actually somewhat slower than
the standard indexing path...


Yes and no. If you just use a single thread, you're right it'll be
slower since it has to copy a
bunch of stuff around. Then at the end, the --go-live step copies the
built index to Solr
then runs a  MERGEINDEXES on it. That copying stuff around can take some time.
Not to mention that a number of the intermediate steps do, a MERGEINDEXES
several times to gather lots of sub-shards together. So you can copy
your index several
times for each shard.

However, in a situation where you need to index a zillion documents
into, say, 5 nodes but
have 200 nodes available in your cluster, the extra copying time is
way more than offset by
being able to farm out the indexing across those 200 nodes. MRIT actually uses
EmbeddedSolrServer under the covers so you get a lot of parallelism.
Or in a  situation
where the amount of data is massive, copying it somewhere where the
standard indexing
path can find it may, in fact, be prohibitive. Or situations where the
ETL pipeline is a
bottleneck that can be farmed out over a zillion commodity nodes. So
It Depends (tm).

bq:  ... is recommended only if the site is already invested in the
Hadoop infrastructure

That's mostly my feeling too. Hadoop adds its own complexity, although
there are some really
cool tools out there to help. I'm just not in favor of adding
complexity unless there's a
compelling use-case. M/R indexing by itself can be enough inducement
to move to Hadoop
for some situations though.

Best,
Erick


On Wed, Jul 15, 2015 at 8:28 AM, Reitzel, Charles
<charles.reit...@tiaa-cref.org> wrote:
> The OP asked about MapReduceIndexerTool.   My understanding is that this is 
> actually somewhat slower than the standard indexing path and is recommended 
> only if the site is already invested in the Hadoop infrastructure.  E.g. 
> input files are already distributed on the Hadoop/Search cluster via HDFS.
>
> See also:
> https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
>
> Note, there is no coordination between replication between the HDFS and Solr 
> systems.  Thus, if you configure Solr replication N > 1 for each shard, and 
> the HDFS replication factor is M > 1, then you get N * M copies of all your 
> index data.   That can add up fast ...
>
> There is work underway to harmonize/mitigate Solr and HDFS replication:
> Ability to set the replication factor for index files created by 
> HDFSDirectoryFactory
> https://issues.apache.org/jira/browse/SOLR-6305
>
> To get a feel for the overall condition of MR/Solr integration, I looked at 
> JIRA issues related to HDFS and Hadoop.   It appears to be an area with some 
> decent bug fixes.  There are some larger feature issues as well, but it isn't 
> clear how much momentum these have.   Can anyone (developers, current users) 
> comment on the state of Hadoop integration?
>
> ---------
>
> Currently open JIRA issues for Solr containing "HDFS" or "Hadoop":
> https://issues.apache.org/jira/browse/SOLR-5069?jql=project%20%3D%20SOLR%20AND%20status%20%3D%20OPEN%20AND%20%28text%20~%20%22HDFS%22%20OR%20text%20~%20%22Hadoop%22%29%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC%2C%20created%20ASC
>
> Recently closed issues containing "HDFS" or "Hadoop":
> https://issues.apache.org/jira/browse/SOLR-7458?jql=project%20%3D%20SOLR%20AND%20status%20!%3D%20OPEN%20AND%20%28text%20~%20%22HDFS%22%20OR%20text%20~%20%22Hadoop%22%29%20ORDER%20BY%20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC
>
>
> -----Original Message-----
> From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org]
> Sent: Wednesday, July 15, 2015 11:24 AM
> To: solr-user@lucene.apache.org
> Subject: RE: copying data from one collection to another collection (solr 
> cloud 521)
>
> Since they want explicitly search within a given "version" of the data, this 
> seems like a textbook application for collection aliases.
>
> You could have N public collection names: current_stuff, previous_stuff_1, 
> previous_stuff_2, ...   At any given time, these will be aliased to reference 
> the "actual" collection names:
>         current_stuff -> stuff_20150712,
>         previous_stuff_1 -> stuff_20150705,
>         previous_stuff_2 -> stuff_20150628,
>         ...
>
> Every weekend, you create a new collection and index everything current into 
> it.  Once done, reset all the aliases to point to the newest N collections 
> and dropping the oldest:
>         current_stuff -> stuff_20150719
>         previous_stuff_1 -> stuff_20150712,
>         previous_stuff_2 -> stuff_20150705,
>         ...
>
> Collections API: Create or modify an Alias for a Collection
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api4
>
> Thus, you can keep the IDs the same and use them to compare to previous 
> versions of any given document.   Useful, if only for debugging purposes.
>
> Curious if there are opportunities for optimization here.  For example, would 
> it be faster to make a file system copy of the most recent collection and 
> load only changed documents (assuming the delta is available from the source 
> system)?
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Monday, July 13, 2015 11:55 PM
> To: solr-user@lucene.apache.org
> Subject: Re: copying data from one collection to another collection (solr 
> cloud 521)
>
> bq: does offline....
>
> No. I'm talking about "collection aliasing". You can create an entirely new 
> collection, index to it however  you want then switch to using that new 
> collection.
>
> bq: Any updates to EXISTING document in the LIVE collection should NOT be 
> replicated to the previous week(s) snapshot(s)
>
> then give it a new ID maybe?
>
> Best,
> Erick
>
> On Mon, Jul 13, 2015 at 3:21 PM, Raja Pothuganti 
> <rpothuga...@competitrack.com> wrote:
>> Thank you Erick
>>>Actually, my question is why do it this way at all? Why not index
>>>directly to your "live" nodes? This is what SolrCloud is built for.
>>>You an use "implicit" routing to create shards say, for each week and
>>>age out the ones that are "too old" as well.
>>
>>
>> Any updates to EXISTING document in the LIVE collection should NOT be
>> replicated to the previous week(s) snapshot(s). Think of the
>> snapshot(s) as an archive of sort and searchable independent of LIVE.
>> We're aiming to support at most 2 archives of data in the past.
>>
>>
>>>Another option would be to use "collection aliasing" to keep an
>>>offline index up to date then switch over when necessary.
>>
>> Does offline indexing refers to this link
>> https://github.com/cloudera/search/tree/0d47ff79d6ccc0129ffadcb50f9fe0
>> b271f
>> 102aa/search-mr
>>
>>
>> Thanks
>> Raja
>>
>>
>>
>> On 7/13/15, 3:14 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:
>>
>>>Actually, my question is why do it this way at all? Why not index
>>>directly to your "live" nodes? This is what SolrCloud is built for.
>>>
>>>There's the new backup/restore functionality that's still a work in
>>>progress, see: https://issues.apache.org/jira/browse/SOLR-5750
>>>
>>>You an use "implicit" routing to create shards say, for each week and
>>>age out the ones that are "too old" as well.
>>>
>>>Another option would be to use "collection aliasing" to keep an
>>>offline index up to date then switch over when necessary.
>>>
>>>I'd really like to know this isn't an XY problem though, what's the
>>>high-level problem you're trying to solve?
>>>
>>>Best,
>>>Erick
>>>
>>>On Mon, Jul 13, 2015 at 12:49 PM, Raja Pothuganti
>>><rpothuga...@competitrack.com> wrote:
>>>>
>>>> Hi,
>>>> We are setting up a new SolrCloud environment with 5.2.1 on Ubuntu
>>>>boxes. We currently ingest data into a large collection, call it LIVE.
>>>>After the full ingest is done we then trigger a delta delta ingestion
>>>>every 15 minutes to get the documents & data that have changed into
>>>>this LIVE instance.
>>>>
>>>> In Solr 4.X using a Master / Slave setup we had slaves that would
>>>>periodically (weekly, or monthly) refresh their data from the Master
>>>>rather than every 15 minutes. We're now trying to figure out how to
>>>>get this same type of setup using SolrCloud.
>>>>
>>>> Question(s):
>>>> - Is there a way to copy data from one SolrCloud collection into
>>>>another quickly and easily?
>>>> - Is there a way to programmatically control when a replica receives
>>>>it's data or possibly move it to another collection (without losing
>>>>data) that updates on a  different interval? It ideally would be
>>>>another collection name, call it Week1 ... Week52 ... to avoid a
>>>>replica in the same collection serving old data.
>>>>
>>>> One option we thought of was to create a backup and then restore
>>>>that into a new clean cloud. This has a lot of moving parts and isn't
>>>>nearly as neat as the Master / Slave controlled replication setup. It
>>>>also has the side effect of potentially taking a very long time to
>>>>backup and restore instead of just copying the indexes like the old M/S 
>>>>setup.
>>>>
>>>> Any ideas of thoughts? Thanks in advance for you help.
>>>> Raja
>>
>
> *************************************************************************
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately 
> and then delete it.
>
> TIAA-CREF
> *************************************************************************
>
> *************************************************************************
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately 
> and then delete it.
>
> TIAA-CREF
> *************************************************************************

Re: MapReduceIndexerTool

Reply via email to