Re: Lucene to Solrcloud migration

Erick Erickson Tue, 11 Nov 2014 08:55:20 -0800

bq:  So I guess with compositeId router I am out of luck.

No, not at all. Atomic updates are exactly about updating
a doc and NOT changing the id. A different uniqueKey is
a different doc by definition.


So you can easily use atomic updates with composite IDs
since you are changing a field of an existing doc as long
as the router bits are the same.

But that may be irrelevant....

Take a look at LotsOfCores (WARNING! this is NOT
verified in SolrCloud!). The design there is exactly to
limit the number of simultaneous cores in memory, having
them load/unload themselves based on the limits you set up.
So you can just fire queries blindly at your server where the
URL includes the core name and be confident that you'll stay
within your hardware limits.

http://wiki.apache.org/solr/LotsOfCores

If you're using SolrCloud, though, there's really no concept
of unloading specific cores/indexes at once, it really pre-supposes
that you've scaled your system such that you can have them all
active at once. So I don't really see how routing to specific cores
is going to help you.

Then again I don't know your problem space.

Best,
Erick

On Tue, Nov 11, 2014 at 11:33 AM, Michal Krajňanský
<michal.krajnan...@gmail.com> wrote:
> Hm. So I found that one can update stored fields with "atomic update"
> operation, however according to
> http://stackoverflow.com/questions/19058795/it-is-possible-to-update-uniquekey-in-solr-4
> this will not work for uniqueKey. So I guess with compositeId router I am
> out of luck.
>
> I have been also searching for a way to implement my own routing mechanism.
> Anyway, this seem to be a cleaner solution -- I would not need to modify
> existing index, just compute hash from the other (stored) fields than just
> document id. Can you confirm that it is possible? The documentation is
> however very modest (I only found that it is possible to specify custom
> hash function).
>
> Best,
>
> Michal
>
> 2014-11-11 16:48 GMT+01:00 Michael Della Bitta <
> michael.della.bi...@appinions.com>:
>
>> Yeah, Erick confused me a bit too, but I think what he's talking about
>> takes for granted that you'd have your various indexes directly set up as
>> individual collections.
>>
>> If instead you're considering one big collection, or a few collections
>> based on aggregations of your individual indexes, having big, multisharded
>> collections using compositeId should work, unless there's a use case we're
>> not discussing.
>>
>> Michael
>>
>>
>> On 11/11/14 10:27, Michal Krajňanský wrote:
>>
>>> Hi Eric, Michael,
>>>
>>> thank you both for your comments.
>>>
>>> 2014-11-11 5:05 GMT+01:00 Erick Erickson <erickerick...@gmail.com>:
>>>
>>>  bq: - the documents are organized in "shards" according to date (integer)
>>>> and
>>>> language (a possibly extensible discrete set)
>>>>
>>>> bq: - the indexes are disjunct
>>>>
>>>> OK, I'm having a hard time getting my head around these two statements.
>>>>
>>>> If the indexes are disjunct in the sense that you only search one at a
>>>> time,
>>>> then they are different "collections" in SolrCloud jargon.
>>>>
>>>>
>>>>  I just meant that every document is contained in a single one of the
>>> indexes. I have a lot of Lucene indexes for various [language X timespan],
>>> but logically we are speaking about a single huge index. That is why I
>>> thought it would be natural to represent is as a single SolrCloud
>>> collection.
>>>
>>> If, on the other hand, these are a big collection and you want to search
>>>
>>>> them all with a single query, I suggest that in SolrCloud land you don't
>>>> want them to be discrete shards. My reasoning here is that let's say you
>>>> have a bunch of documents for October, 2014 in Spanish. By putting these
>>>> all on a single shard, your queries all have to be serviced by that one
>>>> shard. You don't get any parallelism.
>>>>
>>>>
>>>>  That is right. Actually the parallelization is not the main issue right
>>> now. The queries are very sparse, currently our system does not support
>>> load balancing at all. I imagined that in the future it could be
>>> achievable
>>> via SolrCloud replication.
>>>
>>> The main consideration is to be able to plug the indexes in and out on
>>> demand. The total size of the data is in terabytes. We usually want to
>>> search only the latest indexes but occassionally it is needed to plug in
>>> one of the older ones.
>>>
>>> Maybe (probably) I still have some misconceptions about the uses of
>>> SolrCloud...
>>>
>>> If it really does make sense in your case to route all the doc to a
>>>
>>>> single shard,
>>>> then Michael's comment is spot-on use compositeId router.
>>>>
>>>>
>>>>  You confuse me here. I was not thinking about a single shard, on the
>>> contrary, any [language X timespan] index would be itself a shard. I agree
>>> that compositeId router seems to be natural for what I need. I am
>>> currently
>>> searching for the way to convert my indexes in such way that my document
>>> ID's have the composite format. Currently these are just unique integers,
>>> so I would like to prefix all the document ID's of an index with it's
>>> language and timespan. I do not know how, but I believe this should be
>>> possible, as it is a constant operation that would not change the
>>> structure
>>> of the index.
>>>
>>> Best,
>>>
>>> Michal
>>>
>>>
>>>
>>>  Best,
>>>> Erick
>>>>
>>>> On Mon, Nov 10, 2014 at 11:50 AM, Michael Della Bitta
>>>> <michael.della.bi...@appinions.com> wrote:
>>>>
>>>>> Hi Michal,
>>>>>
>>>>> Is there a particular reason to shard your collections like that? If it
>>>>>
>>>> was
>>>>
>>>>> mainly for ease of operations, I'd consider just using CompositeId to
>>>>> prevent specific types of queries hotspotting particular nodes.
>>>>>
>>>>> If your ingest rate is fast, you might also consider making each
>>>>> "collection" an alias that points to many actual collections, and
>>>>> periodically closing off a collection and starting a new one. This
>>>>>
>>>> prevents
>>>>
>>>>> cache churn and the impact of large merges.
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>>
>>>>> On 11/10/14 08:03, Michal Krajňanský wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I have been working on a project that has long employed Lucene indexer.
>>>>>>
>>>>>> Currently, the system implements a proprietary document routing and
>>>>>>
>>>>> index
>>>>
>>>>> plugging/unplugging on top of the Lucene and of course contains a great
>>>>>> body of indexes. Recently an idea came up to migrate from Lucene to
>>>>>> Solrcloud, which appears to be more powerfull that our proprietary
>>>>>>
>>>>> system.
>>>>
>>>>> Could you suggest the best way to seamlessly migrate the system to use
>>>>>> Solrcloud, when the reindexing is not an option?
>>>>>>
>>>>>> - all the existing indexes represent a single collection in terms of
>>>>>> Solrcloud
>>>>>> - the documents are organized in "shards" according to date (integer)
>>>>>>
>>>>> and
>>>>
>>>>> language (a possibly extensible discrete set)
>>>>>> - the indexes are disjunct
>>>>>>
>>>>>> I have been able to convert the existing indexes to the newest Lucene
>>>>>> version and plug them individually into the Solrcloud. However, there
>>>>>> is
>>>>>> the question of routing, sharding etc.
>>>>>>
>>>>>> Any insight appreciated.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>>
>>>>>> Michal Krajnansky
>>>>>>
>>>>>>
>>

Re: Lucene to Solrcloud migration

Reply via email to