Groovy, that's consensus then. Where we can use _db_updates to know which pending jobs are worth running, we will. If it's a 404, we'll do something less optimal, start the job itself. Using the connection pool as discussed and accounting for and penalising jobs that complete very quickly or perform no useful work by applying a rising retry interval should mitigate some of that cost. Happily, we discussed those kinds of scheduler nuances this week.
> On 20 Mar 2016, at 14:54, Adam Kocoloski <[email protected]> wrote: > > I’ll never berate anyone for top-posting (or bottom-posting for that matter). > I just follow suit with whatever the current thread is doing — in this, very > very clearly top-posting ;) > > Thank you for making this distinction clear. Personally I was only ever > interested in the first case. Scoping the replicator manager to only learn > about _replicator docs on the local cluster through internal APIs is a smart > move — I wasn’t suggesting anything different there. I do think we should > have an efficient way for a client to learn about the existence of new > updates to an arbitrary number of databases on a remote cluster using a > single socket. > > Adam > >> On Mar 20, 2016, at 10:45 AM, Robert Samuel Newson <[email protected]> >> wrote: >> >> Final note, we've conflated two uses of /_db_updates that I want to be very >> clear on; >> >> 1) using /_db_updates to detect active source databases of a replication job. >> 2) using /_db_updates to hear about new/updated/deleted _replicator >> documents. >> >> It was the 2nd case where the unreliability was a concern, since the update >> frequency is very low, one expects, for _replicator databases. >> >>> On 20 Mar 2016, at 14:44, Robert Samuel Newson <[email protected]> wrote: >>> >>> (I swear I'll stop soon...) >>> >>> Using /_db_updates as a cheap mechanism to detect activity at the source >>> for any database we're interested in is an important optimization. We >>> didn't discuss it this past week as we felt that /_db_updates wasn't >>> sufficiently reliable. We can save a lot of churn in the scheduler by >>> simply not resuming any job unless we have seen an update to the source >>> database. >>> >>> B. >>> >>>> On 20 Mar 2016, at 14:36, Robert Samuel Newson <[email protected]> wrote: >>>> >>>> I missed a point in Adam's earlier post. >>>> >>>> The current scheme uses couch_event for runtime changes to _replicator >>>> docs but has to read all updates of all _replicator databases at startup. >>>> In the steady state it is just receiving couch_event notifications. The >>>> /_db_updates option would change that only slightly (we'd read >>>> /_db_updates from 0 to find all _replicator databases, rather than reading >>>> the changes feed for the node-local 'dbs' database). >>>> >>>> CouchDB itself has a single /_replicator database, of course, but the code >>>> will consider any database to be a /_replicator database if the name ends >>>> that way. i.e, today, if you made a database called foo/_replicator it >>>> would be considered a /_replicator database by the system (and we'd inject >>>> the ddoc, etc). >>>> >>>> B. >>>> >>>>> On 20 Mar 2016, at 14:31, Robert Samuel Newson <[email protected]> wrote: >>>>> >>>>> Since I'm typing anyway, and haven't yet been dinged for top-posting, I >>>>> wanted to mention one other optimization we had in mind. >>>>> >>>>> Currently each replicator job has its own connection pool. When we >>>>> introduce the notion that we can stop and restart jobs, those become >>>>> approximately useless. So we will obvious hoist that 'up' to a higher >>>>> level and manage connection pools at the manager level. >>>>> >>>>> One optimization that seems obvious from the Cloudant perspective is to >>>>> allow reuse of connections to the same destinations even though they are >>>>> ostensibly for different domains. That is, a connection to >>>>> rnewson.cloudant.com is ultimately a connection to >>>>> lbX.jenever.cloudant.com. This connection could just as easily be used >>>>> for any other user in the jenever cluster. Thus, if it's idle, we could >>>>> borrow that connection rather than create a new one. >>>>> >>>>>> host rnewson.cloudant.com >>>>> rnewson.cloudant.com is an alias for jenever.cloudant.com. >>>>> jenever.cloudant.com is an alias for lb2.jenever.cloudant.com. >>>>> lb2.jenever.cloudant.com has address 5.153.0.207 >>>>> >>>>> Rather than add rnewson.cloudant.com > 5.153.0.207 to the pool, we would >>>>> add lb2.jenever.cloudant.com -> 5.153.0.207 and resolve >>>>> rnewson.cloudant.com to its ultimate CNAME before consulting the pool. >>>>> >>>>> Does this optimization help elsewhere than Cloudant? >>>>> >>>>>> On 20 Mar 2016, at 14:22, Robert Samuel Newson <[email protected]> >>>>>> wrote: >>>>>> >>>>>> My point is that we can (and currently do) trigger the replication >>>>>> manager on receipt of the database updated event, so it avoids all of >>>>>> the other parts of the sequence you describe which could fail. >>>>>> >>>>>> The obvious difference, and I suspect this is what motivates Adam's >>>>>> position, is that _db_updates can be called remotely. A solution using >>>>>> /_db_updates as its feed can run somewhere else, it wouldn't even need >>>>>> to be a couchdb cluster. With the current 2.0 scheme, the _replicator db >>>>>> has to live on the nodes performing replication management (and >>>>>> therefore it depends on couch_{btree,file} etc). That's a huge incentive >>>>>> to go the /_db_updates route and it would serve as a model for others >>>>>> like pouchdb that cannot choose to co-locate. >>>>>> >>>>>> One side-benefit we get from using database updated events from the >>>>>> _replicator shards, though, is that it helps us determine which node >>>>>> will run any particular job. We allocate a job to the lowest live erlang >>>>>> node that hosts the document. If we go with /_db_updates, we'll need >>>>>> some other scheme. That's not a bad thing (indeed, it could be a very >>>>>> good thing), but it would need more thought. While in Seattle we did >>>>>> discuss both directions at some length and believe we'd need some form >>>>>> of leader election system, the leader would then assign (and rebalance) >>>>>> replication jobs across the erlang cluster. I pointed at a >>>>>> proof-of-concept implementation of an algorithm I trust that I wrote a >>>>>> while back at https://github.com/cloudant/sobhuza as a possible starting >>>>>> point. >>>>>> >>>>>> B. >>>>>> >>>>>> P.S. I'm using Mail.app and simply replying where it sticks the cursor >>>>>> (at the top), but in other forums I've been berated for top-posting. >>>>>> Should I modify my reply style here? >>>>>> >>>>>>> On 19 Mar 2016, at 21:42, Benjamin Bastian <[email protected]> wrote: >>>>>>> >>>>>>> When a shard is updated, it'll trigger a "database updated" event. >>>>>>> CouchDB >>>>>>> will hold those updates in memory for a configurable amount of time in >>>>>>> order to dedupe updates. It'll then cast lists of updated databases to >>>>>>> nodes which host the relevant _db_updates shards for further >>>>>>> deduplication. >>>>>>> It's only at that point that the updates are persisted. Only a single >>>>>>> update needs to reach the _db_updates DB. IIRC, _db_updates triggers up >>>>>>> to >>>>>>> n^3 (assuming the _db_updates DB and the updated DB have the same N), >>>>>>> so it >>>>>>> may be a bit tricky for all of them to fail. You'd need coordinated node >>>>>>> failure. Perhaps something like datacenter power loss. Another possible >>>>>>> issue is if all the nodes which host a shard range of the _db_updates DB >>>>>>> are unreachable by the nodes which host a shard range of any other DB. >>>>>>> Even >>>>>>> if it was momentary, it'd cause messages to be dropped from the >>>>>>> _db_updates >>>>>>> feed. >>>>>>> >>>>>>> For n=3 DBs, it seems like it'd be difficult for all of those things to >>>>>>> go >>>>>>> wrong (except perhaps in the case of power loss or catastrophic network >>>>>>> failure). For n=1 DBs, you'd simply need to reboot a node soon after an >>>>>>> update. >>>>>>> >>>>>>>> On Sat, Mar 19, 2016 at 1:31 PM, Adam Kocoloski <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Bob, comments inline: >>>>>>>> >>>>>>>>>> On Mar 19, 2016, at 2:36 PM, Robert Samuel Newson >>>>>>>>>> <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> The problem is that _db_updates is not guaranteed to see every update, >>>>>>>> so I think it falls at the first hurdle. >>>>>>>> >>>>>>>> Do you mean to say that a listener of _db_updates is not guaranteed to >>>>>>>> see >>>>>>>> every updated *database*? I think it would be helpful for the >>>>>>>> discussion to >>>>>>>> describe the scenario in which an updated database permanently fails to >>>>>>>> show up in the feed. My recollection is that it’s quite byzantine. >>>>>>>> >>>>>>>>> What couch_replicator_manager does in couchdb 2.0 (though not in the >>>>>>>> version that Cloudant originally contributed) is to us ecouch_event, >>>>>>>> notice >>>>>>>> which are to _replicator shards, and trigger management work from that. >>>>>>>> >>>>>>>> Did you mean to say “couch_event”? I assume so. You’re describing how >>>>>>>> the >>>>>>>> replicator manager discovers new replication jobs, not how the jobs >>>>>>>> discover new updates to source databases specified by replication jobs. >>>>>>>> Seems orthogonal to me unless I missed something. >>>>>>>> >>>>>>>>> Some work I'm embarking on, with a few other devs here at Cloudant, is >>>>>>>> to enhance the replicator manager to not run all jobs at once and it is >>>>>>>> indeed the plan to have each of those jobs run for a while, kill them >>>>>>>> (they >>>>>>>> checkpoint then close all resources) and reschedule them later. It's >>>>>>>> TBD >>>>>>>> whether we'd always strip feed=continuous from those. We _could_ let >>>>>>>> each >>>>>>>> job run to completion (i.e, caught up to the source db as of the start >>>>>>>> of >>>>>>>> the replication job) but I think we have to be a bit smarter and allow >>>>>>>> replication jobs that constantly have work to do (i.e, the source db is >>>>>>>> always busy), to run as they run today, with feed=continuous, unless >>>>>>>> forcibly ousted by a scheduler due to some configuration concurrency >>>>>>>> setting. >>>>>>>> >>>>>>>> So I think this is really the crux of the issue. My contention is that >>>>>>>> permanently occupying a socket for each continuous replication with the >>>>>>>> same source and mediator is needlessly expensive, and that _db_updates >>>>>>>> could be an elegant replacement. >>>>>>>> >>>>>>>>> I note for completeness that the work we're planning explicitly >>>>>>>> includes "multi database" strategies, you'll hopefully be able to make >>>>>>>> a >>>>>>>> single _replicator doc that represents your entire intention (e.g, >>>>>>>> "replicate _all_ dbs from server1 to server2”). >>>>>>>> >>>>>>>> Nice! It’ll be good to hear more about that design as it evolves, >>>>>>>> particularly in aspects like discovery of newly created source >>>>>>>> databases >>>>>>>> and reporting of 403s and other fatal errors. >>>>>>>> >>>>>>>> Adam >
