Hi Aleksey, I'm sorry, I don't have useful answers to your questions, so if that is all you're after you may as well skip this email.
On Thu, Jun 28, 2018 at 10:53:58AM -0400, Gene Heskett wrote: > Given the history of ksplice, and my innate paranoia, I don't have a pole > long enough to reach it. You shouldn't either. In my opinion this is just fear of a process that is not understood. At another time and place, you could be cautioning people not to travel faster than 40mph because they will surely suffocate, or that photographic reproductions of a person's image trap their soul. Live patching is a technique. It has trade-offs, like everything else. > If something is patched and a reboot is needed to make it 100% > functional, and you can't stand the thought of 2 minutes downtime while > its rebooting, its time to mirror your app to a second machine and > configure an automatic failover. In some (maybe even many) scenarios this is absolutely true. As a thought experiment though, imagine you have a server with 1,000 services on it, each of them through virtualisation belonging to a different entity (customer, organisation, user, whatever). We'll call the entities users for simplicity. Each user pays $5 a month for their service to run on this platform. There is no redundancy. If the user wants redundancy then the user can purchase more services and implement it themselves. Because most users do not see that as a priority, most users do not. The users accept that there will be inevitable occasional downtime, because they don't want to increase their costs to perhaps double or more for what is a relatively rare event. So, this platform, it's raking in $5k a month per server. Then there's a kernel update for a serious security flaw and that requires a reboot. You as operator of the platform schedule maintenance and the users endure 5 minutes of outage. Your competitor works out that they can pay someone to produce live kernel patches for $100 a month per server and does so. Your competitor also has 1,000 users per server, so they're raking in $5k per server, less $100 to pay the live patching company. $4,900 a month. They don't reboot their servers causing outages for their users, they just live patch¹. Your users find out that your competitor's service is exactly the same as yours, with the same features and price, but they've heard that it's a lot more available! 20% of your users move to your competitor. You're still making $5k per server but you have 20% fewer servers because 20% of your users left. Your competitor is still making $4,900 per server but they have 20% more servers, because they took on a bunch of new users. Your competitor is crushing you. You try to explain to your users that they could build in the availability they need by running multiple instances of their services and architecting it so that it can survive failure of a percentage of the instances. Your users ask you why you expect them to pay more, do more, know more and make things more complex, when it is an inarguable fact that your competitor has a more available service than yours for the same price they pay now. Looks like you're going to have to either copy your competitor, or lower your prices. You can talk until you are blue in the face about there being no free lunch, while your user is paying your competitor what they used to pay you, only your competitor's lunch offering is better. As I say you are absolutely correct that for some scenarios it is right and proper to build the resilience into the app. However what I suggest you have missed is that very few use cases require that level of engineering. Most of the users of any software at all exist in a place with much more lax requirements where it is *nice* when things don't fail, but they aren't interested in building N copies of it and altering the software so that it can make use of that distributed nature. Just making it incrementally better is a big thing at scale, especially if that doesn't cost much. Hint: live patching services don't tend to cost $100 per server per month. So we are left in a situation where a lot of things are running on "platforms" and the users like it when the platforms are highly available. Why isn't everything engineered to have near-perfect availability? That's possible, right? Yes, but it costs. Motor vehicles aren't space shuttles; the process around their design and manufacture does not attempt to make them near-perfect, merely good enough for what they cost. If a simple alteration to the process can save X lives per year while not costing very much, then it makes sense, and no one scoffs that it's not near enough to perfection yet so don't do anything at all. The thing about live kernel patching is that it's private to the kernel. The rest of the system doesn't necessarily know that anything has happened. When you start to build highly available distributed services you tend to find that you need to alter the way you do things in order to make it work with the distributed nature of the service. Where there is a partition between the people running the platform and the people running the service (as suggested in the example scenario I gave above), it becomes more likely that the platform operator actually cannot run things in a manner that suits every variety of service owner, and vice versa. Remaining as generic as possible results in the largest market possible. That is why techniques like live kernel patching increasingly do have a place, even though it may horrify purists. > There are some OS's that can do that, QNX comes to mind, but they > aren't free. Even the QNX microkernel has a dead time of 15 or 20 > seconds for a full reload of everything else. Also, there are further trade-offs here in that you'd then be deploying software in a quite specialist environment, rather than on plain old well-understood Linux. While there are some that need, want and are capable of working with QNX, there's vastly more market in hosting things on Linux. > I think the applicable keyword here is TANSTAAFL. Its a universal law, > and there are no shortcuts around it. IOW, if you think the lunch is > free, check the price of the beer. The price of live patching at the moment is that for every kernel update, someone has to work out the corresponding live patch. That work is not free and that is why various organisations charge for it. The Linux kernel cannot go unpatched upstream, so the costs of generating patches are borne by the Linux kernel project. By contrast, most end-users' kernel *can* go without a live patch, so those interested in using live patches currently need to pay or employ someone to generate them. No one is suggesting that anyone is expecting to get anything for free, so the choice of using live patches, or live migrating things, or any other technique to increase availability of a service is just a matter of choosing your poison. It's not black magic and no one needs to spit and say an oath whenever its name is mentioned. I am not currently aware of anyone providing free kernel live patches. These are things you pay for, from sources like Red Hat, Canonical, CloudLinux and Oracle, or hire staff to produce for you, or make yourself. Perhaps one day the process will become so simple that volunteers in a project like Debian could do it for free, nearly as fast as the regular binary package updates come out. We're not there yet and information on how to do it seems quite scarce, which is why people are currently paying for it. Cheers, Andy ¹ At this point some may say, "well, if the users ran their services in virtual machines then the VMs could be migrated to already-patched hardware without the users actually noticing. No need for this live patching stuff!" That's true but I think it's getting bogged down in details. A person could equally say, "this live migration thing is the Devil's work; just make every application distributed so it can survive failure, or else you don't really care about the application!" Like live patching, live migration is a technique that has its trade-offs and will not be suitable for all scenarios.